Object recognition is good enough for this kind of task now IMO. The problem is with controlling physical hardware, which is a problem typically handled by reinforcement learning. This is a difficult problem for current techniques mainly because of its sequential nature (i.e. what you need to do now depends on the state you're in, which depends on what you did a second ago). Being sequential means you can't easily take advantage of GPUs for parallelizing a huge amount of work like you can for image generation where you produce the entire thing at once.
Object recognition is good enough for this kind of task now IMO. The problem is with controlling physical hardware, which is a problem typically handled by reinforcement learning. This is a difficult problem for current techniques mainly because of its sequential nature (i.e. what you need to do now depends on the state you're in, which depends on what you did a second ago). Being sequential means you can't easily take advantage of GPUs for parallelizing a huge amount of work like you can for image generation where you produce the entire thing at once.