Object recognition is good enough for this kind of task now IMO. The problem is with controlling physical hardware, which is a problem typically handled by reinforcement learning. This is a difficult problem for current techniques mainly because of its sequential nature (i.e. what you need to do now depends on the state you're in, which depends on what you did a second ago). Being sequential means you can't easily take advantage of GPUs for parallelizing a huge amount of work like you can for image generation where you produce the entire thing at once.