What makes the Deep-Q architecture particularly interesting is its generality - without being given any game-specific information (other than, effectively, screenshots) it performed well on a large range of games, such as Pinball, Pong, Breakout and Space Invaders. It also performed badly on many others, such as Centipede and Asteroids. This difference in relative performance highlights some interesting things about how Deep Q works, and points to some possible ways in which human cognition tackles the problem of learning.
|"You are defeated. Instead of shooting where I was, you should have shot at where I was going to be."|
By taking account of the historical associations between images, its own actions, and future images, Deep Q works out which of its possible actions ('go up', 'press fire' etc.) is likely to result in its being given the highest-value possible image next. Using this approach, Deep Q found some clever strategies. It worked out, for example, that having deep holes in the wall in 'Breakout' meant you could get the ball behind the bricks and net a large amount of points. The principle is that game-positions with holes in the wall were high-value; so game-positions with near-holes in the wall were also high value; consequently, it took the actions that seemed on past performance to be associated with turning near-holes into actual holes, and ended up using a 'drilling' strategy.
So why did Deep Q do badly on some games and not others? As the paper makes clear, Deep Q is bad at long-term planning. Its assignment of values to game positions is based on their giving immediate points, or being likely to lead to game positions with immediate points if the right action is taken. So it did well in games where points came in at a steady trickle, e.g. for knocking out bricks or killing aliens. It did badly in games where points came in 'lumps' for achieving complex tasks - e.g. in 'Montezuma's Revenge' where one must get keys to open doors to reach ledges to get jewels and so on. Essentially, the greater the distance between the action and the reward, the harder it is for Deep Q to work out what it did right or wrong, and when.
Deep Q's achievement doesn't tell us that human decision-learning operates on a similar principle, but it tells us that it might. It opens up some interesting questions about the way we learn how to do things better. Perhaps most saliently, it points to the fact that better learning could be achieved using at least two separate routes: the 'anticipation' route, and the 'reward' route.
The 'anticipation' route involves getting better at working out how future states-of-the-world are linked to present ones. Humans do this through empirical analysis, most of which happens unconsciously, using models of the world, or hypotheses. Hypothesis-based reasoning is almost certainly a major advance over Deep Q-style approaches which don't try to model the world at all as such, but may only be advantageous in environments with a certain combination of regularity and randomness. With better anticipation, your actions can be linked to ever-more-distant rewards and so it promotes better decision-making overall, at additional cost in data-processing.
The 'reward' route involves getting better at decision-making by getting better at assigning values to states-of-the-world. This isn't about getting better at understanding how the world works - beyond a very low time-horizon, at least - but about getting better at realising what 'good' and 'bad' are. If you know that 'snakes are bad', and can at least recognise snakes, you don't need to learn how snakes work, what venom is and so on - you just need to know that running away from one tends to produce a situation in which there is 'less snake' overall.
The two approaches have similar results - they bring 'actions' and 'rewards' closer to one another - but through different mechanisms. The psychological evidence suggests that human learning is based on a combination of these kinds of approaches, whether or not they are implemented in anything like a Deep-Q-type architecture. Affect and belief are so entangled in our everyday reasoning that it's difficult for us to separate them experientially. Hard-wired reward mechanisms - such as physical pleasure and pain - seem to work precisely because they give us tactical rewards for strategically-sound decision-making.
Perhaps what is most instructive about the parallels between Deep Q and human cognition is that humans seem to learn better when they are placed in the type of environment where Deep Q excels - i.e. one in which there is a continuous reinforcement of performance through regular small rewards and penalties, otherwise known as 'gamification'. That computer games are addictive, and that machines like Deep Q can learn how to play them well, is not, after all, a coincidence.