Friday, 27 February 2015

Similarities and Differences between Deep Q and Humans

Google's DeepMind team have developed a learning architecture ('Deep Q') that can learn how to play simple video games to the extent that it outperforms humans. Although the achievement doesn't represent a leap forward in terms of artificial intelligence concepts, it is impressive in engineering terms, and fun. Deep Q bolts together solutions to a number of diverse AI problems, such as image processing, learning, and decision-analysis, in a way that produces competent performance at a task to which we can easily relate.

What makes the Deep-Q architecture particularly interesting is its generality - without being given any game-specific information (other than, effectively, screenshots) it performed well on a large range of games, such as Pinball, Pong, Breakout and Space Invaders. It also performed badly on many others, such as Centipede and Asteroids. This difference in relative performance highlights some interesting things about how Deep Q works, and points to some possible ways in which human cognition tackles the problem of learning.

"You are defeated. Instead of shooting where I was, you should have shot at where I was going to be."
The first thing to note is that Deep Q is not in any way programmed to model the way the games it is playing work.  It doesn't try to work out how space invaders behave, for example; in fact, strictly speaking it doesn't even try to identify and classify objects it is seeing at all. Instead, it works on the principle of assigning values to whole images (screenshots from the game) - or rather, processed abstractions of those images. A 'high-value' image is one that is close to giving Deep Q some points - such as one where it is about to shoot a space invader mothership - whereas a low-value image is one associated with few points in the near future - such as one in which it is about to lose.

By taking account of the historical associations between images, its own actions, and future images, Deep Q works out which of its possible actions ('go up', 'press fire' etc.) is likely to result in its being given the highest-value possible image next.  Using this approach, Deep Q found some clever strategies. It worked out, for example, that having deep holes in the wall in 'Breakout' meant you could get the ball behind the bricks and net a large amount of points.  The principle is that game-positions with holes in the wall were high-value; so game-positions with near-holes in the wall were also high value; consequently, it took the actions that seemed on past performance to be associated with turning near-holes into actual holes, and ended up using a 'drilling' strategy.

So why did Deep Q do badly on some games and not others?  As the paper makes clear, Deep Q is bad at long-term planning.  Its assignment of values to game positions is based on their giving immediate points, or being likely to lead to game positions with immediate points if the right action is taken. So it did well in games where points came in at a steady trickle, e.g. for knocking out bricks or killing aliens. It did badly in games where points came in 'lumps' for achieving complex tasks - e.g. in 'Montezuma's Revenge' where one must get keys to open doors to reach ledges to get jewels and so on. Essentially, the greater the distance between the action and the reward, the harder it is for Deep Q to work out what it did right or wrong, and when.

Deep Q's achievement doesn't tell us that human decision-learning operates on a similar principle, but it tells us that it might. It opens up some interesting questions about the way we learn how to do things better.  Perhaps most saliently, it points to the fact that better learning could be achieved using at least two separate routes: the 'anticipation' route, and the 'reward' route.  

The 'anticipation' route involves getting better at working out how future states-of-the-world are linked to present ones.  Humans do this through empirical analysis, most of which happens unconsciously, using models of the world, or hypotheses.  Hypothesis-based reasoning is almost certainly a major advance over Deep Q-style approaches which don't try to model the world at all as such, but may only be advantageous in environments with a certain combination of regularity and randomness.  With better anticipation, your actions can be linked to ever-more-distant rewards and so it promotes better decision-making overall, at additional cost in data-processing.

The 'reward' route involves getting better at decision-making by getting better at assigning values to states-of-the-world. This isn't about getting better at understanding how the world works - beyond a very low time-horizon, at least - but about getting better at realising what 'good' and 'bad' are.  If you know that 'snakes are bad', and can at least recognise snakes, you don't need to learn how snakes work, what venom is and so on - you just need to know that running away from one tends to produce a situation in which there is 'less snake' overall.

The two approaches have similar results - they bring 'actions' and 'rewards' closer to one another - but through different mechanisms.  The psychological evidence suggests that human learning is based on a combination of these kinds of approaches, whether or not they are implemented in anything like a Deep-Q-type architecture. Affect and belief are so entangled in our everyday reasoning that it's difficult for us to separate them experientially.  Hard-wired reward mechanisms - such as physical pleasure and pain - seem to work precisely because they give us tactical rewards for strategically-sound decision-making.

Perhaps what is most instructive about the parallels between Deep Q and human cognition is that humans seem to learn better when they are placed in the type of environment where Deep Q excels - i.e. one in which there is a continuous reinforcement of performance through regular small rewards and penalties, otherwise known as 'gamification'.  That computer games are addictive, and that machines like Deep Q can learn how to play them well, is not, after all, a coincidence.


Dr G said...


Interesting, as always. Your separation of ‘anticipation’ and ‘reward’ is appealing, but I’m not sure it reflects how people developing an understanding of the world. The ‘anticipation’ route sounds like accounts of how we construct meaning - a concept that really sets us apart (along with other things - such as ability to recognise context, generate and consider alternative explanations).

With regard to the ‘reward’ route, I’m not convinced that it isn’t just part of the same thing. Assigning snakes as ‘bad’ involves an emotional component that will be associated with the ‘model’ of a snake that the person has constructed. So yes, they can avoid snakes, but at some level is due to recognising what is perceived as something that means danger. Beyond this, it’s easier to predict the action of entities if we can not only describe their characteristics (including behaviour), but also explain it. So I would suggest that beyond a simple level of perception-action, then it IS about understanding how the world works.

I know you’ve stated that human learning is based on a combination of the two things, and that they are entangled, but I think these two things are actually just part of the same thing.

Your final comment about ‘reinforcement’ is an interesting one, but worryingly sounds a bit too much like behaviourism for my liking (largely rejected since the 1950s). Though some of the concepts appear to be useful for certain things - such as gamification - we shouldn’t forget about the basic principles of learning, and the critical component is feedback

Nick Hare said...

I agree, although I think I wasn't clear enough that I was talking about how you could solve the problem of 'making a learner' (i.e. you can make them better at assigning values, or predicting the world, or both). I wasn't talking about how, as a learning agent, you should approach the problem of learning. I suppose the answer would be to hack your own architecture as effectively as possible.

On the question of whether the 'reward' and 'anticipation' routes are really the same thing, I think there is a big subjective difference between e.g. our aversion to heights, snakes etc., which doesn't have a rationale (at point-of-use) and our aversion to getting into debt or not wearing a seat-belt, which are based on our ability to anticipate risks.