Friday, 27 February 2015

Similarities and Differences between Deep Q and Humans

Google's DeepMind team have developed a learning architecture ('Deep Q') that can learn how to play simple video games to the extent that it outperforms humans. Although the achievement doesn't represent a leap forward in terms of artificial intelligence concepts, it is impressive in engineering terms, and fun. Deep Q bolts together solutions to a number of diverse AI problems, such as image processing, learning, and decision-analysis, in a way that produces competent performance at a task to which we can easily relate.

What makes the Deep-Q architecture particularly interesting is its generality - without being given any game-specific information (other than, effectively, screenshots) it performed well on a large range of games, such as Pinball, Pong, Breakout and Space Invaders. It also performed badly on many others, such as Centipede and Asteroids. This difference in relative performance highlights some interesting things about how Deep Q works, and points to some possible ways in which human cognition tackles the problem of learning.

"You are defeated. Instead of shooting where I was, you should have shot at where I was going to be."
The first thing to note is that Deep Q is not in any way programmed to model the way the games it is playing work.  It doesn't try to work out how space invaders behave, for example; in fact, strictly speaking it doesn't even try to identify and classify objects it is seeing at all. Instead, it works on the principle of assigning values to whole images (screenshots from the game) - or rather, processed abstractions of those images. A 'high-value' image is one that is close to giving Deep Q some points - such as one where it is about to shoot a space invader mothership - whereas a low-value image is one associated with few points in the near future - such as one in which it is about to lose.

By taking account of the historical associations between images, its own actions, and future images, Deep Q works out which of its possible actions ('go up', 'press fire' etc.) is likely to result in its being given the highest-value possible image next.  Using this approach, Deep Q found some clever strategies. It worked out, for example, that having deep holes in the wall in 'Breakout' meant you could get the ball behind the bricks and net a large amount of points.  The principle is that game-positions with holes in the wall were high-value; so game-positions with near-holes in the wall were also high value; consequently, it took the actions that seemed on past performance to be associated with turning near-holes into actual holes, and ended up using a 'drilling' strategy.

So why did Deep Q do badly on some games and not others?  As the paper makes clear, Deep Q is bad at long-term planning.  Its assignment of values to game positions is based on their giving immediate points, or being likely to lead to game positions with immediate points if the right action is taken. So it did well in games where points came in at a steady trickle, e.g. for knocking out bricks or killing aliens. It did badly in games where points came in 'lumps' for achieving complex tasks - e.g. in 'Montezuma's Revenge' where one must get keys to open doors to reach ledges to get jewels and so on. Essentially, the greater the distance between the action and the reward, the harder it is for Deep Q to work out what it did right or wrong, and when.

Deep Q's achievement doesn't tell us that human decision-learning operates on a similar principle, but it tells us that it might. It opens up some interesting questions about the way we learn how to do things better.  Perhaps most saliently, it points to the fact that better learning could be achieved using at least two separate routes: the 'anticipation' route, and the 'reward' route.  

The 'anticipation' route involves getting better at working out how future states-of-the-world are linked to present ones.  Humans do this through empirical analysis, most of which happens unconsciously, using models of the world, or hypotheses.  Hypothesis-based reasoning is almost certainly a major advance over Deep Q-style approaches which don't try to model the world at all as such, but may only be advantageous in environments with a certain combination of regularity and randomness.  With better anticipation, your actions can be linked to ever-more-distant rewards and so it promotes better decision-making overall, at additional cost in data-processing.

The 'reward' route involves getting better at decision-making by getting better at assigning values to states-of-the-world. This isn't about getting better at understanding how the world works - beyond a very low time-horizon, at least - but about getting better at realising what 'good' and 'bad' are.  If you know that 'snakes are bad', and can at least recognise snakes, you don't need to learn how snakes work, what venom is and so on - you just need to know that running away from one tends to produce a situation in which there is 'less snake' overall.

The two approaches have similar results - they bring 'actions' and 'rewards' closer to one another - but through different mechanisms.  The psychological evidence suggests that human learning is based on a combination of these kinds of approaches, whether or not they are implemented in anything like a Deep-Q-type architecture. Affect and belief are so entangled in our everyday reasoning that it's difficult for us to separate them experientially.  Hard-wired reward mechanisms - such as physical pleasure and pain - seem to work precisely because they give us tactical rewards for strategically-sound decision-making.

Perhaps what is most instructive about the parallels between Deep Q and human cognition is that humans seem to learn better when they are placed in the type of environment where Deep Q excels - i.e. one in which there is a continuous reinforcement of performance through regular small rewards and penalties, otherwise known as 'gamification'.  That computer games are addictive, and that machines like Deep Q can learn how to play them well, is not, after all, a coincidence.

Friday, 20 February 2015

Intelligence and Hypotheses

Dan Lewis reports on an interesting side-effect of LED traffic lights - they don't melt the snow that lands in front of them, leading to obscured signals, accidents and so on.  Engineers are apparently looking into installing snow detectors coupled to heaters or other such systems.  A design change has revealed a hitherto-unnoticed benefit of filament bulbs that will now need to be met by a control system.

Overflows are simple but effective control systems
Photo: Myles Smith

Simple control systems are everywhere.  Overflows regulating water levels, governors on steam engines, crossguards on swords preventing contact with the blade, thermostats switching heating on and off, and fuses burning out in the event of a power surge are all simple mechanical control systems.  They fulfil design objectives by causing a difference in system behaviour when certain thresholds are met.  Thanks to cheap processors, electronic control systems have become ubiquitous (do people even use the term 'smart phone' or 'smart TV' any more?) but this feels like cheating: the ingenuity of the mechanical systems is so much more striking.

Living organisms have - or perhaps are - abundant control systems.  From plant photoreceptors through to the human brain, evolution has found a wealth of solutions to the problem of getting an organism to reproduce, via the acquisition of energy.  But from a user perspective, the brain doesn't feel like a mechanical control system.  Overflow pipes don't form beliefs about the water level, thermostats don't gather evidence about the temperature, fuses don't make a decision when they burn out, and the traffic lights certainly didn't notice that they weren't melting snow any more.  Our goal-seeking behaviour seems in contrast to be mediated through hypotheses about the world - we entertain cognitive models of the world, attach probabilities to them, run simulations with them, become attached to them, try to bring them about or avert them.

Are we just a more sophisticated version of Dr Nim?
There are several possibilities to consider here.  One is that this difference is only one of degree; perhaps hypotheses are simply epiphenomena of some underlying, basically-mechanical method of producing decisions that is not really 'about' anything?  At the other extreme is the possibility that the hypothesis-driven approach to decision-making is in some way a fully-general end-of-the-line.  Perhaps, once you have the ability to construct and evaluate hypotheses, 'better' is simply a matter of the speed and volume with which you do it and not a matter of more generality.  If mechanical control systems work, this theory goes, it's because they do what a hypothesis-driven decision-maker would do under the same circumstances, if it were prepared to waste processing power on problems as simple as preventing a bath from overflowing.

The most interesting possibility, however, is that while hypotheses do represent a genuinely distinct cognitive technology to purely-mechanical systems, they are not fully general, and that there are ways to design an all-purpose, all-environments decision-maker that don't involve anything like hypotheses.  Perhaps our present cognitive limitations simply prevent us from imagining it?

A lot depends on which of these possibilities is true.  At present, we can understand why artificially-intelligent systems work, even if not exactly how.  It is not fallacious to describe (for example) face recognition algorithms as essentially constructing and testing hypotheses about face-possibilities within images.  If artificial intelligences greater than ours operate on the hypothesis paradigm then we will still be able to understand why they work, impressed though we might be with their speed of operation and the complexity of those hypotheses.  But if the hypothesis-based approach is itself surpassed by a better cognitive technology, we might find ourselves at much more of a loss.