Tuesday, 28 October 2014

Vagueness as an Optimal Communication Strategy

Imagine the following game.  You are given a random number between 1 and 100 and must communicate its value to a friend.  But you're only allowed to use the words 'low' and 'high'.  You win £100, less the difference between your number and your friend's guess.  If your number is 40 and she guesses 60, you'll win £80 (£100 less the £20 difference).  If your number is 40 and she guesses 90, you'll only win £50.  You're allowed to discuss strategy in advance, and we will assume that you are risk averse and so are interested in finding a way of maximising your guaranteed win (a 'maximin' strategy).    

One approach is this: you might agree in advance that you will say 'low' if the number is 1-50, and 'high' if it's 51-100.  Your friend would then guess 25 when you say 'low', and 75 when you say 'high'.  That way you'll never win less than £75.

But what happens if you're given the number 50?  In this case, it might be better not to communicate at all.  That way your friend has no information and may choose to minimise losses by guessing 50, guaranteeing (she thinks) a win of at least £50 but in fact (as you know) giving you a win of £100. This would be a form of strategic vagueness: your friend is more likely to make a better decision if you don't use your limited lexicon of 'high' and 'low' at all.

"...I couldn't possibly comment."

Given that the aim of analytical communication is accurately to convey information, and that our language, while rich, is nevertheless finite in its expressive range, we should expect to find real-life situations in which a strategically vague response is more likely to induce the correct belief in the receiver of the message.  Of course there are venal reasons for vagueness, including avoidance of accountability.  There is also a range of other philosophical theories for the origin of linguistic and epistemic vagueness.  But the possibility that strategic vagueness might be an optimal communications strategy opens some interesting questions for analysts and their customers.    

Friday, 24 October 2014

Terror Threat Levels and Intervals Between Attacks

The shootings in Ottawa on 22 October, which came only two days after a vehicle attack in Quebec, are an anomaly given the phenomenally low incidence of terrorism within Canada.  According to the Global Terrorism Database, only 8 people have been killed in terrorist attacks within Canada since 1970. The investigation is only just underway, but the attacks already seem to have some surprising features.  They are allegedly unconnected, at least in terms of any planning.  The terror threat level in Canada was raised to 'moderate' after the first attack, in response (the authorities say) to 'increased chatter' from radical groups, but the shooter on 22 October, Michael Zehaf-Bibeau, apparently has no direct connection to IS and AQ.  The details, when they emerge, will clearly hold lessons for intelligence analysts.

As in this case, terrorist threat levels often seem to change after an attack has happened.  This sometimes seems a bit like shutting the door after the horse has bolted.  But in fact, as the Canada incident illustrates, terrorist attacks really do come in clumps.  Canada's terrorist attacks are so infrequent that there is an average of about 230 days between them, since 1970.  A purely random process with this kind of interval would mean that only 3% of terrorist attacks would occur within the same week.  In Canada's case, though, nearly one in five terrorist attacks occurs within a week of the last one.  In other words, the rate of terrorist attacks rises by a factor of about six immediately following an attack.  A terrorist attack is therefore a relatively good indicator of another imminent attack - even disregarding any intelligence received - and the authorities almost certainly did the right thing in raising the threat level.

The effect is also clear in countries with relatively high levels of terrorism.  The chart below shows the observed frequency of intervals (in days) between attacks in Northern Ireland since 1970 (in blue) and the frequency that would be observed if attacks came purely randomly (in red).

As the chart shows, attacks are more likely to occur immediately after another attack (whether through reprisals or co-ordinated activity), although the effect is smaller than that of Canada.  After about 2-3 days of peace the immediacy effect washes out and the daily frequency returns to the long-run average.

Wednesday, 22 October 2014

Cogntive Task Sequencing

While at Los Alamos, Richard Feynman designed an interesting way of debugging computer programs before the computers actually arrived:

From 'Surely You're Joking, Mr Feynman'

Effective application of cognitive structures and analytical techniques is somewhat like this.  Breaking an analytical approach to a question into a series of separate, logically-sequential, cognitive tasks makes each of the stages easier to perform and, as an added bonus, gives you an audit trail for the final answer.  The CIA's Tradecraft Primer suggests the following approach to task sequencing:

In the UK, the MOD and Cabinet Office intelligence analysis guide, Quick Wins for Busy Analysts, contains the following:

What these have in common is the evolution of a project from divergent, creative types of approach to convergent, critical, probabilistic assessment, or in other words from hypothesis generation to hypothesis testing.  In other words, good analytical sequencing is the application of the scientific method.  In the words of Richard Feynman again:

"In general we look for a new law by the following process. First we guess it. Then we compute the consequences of the guess to see what would be implied if this law that we guessed is right. Then we compare the result of the computation to nature, with experiment or experience, compare it directly with observation, to see if it works. If it disagrees with experiment it is wrong. In that simple statement is the key to science."

A Simple Base Rate Formula

A number of studies have identified 'base rate neglect' as a significant factor undermining forecasting performance even among experts.  A 'base rate' is not easy to define precisely, but in essence it's a sort of 'information-lite' probability which you might assign to something if you had no specific information about the thing in question.  For example, since North Korea has had a nuclear test three times in about the last nine years, a base rate for another test in the next year would be about one in three, or 33%.  If you're asked to make a judgement about the probability of an event in a forthcoming period of time, you should first construct a base rate, then use your knowledge of the specifics to adjust the probability up or down.  It seems simplistic, but anchoring to a base rate has been shown significantly to improve forecasting performance.

If your arithmetic is rusty, you can use the following simple formula to get a base rate for the occurrence of a defined event:

How far AHEAD are you looking?  Call this 'A'.
How far BACK can you look to identify comparable events?  Call this 'B'.  (Make sure the units are the same as for 'A' - e.g. months, years.)
What NUMBER of events of this kind have happened over this timeframe, anywhere?  Call this 'N'
How big is the POPULATION of entities of which your subject of interest is a part?  Call this 'P' 

Your starting base rate is then given by: (A x N) / (B x P)

For example, suppose we were interested in the probability of a successful coup in Iran in the next five years.

How far AHEAD are we looking? 5 (years)
How far BACK can we look to identify comparable events?  68 (years)
What NUMBER of events of this kind have happened over this timeframe?  223 (successful coups since 1946, according to the Center for Systemic Peace)
How big is the POPULATION of entities (countries, in this case) of which Iran is a part?  The data cover 165 countries

The base rate is therefore: (5 x 223) / (68 x 165) = 0.099, or 9.9%, or more appropriately 'about one in ten'.

Remember this is just a starting point, not a forecast.  And there isn't just one base rate for a event - it will depend on how you classify the event and how good your data are.  But doing this simple step first will help mitigate a significant bias.

(NB. If you're dealing with events that have no precedents, or if the events are relatively frequent compared to your forecast horizon, you have a different problem on your hands and shouldn't use a simple formula like the one above.)

Monday, 20 October 2014

What Coincidences Tell Us

Today I happened to walk down South Hill Park in Hampstead.  South Hill Park is greatly beloved of fact fans as it's associated with one of the most remarkable coincidences in London's history.  South Hill Park is relatively well-known as the street where Ruth Ellis, the last woman to be hanged in the UK, shot her lover outside the Magdala pub.  What is less well known is that the second-last woman to be hanged in the UK, Styllou Christofi, also (and completely unrelatedly) committed the murder of which she was convicted on South Hill Park, at house number eleven.  The cherry on the aleatory cake is that South Hill Park is, quite unusually, distinctively noose-shaped.

 Ruth Ellis, Styllou Christofi, the noose-shaped South Hill Park

Hardened rationalists though we may be, we have to admit that this is an extremely tantalising confluence of events.  Great coincidences like these demand our attention.  The interesting thing about our reaction to coincidences is what it tells us about our cognitive machinery.

First, though, it is worth noting that, by-and-large, the occurrence of coincidences is not objectively remarkable.  Allowing that the South Hill Park coincidence has - let's say - a one-in-a-million probability, and that there are - getting very cavalier with our concepts - several billion facts, it's a certainty that a large number of the facts constitute coincidences.  Coincidences involving notorious people are a small proportion of the whole because there aren't that many notorious people, but we shouldn't, objectively, be that surprised.  

Why are we, then?  To answer this we need to work out what makes something a coincidence in the first place.  This is not as easy as it seems it should be.  Our first reaction is to say that a coincidence is some kind of low-probability event.  But this falls over very quickly on examination, since almost every event is low probability.  A bridge hand consisting of all thirteen spades is exactly as probable as a bridge hand consisting of 2-K of spades and the ace of hearts, or indeed any other particular combination of cards.  Discovering that all four people in your team share a birthday is exactly as probable as discovering that they all have some other specified but unremarkable combination of birthdays.

What makes something a coincidence is not, then, directly the probability.  Coincidences seem instead to require data points with shared values across multiple fields.  This is a somewhat abstract way of putting things, but what it means is that:

(Name = Alice, Team = my team, Birthday = 1 June)
(Name = Bob, Team = my team, Birthday = 3 August)
(Name = Charlie, Team = my team, Birthday = 19 December)
(Name = Dave, Team = my team, Birthday = 5 June)

is not a coincidence, because the data have shared values in only one field ('Team'), but that

(Name = Alice, Team = my team, Birthday = 1 June)
(Name = Bob, Team = my team, Birthday = 1 June)
(Name = Charlie, Team = my team, Birthday = 1 June)
(Name = Dave, Team = my team, Birthday = 1 June)

is a coincidence, because the data have shared values in two fields ('Team' and 'Birthday').  It's not about probability - the probabilities of both these datasets are equal - but about features of the data.  The interesting thing is that, put like this, the connection between coincidence and hypothesis testing becomes very clear.  Broadly, shared data values provide evidence for the existence of a hypothesis which posits a lawlike relationship between the two fields; in this case, the data support (to an extent) hypotheses that I only hire people whose birthday is on 1 June, or that babies born in early summer are particularly suited to this kind of work, and so on.  It is only the low prior probability for hypotheses of this kind that leads us, ultimately, to dismiss them as possible explanations, and to accept the data as a 'mere' coincidence.

This doesn't of course stop us wondering at them.  And this is the interesting thing: it suggests that our sense of wonder at coincidences is how a currently-running unresolved hypothesis-generation algorithm expresses itself.  When we get the explanation, the coincidence - and the wonder - go away.  If I tell you that last night I was at a party and met not one but two old schoolfriends who I hadn't seen for twenty years, this is a remarkable coincidence until I tell you it was a school reunion, at which point it stops being interesting at least in one regard.  The data are the same, but the existence of the explanation rubs away the magic.  The feeling of awe we get from coincidences is the feeling of our brain doing science.

That's No Moon

Mimas, everyone's second-favourite moon, has been found to have an unorthodox wobble that is too big for a moon that size with a solid internal structure.  The authors of a paper in Science offer two explanations: that it has a rugby-ball-shaped core, or that it has an internal liquid ocean.  This is a great example of hypothesis generation, which is both a key part of analysis, and risky and difficult.  It is a fairly straightforward matter (assuming you know about moons and gravity and so on) to work out that the observed wobble isn't consistent with a solid core.  It's quite another matter to come up with possible hypotheses that are consistent with the evidence.  Not least, this is because (as a matter of logic) there are an infinite number of hypotheses that could account for any given set of evidence.  This means that there is no algorithm that can exhaustively generate all the possible explanations for a set of data.  How humans do it is still a matter for debate and undoubtedly one of the most impressive features of our cognitive architecture.

Logarithmic Probability Scores

Expressing probability as a percentage is something we're all used to doing, but in many ways it's of limited usefulness.  It's particularly inadequate for communicating small probabilities of the kinds considered in risk assessments such as the UK's National Security Risk Assessment which looks at a range of risks whose probabilities might vary in orders of magnitude, from say one in a million to one in ten.  For these kinds of situations - where we're more interested in the order of magnitude than in the precise percentage - a logarithmic ('log') scale might be a better communication tool, and potentially more likely to support optimal decisions about risk.

On a log scale, a one point increase equates to an increase in magnitude by a constant factor, usually ten (since it makes the maths easier).  A log scale might start at '0' for a 1 in 1,000,000 probability, move to '1' for 1 in 100,000, and so on through to '5' for 1 in 10 (i.e. 10%) and '6' for 1 in 1 (i.e. 100% probable).  (A more mathematically-purist probability scale would actually top out at '0' for 100%, and use increasingly negative numbers for ever-lower probabilities.)  A log scale also brings the advantage that, if paired with a log scale for impact, expected impact - which is a highly decision-relevant metric - can be calculated by simply adding the scores up (since it's equivalent to multiplying together a probability and an impact).  One thing it can't do, though, is express a 0% probability (although arguably nothing should ever be assigned a 0% probability unless it's logically impossible).

For these reasons, log scales are used to simplify communications in a number of risk-related domains where the objects of interest vary in order-of-magnitude.  The Richter Scale is a log scale.  The Torino scale is used to combine the impact probability and kinetic energy for near-earth objects:

Several intelligence analysis organisations have developed standardised lexicons for expressing probabilities, such as the UK Defence Intelligence's uncertainty yardstick (last slide) and the US National Intelligence Council's 'What we Mean When We Say' (p.5) guidance.  To my knowledge, however, there are no standardised log scales used in these areas.  There may be an argument for their development, to enable easier communication and comparison of risk judgements concerning high-impact, low-probability events across the intelligence community and government more widely.  

Friday, 17 October 2014

Pareto Redux: 80:20 in Information Gathering

The so-called 'Pareto Principle' states, of systems to which it applies, that 80% of something is distributed within just 20% of something else, and vice versa.  People will often mention this in the context of allocating effort: if 80% of work is done in the first 20% of time, it might be better to produce five things 80% well rather than one thing 100% well.  Although it's frequently and often inappropriately cited by charlatans as an excuse for doing things badly, the Pareto Principle does have a mathematical basis in that many systems and processes produce a 'power law' distribution that can sometimes roughly follow the '80:20 rule'.

Interestingly, information-gathering, at least within a fairly abstract model, is one of these processes.  'Information' here is here defined fairly standardly as anything which would change a probability.  As is often the case, we can use an 'urn problem' to stand in for a range of real-life analytical and decision problems.  Here, there are two urns containing different-coloured balls:
One urn is chosen at random (50-50) and the other destroyed.  You need to work out whether Urn A or Urn B has been chosen - because you win £1m if you guess correctly.  Balls are repeatedly drawn out of the chosen urn and then replaced.  

Every black ball that is drawn increases the probability that Urn A is the chosen one.  Every white ball concomitantly reduces it (and increases the probability of Urn B).  The impact on probability is very well-understood: each ball doubles the odds of its respective associated urn.  If we start with odds of 1-1 (representing 50% probability of each urn), a black ball will increase the probability of Urn A to 2-1.  A second black ball will increase it to 4-1.  If we then draw a white ball, the odds go back down to 2-1, and so on.  If Urn A was chosen, the number of black balls would gradually outpace the number of white balls (in the long run) and the assessed probability of Urn A would creep closer and closer (but never actually equal) 100%.  

Because the odds ratio - which is what information directly affects - is related to the probability in a non-linear fashion, we end up with a Pareto-Principle-type effect when looking at probability compared to information quantity (i.e. number of balls drawn).  In fact, the relationship between probability and information quantity, on the assumption that the target fact is true, looks like this:
The relationship between information and probability is fairly linear from 50% to about 80%.  Above 90% the curve steepens dramatically, and on the assumption that information continues to be drawn from the same sample space (e.g. the same urn, with replacement) it edges closer and closer to 100% with increasing amounts of information, without ever reaching it.

The implication is something that most analysts realise intuitively: there is a diminishing marginal return to information the closer you are to certainty, in terms of the impact it has on the probability of your target hypothesis.  The amount of information that will get you from 50% to 80% probability will only take you from 80% to about 94%, and from about 94% to about 98%.  Because the expected utility of a decision scales linearly with probability (in all but exceptional cases), there is indeed an 'optimal' level of information at which it will be better simply to make a decision and stop processing information.

Resource-management under Uncertainty; Bats

The police have reported being over-stretched with the large number of investigations into possible Islamist terrorist activity.  In my experience most analytical organisations feel stretched most of the time, and analysts always worry about missing things.  Curiously though, I haven't heard about any organisations that attempt to match resources to uncertainty in an explicit sort of way.  Apart from in exceptional circumstances, analysts tend to be reallocated within organisations fairly slowly in response to changes in perceived threat levels, and analytical organisations very rarely change significantly in size in the short term.  This is in contrast to many naturally-evolved information-processing systems.

The analytical resource-management task is fairly easy to state in cost-benefit terms.  Analytical work on a topic has an expected benefit in terms of the extent to which it reduces uncertainty, and a cost.  It adds value because of the possibility of it changing a decision.  If you have £10 to bet on a two-horse race, with each horse valued at 1-1, and you currently believe Horse A has a 60% probability of winning, you will (if you have to) bet on it and expect to win £10 60% of the time, and lose £10 40% of the time, for a net expected profit of £2.  If you are offered new information that will accurately predict the winning horse 90% of the time, your winnings will (whatever it tells you) be expected to rise to £8.  This information would therefore have net value of £6 and if it costs less than that then you should buy it.

Although there is no easily-stated general solution to this problem, you can show how the optimal amount of resources to throw at an analytical problem will change, and roughly how significantly, when the problem parameters change.  When the costs and risks associated with a problem rise, analysis becomes more valuable.  When new information increases uncertainty (i.e. pushes the probabilities associated with an outcome away from 0 or 1), analysis becomes more valuable.  When information becomes easier (cheaper) to gather, analysis becomes more valuable.  The optimising organsiation might attempt to measure these things and move resources accordingly - both between analytical topics (within analysis teams) and towards and away from analysis in general (within larger organisations of which only parts produce analysis).

Of course it's not as simple as that, and that's not very simple in the first place.  It's expensive to move analytical resources, not least because it takes time for humans to become effective at analysing a new topic.  This adds an additional dimension of complexity to the problem.  But it surprises me that firms and other analysis organisations don't attempt explicitly to measure the value their analysis adds - perhaps by looking at the relative magnitude of the decisions it helps support - because among other things this would give them a basis on which to argue for more resources, and a framework to help explain why surprises were missed when they were.

Animals have evolved interesting solutions to this problem that we might learn from.  As humans, we do this so naturally we barely notice it.  Under high-risk situations - while driving, negotiating a ski run, or when walking around an antiques shop - our information processing goes into overdrive at the expense of our other faculties such as speech or abstract thought (one theory suggests this is why time seems to slow down when a disaster is unfolding).  Generally, humans sleep at night - switching information-processing to a bare minimum - when threat levels (from e.g. large predators) are lowest in the evolutionary environment.

A bat yesterday

Bats, however, make an interesting study because their information-gathering is easy to measure.  When bats are searching for insects they emit a sonar pulse interval of 100ms.  This enables them to 'see' objects up to 30-40 metres away but with a resolution of only ten times a second.  When bats are in the final moments before capture (where the decision-infoming benefit of frequent information is much higher) this pulse interval falls to 5ms - 200 times a second - but this pulse only provides information about objects less than about 1 metre away.  Sonar is expensive though.  A medium-sized bat needs about one moth an hour.  This would rise to around twenty moths an hour if its sonar was kept switched on at full power.  Bats have therefore found a solution to the problem of optimising information-acquisition from which analysis organisations perhaps have something to learn.

Tuesday, 14 October 2014

When are Conditional Probabilities Equal?

One of the simplest inferential mistakes is to assume a conditional probability is the same as its inverse.  For example, around 75% of criminals are male, but it would be a mistake to assume that 75% of males are criminal.  This seems fairly obvious.  Yet on the level of intuitive reasoning, we don't seem to be particularly effective at keeping the two distinct, as documented in (for example) Villejoubert and Mandel (2002).  A fallacy of this sort might have lain behind the UK's controversial counter-terrorism campaign in 2008: even assuming that the majority of terrorists exhibit suspicious behaviour, it doesn't remotely follow that the majority of people exhibiting suspicious behaviour are terrorists.

It's therefore worth remembering when inverse probabilities are equal to one another, so as to take inferential precautions when they are not.  A simple implication of Bayes' theorem is that the conditional probability of A given B is similar to the conditional probability of B given A when the probabilities of A and B are close to one another.  If there were approximately the same number of terrorists as there were people behaving suspiciously, and most terrorists behaved suspiciously, then one could infer that most suspiciously-behaving people were terrorists.

The opposite is also true.  If there is a big discrepancy between the probabilities of A and B, the conditional probabilities will also differ significantly, the exception being when there is zero probability that A and B can be true together.  

Modelling Over- (and Under-) Confidence

Overconfidence is a prevalent and well-documented phenomenon.  Its practical manifestation is the effect it has on people's probability judgements.  The overconfident analyst places probabilities on hypotheses or scenarios that are unwarrantedly close to 0 or 1, effectively expressing a higher level of certainty than the evidence supports.  Overconfidence exposes decision-makers to the risk of bad decisions and has an unambiguously negative effect on expected returns under uncertainty.

Overconfidence is also relatively easy to measure: it manifests itself as miscalibration, which occurs when (for example) significantly more than 10% of statements assigned a 10% probability actually turn out to be true.  This is easily depicted graphically, such as in Tetlock (2005):

Tetlock (2005)

Here, 'subjective probability' (the x-axis) is the probability actually placed by participants on the (numerous) statements in Tetlock's study.  'Objective frequency' is the observed frequency with which those statements subsequently came true.  The shape of the observed calibration curves (the dotted lines) has been replicated in a number of other studies including an unpublished small-sample survey of UK government analytical product my team conducted in 2012.

Analyses of overconfidence tend to take a descriptive statistical or psychological approach.  Psychologists have proposed various decompositions of the mechanisms driving overconfidence, including 'support theory', 'random support theory', 'probabilistic mental models' and others.  These aim parsimoniously to identify sets of heuristics that account for observed deviations of probabilistic judgements from observed frequency.

I have found that an abstract approach, rooted in the theory of inference, and almost certainly not novel, replicates some of the observed findings using a single parameter that might stand in for a range of mechanisms.  An ideal inference machine would produce a probability that is essentially a summary of the quantity of information received, expressed in terms of Shannon information content (a standard and powerful information-theoretic concept).  If we apply a single modifier to information content - as though we were over- or under-estimating the quantity of information received - this produces calibration curves that are similar to the observed data:

(The family of functions is: estimated probability = p^k / (p^k + (1-p)^k), where k is an 'overconfidence parameter' modifying information content, and p is the subjective probability.  This can be straightforwardly derived from the relationship between binary probability and Shannon information content.)  The simplicity of this function, which has only one parameter, might make it useful for modelling the impact of miscalibration on decisionmaking, whether or not it corresponds to any real psychological mechanism.

Sunday, 12 October 2014

Malware and Optimal Deception

The Stack reports on a talk by Professor Giovanni Vigna concerning the increasing sophistication of malware in its attempts to evade detection by effectively laying low on a target system.  Detection is the enemy of malware, because it is quickly followed by updates to anti-virus software to enable detection of its signature and subsequent removal from the ecosystem.  Of particular importance to the creators of malware is that it does not become active in a virtual machine of the kind used by anti-virus software to encourage safe activation.  Malware therefore needs to do two things that work against one another: to evade detection by staying quiet, and to deliver its payload by becoming active.

This is a very common problem in decision strategy when participants' interests are not aligned, and when there is relevant but hidden information.  Under these conditions, the interaction becomes what in game theory parlance is known as a 'signalling game'.  In the situation described above, there are actually two simultaneous signalling games going on: the malware is trying to signal that it is benign, and the anti-virus's virtual snare is trying to signal to the malware that it's a real machine ready to be infected.

Signalling games occur everywhere.  Interviewees want to appear more competent than they are.  Men and women want to appear more attractive than they are to potential suitors.  Manufacturers want their goods to look high-quality even when they aren't.  Both China and the US want to signal resolve over Taiwan and avoid revealing the true price they'd be willing to pay for their objectives.  The economist Robin Hanson holds the view, with considerable conceptual support, that signalling is the main driver of most human behaviour.  The game of poker - among many others - is fundamentally a signalling game; players with strong hands want to look mediocre, and players with weak hands want to look strong.

The theory of signalling games is a surprisingly recent invention, generally considered to have begun with George Akerlof's 1970 article 'The Market for Lemons'.  Signalling games can resolve one of three ways.  'Pooling' and 'separating' occur when everyone transmits either the same signature or unique signatures respectively.  The more-interesting mixed outcome is when only a proportion of players send the 'false' signal.  As a general rule, 'good' signal senders (non-malicious software) and receivers (anti-virus software) would like separating solutions in which they can identify one another.  'Bad' senders (malware) would like to be in a pooling or mixed solution in which they reap the rewards that should be due only to the 'good' players.

It turns out that in these situations a key determinant is signalling costs, and specifically whether or not the receiver can impose a higher cost on 'bad' types than on 'good' types.  This kind of solution occurs everywhere: it's easier for a clever person to get a good degree than a stupid one; it's easier for an attractive person to look nice than an ugly one; it's easier for a manufacturer of high-end cars to demonstrate their quality than a firm smashing out bangers; it's easier for a competent person to pass an interview than an incompetent one.  Good malware design, and good anti-virus detection, will take account of this by trying to find ways to make it costlier for virtual machines to pretend to be real, or for malware to look benign.

Saturday, 11 October 2014

Unforeseen Consequences

"Facebook apologises to drag queens and transgender people after deleting profiles" is a surprising headline, in the technical sense of describing a set of events that would ex ante have attracted a relatively low probability.  However, the chances are it's not something anyone, including apparently Facebook's policy people, gave very much thought to.  Facebook was enforcing its 'real names only' policy by deleting clearly-pseudonymous profiles.  Apparently pseudonymous profiles are a mode de vie for many drag queens.  Who knew?  (Answer: the transgender community.)

In this regard, it's a bit like The Arab Spring, Russia's annexing of Crimea, or The Only Way is Essex.  These were all things that weren't so much dismissed as low probability, so much as barely considered as possible outcomes.  This is a category of error now commonly known as 'failure of imagination'.  Unlike the various probabilistic reasoning biases, which are well-understood and relatively easily corrected for, there isn't a set of steps you can follow to guarantee immunity from failure of imagination.  The evidence concerning structured brainstorming, which is widely regarded as one of the easiest and most effective collaborative imaginative techniques, only suggests a limited impact on the productivity of ideas compared to other approaches.

Importantly, however, failure of imagination is sometimes a risk it's optimal to take.  Generating hypotheses and scenarios has a cognitive and sometimes economic cost.  For relatively easily-rectified decisions (such as Facebook's) it's not usually worth spending too much time thinking of all the outcomes associated with them.  By definition, surprising things don't happen very often.  In practice, we should always assume that alongside the things we have thought about, there is a small probability of some important thing that we haven't, and plan accordingly.    

The Difficulty of Decision-making with Potentially Catastrophic Outcomes

The UK government faces an archetypal policy challenge in considering whether or not to introduce Ebola screening at borders.  In one sense this is a fairly standard decision-problem.  If we decide not to screen, there are a range of possible outcomes ranging from no or a handful of deaths through to a full-scale epidemic.  The probabilities of these outcomes are presumably estimable using standard epidemiological techniques applied to the specific pathology and etiology of Ebola.  If we decide to screen, we will (one presumes) reduce these probabilities across the board, for the price of an additional cost in time and resources.

The optimal decision will of course depend on the expected outcomes under the 'screen' and 'no screen' options.  The cost of screening will be fairly easy to estimate.  However, the uncertainty associated with epidemic outcomes makes optimal decision-making distinctly problematic.  This is particularly so when the probability distribution of the outcomes has a long tail that reaches to catastrophic dimensions.  Small changes in probabilities can have an extremely high impact on expected outcomes under these circumstances.  If, perhaps as the result of new clinical data, the probability of a million deaths goes from 0.1% to 0.2%, the expected impact would be equivalent to 1000 more deaths - quite possibly enough to change a policymaker's mind, and perhaps rightly so.

In extreme cases - and perhaps only in theory - expected impacts could in fact be undefined.  Suppose we have a 90% chance of an impact of 1 unit, a 9% chance of an impact of 10 units, a 0.9% chance of an impact of 100 units, a 0.09% chance of an impact of 1000 units and so on.  In this case, the expected impact is 0.9 + 0.9 + 0.9 + ...; in other words, it is apparently limitless.  This has an impact on optimal decision-making that is still the subject of debate after nearly 300 years.

The 'St Petersburg Paradox' was first identified by Bernoulli in 1738, and originally concerned the following bet: I toss a coin, and if it's heads you win £1.  If it's tails, the pot doubles to £2 and we play again.  If it's heads this time, you win the £2, but if it's tails, the pot doubles to £4, and so on.  This also has an undefined expectation, and in theory you should be willing to pay any amount of money to play it.  This is almost impossible to believe, yet if you simulate this game repeatedly, the average winnings per game do, indeed, inexorably continue to rise.

Ebola is not like this.  For one thing, UK deaths can't get much higher than 64 million - not that this outcome would be particularly palatable.  But the policy dilemma has similar features to that presented by the St Petersburg Paradox: the huge sensitivity of expectations to our epidemiological assumptions means it is very difficult to put a reasonable value on mitigation measures.

Thursday, 9 October 2014

The Length of Insurgencies

According to this very well-researched and highly relevant piece from RAND, the mean length of an insurgency is ten years.  Of course this figure is downwardly-biased because some insurgencies haven't ended (like FARC in Colombia, which has been going on for 47 years) and these are more likely to be at the upper end of the scale.  Ignoring for a moment all the specifics about the Islamic State insurgency, what in general should we expect its lifespan to be?

The various insurgencies in Syria and Iraq over the last decade are a bit like the bands of the 1970s/80s New Wave scene in New York in that many of them do not have clear start and end points.  (In other respects, the two things are quite distinct of course.)  According to Wikipedia, the modern history of the IS begins with al-Zarqawi in 1999, but as an insurgency it didn't take off until 2003.  In 2004 it affiliated with AQ to become AQ-I, then became ISI in 2006.  Its incarnation as IS is of course less than a year old.  So IS's insurgency is either less than a year old, or possibly eight years old, or eleven years old, or perhaps older.  What difference do these interpretations make to the expected time the insurgency has left?

The answer is 'not much'.  Looking at the dataset RAND used, which consists of structured data and judgements about 89 insurgencies beginning with the Chinese Civil War, the decay of insurgencies is fairly constant over time.  Unlike with, say, humans, knowing how long an insurgency has been going on for actually gives you very little information (by itself) about how long it has left.  For at least the first twenty years of an insurgency, the expected future lifespan remains at between 9-13 years.  The base-rate for an insurgency ending in any given year is only around 5-10%.

All things being equal, then, we shouldn't expect the IS insurgency to finish sooner rather than later.  Of course, not all things are equal.  RAND's analysis points to a number of factors that promote earlier resolution of insurgencies, including withdrawal of government support, desertion, infiltration, intelligence penetration, and - perhaps significantly - use of terrorism, which they found often works against the perpetrators' chance of victory.  However, they also found relevant factors that lengthen insurgencies, including a weak government, and multiple parties involved.

What all this points to is a lot of uncertainty.  We should be suspicious of anyone claiming to have a very good idea about the prospects for IS, Iraq and the region in general.

Clarity on Type I and Type II Errors

The concepts of 'Type I' and 'Type II' errors are taught early on in all statistical analysis courses.  They are often referred to in intelligence contexts as well, and particularly in the field of warning, e.g. here and here.  It is important to realise that the Type I / Type II error distinction (a) is properly applied only to decisions, not to analysis, and (b) depends on an arbitrary demarcation between 'action' and 'inaction' which is difficult to support under careful examination.

A Type I error is typically described as a 'false positive'.  The archetypal example is to infer that someone has a disease, when they don't.  A Type II error is typically described as a 'false negative'.  This is where you give someone the all-clear, when in fact they have the disease.

The reason that this distinction is not applicable to analysis - assuming it's done properly - is that analysis is directed at computing a probability that something is true; e.g. the probability that your patient has cancer, or that the DPRK is about to invade South Korea.  There are all kinds of critical thinking errors you can commit to bias your probability estimate, but it doesn't make much sense to call any of them 'Type I' or 'Type II' errors.  For one thing, there is no 'positive' or 'negative' to consider, only a probability.

Type I and Type II errors don't come into the picture unless an action is involved - for example, the decision to prescribe chemotherapy or to mobilise troops.  If, when all the facts are known, you'd rather have taken a different course of action, you'd be a victim of one of these kinds of errors.  But it doesn't necessarily mean that the analysis was wrong, even though it was a component of the decision.  You could just have been unlucky.  Things which are 99% likely to be true are still false one time in a hundred.  (You'll see Type I and Type II errors in statistical analysis text-books, but this is in the context of the 'rejection' or 'non-rejection' of the 'null hypothesis' - i.e. a decision.)

The distinction between Type I and Type II errors is an arbitrary one.  There is no metaphysical distinction between 'acting' and 'not acting' (or 'acts' and 'omissions').  Prescribing nothing, or keeping troops at a low readiness level, are both actions.  The idea that they represent a 'baseline' can't be made terribly coherent in conceptual terms, and any attempt to do so is vulnerable to an arbitrary relabeling.

Imagine a metal detector which stays silent until it detects metal, when it emits a bleep.  Here, bleeping in the absence of metal would be a 'Type I error' and staying silent in the presence of metal would be a 'Type II error'.  Now imagine a 'metal-lessness detector' which continually bleeps until it detects the absence of metal, when it emits silence.  This time, the Type I and Type II errors are the other way round.  But this 'metal-lessness detector' is functionally identical to the metal detector in every respect.

Claims that Type I and Type II errors are relevant to analysis - and particularly claims that the distinction between the two is significant - should be therefore be examined critically.

Tuesday, 7 October 2014

The Irreducible Vagueness of Truth Conditions

One aspect of taking part in the Good Judgment Project that I've found surprising is the difficulty of deciding whether a scenario has occurred even when the facts are known.  The GJP uses a variety of methods to extract judgments about the probability of scenarios from participants.  The following question - now resolved as 'yes' - is typical:

"Will national military force(s) from one or more countries intervene in Syria before 1 December 2014?"

The GJP provides clarification for certain common terms, in this case including the following definitions:

"An intervention, in this case, will be considered the use of national military force(s) to oppose armed groups in Syria (e.g., airstrikes, the deployment of troops). The provision of support services alone will not be sufficient (e.g., sending military advisers, providing surveillance, or transferring military equipment). For the purposes of this question, the countries represented must acknowledge the intervention, though they do not have to use this term specifically. Whether the intervention is done with the assent of the Syrian government is irrelevant. Inadvertent shelling or border crossing would not count. Outcome will be determined by credible open source media reporting (e.g., Reuters, BBC, AP)."

"National military forces refer to the ground, sea, and/or air components of a country’s official armed forces (e.g., the United States military). This generally excludes members of a country’s intelligence service as well as paramilitary groups, including insurgents, mercenaries, etc. As needed, in resolution criteria, GJP may specify that the definition includes additional entities, such as law enforcement forces."

Even 'before' merits clarification:

"'Before' should be interpreted to mean at or prior to the end (23:59:59 ET) of the previous day. For example, 'before 10 October' means any time up to 23:59:59 ET on 9 Oct. "

Not long after the question was posted, someone noticed an ambiguity: what about CIA drone strikes?  A legalistic approach might rule them out, but then this would seem to go against the spirit of the question.  Perhaps the clarifications were wrong?

This sort of discussion is very common in the GJP, and exposes a very important feature of intelligence - and indeed policy - work, namely that there is a vast amount of uncertainty and ambiguity about what we mean by concepts such as 'war', 'conflict', 'occupy', 'threat', 'intent', 'objective' to an extent that sometimes bleeds through into our picture of what is actually happening.

The logical positivists of the 1920s and 30s tried to draw a distinction between disagreements about language and definitions (they supposed that this was the role of philosophy), and disagreements about facts (this was the role of science).  Unfortunately, for deeper logical reasons, this distinction is itself an ambiguous one, as Quine discusses in his seminal essay 'Two Dogmas of Empiricism'.  There is, it seems, an irreducible grey area between disagreements about language, and those about facts.

This is an everyday problem for analysts in the workplace.  We can help matters, as the GJP does, by attempting to provide standard definitions for common terms.  But we need to learn to live with the edge cases, and consider definitional ambiguity as yet another source of uncertainty.

Monday, 6 October 2014

'Making the Call' is a Sub-Optimal Strategy

Should analysts ever 'make the call'?

'Making the call' means making a final decision for one option over another.  The origin of the phrase seems to originate from US sports, in the context of a decision about whether a foul has been committed or a ball was out.  It might or might not be related to 'calling' an outcome.  Or it could plausibly be related to 'calling' in poker, which ends a round of betting and pushes forward the action.  

Decisions are largely binary.  They have to go one way or another.  The very idea of facing a decision implies rivalry between your options in the sense that you can only take one course of action and not lots of them.  But beliefs are continuous.  You can assign some belief (in the form of probability) to a large range of possible facts, or hypotheses.  A simple boundary between the role of the analyst and that of the decision-maker is that the former assigns probability to a range of possible facts, and the latter deploys resources in exactly one way.  Decision-makers 'make the call'.

Should analysts 'make the call'?  The best interpretation I can put on this idea is that it implies assigning all of your belief to a single hypothesis, or at least communicating to a decision-maker as though that's what you've done.  Certainty undoubtedly sells, in the media as well as in business and government.  But it is not a good strategy for making optimal decisions.  Good decision-making is about finding value, not about taking the option which is most likely to produce a positive outcome.

As with all decision-making under uncertainty, it's exactly like betting on a horse race.  A horse with a 10% chance of winning at odds of 20-1 is a profit-making bet, whereas a horse with a 60% chance of winning at odds of 3-1 on (i.e. 1-3) is a loss-making bet.  An analyst 'making the call' would declare the first horse a loser and the second a winner, and the wrong bets would be made.  Without the probabilities to hand, a gambler - or a strategic decision-maker - is not in a position to identify the optimal course of action.  Analysts need to stick to probabilities and let decision-makers make the call.

Sunday, 5 October 2014

The Width / Length Puzzle in Big Data

I was recently sent a link to this (nearly year-old) talk by Harvard's Sendhil Mullainathan.  He discusses the difference between 'length' and 'width' of data.  Most discussions of big data focus on length - we have more data points.  Professor Mullainathan considers the implications of what happens when you have more 'width' - i.e. when you have access to more information about the data points you have - and makes some profound speculations about what it means for identification of hypotheses by machines.

If you were studying the relationship between, for example, heart disease and lifestyle, adding more people to your dataset would increase its length.  Measuring more things about those people - what time they eat, how much exercise they do and at what times, what their sleep patterns are and so on - would increase its width.  The new proliferation - near-ubiquity - of mobile sensors offers a great deal more data width.  This presents both a potential problem, and a puzzle.

The traditional approach to statistical inference involves the researcher 'suggesting' a type of hypothesis which they think (using expertise, common sense, or a whim) might be generating the data.  This could be something like 'heart disease is partly affected by smoking and BMI'.  Then the algorithms go to work: they find a 'best' model with which the data are most consistent, and also check whether the data are explicable by chance.  In the old days running these algorithms involved paper, pens, and slide rules.  Now of course they happen inside analytics software.

But the explosion of processing power has in the last couple of decades given us a new way of doing business.  We no longer have manually to propose a type of model, relying on expertise and so on.  As well as using machines to test particular hypotheses, we can now get them to generate them.  There isn't much more to this than the statistical equivalent of brute force: we try every possible combination of our explanatory variables (e.g. presence of heart disease, hours of sleep, cigarettes smoked, BMI etc.), then check each combination to see if it produces a powerful hypothesis.  This means we can test vast numbers of possible hypotheses without needing to know anything about the subject-matter.

This, then, is the problem: the old way of doing business strongly militates against finding spurious hypotheses - those which explain the data only by coincidence.  If you tested just three or four models, each of which had strong theoretical foundations, and found one which passed relevant tests, you could feel confident about having found something interesting.  If you test three or four billion models, you know that a proportion of them (5% of them, in fact, using traditional statistical doctrine) will come out looking significant.

Everyone agrees that length is good.  It will always enable you to distinguish between hypotheses with more confidence.  To the new approach, width is also an obvious boon as it gives you even more possible relationships to play with.  However, to the traditional approach to statistical modelling, width is a big threat.  It means you will be able to estimate exponentially more possible relationships and end up with even more 'spurious correlations', most of which will lack theoretical foundations.

This, however, gives us our puzzle: length and width are fundamentally the same thing: more information.  How can this be good when added to the end of data, but bad when added to the side?

I think the answer is that traditional statistical and research doctrine is not designed with modern technology in mind.  Its advantages are that it can be done with paper and pen, can be taught as a set of replicable steps, and that it guards against bad science by discouraging 'data mining' under what amount to incredibly lax standards for 'significance'.

Taking a more robust approach eliminates this mismatch between doctrine and technology.  Particularly problematic is the idea of 'significance', which probably needs to be done away with in favour of a fully Bayesian approach incorporating prior probabilities.  The prior probability of any one of a billion possible models will be vanishingly small - compensating for the coincidentally-high information content of all the spurious correlations.      

Saturday, 4 October 2014

The Top Gear Coincidence Hypothesis

Top Gear are in trouble for having taken a car with an offensive number plate ('H982 FKL' - I suppose the right-hand stem of the 'H' looks like a '1') to Argentina.  The angry Argentines allege a deliberate insult.  Jeremy Clarkson proposes that it is a coincidence.  What are the probabilities that each is right?

As usual it pays to be as clear as possible about what the hypotheses and the evidence are.  I'll run with these:

Hypothesis 1: The number plate was deliberately chosen by one of the Top Gear production team to refer to the Falklands War;

Hypothesis 2: The number plate was not deliberately chosen by one of the Top Gear production team to refer to the Falklands War;

Evidence:  One of the cars taken to Argentina by Top Gear bore the number plate 'H982 FKL'

Denials and so forth are unlikely to have much information value (because of the Rice-Davies effect) so I'll forget about those.  And although we could describe the evidence differently - e.g. as 'One of the cars taken to Argentina had a number plate that could be interpreted as a reference to the Falklands War' - this would complicate matters without materially affecting the hypotheses' likelihood ratio.

So, in order of magnitude terms, what are the likelihoods here?

If the Top Gear team wanted to refer to the Falklands using a number plate, how likely would they be to choose that one?  This largely translates to: 'How many different number plates can you use to refer to the Falklands War?'  I suppose it could just about work with 'J982', 'L982', 'M982', 'N982' and other combinations such as 'GB82' or 'FW82'.  Similarly, you could come up with other combinations of the three letters that would do a similar job to 'FKL', like 'FKW', 'FLK', 'FK1' or whatever.  You might tentatively say that there are possibly 10-100 combinations that would do the job.  On the other hand, 'H982 FKL' might be the only number plate in existence along these lines.  I'd therefore say the likelihood of the evidence under hypothesis 1 is somewhere between 0.01 and 1, refinable if necessary through further research.

How likely is that number plate to have appeared under hypothesis 2, the 'coincidence' hypothesis?  This would require quite a bit of research into how number plates work, which is only slightly interesting.  Luckily City AM's Billy Ehrenberg has already had a go.  He gives it a whopping 100,000,000 to 1.  Although I don't entirely follow all of his reasoning, it's certainly not less than 1,000,000 to 1.  This gives us an evidential likelihood ratio of at least 10,000 and possibly 100,000,000 in favour of the 'conspiracy' rather than 'coincidence' hypothesis.

But what is the prior probability?  Ignoring the number plate, how probable is it that the Top Gear team decided to use number plates to refer to the Falklands War?  Given that there is a whole Wikipedia page on 'Top Gear Controversies', it cannot be reckoned terribly low.  Let's assume they intended a controversy with a probability of about 1 in 10, and, if they did, that there are perhaps 100 ways of doing it of which number plates are just one.  This would give us a prior of around 1 in 1000.

Crunching the numbers, I judge the probability of Hypothesis 1 to be above 90% and defensibly 99.999%.