## Tuesday, 19 May 2015

### Confidence and Probability Part 6: The 'Information' and 'Ignorance' Theories

In our last post, we looked at the theory that 'confidence' was related to uncertainty about a 'true' probability. This was the first of seven theories that have been put forward as to how confidence judgements should be interpreted. In this post, we'll look at the next two theories - 'Information' and 'Ignorance'.

These two theories are closely related, and suggest respectively that 'confidence' is related to the amount of information that has been used to form a judgement, or that it is related (inversely) to the amount of information missing from a judgement.

The 'Information' theory suggests that the difference between

"I think there's a 70% chance of rain, and I am very confident in that judgement"

and

"I think there's a 70% chance of rain, but am not very confident in that judgement"

should be interpreted as something like the difference between

"I have a lot of information suggesting there's a 70% chance of rain"

and

"I think there's a 70% chance of rain but I don't have much information to support that"

 Photo: Biswarup Ganguly

The 'Ignorance' theory, meanwhile, would parse them as something like the following:

"I've seen nearly all the available information, and it suggests there's a 70% chance of rain"

and

"I think there's a 70% chance of rain, but there's a lot more information I could get hold of that might give me a better idea".

As with all the theories of 'confidence', we'll put the 'Information' and 'Ignorance' ideas to three tests: internal coherence, usage by analysts, and decision-relevance.

#### Are the theories internally coherent?

Both the 'Information' and 'Ignorance' theories depend on a crucial claim: that information 'quantity' (either available, or missing) can be meaningfully separated from the probability judgements that the information helps support. This claim seems superficially plausible. These two statements do seem different in a relevant way:

1. Based on an article I read in a magazine, I'd say there's a 20% chance that William Shakespeare wasn't the author of the plays normally attributed to him.

2. I've read everything that's been published about Shakespeare, and I think there's a 20% chance that he wasn't the author of the plays normally attributed to him.

Let's assume our analysis is well-calibrated (that is to say, out of every five statements we attribute a 20% probability to, one turns out to be true): is there a relevant difference between these two statements, in terms of any force with which I should act on the claim? It certainly feels like the latter claim is better supported - that it has more information in it, or perhaps lacks less information - but can we pin this feeling down to something measurable?

In the previous post, we touched on the mathematical field of information theory in the context of the claim that 'probability' does not refer to a measurable feature of systems, as such, but that it quantifies information, or, equivalently, uncertainty, about those systems. This view of information, first described by Claude Shannon in 1948, implies something very important: that you cannot meaningfully talk about a 'low information' 20% and a 'high information' 20%, any more than you can talk about someone being a 'short' 1.7m in height as opposed to a 'tall' 1.7m, because the information content of a judgement is encapsulated by the probability.

But might there be other ways of defining 'information' that don't entwine it so closely with probability, and which would accord with what the 'Information' and 'Ignorance' theories of confidence purport to describe? Possibly. Based on conversations with them, and statements from survey participants, it seems that some analysts have a model of 'available information' that treats it like a finite 'in-tray' of possibly-useful material, or 'raw data', which is then 'ingested' as evidence. In the 'in-tray' model, 'information' is measured not in terms of its impact on the probability of hypotheses of interest, but in terms of something like 'bytes ingested' versus 'bytes remaining'.

The in-tray model does make an interesting and possibly-useful distinction between probability, information used, and information remaining. It allows for 'useless information' that does nothing to change a probability. It also makes the distinction between information and ignorance meaningful: 'information' is how much of the in-tray you've ingested, while 'ignorance' is how much is left; it's possible to have low information and low ignorance, or high information and high ignorance - a situation that the Shannon measure doesn't admit.

We can loosely illustrate the differences between these interpretations as follows. Suppose there are three papers that might shed light on a hypothesis of interest - say, the efficacy of a certain drug in treating hayfever. The first, short, paper details a preliminary study that provides some evidence of effectiveness. The second, longer, paper outlines a large-scale clinical trial that provides strong evidence in its favour. The third, also lengthy, paper turns out to be of no relevance. The difference between the 'in-tray' and the Shannon interpretations of 'information' might look a bit like this:

The problem with the in-tray model is that the concepts it relies on are vulnerable to fairly-arbitrary changes in conditions. Inserting irrelevant data into the 'available data' bucket will raise one's ignorance without affecting one's judgements, while reading it will then add to 'information content' without affecting their evidential basis. We might be tempted to respond by changing our definition from 'information' to 'useful information' - but then we are back to square one, for how else can we define 'useful' information other than in reference to its effect on the probability of the things we are interested in?

In summary, the 'information' and 'ignorance' theories of confidence can only be made coherent if they rest on a concept of 'information content' that differs from the orthodox information-theoretical measure, which treats probability as a measure of information. But measures of information which divorce information from probability are vulnerable to seemingly-arbitrary distortion by 'useless' information.

#### Does the theory align with usage?

There is evidence that many analysts think of 'confidence' in terms of 'how much information' underlies a judgement. The responses to our survey, which invited analysts to define 'confidence', included the following:

"Confidence here means the extent to which, given your knowledge of the data you do have and (importantly) the known gaps, you assess that your statement/analysis is likely to be accurate."

"Confidence here means how well the probabilities are based on reliable statistics and good data samples. (Many of the supplementary bits of information are irrelevant because they do not change the available information.)"

"Strength of an opinion based on all available evidence, factoring in unknowns."

(Although to the extent that the survey responses could be clearly classified, this was an interpretation offered only by a small minority of respondents.) One thing however is clear: that whatever analysts use 'confidence' to mean, they do not on average have in mind something related to the Shannon definition of 'information content'. The following chart shows the relationship between the information content implicit in the probabilities presented, and the mean subjective confidence scores survey respondents assigned to them.
If respondents' confidence scores were in some way capturing Shannon information, there would be a strong positive relationship between these two - and there isn't.

One of the survey questions was designed to test whether analysts thought of information in 'in-tray' terms. In it, analysts were presented with a set of situations which involved reading a number of papers about the effectiveness of a travel sickness drug. The scenarios varied by the number of papers that had been read, and the number of papers that had not been read. Crucially, in every case the relevant information (i.e. Shannon information) was the same - that 1000 people (in total, across all the papers that had been read) had been tested, and the drug was effective in 750 of those cases. The table below shows the mean subjective confidence scores (from 0-6) that were assigned by survey respondents in each of the high / low 'information' and 'ignorance' cases:
What these scores suggest is that analysts' confidence is indeed affected not by information in the Shannon sense, but by information in the 'in-tray' sense - how much 'stuff' they had read, and how much 'stuff' they hadn't. But the scores indicate that 'information' was more important a driver than 'ignorance'; reading 10 papers rather than 1 - even if the total data inside them were the same - raised scores significantly, but leaving papers unread had only a small effect.

These data may suggest that analysts are using an in-tray type model of 'information' - that if an analyst has read more things directed at a hypothesis, even if they are of limited relevance, their confidence goes up. This might indicate that analysts' perception of confidence is in some way related to their sense of having done 'due diligence' on an analytical problem - i.e. to their feeling of having 'done their job' - rather than to the amount of analytically-relevant information they've actually unearthed. There may, of course, be other explanations not tested by our survey.

#### Is it decision relevant?

Finally, if the 'information' and 'ignorance' theories of confidence were meaningful, would they be decision-relevant? Could knowing the confidence attached to a probability affect the optimal decision being made?

The answer is a qualified 'yes'. If 'confidence' communicates information content, and if that information content is not captured by the probability, then it might suggest to us whether we should act now (if there was already 'enough' information in the judgement) or to collect and analyse more information (if there wasn't). On the other hand, merely knowing how much 'information' went into a probability - if this makes sense as an idea - wouldn't by itself tell you how valuable new information is likely to be nor how much it would be expected to cost, both of which are key to deciding whether the next bit of information is worth investing in. We will look at these more-complex economic concepts later.

In the next post we'll examine the fourth theory of 'confidence' - the 'quality' theory, which posits that confidence is not reducible to a simple measure and instead captures summary judgements about the qualitative features of the evidence-base.

## Friday, 1 May 2015

### Confidence and Probability Part 5: The 'Uncertain Probability' Theory

(This is part 5 of a series of posts about probability and confidence that starts here.)

In the last post, we outlined the seven theories that have been put forward as candidates for what we mean when we talk about confidence in an analytical judgement. In this post, we'll look at the first theory - the 'Uncertain Probability' theory. This is the idea that there is a 'correct' probability, but that we might be unsure about its value. The more certain we are about the 'true' value, the higher our confidence will be. It's an intuitively-appealing theory that a significant proportion of survey respondents offered as a proposed interpretation. But it rests on a number of potentially-tricky suppositions.

As with all these theories, we'll apply three tests: internal coherence, alignment with usage, and decision relevance.

#### Is the theory internally coherent?

The problem is that if probability is a quantity of something, which we can measure to a higher or lower level of accuracy, it is one with some very strange properties. First, the 'correct' probability of a hypothesis seems to vary between people. If I've just tossed a coin, and looked at it, I know the probability of heads is either 0% or 100%. But if I haven't shown it to you or indicated the results in any way, you will quite rightly put a probability of 50% on that hypothesis. It's the same coin, but our probabilities are different.

Probability can also vary over time in ways that don't seem to involve the subject of the hypothesis changing in any way. Before the Curiosity Rover measured radiation levels en route to and on Mars, scientists could only put probabilities on whether these would be at acceptable levels to permit a manned mission. Now we know that radiation exposure would be at the lower end of what was expected, we're more certain that the risk would be acceptable. Radiation levels haven't changed - but the probabilities we attach to them have.

What these examples suggest is that probability is not a feature of the world per se, or at least is not an objective feature of the subject of the hypothesis. The probability of a coin coming up heads isn't in the coin somewhere, and the probability of radiation levels for a trip to Mars being prohibitively dangerous didn't change because something to do with space radiation changed. If probability isn't a quantity 'out there' that we're measuring, what is it?

 Where is the 50% probability of heads?

Information

The element here that explains differences between probabilities - between people, or over time - is information. In the case of the coin, I had full information about it whereas you had no information. In the case of the Martian radiation, it was new information - not a change in the system per se - that made us revise our probability that a trip there would pose an acceptable radiation risk.

It's this insight - that probability and information are interrelated - that led to the development of 'information theory'. While working at Bell Labs in 1945, the mathematician Claude Shannon put forward the idea that probability could be used to quantify information. In doing so, he created a field of mathematics that underlies a vast range of phenomena and technology, including much of modern computing.

Under Shannon's approach, information is not simply related to probability - it's not 'one of the things' that affects our probability estimates - but is actually measured using it. 'Shannon entropy' - a measure of the 'information content' of a piece of evidence - is defined in terms of the probability of receiving one of a number of possible 'signals'. The maths of information theory are too involved to cover here, but they do have an intuitive component. The key insight is that unusual or unexpected evidence will contain more information than unsurprising evidence, because it eliminates more possibilities.

 Delware: officially more surprising than California

For example, let’s say you’re interested in someone’s home address, and all you know is that they live somewhere in the US. You start by asking which state they live in. Before they answer, you know that it’s more likely they’ll say ‘California’, with its 38 million denizens, than ‘Delaware’, which has less than 1 million. For this exact reason, if they do actually live in Delaware, this will eliminate more possible addresses than if they live in California. ‘I live in Delaware’ is therefore a message with more information in it than ‘I live in California’. Shannon’s measure of information, which denominates it in ‘bits’, captures this insight in a precise mathematical way.

What this implies, among many things, is that 'probability' isn't a thing 'out there' that we're trying to measure, as with other quantities; instead, probability is a thing 'in here' that summarises everything we know about a hypothesis. And so the concept of being uncertain about it is not easy to make work. There's no 'gap' between us and the information we have about a hypothesis (and therefore its probability) that could explain where the uncertainty could arise. And if we are tempted to say 'I'm not sure how much information I have' - well, this is strictly the same as having less information, because it means that there are a wider range of plausible hypotheses about how the evidence in front of us could have arisen.

Having said all this, there have been some attempts to put the idea of 'uncertain probability' into action. One of them is the system developed by Dempster and Shafer, which incorporates imprecise probability estimates. Another is the idea of 'Knightian Uncertainty', proposed by Frank Knight in 1921, which seeks to draw a distinction between 'measurable' and 'unmeasurable' uncertainty. Some have also advanced pragmatic grounds for expressing uncertainty about probability - because, for example, it discourages overconfidence in models, or because it is harder for non-specialist audiences to misinterpret. However, it is fair to say that these approaches are not widely accepted by most probability theorists; an outline of them can be found here, and the comment thread gives a worthy sense of the debate about them.

The Frequentist Switcheroo

One particular interpretation of probability that sees it as a measurable feature of the world is 'frequentism', and it's important enough to merit special treatment here. Frequentism is the view that 'probability' can be equated with - i.e. is the same thing as - the frequency of an outcome in the 'long run'.  The main alternative interpretation of probability is the Bayesian view, which is broadly the one set out above that 'probability' measures your available information.  Although frequentism became the dominant statistical approach in the early half of the last century, in the last few decades it's been largely driven out of polite society by the Bayesian interpretation.  Not only is the Bayesian view almost certainly more coherent than frequentism (for the reasons set out above), it's also far more useful.

One question in our survey was designed specifically to capture analysts' attachment to the frequentist way of thinking. It was a fairly abstract question that put analysts in the following scenario:

"A box in front of you contains a large number of black balls and white balls, but you don’t know what their relative proportions are and you can’t look into the box. You have been asked to state the probability of the next ball being black. The information you have suggests the answer is 50%."

Analysts were then asked about how confident they'd be in the statement: "The probability of a black ball being drawn next is 50%" in four cases. In the first case, no balls had been drawn from the box. In the second case, two balls had already been drawn (with replacement). In the third, ten balls had been drawn, and in the fourth, one hundred balls had been drawn (again, with replacement). Importantly, in every case, exactly half of the balls were black. And, in this question, analysts' responses were very strongly correlated with the number of balls drawn:

This suggests that in the context of this question at least, analysts are applying a 'frequentist' interpretation of probability, and surreptitiously substituting a different question for the one they were asked.  The question is about the following proposition:

A: The probability that the next ball will be black is 50%

and not:

B: Half the balls in the box are black

To hardcore frequentists, (B) is the same as (A).  To a Bayesian, they're very different: drawing increasing numbers of balls, assuming they are half black and half white, is evidence supporting B, but it does nothing for A, which is the proposition we actually care about in this scenario.

Frequentism is problematic for all sorts of reasons, and the debate between frequentists and Bayesians is so well-trodden that we won't cover it in detail here. A non-technical attack on frequentism can be found here.  But even if these views are wrong, and frequentism can be made coherent, for most analysts there is a much more significant obstacle to using a frequentist interpretation of probability, which is that it massively restricts the scope of probability in a way that would make it practically impossible to use.

The reason the frequentist interpretation is unsustainable for most real-world analysis is that it implies that statements like "climate change is almost certainly linked to CO2 emissions", "further Russian encroachment in Ukraine is unlikely", or "the new engine design will probably improve efficiency by around 2%" are literally meaningless. Because these are not repeatable experiments, words like 'almost certainly', 'unlikely' and 'probably' can't be given any interpretation at all. If you think these statements do mean something then you're almost certainly not a frequentist. (Of course, if you are a frequentist then that last statement is meaningless too.) So while their responses to the 'balls in a box' question might be coherent under a frequentist interpretation, the 'some analysts are frequentists' theory doesn't account for their reaction in these more-realistic kinds of scenario.

Summary

There are good grounds for thinking that the 'uncertain probability' theory is incoherent. Nevertheless, it would be unfair to present this as an entirely uncontroversial conclusion. The debate here is not about probability per se: the laws of probability have been accepted mathematically for several centuries, and their formalisation by Kolmogorov in 1933 has not been seriously challenged. What we are investigating here is instead the interpretation of probability - what probabilities actually mean. Under the Bayesian interpretation, uncertainty about probability is incoherent.  Under the frequentist interpretation - which is itself considered incoherent by most specialists - uncertainty about probability is possible, but only under very restricted circumstances involving repeatable experiments. So uncertainty about probability is either an incoherent idea, or it's applicable to such a narrow range of situations that it cannot account for confidence statements in most analytical circumstances.

#### Does the theory align with usage?

Despite being difficult to make coherent, many analysts think of their own confidence statements as expressing uncertainty about probability. Here are some of the responses from Aleph Insights' survey of analysts and their customers in late 2014:

"Confidence here (as I've used it) is an expression of my imprecise range would be for the stated probability as a odds for a fair bet (odds I'd be equally willing to accept to bet for or against)"

"How sure you are about your assessment. It could be thought of as a probability on a probability. i.e. how likely is it that your probability estimate corresponds to the true probability?"

"An expression of the range of expected answers (25% to 75% probability?) compared to my predicted answer - the narrower the range, the greater the confidence..."

In summary, many analysts do indeed have in mind something like 'certainty about the true probability' when expressing confidence about judgements.

#### Decision-relevance

The final test to which we will put the 'uncertain probability' theory is that of decision-relevance.  If uncertain probabilities were a thing, would they be relevant to a decision-maker?  In other words, is there some feature of a certain probability that will be different to that of an uncertain probability, of which we should take account?

The answer, regardless of which interpretation of probability one subscribes to, is 'no'. There are several ways of reaching this conclusion.  In terms of observable outcomes, the properties of an 'uncertain probability' are entirely identical to those of a certain probability equal to its expected value. For example, let's suppose you have a coin that's certainly 50% likely to come up heads; by definition this will come up heads 50% of the time.  Now let's suppose you select one of two coins at random: one comes up heads 60% of the time, and the other comes up heads 40% of the time.  This time, the probability of heads is 'more uncertain'. (The scare quotes are here because, as we said above, this idea is probably incoherent.) Now, the probability of heads is... still 50% (50% x 40% + 50% x 60%).  So in each case, you will toss a head exactly half the time.  Any reason to take a course of action based on this probability will have exactly the same force in both circumstances.

We can take a more intuitive approach.  There are two boxes in front of you, and you will randomly select one and choose a ball from it.  The first time you perform the experiment, both boxes have two white balls and two black balls.  The second time you perform the experiment, one box has three black balls and one white ball, and the other has one black ball and three white balls:

The first experiment seems to have a 'certain' probability of a black ball - 50%.  The second experiment seems to have an 'uncertain' probability - either 25% or 75%.  But crucially, in each experiment, we are choosing from eight possible balls, half of which are black and half of which are white.  Our sense that one has a 'certain' probability, and the other an 'uncertain' one does not, in fact, have any material bearing on any properties of the outcome, including the expected value of any decisions whose return depends in some way on that outcome.

#### In Summary

The 'uncertain probability' theory of confidence is probably incoherent, although it does accord with some analysts' usage.  It is not decision-relevant, so if it is true, there is no need to communicate confidence to customers.

In the next post, we'll look at the next two ideas: the 'Information' and 'Ignorance' theories of confidence.