Sunday, 5 October 2014

The Width / Length Puzzle in Big Data

I was recently sent a link to this (nearly year-old) talk by Harvard's Sendhil Mullainathan.  He discusses the difference between 'length' and 'width' of data.  Most discussions of big data focus on length - we have more data points.  Professor Mullainathan considers the implications of what happens when you have more 'width' - i.e. when you have access to more information about the data points you have - and makes some profound speculations about what it means for identification of hypotheses by machines.

If you were studying the relationship between, for example, heart disease and lifestyle, adding more people to your dataset would increase its length.  Measuring more things about those people - what time they eat, how much exercise they do and at what times, what their sleep patterns are and so on - would increase its width.  The new proliferation - near-ubiquity - of mobile sensors offers a great deal more data width.  This presents both a potential problem, and a puzzle.

The traditional approach to statistical inference involves the researcher 'suggesting' a type of hypothesis which they think (using expertise, common sense, or a whim) might be generating the data.  This could be something like 'heart disease is partly affected by smoking and BMI'.  Then the algorithms go to work: they find a 'best' model with which the data are most consistent, and also check whether the data are explicable by chance.  In the old days running these algorithms involved paper, pens, and slide rules.  Now of course they happen inside analytics software.

But the explosion of processing power has in the last couple of decades given us a new way of doing business.  We no longer have manually to propose a type of model, relying on expertise and so on.  As well as using machines to test particular hypotheses, we can now get them to generate them.  There isn't much more to this than the statistical equivalent of brute force: we try every possible combination of our explanatory variables (e.g. presence of heart disease, hours of sleep, cigarettes smoked, BMI etc.), then check each combination to see if it produces a powerful hypothesis.  This means we can test vast numbers of possible hypotheses without needing to know anything about the subject-matter.

This, then, is the problem: the old way of doing business strongly militates against finding spurious hypotheses - those which explain the data only by coincidence.  If you tested just three or four models, each of which had strong theoretical foundations, and found one which passed relevant tests, you could feel confident about having found something interesting.  If you test three or four billion models, you know that a proportion of them (5% of them, in fact, using traditional statistical doctrine) will come out looking significant.

Everyone agrees that length is good.  It will always enable you to distinguish between hypotheses with more confidence.  To the new approach, width is also an obvious boon as it gives you even more possible relationships to play with.  However, to the traditional approach to statistical modelling, width is a big threat.  It means you will be able to estimate exponentially more possible relationships and end up with even more 'spurious correlations', most of which will lack theoretical foundations.

This, however, gives us our puzzle: length and width are fundamentally the same thing: more information.  How can this be good when added to the end of data, but bad when added to the side?

I think the answer is that traditional statistical and research doctrine is not designed with modern technology in mind.  Its advantages are that it can be done with paper and pen, can be taught as a set of replicable steps, and that it guards against bad science by discouraging 'data mining' under what amount to incredibly lax standards for 'significance'.

Taking a more robust approach eliminates this mismatch between doctrine and technology.  Particularly problematic is the idea of 'significance', which probably needs to be done away with in favour of a fully Bayesian approach incorporating prior probabilities.  The prior probability of any one of a billion possible models will be vanishingly small - compensating for the coincidentally-high information content of all the spurious correlations.      

No comments: