With the shortcomings of Google Flu Trends exposed last month, many have jumped at the chance to critique ‘big data’. A recent NY Times article on the subject has been widely circulated.
Rather than spouting off a grocery list of issues, this FT Magazine article provides some insight into the core problems with ‘big data’.
Most notably, they draw a distinction between ‘big data’ and ‘found data’:
But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast…Such data sets can be even bigger than the [Large Hadron Collider] data – Facebook’s is – but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of datapoints collected for disparate purposes and they can be updated in real time.
The ease and inexpensiveness of ‘found data’ leads to “theory-free analysis of mere correlations” which often breakdown due to the old statistical curmudgeons—sampling error and sampling bias.
The whole article is well-worth the time to gain some insight into ‘big data’.