This story was written by Keith Dawson for UBM DeusM’s community Web site Business Agility, sponsored by IBM. It is archived here for informational purposes only because the Business Agility site is no more. This material is Copyright 2012 by UBM DeusM.

When Big-Data Becomes Toxic

Does more frequently sampled data always contain more noise and less signal? Nassim Taleb seems to believe so.

Author Nassim Taleb argues that more big-data is not always good, and can be harmful. But we can't fully evaluate his argument until his book is published in November.

It will come as no surprise to members of this community that the big-data phenomenon, already huge, is gathering momentum. For example, this week brings news that NIST has called a big-data workshop to explore "key national priority topics" in support of the Federal government's recently announced R&D initiative for big-data. And Intel and MIT have just announced that the latter's Computer Science and Artificial Intelligence Laboratory will host the Intel Science and Technology Center for big-data.

A dissenting voice to the big-data chorus has been raised, and it's attracting a fair bit of commentary around the Internet. A blogger in Ottawa has published a passage that he claims was written by Nassim Taleb, author of the influential book The Black Swan. Taleb is a professor, an expert on randomness, a Wall Street heretic, and a best-selling author. The passage in the Farnam Street Blog linked above is said to be excerpted from Taleb's next book, Antifragile, to be published in November.

The blogger in question maintains secrecy about his identity; he has not responded to an inquiry about how he came by the material purportedly written by Taleb. Assuming the passage is genuine, let's have a look at its claims.

In business and economic decision-making, data causes severe side effects -- data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities -- even in moderate quantities... The more frequently you look at data, the more noise you are disproportion­ally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio.

The example Taleb gives is of sampling data once per year; assume that the data so gathered has a 1-to-1 ratio of noise to signal. Now increase the sampling to daily and the ratio goes to 95-to-5 in favor of noise, Taleb claims. With hourly sampling, you get 200 times as much noise as signal.

It doesn't seem possible to evaluate Taleb's argument based solely on the passage excerpted in the Farnam Street Blog; supporting threads of argument point to passages elsewhere in the book. This hasn't stopped bloggers around the Net from arguing for or against its validity.

The academic argument for the usefulness of big-data -- the bigger the better, and Web-scale best of all -- is well framed by the work of Google researcher Peter Norvig and colleagues: The Unreasonable Effectiveness of Data (PDF). Here is a video of Norvig arguing for big-data. He overreaches a bit with the implication that machine learning might replace the scientific method as a means of increasing man's knowledge about the world.

Big-data isn't going away. We'll need to learn how to handle it with sufficient care, avoiding the data paradoxes that our blogger Eric Christiansen describes. We'll need to develop tools at a higher level to smooth the tasks of big-data cleaning and analysis. Taleb's theories, if they pan out, might add to out knowledge of big-data, but they are not going to derail it.