Counterpoint: Defining Big Data
About a month ago, I had the opportunity to attend the Big Data in Finance Day Chicago event at Microsoft Technology Center Chicago. One of my many observations was that it is apparent the industry has not agreed on a definition of "Big Data".The best definition that I have personally heard, apart from this event:
- "When the size of the data itself becomes part of the problem"
The speakers at this event gave several:
- "Any data set that is too big or too cumbersome to work with in Excel, or takes too long to calculate.
- "Any data set that has you asking yourself the question, 'Does this belong in Hadoop?"
- "Big Data = transactions + interactions + observations"
- "Big Data is not Hadoop"
The last definition listed above is not really a definition of Big Data, but I list it here because it seeks to define what Big Data is not, and sometimes it helps to provide a narrowing down of options, and it was clear from this event, as it is with so many vendors, that implementations of Hadoop go hand-in-hand with working with Big Data problems.
It is not hard to argue that Big Data is not Hadoop. Hadoop is a solution, not a problem. And besides, Hadoop does not solve all problems in the Big Data space. So what is Big Data? Let's look at the first definition provided at this event. I have used Excel frequently in the past, but every time I need it for any nontrivial task, I typically need to revisit its limitations.
Excel limitations have decreased over time, making it possible to work with more data. In looking at the limitations for Excel 2003, for example, the size of each worksheet or spreadsheet (a workbook is a collection of spreadsheets) is limited to "65,536 rows by 256 columns". In reviewing the limitations for Excel 2010, however, it is clear that much more data can be stored in each spreadsheet than the earlier release: "1,048,576 rows by 16,384 columns".
But is this really the definition of Big Data? I suppose 16,384 columns is a large number of columns, but 1,048,576 rows is not that large a number of rows, and I have typically seen the need for a far greater number of rows than I have for columns. It is very possible that the speaker summoned the Excel comparison because many business users are familiar with the tool, but I am not sure that this is the right approach.
The source of the "Big Data = transactions + interactions + observations" equation apparently originated with Hortonworks. At the surface, I am not particularly fond of this definition, and the source unfortunately does not explain it well. He gives examples of what he considers transactions, interactions, and observations, but there is some overlap with these areas. For example, it is understandable that interactions might be seen in light of human users, but human users can be the source of transactions as well.
The main problem with this equation is that it does not indicate a threshold in terms of the amount of data that needs to be involved to reach Big Data status. For example, if I have a set of data that involves 2 transactions, 1 interaction, and 3 observations, is the set of data in the realm of Big Data? Perhaps what is not explained is that this equation is not intended to be mathematical, but if this is the case, the definition does not really become any clearer.
Apart from the definition sited at the beginning of this post, which I consider the best definition of Big Data, the defintion that "Any data set that has you asking yourself the question, 'Does this belong in Hadoop?'" is rubbing me the right way as well. This definition points to the size of a data set, which can be assumed to be larger than what alternative tools can handle, so the spirit of this definition is similar with that respect.
In the Review section of "The Wall Street Journal" this past weekend, I happened across an essay by Holly Finn entitled "New Gumshoes Go Deep With Data". Finn mentions an interesting quote by Peter Thiel: "Most of 'Big Data' is a fraud, because it is really 'dumb data'. For the most part, we would need something like artificial intelligence to turn the 'dumb data' into 'smart data', and the reality is that we are still pretty far from developing that sort of artificial intelligence."
In reading the rest of the essay (Peter Thiel is quoted about two-thirds into the essay), I was hoping that this quote would be saved somehow, but alas it was not. Is it possible for data to be smart or dumb? Does data in itself have intelligence? Later in the article, in mentioning Kaggle, the water gets even murkier as Finn mentions the world's best data scientists analyzing "daunting sets of information".
Perhaps this is the source of at least a portion of the confusion surrounding what Big Data is and is not. Data in itself is not smart or dumb. Data does not have intelligence. Data is not information. Data that has been properly cleansed can be the source of valuable information, but it does not provide information in itself. If the vast quantities of data that have become available in the world were information in the raw ready to be plucked to solve the world's problems, there would be no need for the statistician, and there would be no need for the data scientist.
Counterpoint: "A doctor's note for virulent consultants"