What is Big Data? (November 6, 2014)
Quite recently, I overheard a project colleague of mine having a discussion with someone on the phone. Somewhat abruptly, he covered the mouthpiece of the phone and whispered a question to me: "Erik, are we doing big data on this project?" My short answer, without thinking too much about whether the term is a verb or a noun, was that, no, we were not. The immediate reply: "So what is big data?" In response, I gave the best concise answer that I have ever heard: "Big data is when the size of the data itself becomes a part of the problem."
Depending on your background and the circles in which you frequent, this may or may not be an obvious answer. However, as the term "big data" has increasingly permeated mainstream culture over the last couple years, its meaning has evolved, and has often taken different meanings depending on audience. Although it is not explicitly identified in the white paper, the generally accepted first definition came from Doug Laney at META Group back in 2001 before the company was acquired by Gartner. This definition essentially discusses the increased volume, velocity, and variety of data, or "the three Vs."
Since then, other definitions have appeared with the same number of Vs but different components, and some have added addional Vs, all of which provide nods to this original definition. In addition, some have proceeded in different directions, such as the one from Hortonworks, which discusses how big data is comprised of transactions, interactions, and observations. However, this definition does not communicate any type of threshold in terms of the amount of data that needs to be involved to reach big data status. Apart from the definition sited at the beginning of this post, the definition which describes big data as any data set that has you asking yourself the question as to whether it belongs in Hadoop rubs me the right way. This definition points to the size of a data set, which can be assumed to be larger than what alternative, typically traditional tools can handle, so the spirit of this definition is similar in this respect. Now that we have defined "big data" as a noun, to some extent, it seems reasonable to discuss the broader ecosystem in which the data itself resides, and the types of projects involved, which are now seemingly included in some definitions of the term.
The media bombards us on a daily basis as to how big data is changing the world. But what is it about big data specifically which is causing this reaction? Much of the time, these types of case studies are actually talking about the eventual use of analytics on data to solve problems. There is now more data out there than ever before, and it is growing exponentially. But is it the quantity of data itself which is solving these problems, or the approach at solving these problems? A comment in the introduction to a book called "Enterprise Analytics", written by a highly regarded author in this space, Thomas Davenport, provides a point in the right direction. The last two words in the subtitle of this book, "Optimize Performance, Process, and Decisions Through Big Data" seemingly imply that the book is about solving big data problems through analytics. But what's this? The author states that although this book occasionally refers to big data, data scientists, and other issues related to the topic, it is based primarily on small data analytics, because many of the ideas from traditional analytics are highly relevant to big data analytics.
So here we have it. The main differentiator between small data and big data is the nature of the data itself. Big data projects involve different technologies, but if it were not for the volume, velocity, and variety of data involved, the conversation would just revolve around small data analytics. But this leads to the topic as to when the threshold is crossed between these two realms, and when to consider adoption of technologies as the data your firm processes evolves.
In my next blog post, I will delve into these areas a bit as they relate to usage of traditional and non-traditional databases.