Big Data's Got a Problem, But It Isn't Technology
The term big data is used by marketers and IT professionals alike, often in a haphazard and incorrect way. In this article we'll cover what big data really is, and what's just marketing fluff.
Big data is facing a big problem these days, and interestingly enough, it has nothing to do with technology. Nope, this is a public relations problem, wherein big data is bit like Tom Cruise’s infamous couch-jumping antics on Oprah: Everyone was talking about it, but most people had no idea what it meant (and the rest probably didn’t care). For celebrities, obscure hype can be a welcome jackpot. When it comes to business and technology, however, buzzwords like big data don’t always bridge the gap between the CTO who wants to implement big data and the CEO who wants to know why.
A full definition of big data may still be up for debate, but what no one’s arguing about is that big data is getting bigger by the day, with corporate data exploding year over year and social media interactions stretching into the hundreds of millions per day. And as business of all kinds become increasingly digital, the amount of data out there is set to get even bigger still. That’s why understanding how big data can help is so important. So let’s take a look at how big data might be defined - and why nailing down that definition is becoming increasingly valuable to businesses of all sizes. (Follow the online conversation around big data by checking out the Big Data Experts to Follow on Twitter.)
What Is Big Data?
Some just call any situation with "lots" of data big data. This is incorrect. While a large volume of information is part of the definition, it is incomplete. People have been processing large volumes of data for decades. Does this mean your 10 GB database from the '90s was big data because it seemed like a lot at the time?
I think we all know the answer to that question. So what, then, draws the line between a lot of data, and big data? This concept was best explained by Doug Laney back in late 2000 - yeah, sorry, big data isn't new! He referred the "3 V’s" of big data: volume, velocity and variety. These V’s characterize the different aspects of big data and also represent its key challenges. In other words, they’re what anyone who attempts to implement big data must contend with. This framework also helps explain the types of software and technology required to address these challenges. Let’s look at each one in turn. (Get more insight on the 3 V's in Today's Big Data challenge Stems from Variety, Not Volume or Velocity.)
When it comes to making inferences about a group of people - in this case, a business's clients or product’s consumers - sample size is key; the bigger the sample, the easier it becomes to find patterns in data and make generalizations about a group’s preferences, behaviors or other important metrics. In the past, gathering and crunching this much data just wasn’t possible. Now however, as technology is increasingly able to process larger amounts of data, it’s creating value by analyzing it through what is known as big data analytics.
The biggest challenge when it comes to dealing with volume is that it means breaking away from conventional relational databases and moving toward solutions like massively parallel processing or Apache Hadoop, a platform that helps to distribute a computing architecture over a number of servers. (Learn more about analytics platforms in this webinar, Hot Technologies of 2012: Analytics Platforms.)
Think about some of the companies that are considered forerunners in big data, such as Google and Facebook. Clearly, these companies have a lot of volume in terms of digital data, but the rate at which that data is being formed is also mind blowingly fast, and in many cases, it’s accelerating. In August of 2012, Facebook revealed that its system was processing 2.5 billion pieces of content - and more than 500 terabytes of data - each day.
Velocity is all about how quickly data can be captured and crunched, because the quicker results are available, the faster companies can respond to them. In some business cases, even a minute would be completely unacceptable - the speed of turnaround is measured in seconds (or fractions of a second). A great example of this need for speed can be found in e-commerce. Think about how Amazon.com can take a customer's purchase, and by the time the confirmation screen renders, give them a customized recommendation for new products to buy. That sort of instantaneous processing is now the accepted norm. Velocity, therefore, is a challenge in big data because if data can't be crunched quickly enough, it may not be useful. (Read more in Big Data: How It's Captured, Crunched and Used to Make Business Decisions.)
If only data always presented itself as uniform, orderly and ready for processing in a relational database. However, the more data an organization collects, the more likely it is to come in different forms, such as text, images or sensor data. On the Web, different browsers, software and user settings can also lead to the collection of inconsistent data. Sure, you could clean things up and keep what’s useful, but big data generally aims to keep everything, which makes the variety of data a huge challenge in terms of setting up big data architecture. As a result, it involves the development of more agile, less structured databases to extract and store diverse data. For those looking to implement big data infrastructure, what it really means is delving into some new and intimidating technologies, and putting in a lot of work into making such diverse data useful.
A Big Definition for a Big Challenge
In summary, think of big data as data that is unstructured and is therefore difficult to process using traditional database architectures. The way it comes at you is a bit like drinking from a fire hose, which is why the 3 V's model does such a great job of describing and defining it.
To be clear, some quibble with this and say that big data is still ill defined. In reality, it’s more like big data, as a concept, is too big and too complex to be encapsulated by a single term. Ed Dumbill, program chair for the O’Reilly Strata Conference, describes big data as "data that exceeds the processing capacity of database systems." That, simple, concise definition says it all - at least in theory. In practice, the challenges that must be overcome in big data are so much more complicated.
As Marc Andreesen put it in an August 2011 piece for the Wall Street Journal, "all of the technology required to transform industries through software finally works and can be widely delivered at global scale." That’s generated a new need for the statistical approach, systems thinking and machine learning that comes along with big data. So whatever the definition, it's clear that big data is one of the most important opportunities in IT.