When I started this article, I was planning to list the different types of big data platforms. But, after three days of attempting to corral all the different big data offerings — relational versus non-relational, SQL versus NoSQL and database versus framework — into some semblance of order, I decided to avoid that mess.
To add insult to injury, I had hoped to introduce the person who coined the term "big data" as part of the article. But, I can’t even do that. There is no agreed-upon answer. In fact, there’s a full-blown research project looking into who came up with big data originally. Instead, I'm going to take a look at some of the key ways big data is used. That's far more important. And it's more interesting and surprising than you might think.
How It Happened
Analysts using traditional data mining have been manipulating data for years. These same analysts are now finding it difficult to cope with the amount and the variety of data being saved by businesses, private organizations and government agencies.
Enter big data, the next evolutionary step in data mining. Big data was designed to handle the massive databases and myriad types of data being created in today’s digital world. If "massive" has you thinking about Google and all the data it collects, you would be in the ballpark. What may surprise you is that Google is only fourth on the Top Ten List of the world’s largest databases. As of January 2014, the World Data Centre for Climate topped the list with 220 terabytes of data, and it’s anyone’s guess as to the size of databases controlled by certain government agencies.
Of course, big data took off because it makes it possible to manipulate vast quantities of dissimilar data, and discover amazing — and amazingly detailed and personal — things. John Sumser, HR industry analyst, provides the following example:
"Today we create hypotheses and collect data. Tomorrow we’ll be doing the inverse. The constant, steady accumulation of data will enable us to look at data before we form questions. That means we’ll be getting answers to questions we didn’t know to ask. We will be unthinking a whole bunch of things we assume to be facts."
Of course, we've all heard about some of the creepy ways this data has been put to use, such as Target's ability to discern a young woman's pregnancy before her family even finds out. But big data is also being used for much less sinister causes. Here are a few organizations that are leveraging it the most:
Loyalty cards and company credit cards are not issued just as a courtesy to customers. The data captured from the cards is processed by a big data platform, providing retailers with information that allows them to make better decisions about pricing, inventory control and customer incentives.
The big data part comes into play because information accumulates very rapidly when considering the number of customers, customer visits over time, product selections, number of stores and online shopping. This use of big data can have repercussions in terms of privacy, but it also provides a way for companies to better serve customers.
The banking industry has embraced big data whole-heartedly. Fraud detection is one reason. A customer’s history and transaction data can be used to detect any out-of-the-ordinary activity. For example, it is imperative that you tell your credit card provider when you are traveling outside the country (I learned this the hard way trying to rent a car in Sweden).
Financial institutions also use big data to analyze transactional data, allowing bankers to determine the risk of financial assets based on market performance and customer behavior. This example from Research Pays mentions that big data is even helping locate new offices:
"SunTrust Bank, based in Florida, uses data analytics to determine not only the location of their next branch office, but also the optimal management qualifications required to operate one of their branches."
One obvious area big data will help is in handling electronic health records safely and accurately across medical organizations. Having accurate records will provide patients with better service and decrease errors. The health care field, for obvious reasons, is adapting big data at a slower pace in order to conform to government regulations regarding patient confidentiality.
As mentioned earlier, big data is known for providing answers to unasked questions. In the health care field, this might mean finding a new drug or treatment that would not have been found otherwise. According to McKinsey & Company, big data could make the following possible in the not-so-distant future:
- Predictive modeling of biological processes and drugs becomes more sophisticated and widespread.
- Patients are identified to enroll in clinical trials based on more sources of information, such as social media.
- Trials are monitored in real time to rapidly identify safety or operational issues.
- Instead of rigid data silos that are difficult to exploit, data is captured electronically and flows easily between different units.
Big Data, Big Opportunity
While big data is being leveraged in some specific areas, it offers opportunity for all organizations in the following areas:
Big data’s ability to analyze, in near real time, social network posts (Twitter and Facebook, for example) allows companies, brands and organizations a unique opportunity to determine customer/member loyalty and how customers feel about products and/or service.
The intersection between big data and social media offers organizations the ability to determine which customers have the most influence over other members of that particular social network. Studies have shown that these people are more important to the company than the top spenders.
No doubt about it, marketers love big data. The more data they have, the better they feel. What big data offers marketers that they did not have before is the ability to mine the tiniest detail about customer behavior towards their products. Marketing firm 360i says that big data has been helpful in:
- Retaining and upselling existing customers
- Identifying new customers
- Revealing new marketing opportunities
- Driving more profitable advertising
- Measuring the impact of campaigns more accurately
Next, a Look at IT and R&D
It is understandable that big data would play a role in today’s research and development departments. To get a better picture, I talked with Dr. Brad Rubin, an associate professor at the University of St. Thomas Graduate Programs in Software. After auditing a few of his classes, Rubin’s big data expertise became apparent.
Research and Development
Big data allows companies, universities and government agencies to all benefit from big data’s ability to inhale vast amounts of unstructured data, giving scientists a better look at what is taking place. The famous quote by H. James Harrington comes to mind:
"Measurement is the first step that leads to control and eventually to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it."
Rubin offered an interesting story about how the university’s Hadoop-based big data platform was able to solve a research project led by Dr. Jadin Jackson of the University of St. Thomas. Jackson was trying to decipher several terabytes of rat brain EEG waveforms using a single Matlab workstation.
Rubin quickly offered Jackson use of the Hadoop cluster. It seemed like a win-win situation for both of the professors. Jackson would get his data processed 60 percent sooner, while Rubin and his students would gain valuable experience. The 192-core Hadoop cluster accomplished in one hour what the Matlab setup required 10 hours to complete. Plus, the cluster can do many of these analyses in parallel, further increasing productivity.
Here is the final report, which describes the research at length.
Just about any computing and networking device logs data. The amount of data being logged quickly becomes unwieldy. Big data can easily manage that amount of data, allowing administrators to monitor network activity, diagnose problems or, in the example Rubin gave me, look for certain network traffic patterns that would indicate malware activity.
If you are reading this article, it’s a fairly safe bet that you’re aware of the Heartbleed issue surrounding OpenSSL. Besides the technical problem, there is the concern that the vulnerability has existed for several years. Rubin mentioned that big data allows network administrators, working with data analysts, to create a program that will search all the network logs for malicious heartbeats. This EFF post mentions:
"Any network operators who have extensive packet logs can check for malicious heartbeats, which most commonly have a TCP payload of 18 03 02 00 03 01 or 18 03 01 00 03 01 (or perhaps even 18 03 03 00 03 01)."
The following example is sample output from the show audit command:
Router# show audit
*Sep 14 18:37:31.535:%AUDIT-1-RUN_VERSION:Hash:
*Sep 14 18:37:31.583:%AUDIT-1-RUN_CONFIG:Hash:
*Sep 14 18:37:31.595:%AUDIT-1-STARTUP_CONFIG:Hash:
*Sep 14 18:37:32.107:%AUDIT-1-FILESYSTEM:Hash:
*Sep 14 18:37:32.107:%AUDIT-1-HARDWARE_CONFIG:Hash:
If you follow the time stamps, the time interval for all those entries was less than one second. I would not even want to extrapolate that out for a day, let alone two years!
Something to Watch
If you check the job ads, there's a dire need for big data experts. I asked Rubin about this. He agreed, mentioning his students were excited about their prospects. I then realized that big data platforms, in particular those considered open source, are following a timeline very similar to how Linux became mainstream.
Universities embrace open-source versions of big-data platforms, in particular Hadoop, because they are free, and students can manipulate the source code. So the graduates who fill all those job openings are going to prefer working with open-source platforms, as it’s what they know the best. It will be interesting to watch.