Can there ever be too much data in big data?

Answer

The answer to the question is a resounding YES. There can absolutely be too much data in a big data project.

There are numerous ways in which this can happen, and various reasons why professionals need to limit and curate data in any number of ways to get the right results. (Read 10 Big Myths About Big Data.)

In general, experts talk about differentiating the "signal" from the "noise" in a model. In other words, in a sea of big data, the relevant insight data becomes difficult to target. In some cases, you're looking for a needle in a haystack.

For example, suppose a company is trying to use big data to generate specific insights on a segment of a customer base, and their purchases over a specific time frame. (Read What does big data do?)

Taking in an enormous amount of data assets may result in the intake of random data that's not relevant, or it might even produce a bias that skews the data in one direction or another.

It also slows down the process dramatically, as computing systems have to wrestle with larger and larger data sets.

In so many different kinds of projects, it's highly important for data engineers to curate the data to restricted and specific data sets – in the case above, that would be only the data for that segment of customers being studied, only the data for that time frame being studied, and an approach that weeds out additional identifiers or background information that can confuse things or slow down systems. (ReadJob Role: Data Engineer.)

For more, let's look at how this works in the frontier of machine learning. (Read Machine Learning 101.)

Machine learning experts talk about something called "overfitting" where an overly complex model leads to less effective results when the machine learning program is turned loose on new production data.

Overfitting happens when a complex set of data points match an initial training set too well, and don't allow the program to easily adapt to new data.

Now technically, overfitting is caused not by the existence of too many data samples, but by the coronation of too many data points. But you could argue that having too much data can be a contributing factor to this type of problem, as well. Dealing with the curse of dimensionality involves some of the same techniques that were done in earlier big data projects as professionals tried to pinpoint what they were feeding IT systems.

The bottom line is that big data can be enormously helpful to companies, or it can become a major challenge. One aspect of this is whether the company has the right data in play. Experts know that it's not advisable to simply dump all data assets into a hopper and come up with insights that way – in new cloud-native and sophisticated data systems, there's an effort to control and manage and curate data in order to get more accurate and efficient use out of data assets.