Big data is a blanket word used to refer to the handling of large volumes of data. We all understand that the larger the volume of data, the more complex it becomes. Traditional database solutions often fail to manage large volumes of data properly because of their complexity and size. Therefore, managing large volumes of data and extracting real insight is a challenging task. The same "value" concept is also applicable to small data.

How Big Data Is Used

Conventional database solutions based on the RDBMS concept can manage transactional data very well and are widely used in different applications. But when it comes to handling a large set of data (data that is archived and is in terabytes or petabytes), these database solutions often fail. These data sets are too big and most of the time, they do not fit into the architecture of traditional databases. These days, big data has become a cost-effective approach to handling larger sets of data. From an organizational point of view, the usage of big data can be broken into the following categories, wherein big data's real value resides:

  • Analytical Use
    Analysts of big data have revealed many important hidden aspects of data that are too costly to process. For example, if we have to check the trend interest of students on a certain new topic, we can do this by analyzing daily attendance records and other social and geographical facts. These facts are captured in the database. If we cannot access this data in an efficient manner, we cannot see the results.

  • Enable New Products
    In the recent past, a lot of new Web companies, such as Facebook, have started using big data as a solution to launch new products. We all know how popular Facebook is - it has successfully prepared a high-performance user experience using big data.

Where Is the Real Value?

Different big data solutions differ in the approach in which they store data, but in the end, they all store data in a flat file structure. In general, Hadoop consists of the file system and some operating-system-level data abstractions. This includes a MapReduce engine and the Hadoop Distributed File System (HDFS). A simple Hadoop cluster includes one master node and several worker nodes. The master node consists of the following:

  • Task Tracker
  • Job Tracker
  • Name Node
  • Data Node
The worker node consists of the following:
  • Task Tracker
  • Data Node

Some implementations have only the data node. The data node is the actual area where the data lies. HDFS stores large files (in the range of terabytes to petabytes) distributed across multiple machines. The reliability of data on every node is achieved by replicating the data across all the hosts. Thus, the data is available even when one of the nodes is down. This helps in achieving faster response against queries. This concept is very useful in the case of huge applications like Facebook. As a user, we get a response to our chat request, for example, almost immediately. Consider a scenario where a user has to wait for a long time while chatting. If the message and the subsequent response isn’t delivered immediately, then how many people will actually use these chatting tools?

Going back to the Facebook implementation, if the data is not replicated across the clusters, it won’t be possible to have an appealing implementation. Hadoop distributes the data across machines in a larger cluster, and stores files as a sequence of blocks. These blocks are of identical size except the last block. The size of the block and replication factor can be customized as per need. Files in HDFS strictly follow the write-once approach and hence can only be written or edited by one user at a time. Decisions regarding replication of blocks are made by the name node. The name node receives reports and pulse responses from each of the data nodes. The pulse responses ensure the availability of the corresponding data node. The report contains the details of the blocks on the data node.

Another big data implementation, Cassandra, also uses a similar distribution concept. Cassandra distributes data based on geographic location. Hence, in Cassandra, the data is segregated based on the geographic location of the data usage.

Sometimes Small Data Makes a Bigger (and Less Expensive) Impact

As per Rufus Pollock of the Open Knowledge Foundation, there is no point in creating hype around big data while small data is still the place where the real value lies.

As the name suggests, small data is a set of data targeted from a larger set of data. Small data intends to shift the focus from data usage and it also aims to counter the trend of moving toward big data. The small data approach helps in gathering data based on specific requirements using less effort. As a result, it's the more efficient business practice while implementing business intelligence.

At its core, the concept of small data revolves around businesses that require results that necessitate further actions. These results need to be fetched quickly and the subsequent action should also be executed promptly. Thus, we can eliminate the kinds of systems commonly used in big data analytics.

In general, if we consider some of the specific systems that are required for big data acquisition, a company might invest in setting up a lot of server storage, use sophisticated high-end servers and the latest data mining applications to handle different bits of data, including dates and times of user actions, demographic information and other information. This entire data set moves to a central data warehouse, where complex algorithms are used to sort and process the data to display in the form of detailed reports.

We all know that these solutions have benefited many businesses in terms of scalability and availability; there are organizations that find that adopting these approaches require substantial effort. It is also true that in some cases, similar results are achieved using a less-robust data mining strategy.

Small data provides a way for organizations to back down from an obsession with the latest and newest technologies that support more sophisticated business processes. Companies that are promoting small data argue that it is important from the business point of view to use their resources in an efficient manner, so that overspending on technology can be avoided to a certain extent.

We have discussed much about the big data and small data realities, but we must understand that selecting the correct platform (big data or small data) for the correct use is the most important part of the entire exercise. And the truth is that while big data can provide a lot of benefits, it isn't always best.