The volume of big data is wildly increasing day by day. From 2,500 exabytes in 2012, big data is expected to increase to 40,000 exabytes in 2020. Therefore, data storage is a serious challenge that only the cloud infrastructure is capable of handling. The cloud has become a popular option mainly because of its enormous storage capacity and its terms and conditions of usage that do not impose any obligations on the subscriber. Cloud storage can be offered in the form of subscriptions and services last for a predetermined period. After that, there is no obligation on the part of the client to renew it.

However, storing big data in the cloud opens new security challenges which cannot be faced with security measures adopted for regular, static data. Though big data is not a novel concept, its collection and use has started to pick up pace only in recent years. In the past, big data storage and analysis were confined to only big corporations and the government who could afford the infrastructure necessary for data storage and mining. Such infrastructure was proprietary and not exposed to general networks. However, big data is now cheaply available to all types of enterprises through the public cloud infrastructure. As a result, new, sophisticated security threats have arisen and they continue to multiply and evolve.

Security Issues in Distributed Programming Frameworks

Distributed programming frameworks process big data with parallel computation and storage techniques. In such frameworks, unauthenticated or modified mappers — which divide huge tasks into smaller sub-tasks so that the tasks can be aggregated to create a final output — can compromise data. Faulty or modified worker nodes — which take inputs from the mapper to execute the tasks — can compromise data by tapping data communication between the mapper and other worker nodes. Rogue worker nodes can also create copies of legitimate worker nodes. The fact that it is extremely difficult to identify rogue mappers or nodes in such a huge framework makes ensuring data security even more challenging.

Most cloud-based data frameworks use the NoSQL database. The NoSQL database is beneficial for handling huge, unstructured data sets but from a security perspective, it is poorly designed. NoSQL was originally designed with almost no security considerations in mind. One of the biggest weaknesses of NoSQL is transactional integrity. It has poor authentication mechanisms, which makes it vulnerable to man-in-the-middle or replay attacks. To make things worse, NoSQL does not support third-party module integration to strengthen authentication mechanisms. Since authentication mechanisms are rather lax, data is also exposed to insider attacks. Attacks could go unnoticed and untracked because of poor logging and log analysis mechanisms.

Data and Transaction Log Issues

Data is usually stored in multi-tiered storage media. It is relatively easy to track data when the volume is relatively small and static. But when the volume exponentially increases, auto-tiering solutions are employed. Auto-tiering solutions store data in different tiers but do not track the locations. This is a security issue. For example, an organization may have confidential data that is rarely used. However, auto-tiering solutions will not distinguish between sensitive and non-sensitive data and just store the rarely-accessed data into the lowermost tier. The lowermost tiers have the lowest available security.

Data Validation Issues

In an organization, big data may be collected from various sources which include endpoint devices such as software applications and hardware devices. It is a big challenge to ensure that the data collected is not malicious. Anyone with malicious intentions may tamper with the device that provides data or with the application collecting data. For example, a hacker may bring on a Sybil attack on a system and then use the faked identities to provide malicious data to the central collection server or system. This threat is especially applicable in a bring your own device (BYOD) scenario because users can use their personal devices within the enterprise network.

Real-Time Big Data Security Monitoring

Real-time monitoring of data is a big challenge because you need to monitor both the big data infrastructure and the data it is processing. As pointed out earlier, the big data infrastructure in the cloud is constantly exposed to threats. Malicious entities can modify the system so that it accesses the data and then relentlessly generate false positives. It is extremely risky to ignore false positives. On top of this, these entities can try to evade detection by building evasion attacks or even use data poisoning to reduce the trustworthiness of the data being processed.

Strategies to Face Security Threats

Big data security strategies are still at a nascent stage, but they need to evolve quickly. The answers to the security threats lie in the network itself. The network components need absolute trustworthiness and that can be achieved with strong data protection strategies. There should be zero tolerance for lax data protection measures. There should also be a strong, automated mechanism for collecting and analyzing event logs.

Improving Trustworthiness in Distributed Programming Frameworks

As pointed out earlier, untrusted mappers and worker nodes can compromise data security. So, trustworthiness of mappers and nodes is required. To do this, mappers need to regularly authenticate the worker nodes. When a worker node sends a connection request to a master, the request will be approved subject to the worker having a predefined set of trust properties. Thereafter, the worker will be regularly reviewed for compliance to trust and security policies.

Strong Data Protection Policies

The security threats to data because of the inherently weak data protection in the distributed framework and the NoSQL database need to be addressed. Passwords should be hashed or encrypted with secure hashing algorithms. Data at rest should always be encrypted and not left out in the open, even after considering the performance impact. Hardware and bulk file encryption are faster in nature and that could address the performance issues to an extent, but a hardware appliance encryption can also be breached by attackers. Considering the situation, it is a good practice to use SSL/TLS to establish connections between the client and the server and for communication across the cluster nodes. Additionally, the NoSQL architecture needs to allow pluggable third-party authentication modules.


Big data analytics can be used to monitor and identify suspicious connections to the cluster nodes and constantly mine the logs to identify any potential threats. Though the Hadoop ecosystem does not have any built-in security mechanisms, other tools may be used to monitor and identify suspicious activities, subject to these tools fulfilling certain standards. For example, such tools must conform to the Open Web Application Security Project (OWASP) guidelines. It is expected that real-time monitoring of events is going to improve with some developments already taking place. For example, the Security Content Automation Protocol (SCAP) is gradually being applied to big data. Apache Kafka and Storm promise to be good real-time monitoring tools.

Detect Outliers While Collecting Data

There is still no intrusion-proof system available to completely prevent unauthorized intrusions at the time of data collection. However, intrusions can be significantly reduced. First, data collection applications must be developed to be as secure as possible, keeping in mind the BYOD scenario when the application may run on several untrusted devices. Second, determined attackers will likely breach even the strongest of defenses and send malicious data to the central collection system. So, there should be algorithms to detect and filter out such malicious inputs.


Big data vulnerabilities in the cloud are unique and cannot be addressed by traditional security measures. Big data protection in the cloud is still a nascent area because certain best practices such as real-time monitoring are still developing and available best practices or measures are not being put to use strictly. Still, considering how lucrative big data is, the security measures are sure to catch up in the near future.