Discovering Data Theft Using Hadoop and Big Data

Why Trust Techopedia

The combined powers of big data and Hadoop are being combined to identify data theft - and put a stop to it.

Nowadays, the risk of data theft due to data exposure in companies and government agencies has increased drastically, with new cases being identified every day. This kind of data theft can be a huge blow to organizations, as they reveal confidential information and result in the loss of large sums of money. Data cannot be secured that easily, and even many advanced techniques fail in the field. The most frightening thing about these thefts is that they are extremely hard to detect. Sometimes, it can take several months or even years to detect them. That is why organizations must take powerful measures that will ensure that their data always remains safe. One such method is to use a combination of Hadoop and big data for the detection of fraudulent criminal websites and for alerting other organizations as well.

Why Do We Need to Secure the Data?

As stated earlier, new instances of data theft are reported every day. These types of data theft can occur in any company, be it a government organization, business or even a dating website. It is estimated that data theft alone can result in the loss of substantial capital. How much, you may ask? About $455 billion annually!

Though the current security systems that companies use can counter some kinds of simple data theft techniques, they still can’t counter more complex attempts or threats inside organizations. Added to that, as these cases take so much time to be identified, the criminals can easily manipulate the loopholes of the security systems.

How to Counter These Threats

As the number and complexity of these kinds of data thefts are increasing, hackers are finding new techniques to manipulate security systems. So, the organizations that maintain important confidential data must change their current security architectures, which are able to respond only to simpler threats. Only a practical solution can be useful for avoiding these kinds of thefts. A company must be ready for any kind of theft, for which they’ll need to plan in advance. This will allow them to quickly respond to such a situation and tackle it.

Many companies have taken the initiative to provide solutions that will allow other companies to protect their data against thieves. An example of such a company is Terbium Labs, which uses the novel method of utilizing big data and Hadoop to effectively detect and respond to such threats.

How Can Terbium’s New Technique Help in Securing Data?

The technique that Terbium utilizes for helping companies respond to threats quickly is called Matchlight. This powerful technology can be used to scan the Web, including its hidden parts, to find any kind of confidential data. If it finds such data, it will immediately report it to the user. This application is highly accurate too. It actually creates unique signatures of the company’s confidential data, called “fingerprints.” After generating the unique signatures of the company’s confidential data, the application accurately matches the data with the “fingerprints” of the data found over the Web. Thus, this application of big data can be used to effectively identify instances of data theft by looking for evidence around the Web. If the data is found in any places other than authorized ones, like on the Internet, Dark Web or on a competing company’s website, it will immediately inform the parent company about the stolen information and about its location.


“Fingerprinting” Technology

Matchlight incorporates a special technology called fingerprinting, with which it can match large amounts of data without any hassles. The application first finds the fingerprints of the confidential data. After that, it is stored in its database and is regularly compared with the fingerprint data gathered around the Internet. This data can now be used to detect exposure of data on the Web. If a matching data signature is found, it’ll automatically alert the client company, which can implement their planned security measures immediately.

Which Data Types Does It Cover?

Any kind of data type can be found by Matchlight. This may include picture files, text documents, applications and even codes. The solution is so powerful that it can process whole, highly complex data sets at once. Because of this, many companies are using Matchlight for data security, and Terbium’s current database contains more than 340 billion fingerprints, which is increasing every day.

How Does Hadoop Help?

In order to effectively handle the vast amounts of data in the database, Terbium required a powerful big data processing platform. They chose Hadoop for this. However, they needed a fast and efficient version of Hadoop which could be used for effectual big data processing. For this, they thought that the Hadoop distribution for enterprises running in the native code would be the most suitable option to go with. They didn’t choose a JVM version, as it made the distribution heavy on resources.

The co-founder of Terbium, Mr. Danny Rogers, noted the importance of Hadoop. He said that the efficiency of Matchlight depends on the efficiency of data collection, which depends on Hadoop. This shows the importance of Hadoop in ensuring data security in organizations.

Prospects of Hadoop in the Field of Data Security

Terbium is fast gaining popularity, and already some large Fortune 500 companies have begun using the Matchlight service for tracking stolen data. These companies include healthcare companies, technology providers, banks and other such financial service providers. The results are astounding as well. The companies have recovered about 30,000 credit card information records and 6,000 new email addresses which were stolen by the attackers, and all in the first few seconds of the first day. These were apparently for sale on the Dark Web.

Benefits of Using Hadoop for Discovering Stolen Data

Such a powerful type of integration between machine learning, cloud-based databases and the highly reliable and accurate enterprise-grade Hadoop version can benefit the companies in a lot of ways. These cloud-based databases will be able to accumulate a large amount of data, which will be utilized by the application, with the help of Hadoop, to match the signatures over the Internet in seconds. Thus, Hadoop will be able to greatly enhance the speed of the overall search. Because of this, companies will be able to find their stolen data in a very short time, i.e. a few seconds, instead of the current average search time needed, which stands at 200 days.

Why MapR Distribution Only?

Matchlight uses the MapR distribution of Hadoop only. This is due to a variety of reasons. The first reason is that the enterprise-grade version of Hadoop runs on the native code, and as a result, it effectively utilizes every resource easily. It also uses very little cost for storage, considering that it is cloud based. Furthermore, it is extremely fast, so it can easily help in the management of large numbers of data fingerprints. It offers many additional business-grade features like state-of-the-art security, high reliability and easy backup and recovery.


Hadoop is proving to be extremely useful in the field of data security in organizations. Many companies use MapR to effectively manage data and make a plan to execute, in case of a data theft. Many new companies are also emerging which promise to secure the data of these organizations, and even identify data theft in a matter of a few seconds instead of months.


Related Reading

Kaushik Pal
Technology writer
Kaushik Pal
Technology writer

Kaushik is a technical architect and software consultant with over 23 years of experience in software analysis, development, architecture, design, testing and training. He has an interest in new technologies and areas of innovation. He focuses on web architecture, web technologies, Java/J2EE, open source software, WebRTC, big data and semantic technologies. He has demonstrated expertise in requirements analysis, architectural design and implementation, technical use cases and software development. His experience has covered various industries such as insurance, banking, airlines, shipping, document management and product development, etc. He has worked on a wide range of technologies ranging from large scale (IBM…