Why Data Scientists Are Falling in Love with Blockchain Technology

Many will attest that data science and blockchain have the potential to revolutionize the financial sector, business, healthcare and industry. On one hand, blockchain is transforming traditionally centralized database systems into decentralized systems with better transparency, upgraded security, improved traceability and reduced cost (Read also: Blockchain Explained). On the other hand, data science is constantly becoming vital in decision-making processes of the aforementioned sectors.

While the distinct advantages of these technologies are well charted, what is not well-explored is how they can complement each other. In this article, I describe a few challenges that data scientists usually face and the potential of blockchain to alleviate these challenges.

Data Challenges for Data Scientists

Since data has become among the most valuable resource for businesses and government, the demand for data scientists to transform raw data into this valuable asset in a usable form is constantly growing. (Read also: Job Role: Data Scientist)

A data scientist collects, analyzes and interprets data to uncover insights that help organizations in their decision-making process. While pursuing their objectives, data scientists encounter several challenges (Read also: Challenges and Opportunities in Data Science) that hinder their progress. Besides other challenges like cross-domain expertise, for the purposes of this article, I would highlight data-related challenges and categorize them into five categories:

Data authenticity: Data scientists collect data from multiple sources that are vulnerable to tampering and theft. The increasing importance of data has led to a dangerous increase in data breaches. For example, in the U.S. the number of breached data records increased from around 67 million to 164.7 million from 2005 to 2019 (Read also: What is Data Integrity ? Definition and Best Practices).

Many companies including Yahoo, CAM4, Zoom, Twitter, Facebook and LinkedIn have been victims of data breaches. As organizations are increasingly relying on data scientists for their vital decision making, data scientists must have authenticated data. Most preferably, data scientists want data to have built-in authenticity which is particularly essential for financial sectors and data hosting organizations. (Read also: The Biggest Data Breaches)
Data privacy: Data privacy is the biggest hurdle when it comes to data availability—especially for data scientists working with users’ data. With more and more countries adopting data privacy legislation such as General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), getting access to data is becoming difficult for data scientists. Although some data privacy technologies have emerged to support data scientists in their endeavors, the most promising ones, such as federated machine learning, require data to be secured in distributed fashion. (Read also: US Data Protection and Privacy in 2020 – An Overview)

Data quality: Data scientists typically spend most of their time “data scrubbing” because they do not want to be caught in the mess of dirty data; no matter how well they analyze it, dirty data cannot give them what they want. Dirty data has many facets like duplicate or incorrect data and typically arises from systems with poor data integrity and validity mechanisms. The improved data integrity bars of database systems would certainty help data scientists to perform genuine analysis on accurate data. (Read also: The Challenges of Data Quality)
Data access: The unregulated data access process often haunts data scientists who are trying to smoothly access their required data. This inefficiency in data access makes the access and analysis life cycle cumbersome.
Real-time analysis : In many ways data scientists can get more value from data by analyzing it in real-time. However, traditional data management systems do not support real-time data analysis, which restricts data scientists from gaining the advantages of real-time analysis.

Blockchain: Solution to Data Challenges

A blockchain is essentially a distributed database system that maintains data on a peer-to-peer (P2P) network in an increasing list of ordered units called blocks. Each block has a time-stamp and link to the previous block, and stores data in an immutable and encrypted form. Since its emergence as a secured electronic cash system for the digital cryptocurrency known as bitcoin, the applications of blockchain are rapidly growing in many sectors. Below, I describe key characteristics of blockchain as a solution to the aforesaid challenges of data scientists. (Read also: How Blockchain will Disrupt Data Science, Implications of Blockchain in Data Science and Blockchain and Big Data: A Great Marriage)

Built-in authenticity: Being a distributed system, blockchain maintains multiple instances of the data, rather than a single copy. This enables blockchain to prevent data tampering and revision since data authenticity can easily be verified. The blockchain retains a unique “fingerprint” for each of its blocks. The fingerprint is computed by using a hash algorithm based on the contents of the block. This process ensures data authenticity at two levels: first, data can be easily verified and second, the structure of blockchain depends on the validity of fingerprints as they are used to link the blocks¹¹.
Data privacy protection: Blockchain protects data privacy with its special protocols while still allowing data scientists to utilize the data. There are various ways in which blockchain can help data scientists access privacy-protected data for their particular endeavors. Two of these ways are: (Read also: The Blockchain as Decentralized Security Framework)
- Homomorphic encryption, which is a new form of encryption that allows computation to be performed on encrypted data, so there is no need to share original data. This form of encryption is now incorporated into cryptographic techniques such as Zero Knowledge Proofs (ZKPs) and zk-SNARKs. (Read also: Cryptography: Understanding Its Not-So-Secret Importance)
- Federated machine learning, which is a collaborative data analysis technique for analyzing data distributively across multiple devices without having to keep it to a central location. The technique uses local data models (i.e. data characteristics) of each distributed unit, rather than actual data, to protect the data privacy. The marriage of federated machine learning and blockchain for privacy preserved data analysis has been used for analyzing data of IoT devices in tasks such as energy behaviour analysis of home appliances (Read also: Blockchained on-device federated learning).
Data quality guarantee: The immutable nature of blockchain certifies the data consistency because once the data has been recorded on blockchain, it cannot be edited or deleted. The cryptographic authenticity mechanism of blockchain also maintains the consistency of its data. To warrant data accuracy, blockchain has a decentralized consensus procedure for cross-checking data at its entry point.
Smooth data access: Blockchain can streamline data access processes for data scientists as they can be made part of the blockchain at a certain level under certain conditions to access their required data. This makes their work process efficient and reduces the time cycle of data access and analysis.
Real-time analysis: The ability of blockchain to maintain the record for every data transaction makes it a valuable resource for analyzing data in real-time. The promises of these newly emerging resources have been already demonstrated in the case of cryptocurrency. (Read also: Liberland: The Country on the Blockchain — An Inside Look.

Final Thoughts

It’s true that blockchain data is verified and secured using cryptography. This restricts all unauthorized changes and hacks in the system. It removes the middlemen from the system so no one can make any unauthorized changes.

However, as Epiq Global points out, this doesn’t mean blockchain is failproof. If enterprise businesses utilize permissionless platforms (as is the case with Bitcoin,) any endpoints that also have vulnerabilities have the potential to be targeted by malicious threat actors. This raises the question of whether data scientists using these types of public blockchain are able to guarantee confidentiality and if the integrity of any data being ingested can be trusted or not. Further, can the computed results be relied upon?