Despite being one of the latest technology trends and the focus of so much attention in the business world, big data is not a true novelty. It always existed – but today, the biggest difference is that this immense vault of useful info and actionable insights is now finally accessible. The newest data analysis methods and advancements in cloud computing lowered the threshold for what big data can be used by companies to drive their interests forward. By “cracking the safe,” technology enabled a true business revolution.
A true mirage for analysts who tried to access the myriad of unique insights that it may provide, the ever-expanding pool of data generated by databases, archives and internal sources evolved into an ocean with the introduction of the internet and social media. Today, an immense volume of high-velocity data is produced every day, opening a world of possibilities and business opportunities for companies who exploit it for competitive advantage. With a yearly revenue expected to reach $42 billion in 2018 and an accumulated volume of 44 zettabytes by 2020, big data is, hands down, the future of commerce.
The real potential of big data goes beyond the mere size of data itself. Its immense value lies in the chance to analyze these gargantuan data sets to empower almost every aspect of business, from operations to customer behavior, advertisement, workflow procedures, supply-chain management, and so on. Big data provides a clearer understanding of the overall picture, is statistically reliable, and is an irreplaceable tool to analyze past performance, optimize present processes and set future goals.
Defining Big Data: What It Is and What Are Its Characteristics
Big data is defined by its enormous size, diversity and speed. The three elements that define data as “big data” are, therefore, high volume, high velocity and high variety (the so-called “three V's”). Another subset of “V's” has also been added in the form of veracity, validity and volatility to define data quality and contextualize it.
Volume is high since it is so large that specialized storage and parallel processing are necessary to store big data. Hence, a single computer (known as a “node”) is never enough, and their numbers usually vary from 10 to 100. Big data does include all machine-to-machine information and record logs that describe some event such as commercial transactions, comments on social networks, posts and threads in a forum, web page clicks and impressions, etc. These events do not change as it happens with traditional data (think about a shop’s inventory, for example), but their size is, by their very definition, massive.
Velocity is also really high because big data is extremely granular, as it is produced in real time by devices and software connected to the world wide web, such as large-scale transaction systems, IoT and sensors. The data flow is unstoppable, enormous and continuous. Big data must be analyzed and consumed by organizations in a near-real-time manner to increase their efficiency and make decisions on the spot.
The last of the three “core” V's – variety – refers to the extremely diversified type and nature of big data. Most big data now comes from unstructured sources and is often used to complete missing pieces through data fusion. Experts and analysts must, therefore, deal with different formats of data coming from different sources (documents, images, videos, etc.), usually by using some form of automated system.
Veracity and Validity
Accurate analysis requires high-quality data, but the quality of captured data usually varies substantially. Noise, biases, inconsistencies and abnormalities may negatively affect the quality of data, as well as any other form of non-relevant “dirty” data that is collected or stored that must be, therefore, filtered out. Veracity defines how trustworthy the is data itself as well as its source, type, and processing method, while validity means how much this data is correct and accurate for the intended use.
Volatility refers to how long the data stays valid. Depending on its rate of change, volatility determines how long data should be stored and its overall lifetime. For example, data coming from social media is highly volatile since trending topics and opinions change in the blink of an eye and quickly become non-relevant to a given analysis. The more an information subset is predictable and unchangeable or, at least, repetitive (such as weather trends, for example), the less volatile the data is.
What Are the Advantages of Big Data over Traditional Data?
Even though traditional databases can store only small amounts of data (usually a couple of terabytes), what really defines big data and sets it apart from traditional data is not just its size. Big data is defined by its different uses as well as the advantages over traditional data, goals achieved and strategies employed when dealing with it. Let’s have a look at some of the reasons why so many businesses are now choosing big data over traditional data to skyrocket their productivity.
More Efficient and Affordable Data Architecture
Traditional data was stored on inefficient conventional disk drives and costly centralized database architectures where a single computer system must untangle all the issues. Big data solves this problem by employing distributed database architectures and software-defined scalable storage systems. Efficiency is vastly improved as large blocks of data are divided into smaller sizes which are abstracted and computed by many different nodes in a network. The distributed database allows data to be moved more quickly (and with fewer resources) from one storage unit to the other with no loss of performance. Also, microprocessors in distributed database systems are cheaper and can reach superior computational power compared to a traditional centralized mainframe.
Breaking the Vault of Unstructured Data
Traditional database systems use only structured data, and data analysts were only capable of extracting useful information from this type of highly searchable and clear data. However, structured data represents only the tip of the iceberg, since it’s limited to highly organized information that can be easily stored in relational database systems (RDBs) and spreadsheets. Structured databases provide insights only at a very low level since all information is strictly defined in terms of field type and name. Big data, on the other hand, makes full use of semi-structured and unstructured data, which represent a whopping 80 percent of all data available. Big data increases the variety of data which can be gathered and analyzed by adding videos, pictures, web logs, medical scans and NoSQL databases (just to name a few) to the fray. Metadata could be used to connect structured and unstructured data together and transform them into consumable information.
Scalability and Flexibility
Big data is flexible and scalable. Platforms like Hadoop and Spark can analyze massive amounts of data with no performance degradation at any level, while traditional SQL queries needed to be integrated into larger and more expensive analytics frameworks. Big data scales up as the distributed approaches for computing are employed with more than one server, which can be easily upgraded or increased in number. Traditional databases, instead, rely on the computational power of a single server. Flexibility is also a strong point since data sets don’t need to have consistent data structures anymore. Data can be transformed quickly, allowing analytics to handle any type of data, even very different ones at the same time.
Higher Data Quality and Accuracy
Traditional database systems cannot store all the data, so the actual amount of data that can be analyzed is reduced. Less data available translates to less accurate results, and, in turn, lower quality. Big data offers real-time insights, and since voluminous amount of data can be stored rather easily, the quality and the accuracy of results is greatly improved.
Sources of Big Data
What are the principal sources of big data? There are many places from which a business should look to extract big data such as the web, social media, databases, self-service data, business apps and more. Data sources are aplenty, but first, let’s summarize the principal types of data available.
- Structured data is organized information that resides in fixed fields within a record, database or file. Examples include phone numbers, ZIP codes, and user demographics such as gender or age.
- Unstructured data is raw, unorganized information that does not have a recognizable structure. It may contain text together with numbers, facts or dates with no identifiable internal structure. Examples include emails, social media posts, customer service interactions and multimedia content.
- Internal data is (usually unstructured) data that is archived that behind an organization’s firewalls.
- External data is, instead, all data that is not owned by an organization and that is collected by external sources.
- Open data is data acquired for free from an open-source repository, usually from the world wide web. Some examples are documents, videos and images acquired from government sources as well as non-government, not-for-profit organizations such as DBpedia, Wikipedia, DMOZ, Google and other projects.
There are many sources of big data. Here are some of the most important ones:
Internet of Things (IoT)
Data coming from the internet of things (IoT) devices is, for the most part, machine-generated data obtained from the sensors connected to them. Any device that can emit data – from webcams, to smartphones, computers, robots on a manufacturing line, and transit systems – can provide real-time information which can later be collected and extracted. Quality varies depending on the accuracy of the sensor, or the ability of the operator during manual manipulations.
Self-service data includes all those daily operations performed by ordinary people, from check-in at airports to monetary transactions at an ATM or paying a freeway toll. It is a huge mine of internal and sometimes external big data whose quality is frequently high since it is usually unbiased. This data is usually stored in clouds that accommodate structured and unstructured data that is later used by analysts to obtain real-time information and insights and improve business intelligence.
Data produced as a result of business activities can be recorded in a mix of integrated traditional and modern databases. Business apps use APIs to produce structured internal data which can be integrated with CRM and traditional structured, unstructured or hybrid databases. Volume is usually very high, and speed can also be really fast, especially for bigger organizations (think about a globalized fast-food chain recording millions of sales every second). Business transactions are usually the pulsating heart of business intelligence.
The public web is an external source of easily accessible open data. It is especially useful for those businesses that are affected by fluctuating elements that do not depend on internal factors, such as currency values on stock exchanges or keyword search volumes on Google Trends. Web data is simply massive, yet extremely usable by any company that has no means to develop its own big data infrastructures, such as startups or small businesses. It includes public insights such as those provided by Wikipedia, open-source databases, and all data which can be drawn from social media such as Facebook, Instagram or Twitter. However, its quality is not always reliable, as qualitative aspects are much harder to measure than quantitative ones.
How Is Raw Big Data Collected and Analyzed?
Once raw data is collected, it must be stored, aggregated, processed and then finally analyzed. Raw big data is stored and then processed on software frameworks specialized to handle its unique mix of structured and unstructured data, and which represent the base for big data analytics. Hadoop, SAP HANA, Google F1, Facebook Presto, Cassandra, MongoDB, CouchDB and Neo4j are the most used. Companies use any of these solutions to achieve different tasks, or, more commonly, integrate two or more of them for different purposes. Let’s have a look at the two most important ones – Hadoop and SAP HANA, and the most-known interface to integrate them – SAP HANA Vora.
A software framework specifically built to handle big data, Hadoop is used to store massive amounts of unstructured data and then digest them thanks to its superior parallel processing capabilities. Large data sets are stored in the Hadoop Distributed File System (HDFS) data lake, and then processed with a very efficient and redundant parallelized MapReduce programming model. Hadoop is a reliable and solid system which is very tolerant of both hardware and software failures. Since it’s open source and it can use any type of drive, it's also very cheap. However, it is not suited to extract information in real time and is not optimized to read small files because the block size for HDFS is typically in the 64-128 MB range.
SAP HANA is a relational massively parallel processing (MPP) relational database management system known for its fast and reliable analytical reporting. By relying on in-memory, column-oriented data storage, it allows one to store, process and retrieve big data much quicker than Hadoop, allowing for real-time big data analytics. Albeit highly efficient and scalable, HANA is also extremely pricey, especially since it has strict hardware specifications which may cost up to $1 million, even before taking software into account. To prevent its cost from inflating enormously, many companies choose to store the largest (or oldest) data sets on Hadoop and use HANA for processing newer data on the fly.
SAP HANA Vora
SAP HANA Vora is software that integrates the best of both worlds. A mediator between the two interfaces, it draws from the large unstructured data sets in Hadoop to build structured data hierarchies after integrating them with data from HANA, and then uses the Apache Spark SQL to enable OLAP-style in-memory analysis. It allows processing both the “hot” structured data found in databases and the “cold” unstructured big data found in Hadoop in real time.
How Can Big Data Be Consumed?
What are the uses of big data? The various industries consume big data to make a profit, enhance their business intelligence, improve the efficiency of their processes, check consumer trends and make predictions. Some sectors such as financial institutions, the government and the public sector started earlier than others and have already been using big data analytics for many years.
Business and Industrial Analytics
Big data can be consumed by manufacturers and companies for more proactive maintenance, reducing downtimes, monitoring the performance of their employees, improving the manufacturing line’s efficiency and identifying the top-performing product lines. Most of this data is usually collected by sensors, but the interesting thing is that it can also be generated outside of the core enterprise environment and stored in the cloud. For example, a car manufacturer may install sensors in the product line that recognize pre-failure temperature or load patterns to dispatch maintenance teams before breakage occurs. Or it may collect data sent back by smart car sensors after the vehicle has been sold, in order to have a better overview of a specific product line once it hits the road.
Feeding Artificial Intelligence
Big data is also becoming a fundamental piece in the evolution of artificial intelligence. Although big data is still “old-style computing” in a way, since it's all about collecting data rather than acting on and reacting to its results, the future of AI is strictly intertwined with its older brother. AI does, in fact, need data to be fed to its algorithms to allow its machine-learning capabilities to react and become smarter. Big data that is used to train AI, however, is highly processed, since it must be cleaned and purified from all unnecessary or duplicate information. Only by doing this are machines able to identify useful patterns reliably.
Customer Relations and Communication
Big data can be used to increase the acquisition of customers and improve the services offered to them by looking at things through their point of view. By collecting data about their customer's demographics, transactions, preferences and behaviors from social media, text messages and emails, companies can better understand their customer needs and obtain a broader overview of where their marketing efforts should be concentrated.
Early Fraud Detection
Real-time internal and external data can be used by financial institutions and banks to detect unusual or suspicious behaviors, and block fraudulent activities before they occur. For example, if a service is accessed from a remote country (say, China or South Africa when the client usually resides in the U.S.), the transaction can be declined, or the credit card blocked until confirmation is sent from a trusted device. Public authorities may also be alerted to verify this user identity and, if necessary, take immediate action.
Big Data and Privacy Issues
Privacy and security issues are a hot topic for any company that wants to work with big data. Exposure and data breaches could result in anything from embarrassment to lawsuits, especially after all the social media scandals (once again, check the Cambridge Analytica and Facebook scandals). Here is some advice to dodge the most avoidable privacy risks when dealing with big data.
The Importance of Transparency
If a company is collecting data from a group of people, it is important to be upfront with them. Full transparency must be granted at all times to disclose why they are studied, where the information is drawn from, and the analytic methods used.
Data collected for predictive analytic purposes should never be used to make determinations about the abilities of a given group, gender or minority. Decisions made using these technologies, especially when automation is involved, should never have a negative impact on individuals or generate a bias. Otherwise, freedom of association may be affected by people who will want to avoid unnecessary categorizations.
Security Must Come First
A company must ensure that the best security measures are taken at all times in order to preserve the full anonymity of all the subjects involved in the research. Anonymity should be guaranteed before data is collected, not later, so even if a breach occurs, at least privacy is protected.
The amount of big data is already massive, but it is expected to grow exponentially as new technologies such as the more pervasive IoT devices, drones and wearables will jump into the fray. Ninety percent of the big data in the world today has been generated in the last two years, and the recent advancements in deep learning are playing a key role in helping businesses decrypt this precious goldmine of information. Big data and business analytics solutions are now a mainstream technology, and together with AI and automation, they represent the foundation upon which the digital transformation process is built.