What is Schema-on-Read?
Schema-on-read is an approach to data management that stores data in its native form without making it conform to a database‘s existing structure or schema. The schema is only applied later when the data is read.
With the growth of data lakes and the large quantities of unstructured data they store, schema-on-read has become an enabler of new data science use cases and machine learning (ML) models. Because it allows data to be categorized after ingestion, data teams gain more flexibility to store and analyze different types of raw information quickly and cost-effectively.
Schema-on-read is the opposite of schema-on-write, which applies a schema before data is ingested into a data lake or other large repository.
Key Takeaways
- Schema-on-read is a way of adding new data to a repository without changing it to fit pre-set categories and definitions.
- Data can be added to a schema as it is being read, allowing the same data to be used flexibly for multiple use cases.
- The opposite approach is called schema-on-write, where new data is formatted into a firm schema before it can be stored.
- Schema-on-read is ideal for modern analytics and machine learning applications and for ingesting large volumes of unstructured data into a data lake.
- When used with unstructured data, additional security measures may be necessary.
Functionality and Features
Every database has a structure (the schema) that new data must eventually conform to – columns and rows organized into categories, definitions, dates, timestamps, demographics, transaction types, geographies, languages, and more.
With schema-on-read, the data structure is only applied during extract, transform, load (ETL), the 3-step process by which a database ‘reads’ new data and combines it with what’s already there. This makes it possible to store unstructured data in a database and structure it at a later point when it is really needed.
The opposite process, schema-on-write, compels data management systems to apply a database schema before data is written into the system. With unstructured data, this can be time-consuming, adding cost and complicating project management in the development of new machine learning and artificial intelligence (AI) applications.
Schema-on-Read vs. Schema-on-Write
Schema-on-Read and Machine Learning
Most datasets used by large organizations are shared assets that different departments apply to a variety of known use cases. That requires the data to be in a one-size-fits-all schema that can be understood by a wide variety of users and applications.
Machine learning algorithms require something different. Because their use cases are highly specific, they need raw, unstructured data in order to uncover new patterns and unexpected relationships. If a data set has been transformed during the ingestion process to fit an existing schema, it may become less valuable as a machine learning resource.
Schema-on-read is better suited to the data preparation needs of machine learning models because it allows data scientists great flexibility in schema design. It also accommodates faster and less costly ingestion of large volumes of unstructured data into data lakes.
Schema-on-Read Example
Amazon Athena is one example of a data management tool that relies on schema-on-read. Used in conjunction with an Amazon S3 data lake, Athena can run ad hoc queries without having to aggregate or write the data into S3.
Unstructured, semi-structured, and structured data sets can all be processed without imposing a schema, including comma-separated values (CSV), JSON, and columnar data formats. It’s worth noting that structured data formats typically benefit from having a defined schema to ensure consistency.
Schema-on-Read Use Cases
Using the above example in the context of a ride-hailing app, transactional data about each taxi trip can be loaded into the S3 data lake in real-time. The data captured is the same each time. However, developers decide to capture additional weather details for each trip in a later update.
Before the update, the app captured only generic details such as rain, snow, sleet, or sunshine. After the update, details are captured about times of day, precipitation levels experienced, and for what duration.
With a schema-on-read process, new fields for these details are created automatically as the data is being ingested.
Schema-on-Read Security
Cybersecurity risks are similar for both schema-on-read and schema-on-write.
Because schema-on-read often handles large quantities of unstructured data with greater potential for containing personally identifiable information (PII), it may need additional tools and processes to ensure data privacy.
Any data management system should have strong governance capabilities, strict access controls, suit trails, and the ability to conduct comprehensive audits of users and activity.
Schema-on-Read Pros and Cons
Pros
- Allows faster ingestion of data than its sister approach, schema-on-write
- Scales up rapidly as data volumes grow
- Allows users with different needs to define schemas that are specific to a particular use case
- Allows data used for a variety of use cases to be stored in a single repository
Cons
- May require more computing power for processing large, unstructured datasets during reads
- Used with unstructured data, it may need additional security measures
The Bottom Line
Schema-on-read applies a database schema to data when it’s being read. By definition, that means database tables with different schemas can be populated with the same raw data.
The practical advantage of this approach is it leaves underlying data in its original unsullied state. Data transformation happens on demand, making it the most effective data ingestion method for machine learning and modern analytics applications.