What is Schema-on-Read? Definition, Features & How it Works

What is Schema-on-Read?

Schema-on-read is an approach to data management that stores data in its native form without making it conform to a database‘s existing structure or schema. The schema is only applied later when the data is read.

Key Takeaways

Schema-on-read is a way of adding new data to a repository without changing it to fit pre-set categories and definitions.
Data can be added to a schema as it is being read, allowing the same data to be used flexibly for multiple use cases.
The opposite approach is called schema-on-write, where new data is formatted into a firm schema before it can be stored.
Schema-on-read is ideal for modern analytics and machine learning applications and for ingesting large volumes of unstructured data into a data lake.
When used with unstructured data, additional security measures may be necessary.

Functionality and Features

Every database has a structure (the schema) that new data must eventually conform to – columns and rows organized into categories, definitions, dates, timestamps, demographics, transaction types, geographies, languages, and more.

With schema-on-read, the data structure is only applied during extract, transform, load (ETL), the 3-step process by which a database ‘reads’ new data and combines it with what’s already there. This makes it possible to store unstructured data in a database and structure it at a later point when it is really needed.

The opposite process, schema-on-write, compels data management systems to apply a database schema before data is written into the system. With unstructured data, this can be time-consuming, adding cost and complicating project management in the development of new machine learning and artificial intelligence (AI) applications.

Schema-on-Read vs. Schema-on-Write

Schema-on-read

Allows data science teams to schedule the structuring of a new data set at the time of analysis. This is ideal for managing unstructured data destined for machine learning models, as it allows greater flexibility for ad hoc queries and makes schemas easier to update over time.

Because schema-on-read doesn’t rely on data modelers or create a rigid database, it is better suited to managing huge volumes of unstructured data.

Schema-on-write

Structures data before it’s written into storage. For data sets used by a wide variety of users and use cases, this approach ensures data is stored in a consistent format and makes queries faster and simpler.

Categorizing new data into a schema before it is allowed to be stored can be slow and costly. It is best suited for small amounts of structured data. If large amounts of unstructured data need to be ingested, it quickly becomes untenable.

Schema-on-Read and Machine Learning

Most datasets used by large organizations are shared assets that different departments apply to a variety of known use cases. That requires the data to be in a one-size-fits-all schema that can be understood by a wide variety of users and applications.

Machine learning algorithms require something different. Because their use cases are highly specific, they need raw, unstructured data in order to uncover new patterns and unexpected relationships. If a data set has been transformed during the ingestion process to fit an existing schema, it may become less valuable as a machine learning resource.

Schema-on-read is better suited to the data preparation needs of machine learning models because it allows data scientists great flexibility in schema design. It also accommodates faster and less costly ingestion of large volumes of unstructured data into data lakes.

Schema-on-Read Example

Amazon Athena is one example of a data management tool that relies on schema-on-read. Used in conjunction with an Amazon S3 data lake, Athena can run ad hoc queries without having to aggregate or write the data into S3.

Unstructured, semi-structured, and structured data sets can all be processed without imposing a schema, including comma-separated values (CSV), JSON, and columnar data formats. It’s worth noting that structured data formats typically benefit from having a defined schema to ensure consistency.

Schema-on-Read Use Cases

Using the above example in the context of a ride-hailing app, transactional data about each taxi trip can be loaded into the S3 data lake in real-time. The data captured is the same each time. However, developers decide to capture additional weather details for each trip in a later update.

Before the update, the app captured only generic details such as rain, snow, sleet, or sunshine. After the update, details are captured about times of day, precipitation levels experienced, and for what duration.

With a schema-on-read process, new fields for these details are created automatically as the data is being ingested.

Schema-on-Read Security

Cybersecurity risks are similar for both schema-on-read and schema-on-write.

Because schema-on-read often handles large quantities of unstructured data with greater potential for containing personally identifiable information (PII), it may need additional tools and processes to ensure data privacy.

Any data management system should have strong governance capabilities, strict access controls, suit trails, and the ability to conduct comprehensive audits of users and activity.

Schema-on-Read Pros and Cons

Pros

Allows faster ingestion of data than its sister approach, schema-on-write
Scales up rapidly as data volumes grow
Allows users with different needs to define schemas that are specific to a particular use case
Allows data used for a variety of use cases to be stored in a single repository

Cons

May require more computing power for processing large, unstructured datasets during reads
Used with unstructured data, it may need additional security measures

The Bottom Line

Schema-on-read applies a database schema to data when it’s being read. By definition, that means database tables with different schemas can be populated with the same raw data.

The practical advantage of this approach is it leaves underlying data in its original unsullied state. Data transformation happens on demand, making it the most effective data ingestion method for machine learning and modern analytics applications.