Schema-on-Read

Why Trust Techopedia

What is Schema-on-Read?

Schema-on-read is an approach to data management that stores data in its native form without making it conform to a database‘s existing structure or schema. The schema is only applied later when the data is read.

Advertisements

With the growth of data lakes and the large quantities of unstructured data they store, schema-on-read has become an enabler of new data science use cases and machine learning (ML) models. Because it allows data to be categorized after ingestion, data teams gain more flexibility to store and analyze different types of raw information quickly and cost-effectively.

Schema-on-read is the opposite of schema-on-write, which applies a schema before data is ingested into a data lake or other large repository.

Illustration of a laptop displaying data charts, surrounded by database servers and analytics icons, representing the concept of schema-on-read. Text explains schema-on-read as a data management approach that stores data in its native form without conforming to a predefined database schema.

Key Takeaways

  • Schema-on-read is a way of adding new data to a repository without changing it to fit pre-set categories and definitions.
  • Data can be added to a schema as it is being read, allowing the same data to be used flexibly for multiple use cases.
  • The opposite approach is called schema-on-write, where new data is formatted into a firm schema before it can be stored.
  • Schema-on-read is ideal for modern analytics and machine learning applications and for ingesting large volumes of unstructured data into a data lake.
  • When used with unstructured data, additional security measures may be necessary.

Functionality and Features

Every database has a structure (the schema) that new data must eventually conform to – columns and rows organized into categories, definitions, dates, timestamps, demographics, transaction types, geographies, languages, and more.

With schema-on-read, the data structure is only applied during extract, transform, load (ETL), the 3-step process by which a database ‘reads’ new data and combines it with what’s already there. This makes it possible to store unstructured data in a database and structure it at a later point when it is really needed.

The opposite process, schema-on-write, compels data management systems to apply a database schema before data is written into the system. With unstructured data, this can be time-consuming, adding cost and complicating project management in the development of new machine learning and artificial intelligence (AI) applications.

Schema-on-Read vs. Schema-on-Write

Schema-on-read
Allows data science teams to schedule the structuring of a new data set at the time of analysis. This is ideal for managing unstructured data destined for machine learning models, as it allows greater flexibility for ad hoc queries and makes schemas easier to update over time.

Because schema-on-read doesn’t rely on data modelers or create a rigid database, it is better suited to managing huge volumes of unstructured data.

Schema-on-write
Structures data before it’s written into storage. For data sets used by a wide variety of users and use cases, this approach ensures data is stored in a consistent format and makes queries faster and simpler.

Categorizing new data into a schema before it is allowed to be stored can be slow and costly. It is best suited for small amounts of structured data. If large amounts of unstructured data need to be ingested, it quickly becomes untenable.

Schema-on-Read and Machine Learning

Most datasets used by large organizations are shared assets that different departments apply to a variety of known use cases. That requires the data to be in a one-size-fits-all schema that can be understood by a wide variety of users and applications.

Machine learning algorithms require something different. Because their use cases are highly specific, they need raw, unstructured data in order to uncover new patterns and unexpected relationships. If a data set has been transformed during the ingestion process to fit an existing schema, it may become less valuable as a machine learning resource.

Schema-on-read is better suited to the data preparation needs of machine learning models because it allows data scientists great flexibility in schema design. It also accommodates faster and less costly ingestion of large volumes of unstructured data into data lakes.

Schema-on-Read Example

Amazon Athena is one example of a data management tool that relies on schema-on-read. Used in conjunction with an Amazon S3 data lake, Athena can run ad hoc queries without having to aggregate or write the data into S3.

Unstructured, semi-structured, and structured data sets can all be processed without imposing a schema, including comma-separated values (CSV), JSON, and columnar data formats. It’s worth noting that structured data formats typically benefit from having a defined schema to ensure consistency.

Schema-on-Read Use Cases

Using the above example in the context of a ride-hailing app, transactional data about each taxi trip can be loaded into the S3 data lake in real-time. The data captured is the same each time. However, developers decide to capture additional weather details for each trip in a later update.

Before the update, the app captured only generic details such as rain, snow, sleet, or sunshine. After the update, details are captured about times of day, precipitation levels experienced, and for what duration.

With a schema-on-read process, new fields for these details are created automatically as the data is being ingested.

Schema-on-Read Security

Cybersecurity risks are similar for both schema-on-read and schema-on-write.

Because schema-on-read often handles large quantities of unstructured data with greater potential for containing personally identifiable information (PII), it may need additional tools and processes to ensure data privacy.

Any data management system should have strong governance capabilities, strict access controls, suit trails, and the ability to conduct comprehensive audits of users and activity.

Schema-on-Read Pros and Cons

Pros

  • Allows faster ingestion of data than its sister approach, schema-on-write
  • Scales up rapidly as data volumes grow
  • Allows users with different needs to define schemas that are specific to a particular use case
  • Allows data used for a variety of use cases to be stored in a single repository

Cons

  • May require more computing power for processing large, unstructured datasets during reads
  • Used with unstructured data, it may need additional security measures

The Bottom Line

Schema-on-read applies a database schema to data when it’s being read. By definition, that means database tables with different schemas can be populated with the same raw data.

The practical advantage of this approach is it leaves underlying data in its original unsullied state. Data transformation happens on demand, making it the most effective data ingestion method for machine learning and modern analytics applications.

FAQs

What is schema-on-read in simple terms?

What is an example of a schema-on-read?

What is a schema-on-read mode?

What are the benefits of schema-on-read?

What is a schema in a database?

Advertisements

Related Terms

Mark De Wolf
Technology Journalist
Mark De Wolf
Technology Journalist

Mark is a freelance tech journalist covering software, cybersecurity, and SaaS. His work has appeared in Dow Jones, The Telegraph, SC Magazine, Strategy, InfoWorld, Redshift, and The Startup. He graduated from the Ryerson University School of Journalism with honors where he studied under senior reporters from The New York Times, BBC, and Toronto Star, and paid his way through uni as a jobbing advertising copywriter. In addition, Mark has been an external communications advisor for tech startups and scale-ups, supporting them from launch to successful exit. Success stories include SignRequest (acquired by Box), Zeigo (acquired by Schneider Electric), Prevero (acquired…