What Does Data Lakehouse Mean?
A data lakehouse is a unified storage architecture that combines the cost benefits of a data lake with the analytic benefits of a data warehouse.
An important purpose of a data lakehouse is to make it easier for machine learning engineers (MLEs) to use the same large data sets for different types of artificial intelligence (AI) workloads.
A data lakehouse architecture has five layers:
- Ingestion layer – pulls structured and unstructured data from a variety of sources.
- Storage layer – stores data at rest as storage objects in one layer of the architecture.
- Metadata layer – used to locate specific storage objects and assign schema on read.
- Application Programming Integration (API) layer – helps applications understand what data items are required to complete a particular task and how to retrieve them.
- Consumption layer – provides support for analytics and reporting.
Techopedia Explains Data Lakehouse
A data lakehouse allows the same unified storage layer to be used for multiple purposes — including predictive analytics, prescriptive analytics, deep learning and reporting.
This emerging architecture uses metadata to combine the flexibility of a data lake with the benefits of a data warehouse. Popular data lakehouse vendors include:
Cloudera – this open source, open standards-based data lakehouse is built on Apache Iceberg’s open table format.
Databricks – the Databricks Lakehouse Platform can be delivered and managed as a service on AWS, Microsoft Azure and Google Cloud.
Dremio – provides fully-managed services designed to help customers experiment with using a lakehouse architecture with less TCO.
Snowflake – integrates subject-specific data marts, data warehouses and data lakes into a single source of truth (SSOT) that can be used to power different types of workloads.