What Does Data Lake Mean?
A data lake is a centralized storage repository for large volumes of structured and unstructured data. A data lake has a flat architecture and uses object storage to store data.
Data lakes play an important role in helping data scientists visualize and analyze data from disparate data in their native formats. In data science, this is an especially important consideration when the scope of the data -- and its uses -- may not yet be fully known.
Although data lakes offer strong data access benefits, they require a management component to help users find the most relevant data, understand relationships and integrate heterogeneous data sources. Popular data lake platforms include:
- CoreLAKE -- a commercial, off-the-shelf (COTS) data lake platform for healthcare organizations.
- Qubole -- an open source data lake platform for machine learning and ad hoc analytics.
- Azure Data Lake -- built on Hadoop YARN and optimized for the cloud.
- AWS Lake Formation -- allows users to access a centralized data catalog that describes available data sets and their appropriate usage.
A data lake may also be referred to as a schema-agnostic or schema-less data repository.