A data lake is a type of data repository that stores large and varied sets of raw data in its native format. Data lakes let you keep an unrefined view of your data. They are becoming a more common data management strategy for enterprises who want a holistic, large repository for their data.
Raw data is data that hasn’t yet been processed for a specific purpose. Data in a data lake isn’t defined until it is queried. Data scientists can access the raw data when they need it using more advanced analytics tools or predictive modeling.
All data is kept when using a data lake; none of it is removed or filtered prior to storage. The data might be used for analysis soon, in the future, or never at all. Data could also be used many times for different purposes, as opposed to when the data has been refined for a specific purpose, which makes it difficult to reuse data in a different way.
The term "data lake" was introduced by James Dixon, Chief Technology Officer of Pentaho. Describing this type of data repository as a lake makes sense because it stores a pool of data in its natural state, like a body of water that hasn’t been filtered or packaged. Data flows from multiple sources into the lake and is stored in its original format.
Data in a data lake isn’t transformed until it is needed for analysis, schema is then applied so data can be analyzed. This is called "schema on read," because data is kept raw until it is ready to be used.
Data lakes allow users to access and explore data in their own way, without needing to move the data into another system. Insights and reporting obtained from a data lake typically occur on an ad hoc basis, instead of regularly pulling an analytics report from another platform or type of data repository. However, users could apply schema and automation to make it possible to duplicate a report if needed.
Data lakes need to have governance and require continual maintenance to make the data usable and accessible. Without this upkeep, you risk letting your data become junk—inaccessible, unwieldy, expensive, and useless. Data lakes that become inaccessible for their users are referred to as "data swamps."
Data lake vs. data warehouse
Though they are often confused, data lakes and data warehouses are not the same and serve different purposes. Both are data storage repositories for big data, but this is where the similarities end. Many enterprises will use both a data warehouse and a data lake to meet their specific needs and goals.
A data warehouse provides a structured data model designed for reporting. This is a main difference between a data lake and a data warehouse. A data lake stores unstructured, raw data without a currently defined purpose.
Before data can be put into a data warehouse, it needs to be processed. Decisions are made about what data will or will not be included in the data warehouse, which is referred to as "schema on write."
The process of refining the data before storing it in a data warehouse can be time consuming and difficult, sometimes taking months or even years, which also prevents you from collecting data right away. With a data lake, you can start collecting data immediately and figure out what to do with it in the future.
Because of their structure, data warehouses are more often used by business analysts and other business users who know what data they need in advance for regular reporting. A data lake is more often used by data scientists and analysts because they are performing research using the data, and the data needs more advanced filters and analysis applied to it before it can be useful.
Data lakes and data warehouses also typically use different hardware for storage. Data warehouses can be expensive, while data lakes can remain inexpensive despite their large size because they often use commodity hardware.
Data lake architecture
A data lake has a flat architecture because the data can be unstructured, semi-structured, or structured, and collected from various sources across the organization, compared to a data warehouse that stores data in files or folders. You can have a data lake on-premises or in the cloud.
Because of their architecture, data lakes offer massive scalability up to the exabyte scale. This is important because when creating a data lake you generally don’t know in advance the volume of data it will need to hold. Traditional data storage systems can’t scale in this way.
This architecture benefits data scientists who are able to mine and explore data from across the enterprise and share and cross-reference data, including heterogeneous data from different fields, to ask questions and find new insights. They can also take advantage of big data analytics and machine learning to analyze the data in a data lake.
Even though data does not have a fixed schema prior to storage in a data lake, data governance is still important to avoid a data swamp. Data should be tagged with metadata when it is put into the lake to ensure that it is accessible later.
Improve AI/ML application management
Get expert perspectives on how to simplify the deployment and lifecycle management of Artificial Intelligence/Machine Learning (AI/ML) applications so you can build, collaborate, and share ML models and AI apps faster with this webinar series.
Why choose Red Hat Data Services?
With Red Hat’s open, software-defined storage solutions, you can work more, grow faster, and rest easy knowing that your data—from important financial documents to rich media files—is stored safely and securely.
With scalable, cost-efficient software-defined storage, you can analyze huge lakes of data for better business insights. Red Hat’s software-defined storage solutions are all built on open source, and draw on the innovations of a community of developers, partners, and customers. This gives you control over exactly how your storage is formatted and used—based on your business’ unique workloads, environments, and needs.