AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all our data, both in its original form and prepared for analysis. A data lake enables us to break down data silos and combine different types of analytics to gain insights and guide better business decisions.
However, setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.
Creating a data lake with Lake Formation is as simple as defining data sources and what data access and security policies we want to apply. Lake Formation then helps us collect and catalog data from databases and object storage, move the data into our new Amazon S3 data lake, clean and classify our data using machine learning algorithms, and secure access to our sensitive data. Users can access a centralized data catalog which describes available data sets and their appropriate usage. Users then leverage these data sets with their choice of analytics and machine learning services, like Amazon Redshift, Amazon Athena, and (in beta) Amazon EMR for Apache Spark. Lake Formation builds on the capabilities available in AWS Glue.
Benefits
Build data lakes quickly
With Lake Formation, we can move, store, catalog, and clean our data faster. We simply point Lake Formation at our data sources, and Lake Formation crawls those sources and moves the data into our new Amazon S3 data lake. Lake Formation organizes data in S3 around frequently used query terms and into right-sized chunks to increase efficiency. Lake Formation also changes data into formats like Apache Parquet and ORC for faster analytics. In addition, Lake Formation has built-in machine learning to deduplicate and find matching records (two entries that refer to the same thing) to increase data quality.
Simplify security management
We can use Lake Formation to centrally define security, governance, and auditing policies in one place, versus doing these tasks per service, and then enforce those policies for our users across their analytics applications. Our policies are consistently implemented, eliminating the need to manually configure them across security services like AWS Identity and Access Management and AWS Key Management Service, storage services like S3, and analytics and machine learning services like Redshift, Athena, and (in beta) EMR for Apache Spark. This reduces the effort in configuring policies across services and provides consistent enforcement and compliance.
Provide self-service access to data
With Lake Formation we build a data catalog that describes the different data sets that are available along with which groups of users have access to each. This makes our users more productive by helping them find the right data set to analyze. By providing a catalog of our data with consistent security enforcement, Lake Formation makes it easier for our analysts and data scientists to use their preferred analytics service.
They can use EMR for Apache Spark (in beta), Redshift, or Athena on diverse data sets now housed in a single data lake. Users can also combine these services without having to move data between silos.