Build A Best Practice AWS Data Lake Faster with AWS Lake Formation
The world’s first gigabyte hard drive was the size of a refrigerator — and that wasn’t all that long ago. Clearly, technology has evolved, and so have our data storage and analysis needs. With data serving a key role in helping companies unearth intelligence that can provide a competitive advantage, solutions that allow organizations to end data silos and help create actionable business outcomes from intelligent data analysis are gaining traction.
According to the 2018 Big Data Trends and Challenges report by Dimensional Research, the number of firms with an average data lake size over 100 Terabytes has grown from 36% in 2017 to 44% in 2018. A trend that’s sure to continue, especially as cloud providers like AWS provide services such as the newly announced AWS Lake Formation that help streamline the process of creating and managing a data lake solution. As such, in today’s blog, we’re going to take a look at the new AWS Lake Formation service, and share our take on its features, benefits, and things we’d like to see in the next version of the service.
What is AWS Lake Formation
AWS Lake Formation is the newest service from AWS. It is designed to streamline the process of building a data lake in AWS, creating a full solution in just days. At a high level, AWS Lake Formation provides best-practice templates and workflows for creating data lakes that are secure, compliant and operate effectively. The overall goal is to provide a solution that is well architected to identify, ingest, clean and transform data while enforcing appropriate security policies to enable firms to focus on gaining new insights, rather than building data lake infrastructure.
Before the release of AWS Lake Formation, organizations would need to take several steps to build their data lake. Not only was the process time-consuming, but there were several points in the process that proved difficult for the average operator. For example, users needed to set up their own Amazon S3 storage; deploy AWS Glue to prepare the data for analysis through the automated extract, transform and load (ETL) process; configure and enforce security policies; ensure compliance and more. Each part of the process offered room for missteps, making the overall data lake set up challenging and a month+ long process for many.
AWS Data Lake Benefits
AWS has solved many of these challenges with AWS Lake Formation that offers three key areas of benefit and one area that we think is a neat, supporting feature.
Templates – The new AWS Lake Formation provides templates for a number of things. We are most excited about the templates for AWS Glue which is important as this is an area where many organizations find they need to loop in AWS engineering for best practice help. Glue templates show that AWS really is listening to its customers and providing guidance where they need it most. In addition, our AWS consulting team was really happy to see templates that simplify the import of data and templates for the management of long-running cron jobs. These reusable templates will streamline each part of the data lake process.
Cloud Security Solutions – Data is the lifeblood of an organization and for many companies, it is the foundation of their IP. As a result, sound security (and compliance) must be a key consideration for any data lake solution. AWS is definitely singing from that hymn book with AWS Lake Formation as they have created opportunities for security at the most granular of levels — not just securing the S3 bucket, but the data catalog as well. For example, at the data catalog level, you could specify which columns of data a Lambda function can read, or revoke a user’s permissions to a specific database. (AWS notes that row-level tagging will be in a future version of the solution.)
Machine Learning Transformations – AWS provides algorithms for its customers to create their own machine learning solutions. AWS cites record de-duplication as a use case here, illustrating how ML can help clean and update data. However, we see this feature as being particularly interesting to firms in industries like pharmaceuticals where a company could, for example, use it to mine and predictively match chemical patterns to patients or in the oil and gas industry where ML can be applied to learn from field-based data points to maximize oil production.
Also neat, but not marquee-stealing, is the AWS Lake Formation feature that allows users to add metadata and tag data catalog objects. For developers, in particular, this is a nice-to-have feature as it will allow them to more easily search all this data. Separately, we also like that AWS Lake Formation users will only pay for the underlying services used and that there are no additional charges.
Ready to Swim?
One feature we’d like to see in an upcoming release of Lake Formation is integration with directory services like AD. This will help further streamline the process of controlling data access to ensure permissions are revoked when, for example, an employee leaves the organization or changes workgroups.
Moreover, while AWS Lake Formation greatly streamlines the process of building a data lake, being able to create your own templates moving forward may still remain a challenge for some organizations. At Flux7, we teach organizations how to build, manage and maintain templates for this — and many other AWS solutions — and can help your team ensure your templates incorporate Well-Architected best practice standards on an ongoing basis.