Launched at the end of 2018 during the AWS re:Invent conference, Amazon’s Lake Formation service shows how the Cloud can significantly accelerate the execution of large-scale data analytics projects.
Diving into the Big Data paradigm when you manage your own infrastructures is not an easy task. Several months can pass from the expression of a need to the production phase. During this period, structuring decisions have to be taken, related to both the financial aspects of the project and the technical choices of the components that will compose the data pipeline.
The emergence of managed services dedicated to data analytics by most major Cloud providers appears to be a breath of fresh air in this context. Beyond the inherent benefits of the Cloud, such as elasticity or pay-as-you-go (PAYG), it is possible to dissociate where the data is stored from where it is processed. It is thus possible to start processes very quickly, regardless of the exact nature or location of the clusters.
Big Data bricks natively Cloud
Most of the major Cloud providers have developed platforms dedicated to Big Data. Google Dataproc and Amazon EMR (Elastic Map Reduce) for instance, integrate Spark, Hadoop, Pig or Hive without having to manage infrastructure issues. Moreover, providers offer to take advantage of these native integrations with the rest of a vast catalog of data analytics services
At AWS, Amazon EMR relies primarily on EC2 instances and storage mainly on S3, as well as Redshift and many others, but it also interfaces with Glue (data transformation), Athena (interactive query service), QuickSight (data visualization), Kinesis (real-time data collection and loading), and more. Amazon has also significantly upgraded S3 in recent years to make it a viable alternative to Hadoop clusters running HDFS.
AWS Lake Formation to speed up projects
Among the latest products in the Amazon services portfolio, AWS Lake Formation is specifically designed to accelerate the creation and configuration of a Datalake. The goal is to reduce implementation time from a few months to a few days by centralizing the definition of security, governance and audit strategies.
Lake Formation provides a template to configure all the services required to load data from various sources, define transformation tasks, clean them up with Machine Learning and reorganize them so that they can be used with the lowest level of friction by Data Scientists or even Citizen Data Scientists.
Saving time: a competitive challenge
Amazon’s efforts in the Data Science segment illustrate the competitive challenge of the Analytics world in a challenging environment:
- by the continuous rise in investments in the Cloud (+50% by 2022 according to Gartner)
- by the arrival of ‘new’ market players such as Alibaba on a market traditionally occupied by a few key players
- by the emergence of projects and therefore new specialized profiles
This seems to be a particularly good time to start mobilizing your data!