Azure Data Lake (ADL) Gen 2 Implementation Best Practices

Featured in: / /
Published on: November 18, 2021
Written by: Bhavna Gupta

These days, a number of businesses have a requirement to get data from varied sources and store the voluminous data of any type at a single destination. This data can be later used for providing data-driven decisions. Azure Data Lake Gen 2 is being popularly used to serve this purpose.

However, it can be a daunting task for a solution provider to identify the aspects to be considered while designing their first Azure Data Lake. Though, there is no perfect solution as it varies with the business requirement and provided scenario, still there are some generic parameters that can be looked upon.

In this article I will provide insight on what the Azure Data Lake Gen 2 is and provide some best practices for efficient use.

What is Azure Data Lake (ADL) Gen 2?

Azure Data Lake Gen 2 is an evolution of Data Lake Storage Gen1 built on top of Blob storage with hierarchical namespace. It's a cloud based big data storage that serves as a storage account for structured, semi-structured and unstructured data.

The data of any type and size can be collected from a number of data sources, stored into ADL and from there the data can be further transformed and analysed.

Azure Data Lake (ADL) Gen 2 Implementation Best Practices

These are some implementation best practices for efficient use of Azure Data Lake Gen 2 services:

1. Environments

The number of environments varies depending on the project constraints. However, having separate DEV and PROD environments is preferable. This will limit the risk of loss of data in case of accidental drops or unwanted amendments performed.

2. Backup and Recovery

Most organisations leverage the disaster recovery options available to them on Azure to ensure they are properly protected in case of a data centre outage. As per the requirement, any of these can be selected: ZRS, GZRS, or GRS & RA-GRS.

However, if records are accidentally deleted, that deletion is also replicated. To protect against such events, you should consider having a separate back-up area over the same data lake storage account.

The folder structure shall be as follows:

 

Also, we shall enable soft deletion and set minimal number of days to 7. Soft deletion allows us to retrieve back the data that was accidentally dropped, in our case, within the last 7 days.

3. Data layers

Data layers are recommended, only if there is a strong business requirement to transform the raw data to be more business friendly or the type of data needs cleansing.

Data can be stored in two layers:

  • Raw layer: data is stored in its native format. There is no data loss by aggregation or modifications. The data can be used for error recovery, tracing data lineage or for exploratory analysis.
  • Curated layer: data is transformed and cleansed. The data is optimised for analytic consumption to ensure fast reads.

NOTE: Once UAT is performed and PROD environment is readily being used, we can opt to stop refreshing the DEV environment as it will help in cost optimisation or if required, we can continue refreshing both DEV and PROD environments.

If we stop refreshing the DEV environment for cost efficiencies, we need to plan to synch the DEV environment from PROD periodically.

4. Directory Layout

The directory layout depends upon the business requirement. It can be based on:

  • Time series
  • Batch jobs
  • IOT based

Time series directory layout is the preferred one as it enables us to retrieve the latest data and easily traverse back in time to retrieve the required dataset.

For better understanding, let us consider the following examples:

  • /prod/raw/datasource1/YYYY_MM_DD/filename1_yyyy_mm_dd.csv
  • /prod/raw/datasource1/ YYYY_MM_DD/ filename2_yyyy_mm_dd.csv
  • /prod/raw/datasource2/ YYYY_MM_DD/filename1_yyyy_mm_dd.csv
  • /prod/raw/datasource2/ YYYY_MM_DD/filename2_yyyy_mm_dd.csv
  • /backup/prod/raw/datasource1/ datasource1 _yyyy_mm_dd.zip
  • /backup/prod/raw/datasource2/datasource2_yyyy_mm_dd.zip

Similarly, we can have the folder structure for curated layer.

5. Roles and Permissions

Azure Data Lake supports Role-based Access Control (RBAC) and Access Control Lists (ACL) which can be achieved through integration with Azure Active Directory (AAD). RBAC controls access to storage accounts and containers whereas ACL is associated with directories and files.

Hence, we can control who can have access to data in DEV, PROD or Backup container and furthermore, break down access to file level. As a best practice, it is advised to assign security principals with RBAC role on the Storage Account/Container level and to then proceed with the restrictive and selective ACLs on the directory and file level.

6. Cost Optimisation

Azure Data Lake is economical in nature, it has no upfront cost and follows pay-per-use model. The cost depends upon 

  • The data at rest and the number of gigabytes stored, and
  • The number of transactions (read and write) over that data.

It is a good practice to calculate and estimate the cost involved before the commencement of the project. Cost estimation can be performed using Azure price calculator.

REFERENCES 

Follow the ADL Best practices: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices 

Privacy Policy
chevron-down