Data Lake has become a mainstay in data analytics architectures. By storing data in its native format, it allows organizations to defer the effort of structuring and organizing data upfront. This promotes data collection and serves as a rich platform for data analytics. Most data lakes are also backed by a distributed file system that enables massively parallel processing (MPP) and scales with even the largest of data sets. The increase of data privacy regulations and demands on governance requires a new strategy. Simple tasks such as finding, updating or deleting a record in a data lake can be difficult. It requires an understanding of the data and typically involves an inefficient process that includes re-writing the entire data set. This can lead to resource contention and interruptions in critical analytics workloads.

Apache Spark has become one of the most adopted data analytics platforms. Earlier this year, the largest contributor, Databricks, open-sourced a library called Delta Lake. Delta Lake solves the problem of resource contention and interruption by creating an optimized ACID-compliant storage repository that is fully compatible with the Spark API and sits on top of your existing data lake. Files are stored in Parquet format which makes them portable to other analytics workloads. Optimizations like partitioning, caching and data skipping are built-in so additional performance gains will be realized over native formats.

DeltaLake is not intended to replace a traditional domain modeled data warehouse. However, it is intended as an intermediate step to loosely structure and collect data. The schema can remain the same as the source system and personally identifiable data like email addresses, phone numbers, or customer IDs can easily be found and modified. Another important DeltaLake capability is Spark Structured Stream support for both ingest and data changes. This creates a unified ETL for both stream and batch while helping promote data quality.

Data Lake Lifecycle

  1. Ingest Data directly from the source or in a temporary storage location (Azure Blob Storage with Lifecycle Management)
  2. Use Spark Structured Streaming or scheduled jobs to load data into DeltaLake Table(s).
  3. Maintain data in DeltaLake table to keep data lake in compliance with data regulations.
  4. Perform analytics on files stored in data lake using DeltaLake tables in Spark or Parquet files after being put in a consistent state using the `VACCUUM` command.

Data Ingestion and Retention

The concept around data retention is to establish policies that ensure that data that cannot be retained should be automatically removed as part of the process.
By default, DeltaLake stores a change data capture history of all data modifications. There are two settings `delta.logRetentionDuration` (default interval 30 days) and `delta.deletedFileRetentionDuration` (default interval 1 week)

%sql
ALTER table_name SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 240 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
SHOW TBLPROPERTIES table_name

Load Data in DeltaLake

The key to DeltaLake is a SQL style `MERGE` statement that is optimized to modify only the affected files. This eliminates the need to reprocess and re-write the entire data set.

%sql
MERGE INTO customers
USING updates
ON customers.customerId = updates. customerId
WHEN MATCHED THEN
      UPDATE email_address = updates.email_address
WHEN NOT MATCHED THEN
      INSERT (customerId, email_address) VALUES (updates.customerId, updates.email_address)

Maintain Data in DeltaLake

Just as data can be updated or inserted, it can be deleted as well. For example, if a list of opted_out_consumers was maintained, data from related tables can be purged.

%sql
MERGE INTO customers
USING opted_out_customers
ON opted_out_customers.customerId = customers.customerId

WHEN MATCHED THEN DELETE

Summary

In summary, Databricks DeltaLake enables organizations to continue to store data in Data Lakes even if it’s subject to privacy and data regulations. With DeltaLakes performance optimizations and open parquet storage format, data can be easily modified and accessed using familiar code and tooling. For more information, Databricks DeltaLake and Python syntax references and examples see the documentation. https://docs.databricks.com/delta/index.html

AIS is working with a large organization that wants to discover relationships between data and the business by iteratively integrating data from many sources into Azure Data Lake. The data will be analyzed by different groups within the organization to discover new factors that might affect the business. Data will then be published to the appropriate consumers using PowerBI.

In the initial phase, data lake ingests data from some of the Operational Systems. Eventually, data will be captured not only from all the organization’s systems but also from streaming data from IoT devices.  

Azure Data Lake 

Azure Data Lake allows us to store a vast amount of data of various types and structures. Data can be analyzed and transformed by Data Scientists and Data Engineers. 

The challenge with any data lake system is preventing it from becoming a data swamp. To establish an inventory of what is in a data lake, we capture the metadata such as origin, size, and content type during ingestion. We also have the Interface Control Document (ICD) from the Operational Systems that describe the data definition of the source data. 

Logical Zones

The data in the data lake is segregated into logical zones to allow logical and physical separation to keep the environment secure, organized and agile. As the data progress through the zones various transformation is performed. 

  • Landing Zone is a place where the original data files are stored untouched. No data is deleted from this zone, and access to this zone is limited.  
  • Raw Zone is a place where data quality validation is applied based on the rules defined in source ICDAny data filed validation moves to Error Zone. 
  • Curated Zone is a place where we store the cleansed and transformed data and ready for consumption. The transformation is done for different audiences, and within the Zone, folders will be created for each specialized change.  
  • Error Zone is a place where we store data that filed validation. A notification is sent to the registered data curators upon arriving new data.  
  • Metadata Zone is a place where we keep track of metadata of the source and the transformed data.Metadata Zone Organization

The source systems have security requirements that prevent access to sensitive data. When the folders are created, permissions are given to security groups in Azure Active Directory. The same security rules are applied to the subsequent folders.

Now that the data is in the data lake, we allow each consuming group to create their own transformation rules. The transformed data is then moved to the curated zone ready to be loaded to the Azure Data Warehouse.

Azure Data Factory

Azure Data Factory orchestrated the movement and transformation of data, as shown in the diagram below. When a file is dropped in the Landing Zone, the Azure Data Factory pipeline that consists of activities to Unzip, Validate, Transform, and Load the data into Data Warehouse.

The unzipping is performed by a custom code Azure Function activity rather than the copy activity’s decompress functionality. The out of box functionality of Azure Data Factory can be used to uncompressed only GZip, Deflate and BZip2 files but not Tar, Rar, 7Zip, Lzip.

The basic validation rules, such as data range, valid values, and reference data, are described in the ICD. A custom Azure Function activity was created to validate the incoming data.

Data is transformed using Spark activity in Azure Data Factory for each consuming user. Each consumer has a folder under the Curated Zone.

Data Processing Example

Tables in the Azure Data Warehouse were created based on the Curated zone by executing the Generate Azure Function activity to create data definition language (DDL). The script modifies the destination table if there is a new field added.

Finally, the data is copied to the destination tables to be used by end-users and warehouse designers.

In each step, we captured business, operational, and technical metadata to help us descript the data in the lake. The metadata information can be uploaded to a metadata management system in the future.

Azure Data Lake logoFirst Things First…What’s a Data Lake?

If you’re not already familiar with the term, a “data lake” is generally defined as an expansive collection of data that’s held in its original format until needed. Data lakes are repositories of raw data, collected over time, and intended to grow continually. Any data that’s potentially useful for analysis is collected from both inside and outside your organization, and is usually collected as soon as it’s generated. This helps ensure that the data is available and ready for transformation and analysis when needed. Data lakes are central repositories of data that can answer business questions…including questions you haven’t thought of yet.

Azure Data Lake

Azure Data Lake is actually a pair of services: The first is a repository that provides high-performance access to unlimited amounts of data with an optional hierarchical namespace, thus making that data available for analysis. The second is a service that enables batch analysis of that data. Azure Data Lake Storage provides the high performance and unlimited storage infrastructure to support data collection and analysis, while Azure Data Lake Analytics provides an easy-to-use option for an on-demand, job-based, consumption-priced data analysis engine.

We’ll now take a closer look at these two services and where they fit into your cloud ecosystem. Read More…