Data Lake has become a mainstay in data analytics architectures. By storing data in its native format, it allows organizations to defer the effort of structuring and organizing data upfront. This promotes data collection and serves as a rich platform for data analytics. Most data lakes are also backed by a distributed file system that enables massively parallel processing (MPP) and scales with even the largest of data sets. The increase of data privacy regulations and demands on governance requires a new strategy. Simple tasks such as finding, updating or deleting a record in a data lake can be difficult. It requires an understanding of the data and typically involves an inefficient process that includes re-writing the entire data set. This can lead to resource contention and interruptions in critical analytics workloads.

Apache Spark has become one of the most adopted data analytics platforms. Earlier this year, the largest contributor, Databricks, open-sourced a library called Delta Lake. Delta Lake solves the problem of resource contention and interruption by creating an optimized ACID-compliant storage repository that is fully compatible with the Spark API and sits on top of your existing data lake. Files are stored in Parquet format which makes them portable to other analytics workloads. Optimizations like partitioning, caching and data skipping are built-in so additional performance gains will be realized over native formats.

DeltaLake is not intended to replace a traditional domain modeled data warehouse. However, it is intended as an intermediate step to loosely structure and collect data. The schema can remain the same as the source system and personally identifiable data like email addresses, phone numbers, or customer IDs can easily be found and modified. Another important DeltaLake capability is Spark Structured Stream support for both ingest and data changes. This creates a unified ETL for both stream and batch while helping promote data quality.

Data Lake Lifecycle

  1. Ingest Data directly from the source or in a temporary storage location (Azure Blob Storage with Lifecycle Management)
  2. Use Spark Structured Streaming or scheduled jobs to load data into DeltaLake Table(s).
  3. Maintain data in DeltaLake table to keep data lake in compliance with data regulations.
  4. Perform analytics on files stored in data lake using DeltaLake tables in Spark or Parquet files after being put in a consistent state using the `VACCUUM` command.

Data Ingestion and Retention

The concept around data retention is to establish policies that ensure that data that cannot be retained should be automatically removed as part of the process.
By default, DeltaLake stores a change data capture history of all data modifications. There are two settings `delta.logRetentionDuration` (default interval 30 days) and `delta.deletedFileRetentionDuration` (default interval 1 week)

%sql
ALTER table_name SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 240 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
SHOW TBLPROPERTIES table_name

Load Data in DeltaLake

The key to DeltaLake is a SQL style `MERGE` statement that is optimized to modify only the affected files. This eliminates the need to reprocess and re-write the entire data set.

%sql
MERGE INTO customers
USING updates
ON customers.customerId = updates. customerId
WHEN MATCHED THEN
      UPDATE email_address = updates.email_address
WHEN NOT MATCHED THEN
      INSERT (customerId, email_address) VALUES (updates.customerId, updates.email_address)

Maintain Data in DeltaLake

Just as data can be updated or inserted, it can be deleted as well. For example, if a list of opted_out_consumers was maintained, data from related tables can be purged.

%sql
MERGE INTO customers
USING opted_out_customers
ON opted_out_customers.customerId = customers.customerId

WHEN MATCHED THEN DELETE

Summary

In summary, Databricks DeltaLake enables organizations to continue to store data in Data Lakes even if it’s subject to privacy and data regulations. With DeltaLakes performance optimizations and open parquet storage format, data can be easily modified and accessed using familiar code and tooling. For more information, Databricks DeltaLake and Python syntax references and examples see the documentation. https://docs.databricks.com/delta/index.html

Like every business that’s dependent on consumer sales to fuel growth, you and your team members are probably constantly thinking about how you can make your organization’s sales processes fast and efficient enough to support the growth and customer retention that your executive team desires.

Well, we’ve figured out a way to do just that – our client organization is in the highly competitive insurance industry, and needed a way to increase sales volumes. Enter AIS; we were able to provide our client with an automated method of providing customers with a quote for insurance rates via a self-service web portal solution…resulting in the higher sales volumes they were seeking, while also reducing costs. Read More…

Microsoft Dynamics CRM is an interesting and powerful business application. A core out-of-the-box (OOTB) benefit of Dynamics CRM is the ability to extensively tailor the application to address business needs, and here there are two approaches to consider: development or customization. Determining which approach to take is the key to maximizing the benefits of Dynamics CRM while keeping costs low.

Dynamics CRM is a basic web user interface fronting a SQL Server database that manages relational data. However, it is flanked by a built-in array of basic analytical tools and extensive administrative features, such as auditing, which give it enterprise-level credentials. Throw in a customizable user interface (UI), and you have a tool that is capable of supporting both small businesses and multinational corporations. So it would be logical to assume that Dynamics CRM has a developer-friendly, structured architecture to support customizations.

However, the reality is a little more complicated and brings up some curious paradoxes about Dynamics CRM. Read More…

Good question. And we’ve got the answer.

Here at AIS, we’ve spent hundreds of thousands of hours envisioning, designing and constructing SharePoint-based solutions for our clients. With each new version of SharePoint, we make additional investments to deeply understand the new release’s capabilities.

We’ve taken a look at all the changes and enhancements in the latest version — and you just have to look at our blog archives to realize that a LOT has changed in 2013 — and put together a short, easy-to-read whitepaper that highlights the top new features that make SharePoint 2013 a must-have for your business, including:

  • Smarter Search
  • Simpler and Mobile-Ready UI
  • The game-changing SharePoint App Store Model
  • Better Workflow
  • Social SharePoint
  • Easy Migration Tools
  • Lower Costs
  • And much more!

Please CLICK HERE to download your free copy.

Not familiar with SharePoint as a business solution? Take a look at our SharePoint solutions on our website and contact us to learn more about how SharePoint can transform your organization.

One day, not so very long ago, Kevin and Tom stopped by for a visit and asked me, “Can we build a low-cost Content Management System (CMS) on .NET that serves up audio and video content? The site also needs to sell access to the A/V content, and oh…the CMS users will be non-technical and it has to work on the iPad too.” I replied that of course we could build such a system and would get back to them with a plan.

Then I thought: What did I just promise?

Read More…