Apache Delta lake format is one of the widely used enhancements of the plain Parquet format. The Parquet files are read-only and enable you just to append new data by adding new Parquet files into some folder. Apache Delta Lake format is based on Parquet, but it also enables you to update, delete, merge data with ACID guarantees. In many solutions, Delta format is a great choice when you need to gradually modify and refine your data.
This way you can store raw data in some bronze layer, and gradually refine, clean, and modify data and move them into other layers.
Serverless Synapse SQL pools in Azure Synapse Analytics enable you to read various formats that are stored in Azure Data Lake storage. However, Delta format is not supported. If you would like to have a native/built-in support for Delta format vote for this feature on Azure Feedback site.
There are some workarounds that enable you to read Delta Lake format in serverless SQL pool. One popular workaround is described in this article. In that article, you can see how to leverage manifests files that you can generate with Apache Spark pool in Azure Synapse Analytics or Azure DataBricks, and create views based on these manifest files.
One drawback of the approach described in James-Giang Nguyen article is that you must have Spark pool that will generate manifest file.
In this article, you will see another solution that doesn’t require Apache Spark or Azure DataBricks to generate a manifest file. This solution uses a logic implemented in PowerShell to generate a view on top of Delta files.
First, you need to have a data set in Delta format. The data set that I will use in this article is stored on Azure Data Lake storage.
In Delta lake structure, you can see a set of folders with the pattern puYear=* that contain Parquet files, and one folder called _delta_log where we have all changes that are made on the Parquet file. If you want to know more about the structure of this folder you can read this article.
You will also need to have one Synapse workspace (make sure that it is in the same region as your data lake storage). Synapse Analytics workspace will automatically provide a serverless SQL pool that you can use to query data on Azure Data Lake storage.
Make sure that you are able to read data on your Azure Data Lake storage from your serverless SQL pool in the Synapse Analytics workspace. You can create a credential with the URI of your Azure Data Lake storage and shared access signature that will be used to access data:
Now you are ready to create a T-SQL view on top of these Delta files.
Creating Delta Lake view using PowerShell
First, take the PowerShell script from this GitHub location. Download the script and open it in some editor/IDE that can execute PowerShell. I’m using VS Code. Make the following changes in the script:
- Set connection information to your serverless SQL endpoint with SQL username/password
- Define the name of the view that will be created.
- Specify the location of your Delta folder and optionally partitioning columns.
In the following image, you can see where to populate connection information.
Make sure that you have created your database in serverless Synapse SQL, and that you are putting here the valid credentials.
Then you need to specify the name of the view that will be generated and the URI of your Delta folder:
I would like to generate a view called dbo.green, and the path to my DeltaLake file is https://sqlondemandstorage.blob.core.windows.net/delta-lake/green/. My Delta Lake is partitioned using the puYear=*/puMonth=* folder pattern, so I will put these columns as partitioning columns.
Now you just need to run this PowerShell script.
Once you run the script, you will see a view called dbo.green in the database DeltaLakeDemo that enables you to read the data from Delta Lake storage.
If you connect to your serverless SQL endpoint and the database, you will see the view and will be able to query Delta Lake using this view:
With this view, you can run any T-SQL query on top of your delta folder.
This PowerShell script enables you to easily create a T-SQL view on top of Delta Lake files. This view enables you to read Delta Lake files using a serverless Synapse SQL pool.
Note that this view references the snapshot of the Delta Lake folder at the time of running this script. You would need to re-run the PowerShell script again if you change the data in the Delta folder in order to reference the latest data.
You can also use this view to reference the snapshot of your Delta data set at the time of creation. This might be useful for you if you always want to see the snapshot of your data. In this scenario, you can create new views for every new snapshot that you need.
IMPORTANT: Note that Delta Lake is not supported by serverless Synapse SQL pool. This is a workaround that uses a view created on Parquet files. This is an open-source solution and not a part of Azure service.
If you need this scenario and if you would like to have native/built-in support for Delta format in the Synapse Analytics pools, vote for this feature on Azure Feedback site.
Disclaimer: This is personal work and is not an official capability from either Azure Synapse Analytics, Apache Delta Lake, or Databricks. The PowerShell sample is open-sourced and you can modify it. If you notice some issues, file a bug in this repository: https://github.com/JocaPC/tsql-delta-lake-view/issues or feel free to send the pull request.