Dynamic SQL in Databricks and SQL Server

What is dynamic SQL?

Dynamic SQL is a programming technique where you write a general purpose query and store it in a string variable, then alter key words in the string at runtime to alter the type of actions it will perform, the data it will return or the objects it will perform these actions on before it is actually executed.

Some form of dynamic SQL execution exists on all technologies that support the SQL language.

In this article, I will focus on Microsoft SQL Server (On-Premises or Azure) and Databricks.

Use Cases

  1. In Azure Synapse Analytics (Serverless and Dedicated Pool) you can dynamically generate an external table pointing to a location in the data lake which is determined at runtime. This is ideal when you want the external table to only point to the latest file available in a lake, not all historical files.
  2. Maintenance – you can perform the same maintenance task on a large set of objects when you make the maintenance code dynamic in terms of the object it interacts on. These actions can be to maintain indexes, drop tables not needed, clear the cache of databases etc. 
  3. Pivot and Unpivot – you can create a stored procedure to pivot or unpivot a table structure based on the column names sent in as parameters. This means you never have to google how to do the actual pivot or unpivot again – joy!
  4. JSON Normalization – you can dynamically normalize a JSON object or array by dynamically getting all the attributes from the JSON string and generating the code to persist the normalized structure.
  5. Optimized data retrieval – this is very useful for databases that get requests from multiple users at a time. You can create a dynamic query that will filter a column based on the value sent in from a front end application at runtime. This means a reduced set of data is returned, increasing performance and security.
These are but a few, I am not going to list all the use cases, I do have a life you know…

Benefits

The main benefit is you can increase the level of automation of your application by having code be generated dynamically, saving your developers the time of writing it all out manually. Especially if the majority of the structure of the query is the same with only minor differences here and there.

Considerations

  1. Dynamic SQL can be hard to debug. I typically print out the SQL statement generated first before ever adding the EXEC statement at all so I can investigate the code generated and run it manually before letting SQL run it for me.
  2. It is vulnerable to SQL Injection attacks, especially if you rely on input from users and not input from secure backend sources the users don’t interact with.
  3. It can be slower to execute due to the execution plans the SQL engine uses to perform the actions the code instructs having to be generated anew each time – depending on the degree to which the query statement is altered as part of you dynamic-ness.

SQL Server Example

In this example, the query will filter the table based on a dynamic value passed in at runtime.
/*Variable to hold the value we will filter on at runtime*/
DECLARE @vCustomerID INT;

/*Variable to hold the query string we will alter at runtime then execute*/
DECLARE @vSqlQuery VARCHAR(4000);

/*Define the base version of the query*/
SET @vSqlQuery = '
SELECT *
FROM [dbo].[Customers]
WHERE [Customers].[CustomerID] = ' + @vCustomerID + ''

/*Execute the code*/
EXEC (@vSqlQuery)
When we execute this piece of code as part of a stored procedure and pass in the value 123, the final resulting code will look like the below.
SELECT *
FROM [dbo].[Customers]
WHERE [Customers].[CustomerID] = 123

Databricks Example

In Databricks, there are situations where using SQL is the better choice over Python or Scala to perform certain actions, e.g. when working with Delta tables and merging new data into it.
 
Some pre-requisites for the below example:
  1. You need to have a dataframe already defined earlier in the notebook and have it saved as a temporary view called sourceTemporaryView
  2. The dataframe must have a field called ProductCategory to apply the partitioning on. This line can actually also be made dynamic to pass in a dynamic field name to partition on.
  3. You must have a linked data lake where the data for this delta table will be stored. This location is defined by the vDeltaTablePath variable value. This should ideally be the path to a mounted data lake location.
Here is an example of python code where you can dynamically generate a new Delta table using SQL based on the data in a pre-defined dataframe.
# pDataLakeContainer: "silver"
# pTableName": "Product"
# vDeltaTablePath: "/mnt/silver/WorldWideImporters/Batch/Product/"

vDeltaTableCreateStatement = 'CREATE TABLE IF NOT EXISTS ' \
+ pDataLakeContainer + '.' + pTableName + ' \n' \
+ 'USING DELTA ' + '\n' \
+ 'LOCATION \'' + vDeltaTablePath + '\' ' + '\n' \
+ 'PARTITIONED BY ( ProductCategory )' + '\n' \
+ 'AS' + '\n' \
+ 'SELECT * FROM sourceTemporaryView'

#Check final output
print('vDeltaTableCreateStatement: ' + vDeltaTableCreateStatement)

#Execute the SQL
spark.sql(vDeltaTableCreateStatement)
The final sql query generated and executed can look something like this
CREATE TABLE IF NOT EXISTS silver.Product 
USING DELTA 
LOCATION '/mnt/silver/WorldWideImporters/Batch/Product/' 
PARTITIONED BY ( ProductCategory )
AS
SELECT * FROM sourceTemporaryView

Closing Arguments

This article only gave a very high-level overview of the potential that dynamic SQL holds.
 
But remember, with great power, comes great responsibility.
 
Only use dynamic SQL when necessary as it comes with a cost – see the considerations above.

Ok, good luck, have fun, and don’t accidentally drop the production database using this information, I take no responsibility.

If you like what I do please consider supporting me on Ko-Fi

1 thought on “Dynamic SQL in Databricks and SQL Server”

Comments are closed.