Function to mount a storage account container to Azure Databricks

What does it mean to mount a storage account to Azure Databricks

Databricks has a built in “Databricks File System (DBFS)”. It is a distributed file system mounted onto your Databricks workspace. It is mounted directly to your cluster and is only accessible while the cluster is running. It is an abstraction on object storage. The benefits of this approach include:
  • You can mount external data storage mediums to DBFS to enable seamless access to the data without re-specifying the access credentials every time you want to access the data
  • It allows you to interact with files using directory and file path semantics instead of cumbersome URLs that are unique to the storage account service you are using
  • It allows you to persist data after processing it.

DBFS root

The default storage location in DBFS is known as the DBFS root. Databricks stores its internal working objects here. It is advised to not work in this area and rather mount your own custom storage medium instead as an additional mount point. The DBFS root is not intended for production customer data.

Mount object storage to DBFS

Manually writing the code to correctly mount your Azure Storage Account to Databricks can become cumbersome.
Here is a function you can use to ease this burden.
def mount_lake_container(pAdlsContainerName):
  
    """
    mount_lake_container: 
        Takes a container name and mounts it to Databricks for easy access. 
        Prints out the name of the mount point. 
        Uses a service princple to authenticate. 
    Key Vault SecretScopeName = "KeyVault"
    Key Vault Secret with Data Lake Name: DataLakeStorageAccountName
    Key Vault Secret with ClientID = "DataLakeAuthServicePrincipleClientID"
    Key Vault Secret with ClientSecret = "DataLakeAuthServicePrincipleClientSecret"
    Key Vault Secret with TenantID = "DataLakeAuthServicePrincipleTenantID"
    """

    # KeyVault Secret Scope Name - use a variable because it is referenced multiple times
    vSecretScopeName = "KeyVault" # Fixed standardised name. To ensure deployment from DEV to PROD is seemless. 

    # Define the variables used for creating connection strings - Data Lake Related
    vAdlsAccountName = dbutils.secrets.get(scope=vSecretScopeName,key="DataLakeStorageAccountName") # e.g. "dummydatalake" - the storage account name itself
    vAdlsContainerName = pAdlsContainerName # e.g. rawdata, bronze, silver, gold, platinum etc.
    vMountPoint = "/mnt/datalake_" + vAdlsContainerName #fixed since we already parameterised the container name. Ensures there is a standard in mount point naming

    # Define the variables that have the names of the secrets in key vault that store the sensitive information we need for the conenction via Service Principle Auth
    vSecretClientID = "DataLakeAuthServicePrincipleClientID" #Name of the generic key vault secret contianing the Service Principle Name.
    vSecretClientSecret = "DataLakeAuthServicePrincipleClientSecret" #Name of the generic key vault secret contianing the Service Principle Password. 
    vSecretTenantID = "DataLakeAuthServicePrincipleTenantID" #Name of the generic key vault secret contianing the Tenant ID.

    # Get the actual secrets from key vault for the service principle
    vApplicationId = dbutils.secrets.get(scope=vSecretScopeName, key=vSecretClientID) # Application (Client) ID
    vAuthenticationKey = dbutils.secrets.get(scope=vSecretScopeName, key=vSecretClientSecret) # Application (Client) Secret Key
    vTenantId = dbutils.secrets.get(scope=vSecretScopeName, key=vSecretTenantID) # Directory (Tenant) ID

    # Using the secrets above, generate the URL to the storage account and the authentication endpoint for OAuth
    vEndpoint = "https://login.microsoftonline.com/" + vTenantId + "/oauth2/token" #Fixed URL for the endpoint
    vSource = "abfss://" + vAdlsContainerName + "@" + vAdlsAccountName + ".dfs.core.windows.net/"

    # Connecting using Service Principal secrets and OAuth
    vConfigs = {"fs.azure.account.auth.type": "OAuth", #standard
               "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", #standard
               "fs.azure.account.oauth2.client.id": vApplicationId,
               "fs.azure.account.oauth2.client.secret": vAuthenticationKey,
               "fs.azure.account.oauth2.client.endpoint": vEndpoint}

    # Mount Data Lake Storage to Databricks File System only if the container is not already mounted
    # First generate a list of all mount points available already via dbutils.fs.mounts()
    # Then it checks the list for the new mount point we are trying to generate.
    if not any(mount.mountPoint == vMountPoint for mount in dbutils.fs.mounts()): 
      dbutils.fs.mount(
        source = vSource,
        mount_point = vMountPoint,
        extra_configs = vConfigs)

    # print the mount point used for troubleshooting in the consuming notebook
    print("Mount Point: " + vMountPoint)

Prerequisites

  1. You must have a Azure Key Vault secret scope linked to your Databricks workspace (recommended). Or alternatively you should have your secrets stored in a Databricks Backed Secret Scope. This is where the sensitive values used for authentication are stored.
  2. You must already have an Azure Storage Account created which has Hierarchical Namespaces enabled – this is what makes it a Data Lake and influences what endpoints can be used to connect to the storage account.
  3. You must have a Service Principle created in your Active Directory. This is the account that will be used for authentication and authorization.
  4. The Service Principle must have Storage Blob Data Contributor rights on the storage account.
  5. The secret scope and secret names are hardcoded, so you should either setup your environment to match this code, or edit the function to match your environment.

Usage

This function can now be called at any point in time to mount a container from your storage account to databricks.
 
Tip, store this in a separate notebook and run the %run magic command in other notebooks where you need this function. 

Note, it cannot mount a nested directory, only a container. For this to work you need to edit the path specified in the function on line 35 to add the directory name at the end after the final / .


If you like what I do please consider supporting me on Ko-Fi

6 thoughts on “Function to mount a storage account container to Azure Databricks”

  1. I’m not sure exactly why but this weblog is loading very slow for me. Is anyone else having this issue or is it a issue on my end? I’ll check back later on and see if the problem still exists.

  2. I haven?¦t checked in here for a while since I thought it was getting boring, but the last few posts are great quality so I guess I?¦ll add you back to my daily bloglist. You deserve it my friend 🙂

  3. I like the helpful info you provide in your articles. I will bookmark your blog and check again here regularly. I’m quite certain I will learn lots of new stuff right here! Good luck for the next!

  4. Have you ever thought about creating an ebook or guest authoring on other blogs? I have a blog based on the same subjects you discuss and would love to have you share some stories/information. I know my readers would enjoy your work. If you are even remotely interested, feel free to send me an email.

  5. I conceive this web site contains some real superb information for everyone. “It is easy enough to define what the Commonwealth is not. Indeed this is quite a popular pastime.” by Elizabeth II.

Comments are closed.