Get the latest file from Azure Data Lake in Databricks
3/19/20221 min read
There are many ways to orchestrate a data flow in the cloud.
One such option is to have an independent process pull data from source systems and land the latest batch of data in an Azure Data Lake as a single file. The next layer where you process the data can be handled in many ways. The most independent way to do this is to have the processing layer fetch the latest file from the Data Lake on its own. This ensures the processing layer is not dependent on a previous tool or service giving the file path to it, increasing fault tolerance.
In Databricks, there is no built in function to get the latest file from a Data Lake. There are other libraries available that can provide such functions, but it is advisable to always use standardised libraries and code as far as possible.
Below are 2 functions that can work together to go to a directory in an Azure Data Lake and return the full file path of the last modified file. This file can then be processed as normal in Databricks.
Thank you for reading my ramblings, if you want to you can buy me a coffee here:
Social
All content on this website is my own. I create the posts here to help the community as best I can. It doesn't mean I am always correct, or the methods I show here are the best. We all change and learn as we grow, so if you see something you think I could have done better, please reach out! Lets share the knowledge and be kind to each other!