How to use Kedro on a Databricks cluster

Note: This documentation is based on Kedro 0.16.4, if you spot anything that is incorrect then please create an issue or pull request.

GitHub workflow with Databricks

This workflow posits that development of the Kedro project is done on a local environment under version control by Git. Commits are pushed to a remote server (e.g. GitHub, GitLab, Bitbucket, etc.).

Deployment of the (latest) code on the Databricks driver is accomplished through cloning and the periodic pulling of changes from the Git remote. The pipeline is then executed on the Databricks cluster.

While this example uses GitHub Personal Access Tokens (or equivalents for Bitbucket, GitLab, etc.) you should be able to use your GitHub password as well (although this is less secure).

Firstly, you will need to generate a GitHub Personal Access Token with the relevant privileges.

Add your username and token to the environment variables of your running Databricks environment (all the following commands should be run inside a Notebook):

import os

os.environ["GITHUB_TOKEN"] = "YOUR_TOKEN"

Then clone your project to a directory of your choosing:

%sh mkdir -vp ~/projects/ && cd ~/projects/ &&
git clone https://${GITHUB_USER}:${GITHUB_TOKEN}**/your_project.git

And, cd into your project directory:

cd ~/projects/your_project

You’ll need to add the src directory to path using:

import sys
import os.path


Then, import and execute the run module to run your pipeline:

import as run


To pull in updates to your code run from your project directory:

%sh git pull

Detach and re-attach your Notebook or re-import the run module for changes to be picked up:

import importlib
run = importlib.reload(run)