How to deploy your Kedro pipeline on Apache Airflow with Astronomer

Note

This documentation is based on Kedro 0.17.1. If you spot anything that is incorrect then please create an issue or pull request.

This tutorial explains how to deploy a Kedro project on Apache Airflow with Astronomer. Apache Airflow is an extremely popular open-source workflow management platform. Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. Astronomer is a managed Airflow platform which allows users to spin up and run an Airflow cluster easily in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.

The following discusses how to run the example Iris classification pipeline on a local Airflow cluster with Astronomer.

Strategy

The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an Airflow task while the whole pipeline is converted into a DAG for orchestration purpose. This approach mirrors the principles of running Kedro in a distributed environment.

Prerequisites

To follow along with this tutorial, make sure you have the following:

Project Setup

  1. Initialise an Airflow project with Astro. Let’s call it kedro-airflow-iris

    mkdir kedro-airflow-iris
    cd kedro-airflow-iris
    astro dev init
    
  2. Create a new Kedro project using the pandas-iris starter. You can use the default value in the project creation process:

    kedro new --starter=pandas-iris
    
  3. Copy all files and directories under new-kedro-project, which was the default project name created in step 2, to the root directory so Kedro and Astro CLI share the same project root:

    cp new-kedro-project/* .
    rm -r new-kedro-project
    

    After this step, your project should have the following structure:

    .
    ├── Dockerfile
    ├── README.md
    ├── airflow_settings.yaml
    ├── conf
    ├── dags
    ├── data
    ├── docs
    ├── include
    ├── logs
    ├── notebooks
    ├── packages.txt
    ├── plugins
    ├── pyproject.toml
    ├── requirements.txt
    ├── setup.cfg
    └── src
    
  4. Install kedro-airflow~=0.4. We will use this plugin to convert the Kedro pipeline into an Airflow DAG.

    pip install kedro-airflow~=0.4
    
  5. Run kedro install to install all dependencies.

Deployment process

Step 1. Create new configuration environment to prepare a compatible DataCatalog

  • Create a conf/airflow directory in your Kedro project

  • Create a catalog.yml file in this directory with the following content

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv
example_train_x:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_train_x.pkl
example_train_y:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_train_y.pkl
example_test_x:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_test_x.pkl
example_test_y:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_test_y.pkl
example_model:
  type: pickle.PickleDataSet
  filepath: data/06_models/example_model.pkl
example_predictions:
  type: pickle.PickleDataSet
  filepath: data/07_model_output/example_predictions.pkl

This ensures that all datasets are persisted so all Airflow tasks can read them without the need to share memory. In the example here we assume that all Airflow tasks share one disk, but for distributed environment you would need to use non-local filepaths.

Step 2. Package the Kedro pipeline as an Astronomer-compliant Docker image

  • Step 2.1: Package the Kedro pipeline as a Python package so you can install it into the container later on:

kedro package

This step should produce a wheel file called new_kedro_project-0.1-py3-none-any.whl located at src/dist.

  • Step 2.2: Add the src/ directory to .dockerignore, as it’s not necessary to bundle the entire code base with the container once we have the packaged wheel file.

echo "src/" >> .dockerignore
  • Step 2.3: Modify the Dockerfile to have the following content:

FROM quay.io/astronomer/ap-airflow:2.0.0-buster-onbuild

RUN pip install --user src/dist/new_kedro_project-0.1-py3-none-any.whl

Step 3. Convert the Kedro pipeline into an Airflow DAG with kedro airflow

kedro airflow create --target-dir=dags/ --env=airflow

Step 4. Launch the local Airflow cluster with Astronomer

astro dev start

If you visit the Airflow UI, you should now see the Kedro pipeline as an Airflow DAG:

../_images/kedro_airflow_dag.png

../_images/kedro_airflow_dag_run.png

Final thought

This tutorial walks you through the manual process of deploying an existing Kedro project on Apache Airflow with Astronomer. However, if you are starting out, consider using our astro-iris starter which provides all the aforementioned boilerplate out of the box:

kedro new --starter=astro-iris