How to deploy your Kedro pipeline on Apache Airflow with Astronomer¶
Note: This documentation is based on
Kedro 0.17.1, if you spot anything that is incorrect then please create an issue or pull request.
This tutorial explains how to deploy a Kedro project on Apache Airflow with Astronomer. Apache Airflow is an extremely popular open-source workflow management platform. Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. Astronomer is a managed Airflow platform which allows users to spin up and run an Airflow cluster easily in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.
The following discusses how to run the example Iris classification pipeline on a local Airflow cluster with Astronomer.
The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an Airflow task while the whole pipeline is converted into a DAG for orchestration purpose. This approach mirrors the principles of running Kedro in a distributed environment.
To follow along with this tutorial, make sure you have the following:
Initialise an Airflow project with Astro. Let’s call it
mkdir kedro-airflow-iris cd kedro-airflow-iris astro dev init
Create a new Kedro project using the
pandas-irisstarter. You can use the default value in the project creation process:
kedro new --starter=pandas-iris
Copy all files and directories under
new-kedro-project, which was the default project name created in step 2, to the root directory so Kedro and Astro CLI share the same project root:
cp new-kedro-project/* . rm -r new-kedro-project
After this step, your project should have the following structure:
. ├── Dockerfile ├── README.md ├── airflow_settings.yaml ├── conf ├── dags ├── data ├── docs ├── include ├── logs ├── notebooks ├── packages.txt ├── plugins ├── pyproject.toml ├── requirements.txt ├── setup.cfg └── src
kedro-airflow~=0.4. We will use this plugin to convert the Kedro pipeline into an Airflow DAG.
pip install kedro-airflow~=0.4
kedro installto install all dependencies.
Step 1. Create new configuration environment to prepare a compatible
conf/airflowdirectory in your Kedro project
catalog.ymlfile in this directory with the following content
example_iris_data: type: pandas.CSVDataSet filepath: data/01_raw/iris.csv example_train_x: type: pickle.PickleDataSet filepath: data/05_model_input/example_train_x.pkl example_train_y: type: pickle.PickleDataSet filepath: data/05_model_input/example_train_y.pkl example_test_x: type: pickle.PickleDataSet filepath: data/05_model_input/example_test_x.pkl example_test_y: type: pickle.PickleDataSet filepath: data/05_model_input/example_test_y.pkl example_model: type: pickle.PickleDataSet filepath: data/06_models/example_model.pkl example_predictions: type: pickle.PickleDataSet filepath: data/07_model_output/example_predictions.pkl
This ensures that all datasets are persisted so all Airflow tasks can read them without the need to share memory. In the example here we assume that all Airflow tasks share one disk, but for distributed environment you would need to use non-local filepaths.
Step 2. Package the Kedro pipeline as an Astronomer-compliant Docker image¶
Step 2.1: Package the Kedro pipeline as a Python package so you can install it into the container later on:
This step should produce a wheel file called
new_kedro_project-0.1-py3-none-any.whl located at
Step 2.2: Add the
.dockerignore, as it’s not necessary to bundle the entire code base with the container once we have the packaged wheel file.
echo "src/" >> .dockerignore
Step 2.3: Modify the
Dockerfileto have the following content:
FROM quay.io/astronomer/ap-airflow:2.0.0-buster-onbuild RUN pip install --user src/dist/new_kedro_project-0.1-py3-none-any.whl
Step 3. Convert the Kedro pipeline into an Airflow DAG with
kedro airflow create --target-dir=dags/ --env=airflow
Step 4. Launch the local Airflow cluster with Astronomer¶
astro dev start
If you visit the Airflow UI, you should now see the Kedro pipeline as an Airflow DAG:
This tutorial walks you through the manual process of deploying an existing Kedro project on Apache Airflow with Astronomer. However, if you are starting out, consider using our
astro-iris starter which provides all the aforementioned boilerplate out of the box:
kedro new --starter=astro-iris