Creating new projects with Kedro Starters

Note: This documentation is based on Kedro 0.16.3, if you spot anything that is incorrect then please create an issue or pull request.

When creating a new project, sometimes you might want to customise the starting boilerplate provided by kedro new to adapt to different use cases. For example, you might want to:

  • Add initial configuration, initialisation code and example pipeline for PySpark
  • Add a docker-compose setup to launch Kedro next to a monitoring stack
  • Add deployment scripts and CI/CD setup for your targeted infrastructure

To address this need, we have added the ability to supply a starting project template to kedro new through a --starter flag.

Introducing Kedro starters

A Kedro starter is a Cookiecutter template containing boilerplate code for a Kedro project. Each starter should encode best practices and provide utilities to help users bootstrap a new Kedro project for a particular use case in the most effective way. For example, we have created a PySpark starter, which contains initial configuration and initialisation code for PySpark according to our recommended best practices.

To create a Kedro project using a starter, run:

kedro new --starter=<path-to-starter>

The path to starter could be a local directory or a VCS repository, as long as it is supported by Cookiecutter.

For example, to create a project using the PySpark starter above, run:

kedro new --starter=https://github.com/quantumblack/kedro-starter-pyspark.git

If no starter is provided to kedro new, the default Kedro template will be used, as documented in Creating a new project.

Using starter aliases

For common starters maintained by Kedro team like PySpark, we provide aliases so that users don’t have to specify the full path to the starter. For example, to create a project using PySpark starter, you can simply run:

kedro new --starter=pyspark

To see a list of all supported aliases, run:

kedro starter list

List of official starters

The Kedro team maintains the following starters:

Alias Link to starter Description
pandas-iris https://github.com/quantumblacklabs/kedro-starter-pandas-iris Provide an example iris-classification pipeline built with Kedro
pyspark https://github.com/quantumblacklabs/kedro-starter-pyspark Provide initial configuration and initialisation code for a Kedro pipeline using PySpark
pyspark-iris https://github.com/quantumblacklabs/kedro-starter-pyspark-iris Provide all features in the basic PySpark starter, plus an example pipeline to train a machine learning model with Spark primitives

Using a starter’s version

By default, Kedro will use the latest commit in the default branch of the starter repository’s. However, if you want to use a specific version of a starter, you can pass a --checkout argument to the command as follows:

kedro new --starter=pyspark --checkout=0.1.0

The --checkout value could point to a branch, tag or commit in the starter repository. Under the hood, the value will be passed to the --checkout flag in Cookiecutter.

Using starter in interactive mode

By default, creating a new project using a starter will be launched in interactive mode. You will need to provide the following variables similar to running kedro new without any argument:

  • project_name - A human readable name for your new project
  • repo_name - A name for the directory that holds your project repository
  • python_package - A Python package name for your project package (see Python package naming conventions)

This mode assumes that the starter doesn’t require any additional configuration variable.

Using starter with a configuration file

As documented in Creating a new project from a configuration file, Kedro also supports specifying a configuration file when creating a project through a --config flag. You can use this flag with starter seamlessly:

kedro new --config=my_kedro_pyspark_project.yml --starter=pyspark

This is particularly useful when the starter requires more configuration than the default variables supported by the interactive mode.