Creating new projects with Kedro Starters¶
Note: This documentation is based on
Kedro 0.16.3, if you spot anything that is incorrect then please create an issue or pull request.
When creating a new project, sometimes you might want to customise the starting boilerplate provided by
kedro new to adapt to different use cases. For example, you might want to:
- Add initial configuration, initialisation code and example pipeline for PySpark
- Add a docker-compose setup to launch Kedro next to a monitoring stack
- Add deployment scripts and CI/CD setup for your targeted infrastructure
To address this need, we have added the ability to supply a starting project template to
kedro new through a
Introducing Kedro starters¶
A Kedro starter is a Cookiecutter template containing boilerplate code for a Kedro project. Each starter should encode best practices and provide utilities to help users bootstrap a new Kedro project for a particular use case in the most effective way. For example, we have created a
PySpark starter, which contains initial configuration and initialisation code for PySpark according to our recommended best practices.
To create a Kedro project using a starter, run:
kedro new --starter=<path-to-starter>
The path to starter could be a local directory or a VCS repository, as long as it is supported by Cookiecutter.
For example, to create a project using the
PySpark starter above, run:
kedro new --starter=https://github.com/quantumblack/kedro-starter-pyspark.git
If no starter is provided to
kedro new, the default Kedro template will be used, as documented in Creating a new project.
Using starter aliases¶
For common starters maintained by Kedro team like
PySpark, we provide aliases so that users don’t have to specify the full path to the starter. For example, to create a project using
PySpark starter, you can simply run:
kedro new --starter=pyspark
To see a list of all supported aliases, run:
kedro starter list
List of official starters¶
The Kedro team maintains the following starters:
|Alias||Link to starter||Description|
|pandas-iris||https://github.com/quantumblacklabs/kedro-starter-pandas-iris||Provide an example iris-classification pipeline built with Kedro|
|pyspark||https://github.com/quantumblacklabs/kedro-starter-pyspark||Provide initial configuration and initialisation code for a Kedro pipeline using PySpark|
|pyspark-iris||https://github.com/quantumblacklabs/kedro-starter-pyspark-iris||Provide all features in the basic PySpark starter, plus an example pipeline to train a machine learning model with Spark primitives|
Using a starter’s version¶
By default, Kedro will use the latest commit in the default branch of the starter repository’s. However, if you want to use a specific version of a starter, you can pass a
--checkout argument to the command as follows:
kedro new --starter=pyspark --checkout=0.1.0
--checkout value could point to a branch, tag or commit in the starter repository.
Under the hood, the value will be passed to the
--checkout flag in Cookiecutter.
Using starter in interactive mode¶
By default, creating a new project using a starter will be launched in interactive mode. You will need to provide the following variables similar to running
kedro new without any argument:
project_name- A human readable name for your new project
repo_name- A name for the directory that holds your project repository
python_package- A Python package name for your project package (see Python package naming conventions)
This mode assumes that the starter doesn’t require any additional configuration variable.
Using starter with a configuration file¶
As documented in Creating a new project from a configuration file, Kedro also supports specifying a configuration file when creating a project through a
--config flag. You can use this flag with starter seamlessly:
kedro new --config=my_kedro_pyspark_project.yml --starter=pyspark
This is particularly useful when the starter requires more configuration than the default variables supported by the interactive mode.