Set up the data

In this section, we discuss the data set-up phase, which is the second part of the standard development workflow. The steps are as follows:

  • Add datasets to your data/ folder, according to data engineering convention
  • Register the datasets with the Data Catalog, which is the registry of all data sources available for use by the project conf/base/catalog.yml. This ensures that your code is reproducible when it references datasets in different locations and/or environments.

You can find further information about the Data Catalog in specific documentation covering advanced usage.

Add your datasets to data

The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data by doing some data engineering, which is the process of preparing data for model building by creating a master table.

The spaceflight tutorial has three files and uses two data formats: .csv and .xlsx. Download and save the files to the data/01_raw/ folder of your project directory:

Here are some examples of how you can download the files from GitHub to the data/01_raw directory inside your project:

Using cURL in a Unix terminal:

Click to expand
# reviews
curl -o data/01_raw/reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv
# companies
curl -o data/01_raw/companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv
# shuttles
curl -o data/01_raw/shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx

Using cURL for Windows:

Click to expand
curl -o data\01_raw\reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv
curl -o data\01_raw\companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv
curl -o data\01_raw\shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx

Using Wget in a Unix terminal:

Click to expand
# reviews
wget -O data/01_raw/reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv
# companies
wget -O data/01_raw/companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv
# shuttles
wget -O data/01_raw/shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx

Using Wget for Windows:

Click to expand
wget -O data\01_raw\reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv
wget -O data\01_raw\companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv
wget -O data\01_raw\shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx

Register the datasets

You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a conf/base/catalog.yml file, and you register each dataset by adding a named entry into the .yml file. The entry should include the following:

  • File location (path)
  • Parameters for the given dataset
  • Type of data
  • Versioning

Kedro supports a number of different data types, and those supported can be found in the API documentation. Kedro uses fssspec to read data from a variety of data stores including local file systems, network file systems, cloud object stores and HDFS.

csv

For the spaceflights data, first register the csv datasets by adding this snippet to the end of the conf/base/catalog.yml file:

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataSet
  filepath: data/01_raw/reviews.csv

To check whether Kedro can load the data correctly, open a kedro ipython session and run:

context.catalog.load("companies").head()

The command loads the dataset named companies (as per top-level key in catalog.yml), from the underlying filepath data/01_raw/companies.csv. It displays the first five rows of the dataset, and is loaded into a pandas DataFrame for you to experiment with the data.

When you have finished, close ipython session as follows:

exit()

xlsx

Now register the xlsx dataset by adding this snippet to the end of the conf/base/catalog.yml file:

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx

To test that everything works as expected, load the dataset within a new kedro ipython session:

context.catalog.load("shuttles").head()

When you have finished, close ipython session as follows:

exit()

Custom data

Kedro supports a number of datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline.

You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.