Set up the data¶
In this section, we discuss the data set-up phase, which is the second part of the standard development workflow. The steps are as follows:
Add datasets to your
data/
folder, according to data engineering conventionRegister the datasets with the Data Catalog in
conf/base/catalog.yml
, which is the registry of all data sources available for use by the project. This ensures that your code is reproducible when it references datasets in different locations and/or environments.
You can find further information about the Data Catalog in specific documentation covering advanced usage.
Add your datasets to data
¶
The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data for model building by creating a model input table.
The spaceflight tutorial has three files and uses two data formats: .csv
and .xlsx
. Download and save the files to the data/01_raw/
folder of your project directory:
Here are some examples of how you can download the files from GitHub to the data/01_raw
directory inside your project:
Using cURL in a Unix terminal:
Click to expand
# reviews
curl -o data/01_raw/reviews.csv https://kedro-org.github.io/kedro/reviews.csv
# companies
curl -o data/01_raw/companies.csv https://kedro-org.github.io/kedro/companies.csv
# shuttles
curl -o data/01_raw/shuttles.xlsx https://kedro-org.github.io/kedro/shuttles.xlsx
Using cURL for Windows:
Click to expand
curl -o data\01_raw\reviews.csv https://kedro-org.github.io/kedro/reviews.csv
curl -o data\01_raw\companies.csv https://kedro-org.github.io/kedro/companies.csv
curl -o data\01_raw\shuttles.xlsx https://kedro-org.github.io/kedro/shuttles.xlsx
Using Wget in a Unix terminal:
Click to expand
# reviews
wget -O data/01_raw/reviews.csv https://kedro-org.github.io/kedro/reviews.csv
# companies
wget -O data/01_raw/companies.csv https://kedro-org.github.io/kedro/companies.csv
# shuttles
wget -O data/01_raw/shuttles.xlsx https://kedro-org.github.io/kedro/shuttles.xlsx
Using Wget for Windows:
Click to expand
wget -O data\01_raw\reviews.csv https://kedro-org.github.io/kedro/reviews.csv
wget -O data\01_raw\companies.csv https://kedro-org.github.io/kedro/companies.csv
wget -O data\01_raw\shuttles.xlsx https://kedro-org.github.io/kedro/shuttles.xlsx
Register the datasets¶
You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a conf/base/catalog.yml
file, and you register each dataset by adding a named entry into the .yml
file. The entry should include the following:
File location (path)
Parameters for the given dataset
Type of data
Versioning
Kedro supports a number of different data types, and those supported can be found in the API documentation. Kedro uses fssspec
to read data from a variety of data stores including local file systems, network file systems, cloud object stores and HDFS.
csv
¶
For the spaceflights data, first register the csv
datasets by adding this snippet to the end of the conf/base/catalog.yml
file:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
To check whether Kedro can load the data correctly, open a kedro ipython
session and run:
companies = catalog.load("companies")
companies.head()
Note
If this is the first kedro
command you have executed in the project, you will be asked whether you wish to opt into usage analytics. Your decision is recorded in the .telemetry
file so that subsequent calls to kedro
in this project do not ask you again.
The command loads the dataset named companies
(as per top-level key in catalog.yml
) from the underlying filepath data/01_raw/companies.csv
into the variable companies
, which is of type pandas.DataFrame
. The head
method from pandas
then displays the first five rows of the DataFrame.
When you have finished, close ipython
session as follows:
exit()
xlsx
¶
Now register the xlsx
dataset by adding this snippet to the end of the conf/base/catalog.yml
file:
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl # Use modern Excel engine, will be default in Kedro 0.18.0
Note
The load_args
are passed to the pd.read_excel
method as keyword arguments; although not specified here, save_args
would be passed to the pd.DataFrame.to_excel
method.
To test that everything works as expected, load the dataset within a new kedro ipython
session and display its first five rows:
shuttles = catalog.load("shuttles")
shuttles.head()
When you have finished, close ipython
session as follows:
exit()
Custom data¶
Kedro supports a number of datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline.
You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.