Standalone use of the DataCatalog


To make it easier to share a Jupyter notebook with others you need to avoid hard-coded file paths used to load or save data. One way to explore data within a shareable Jupyter notebook is take advantage of Kedro’s DataCatalog, but in the early phases of a project, you may not want to use any other Kedro features.

The Kedro starter with alias standalone-datacatalog (formerly known as mini-kedro) provides this minimal functionality. You can specify the sources required to load and save data using a YAML API. For example:

# conf/base/catalog.yml
 type: pandas.CSVDataSet
 filepath: folder/filepath.csv

 type: spark.SparkDataSet
 filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
 credentials: dev_s3
 file_format: csv
   if_exists: replace

This makes it possible to interact with data within your Jupyter notebook, with code much like this:

df = catalog.load("example_dataset_1")
df_2 ="example_dataset_2")


Create a new project using the standalone-datacatalog starter:

$ kedro new --starter=standalone-datacatalog


The starter comprises a minimal setup to use the traditional Iris dataset with Kedro’s DataCatalog.

The starter contains:

  • A conf/ directory, which contains an example DataCatalog configuration (catalog.yml)

  • A data/ directory, which contains an example dataset identical to the one used by the pandas-iris starter

  • An example notebook showing how to instantiate the DataCatalog and interact with the example dataset

  • A blank which points to this page of documentation

Create a full Kedro project

When you later wish to build a full pipeline, you can use the same configuration, with the following steps:

1. Create a new empty Kedro project in a new directory

Let’s assume that the new project is created at /path/to/your/project:

kedro new

2. Copy the conf/ and data/ directories from your standalone-datacatalog starter project over to your new project

cp -fR {conf,data} `/path/to/your/project`