kedro.io.DataCatalog

class kedro.io.DataCatalog(data_sets=None, feed_dict=None, transformers=None, default_transformers=None, journal=None, layers=None)[source]

Bases: object

DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.

Methods

DataCatalog.__init__([data_sets, feed_dict, …]) DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program.
DataCatalog.add(data_set_name, data_set[, …]) Adds a new AbstractDataSet object to the DataCatalog.
DataCatalog.add_all(data_sets[, replace]) Adds a group of new data sets to the DataCatalog.
DataCatalog.add_feed_dict(feed_dict[, replace]) Adds instances of MemoryDataSet, containing the data provided through feed_dict.
DataCatalog.add_transformer(transformer[, …]) Add a DataSet Transformer to the:class:~kedro.io.DataCatalog.
DataCatalog.confirm(name) Confirm a dataset by its name.
DataCatalog.exists(name) Checks whether registered data set exists by calling its exists() method.
DataCatalog.from_config(catalog[, …]) Create a DataCatalog instance from configuration.
DataCatalog.list([regex_search]) List of all DataSet names registered in the catalog.
DataCatalog.load(name[, version]) Loads a registered data set.
DataCatalog.release(name) Release any cached data associated with a data set
DataCatalog.save(name, data) Save data to a registered data set.
DataCatalog.shallow_copy() Returns a shallow copy of the current object.
__init__(data_sets=None, feed_dict=None, transformers=None, default_transformers=None, journal=None, layers=None)[source]

DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.

Parameters:
  • data_sets (Optional[Dict[str, AbstractDataSet]]) – A dictionary of data set names and data set instances.
  • feed_dict (Optional[Dict[str, Any]]) – A feed dict with data to be added in memory.
  • transformers (Optional[Dict[str, List[AbstractTransformer]]]) – A dictionary of lists of transformers to be applied to the data sets.
  • default_transformers (Optional[List[AbstractTransformer]]) – A list of transformers to be applied to any new data sets.
  • journal (Optional[Journal]) – Instance of Journal.
  • layers (Optional[Dict[str, Set[str]]]) – A dictionary of data set layers. It maps a layer name to a set of data set names, according to the data engineering convention. For more details, see https://kedro.readthedocs.io/en/stable/06_resources/01_faq.html#what-is-data-engineering-convention
Raises:

DataSetNotFoundError – When transformers are passed for a non existent data set.

Example:

from kedro.extras.datasets.pandas import CSVDataSet

cars = CSVDataSet(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
io = DataCatalog(data_sets={'cars': cars})
Return type:None
add(data_set_name, data_set, replace=False)[source]

Adds a new AbstractDataSet object to the DataCatalog.

Parameters:
  • data_set_name (str) – A unique data set name which has not been registered yet.
  • data_set (AbstractDataSet) – A data set object to be associated with the given data set name.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.
Raises:

DataSetAlreadyExistsError – When a data set with the same name has already been registered.

Example:

from kedro.extras.datasets.pandas import CSVDataSet

io = DataCatalog(data_sets={
                  'cars': CSVDataSet(filepath="cars.csv")
                 })

io.add("boats", CSVDataSet(filepath="boats.csv"))
Return type:None
add_all(data_sets, replace=False)[source]

Adds a group of new data sets to the DataCatalog.

Parameters:
  • data_sets (Dict[str, AbstractDataSet]) – A dictionary of DataSet names and data set instances.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.
Raises:

DataSetAlreadyExistsError – When a data set with the same name has already been registered.

Example:

from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet

io = DataCatalog(data_sets={
                  "cars": CSVDataSet(filepath="cars.csv")
                 })
additional = {
    "planes": ParquetDataSet("planes.parq"),
    "boats": CSVDataSet(filepath="boats.csv")
}

io.add_all(additional)

assert io.list() == ["cars", "planes", "boats"]
Return type:None
add_feed_dict(feed_dict, replace=False)[source]

Adds instances of MemoryDataSet, containing the data provided through feed_dict.

Parameters:
  • feed_dict (Dict[str, Any]) – A feed dict with data to be added in memory.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.

Example:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2],
                   'col2': [4, 5],
                   'col3': [5, 6]})

io = DataCatalog()
io.add_feed_dict({
    'data': df
}, replace=True)

assert io.load("data").equals(df)
Return type:None
add_transformer(transformer, data_set_names=None)[source]

Add a DataSet Transformer to the:class:~kedro.io.DataCatalog. Transformers can modify the way Data Sets are loaded and saved.

Parameters:
  • transformer (AbstractTransformer) – The transformer instance to add.
  • data_set_names (Union[str, Iterable[str], None]) – The Data Sets to add the transformer to. Or None to add the transformer to all Data Sets.
Raises:
confirm(name)[source]

Confirm a dataset by its name.

Parameters:name (str) – Name of the dataset.
Raises:DataSetError – When the dataset does not have confirm method.
Return type:None
exists(name)[source]

Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.

Parameters:name (str) – A data set to be checked.
Return type:bool
Returns:Whether the data set output exists.
Raises:DataSetNotFoundError – When a data set with the given name has not yet been registered.
classmethod from_config(catalog, credentials=None, load_versions=None, save_version=None, journal=None)[source]

Create a DataCatalog instance from configuration. This is a factory method used to provide developers with a way to instantiate DataCatalog with configuration parsed from configuration files.

Parameters:
  • catalog (Optional[Dict[str, Dict[str, Any]]]) – A dictionary whose keys are the data set names and the values are dictionaries with the constructor arguments for classes implementing AbstractDataSet. The data set class to be loaded is specified with the key type and their fully qualified class name. All kedro.io data set can be specified by their class name only, i.e. their module name can be omitted.
  • credentials (Optional[Dict[str, Dict[str, Any]]]) – A dictionary containing credentials for different data sets. Use the credentials key in a AbstractDataSet to refer to the appropriate credentials as shown in the example below.
  • load_versions (Optional[Dict[str, str]]) – A mapping between dataset names and versions to load. Has no effect on data sets without enabled versioning.
  • save_version (Optional[str]) – Version string to be used for save operations by all data sets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.
  • journal (Optional[Journal]) – Instance of Journal.
Return type:

DataCatalog

Returns:

An instantiated DataCatalog containing all specified data sets, created and ready to use.

Raises:

DataSetError – When the method fails to create any of the data sets from their config.

Example:

config = {
    "cars": {
        "type": "pandas.CSVDataSet",
        "filepath": "cars.csv",
        "save_args": {
            "index": False
        }
    },
    "boats": {
        "type": "pandas.CSVDataSet",
        "filepath": "s3://aws-bucket-name/boats.csv",
        "credentials": "boats_credentials"
        "save_args": {
            "index": False
        }
    }
}

credentials = {
    "boats_credentials": {
        "client_kwargs": {
            "aws_access_key_id": "<your key id>",
            "aws_secret_access_key": "<your secret>"
        }
     }
}

catalog = DataCatalog.from_config(config, credentials)

df = catalog.load("cars")
catalog.save("boats", df)
list(regex_search=None)[source]

List of all DataSet names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.

Parameters:regex_search (Optional[str]) – An optional regular expression which can be provided to limit the data sets returned by a particular pattern.
Return type:List[str]
Returns:A list of DataSet names available which match the regex_search criteria (if provided). All data set names are returned by default.
Raises:SyntaxError – When an invalid regex filter is provided.

Example:

io = DataCatalog()
# get data sets where the substring 'raw' is present
raw_data = io.list(regex_search='raw')
# get data sets which start with 'prm' or 'feat'
feat_eng_data = io.list(regex_search='^(prm|feat)')
# get data sets which end with 'time_series'
models = io.list(regex_search='.+time_series$')
load(name, version=None)[source]

Loads a registered data set.

Parameters:
  • name (str) – A data set to be loaded.
  • version (Optional[str]) – Optional argument for concrete data version to be loaded.
  • only with versioned datasets. (Works) –
Return type:

Any

Returns:

The loaded data as configured.

Raises:

DataSetNotFoundError – When a data set with the given name has not yet been registered.

Example:

from kedro.io import DataCatalog
from kedro.extras.datasets.pandas import CSVDataSet

cars = CSVDataSet(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
io = DataCatalog(data_sets={'cars': cars})

df = io.load("cars")
release(name)[source]

Release any cached data associated with a data set

Parameters:name (str) – A data set to be checked.
Raises:DataSetNotFoundError – When a data set with the given name has not yet been registered.
save(name, data)[source]

Save data to a registered data set.

Parameters:
  • name (str) – A data set to be saved to.
  • data (Any) – A data object to be saved as configured in the registered data set.
Raises:

DataSetNotFoundError – When a data set with the given name has not yet been registered.

Example:

import pandas as pd

from kedro.extras.datasets.pandas import CSVDataSet

cars = CSVDataSet(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
io = DataCatalog(data_sets={'cars': cars})

df = pd.DataFrame({'col1': [1, 2],
                   'col2': [4, 5],
                   'col3': [5, 6]})
io.save("cars", df)
Return type:None
shallow_copy()[source]

Returns a shallow copy of the current object.

Return type:DataCatalog
Returns:Copy of the current object.