kedro.io.DataCatalogWithDefault

class kedro.io.DataCatalogWithDefault(data_sets=None, default=None, remember=False)[source]

Bases: kedro.io.data_catalog.DataCatalog

A DataCatalog with a default DataSet implementation for any data set which is not registered in the catalog.

Methods

DataCatalogWithDefault.__init__([data_sets, …]) A DataCatalog with a default DataSet implementation for any data set which is not registered in the catalog.
DataCatalogWithDefault.add(data_set_name, …) Adds a new AbstractDataSet object to the DataCatalog.
DataCatalogWithDefault.add_all(data_sets[, …]) Adds a group of new data sets to the DataCatalog.
DataCatalogWithDefault.add_feed_dict(feed_dict) Adds instances of MemoryDataSet, containing the data provided through feed_dict.
DataCatalogWithDefault.add_transformer(…) Add a DataSet Transformer to the:class:~kedro.io.DataCatalog.
DataCatalogWithDefault.confirm(name) Confirm a dataset by its name.
DataCatalogWithDefault.exists(name) Checks whether registered data set exists by calling its exists() method.
DataCatalogWithDefault.from_config(catalog) To create a DataCatalogWithDefault from configuration, please use: .
DataCatalogWithDefault.from_data_catalog(…) Convenience factory method to create a DataCatalogWithDefault from a DataCatalog
DataCatalogWithDefault.list([regex_search]) List of all DataSet names registered in the catalog.
DataCatalogWithDefault.load(name[, version]) Loads a registered data set
DataCatalogWithDefault.release(name) Release any cached data associated with a data set
DataCatalogWithDefault.save(name, data) Save data to a registered data set.
DataCatalogWithDefault.shallow_copy() Returns a shallow copy of the current object.
__init__(data_sets=None, default=None, remember=False)[source]

A DataCatalog with a default DataSet implementation for any data set which is not registered in the catalog.

Parameters:
  • data_sets (Optional[Dict[str, AbstractDataSet]]) – A dictionary of data set names and data set instances.
  • default (Optional[Callable[[str], AbstractDataSet]]) – A callable which accepts a single argument of type string, the key of the data set, and returns an AbstractDataSet. load and save calls on data sets which are not registered to the catalog will be delegated to this AbstractDataSet.
  • remember (bool) – If True, then store in the catalog any AbstractDataSets provided by the default callable argument. Useful when one want to transition from a DataCatalogWithDefault to a DataCatalog: just call DataCatalogWithDefault.to_yaml, after all required data sets have been saved/loaded, and use the generated YAML file with a new DataCatalog.
Raises:

TypeError – If default is not a callable.

Example:

from kedro.extras.datasets.pandas import CSVDataSet

def default_data_set(name):
    return CSVDataSet(filepath='data/01_raw/' + name)

io = DataCatalog(data_sets={},
                 default=default_data_set)

# load the file in data/raw/cars.csv
df = io.load("cars.csv")
add(data_set_name, data_set, replace=False)

Adds a new AbstractDataSet object to the DataCatalog.

Parameters:
  • data_set_name (str) – A unique data set name which has not been registered yet.
  • data_set (AbstractDataSet) – A data set object to be associated with the given data set name.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.
Raises:

DataSetAlreadyExistsError – When a data set with the same name has already been registered.

Example:

from kedro.extras.datasets.pandas import CSVDataSet

io = DataCatalog(data_sets={
                  'cars': CSVDataSet(filepath="cars.csv")
                 })

io.add("boats", CSVDataSet(filepath="boats.csv"))
Return type:None
add_all(data_sets, replace=False)

Adds a group of new data sets to the DataCatalog.

Parameters:
  • data_sets (Dict[str, AbstractDataSet]) – A dictionary of DataSet names and data set instances.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.
Raises:

DataSetAlreadyExistsError – When a data set with the same name has already been registered.

Example:

from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet

io = DataCatalog(data_sets={
                  "cars": CSVDataSet(filepath="cars.csv")
                 })
additional = {
    "planes": ParquetDataSet("planes.parq"),
    "boats": CSVDataSet(filepath="boats.csv")
}

io.add_all(additional)

assert io.list() == ["cars", "planes", "boats"]
Return type:None
add_feed_dict(feed_dict, replace=False)

Adds instances of MemoryDataSet, containing the data provided through feed_dict.

Parameters:
  • feed_dict (Dict[str, Any]) – A feed dict with data to be added in memory.
  • replace (bool) – Specifies whether to replace an existing DataSet with the same name is allowed.

Example:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2],
                   'col2': [4, 5],
                   'col3': [5, 6]})

io = DataCatalog()
io.add_feed_dict({
    'data': df
}, replace=True)

assert io.load("data").equals(df)
Return type:None
add_transformer(transformer, data_set_names=None)

Add a DataSet Transformer to the:class:~kedro.io.DataCatalog. Transformers can modify the way Data Sets are loaded and saved.

Parameters:
  • transformer (AbstractTransformer) – The transformer instance to add.
  • data_set_names (Union[str, Iterable[str], None]) – The Data Sets to add the transformer to. Or None to add the transformer to all Data Sets.
Raises:
confirm(name)

Confirm a dataset by its name.

Parameters:name (str) – Name of the dataset.
Raises:DataSetError – When the dataset does not have confirm method.
Return type:None
exists(name)

Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.

Parameters:name (str) – A data set to be checked.
Return type:bool
Returns:Whether the data set output exists.
Raises:DataSetNotFoundError – When a data set with the given name has not yet been registered.
classmethod from_config(catalog, credentials=None, load_versions=None, save_version=None, journal=None)[source]

To create a DataCatalogWithDefault from configuration, please use:

DataCatalogWithDefault.from_data_catalog(
    DataCatalog.from_config(catalog, credentials))
Parameters:
  • catalog (Optional[Dict[str, Dict[str, Any]]]) – See DataCatalog.from_config
  • credentials (Optional[Dict[str, Dict[str, Any]]]) – See DataCatalog.from_config
  • load_versions (Optional[Dict[str, str]]) – See DataCatalog.from_config
  • save_version (Optional[str]) – See DataCatalog.from_config
  • journal (Optional[Journal]) – See DataCatalog.from_config
Raises:
classmethod from_data_catalog(data_catalog, default)[source]

Convenience factory method to create a DataCatalogWithDefault from a DataCatalog

A DataCatalog with a default DataSet implementation for any data set which is not registered in the catalog.

Parameters:
  • data_catalog (DataCatalog) – The DataCatalog to convert to a DataCatalogWithDefault.
  • default (Callable[[str], AbstractDataSet]) – A callable which accepts a single argument of type string, the key of the data set, and returns an AbstractDataSet. load and save calls on data sets which are not registered to the catalog will be delegated to this AbstractDataSet.
Return type:

DataCatalogWithDefault

Returns:

A new DataCatalogWithDefault which contains all the AbstractDataSets from the provided data-catalog.

list(regex_search=None)

List of all DataSet names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.

Parameters:regex_search (Optional[str]) – An optional regular expression which can be provided to limit the data sets returned by a particular pattern.
Return type:List[str]
Returns:A list of DataSet names available which match the regex_search criteria (if provided). All data set names are returned by default.
Raises:SyntaxError – When an invalid regex filter is provided.

Example:

io = DataCatalog()
# get data sets where the substring 'raw' is present
raw_data = io.list(regex_search='raw')
# get data sets which start with 'prm' or 'feat'
feat_eng_data = io.list(regex_search='^(prm|feat)')
# get data sets which end with 'time_series'
models = io.list(regex_search='.+time_series$')
load(name, version=None)[source]

Loads a registered data set

Parameters:
  • name (str) – A data set to be loaded.
  • version (Optional[str]) – Optional version to be loaded.
Return type:

Any

Returns:

The loaded data as configured.

Raises:

DataSetNotFoundError – When a data set with the given name has not yet been registered.

release(name)

Release any cached data associated with a data set

Parameters:name (str) – A data set to be checked.
Raises:DataSetNotFoundError – When a data set with the given name has not yet been registered.
save(name, data)[source]

Save data to a registered data set.

Parameters:
  • name (str) – A data set to be saved to.
  • data (Any) – A data object to be saved as configured in the registered data set.
Raises:

DataSetNotFoundError – When a data set with the given name has not yet been registered.

shallow_copy()[source]

Returns a shallow copy of the current object. :rtype: DataCatalogWithDefault :returns: Copy of the current object.