kedro.io.DataCatalogWithDefault¶
-
class
kedro.io.
DataCatalogWithDefault
(data_sets=None, default=None, remember=False)[source]¶ A
DataCatalog
with a defaultDataSet
implementation for any data set which is not registered in the catalog.Methods
add
(data_set_name, data_set[, replace])Adds a new
AbstractDataSet
object to theDataCatalog
.add_all
(data_sets[, replace])Adds a group of new data sets to the
DataCatalog
.add_feed_dict
(feed_dict[, replace])Adds instances of
MemoryDataSet
, containing the data provided through feed_dict.add_transformer
(transformer[, data_set_names])Add a
DataSet
Transformer to the:class:~kedro.io.DataCatalog.confirm
(name)Confirm a dataset by its name.
exists
(name)Checks whether registered data set exists by calling its exists() method.
from_config
(catalog[, credentials, …])To create a
DataCatalogWithDefault
from configuration, please use: .from_data_catalog
(data_catalog, default)Convenience factory method to create a
DataCatalogWithDefault
from aDataCatalog
list
([regex_search])List of all
DataSet
names registered in the catalog.load
(name[, version])Loads a registered data set
release
(name)Release any cached data associated with a data set
save
(name, data)Save data to a registered data set.
Returns a shallow copy of the current object.
-
__init__
(data_sets=None, default=None, remember=False)[source]¶ DataCatalogWithDefault
is deprecated and will be removed in Kedro 0.18.0. ADataCatalog
with a defaultDataSet
implementation for any data set which is not registered in the catalog.- Parameters
data_sets (
Optional
[Dict
[str
,AbstractDataSet
]]) – A dictionary of data set names and data set instances.default (
Optional
[Callable
[[str
],AbstractDataSet
]]) – A callable which accepts a single argument of type string, the key of the data set, and returns anAbstractDataSet
.load
andsave
calls on data sets which are not registered to the catalog will be delegated to thisAbstractDataSet
.remember (
bool
) – If True, then store in the catalog anyAbstractDataSet
s provided by thedefault
callable argument. Useful when one want to transition from aDataCatalogWithDefault
to aDataCatalog
: just callDataCatalogWithDefault.to_yaml
, after all required data sets have been saved/loaded, and use the generated YAML file with a newDataCatalog
.
- Raises
TypeError – If default is not a callable.
Example:
from kedro.extras.datasets.pandas import CSVDataSet def default_data_set(name): return CSVDataSet(filepath='data/01_raw/' + name) io = DataCatalog(data_sets={}, default=default_data_set) # load the file in data/raw/cars.csv df = io.load("cars.csv")
-
add
(data_set_name, data_set, replace=False)¶ Adds a new
AbstractDataSet
object to theDataCatalog
.- Parameters
data_set_name (
str
) – A unique data set name which has not been registered yet.data_set (
AbstractDataSet
) – A data set object to be associated with the given data set name.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet io = DataCatalog(data_sets={ 'cars': CSVDataSet(filepath="cars.csv") }) io.add("boats", CSVDataSet(filepath="boats.csv"))
- Return type
None
-
add_all
(data_sets, replace=False)¶ Adds a group of new data sets to the
DataCatalog
.- Parameters
data_sets (
Dict
[str
,AbstractDataSet
]) – A dictionary ofDataSet
names and data set instances.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet io = DataCatalog(data_sets={ "cars": CSVDataSet(filepath="cars.csv") }) additional = { "planes": ParquetDataSet("planes.parq"), "boats": CSVDataSet(filepath="boats.csv") } io.add_all(additional) assert io.list() == ["cars", "planes", "boats"]
- Return type
None
-
add_feed_dict
(feed_dict, replace=False)¶ Adds instances of
MemoryDataSet
, containing the data provided through feed_dict.- Parameters
feed_dict (
Dict
[str
,Any
]) – A feed dict with data to be added in memory.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
Example:
import pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io = DataCatalog() io.add_feed_dict({ 'data': df }, replace=True) assert io.load("data").equals(df)
- Return type
None
-
add_transformer
(transformer, data_set_names=None)¶ Add a
DataSet
Transformer to the:class:~kedro.io.DataCatalog. Transformers can modify the way Data Sets are loaded and saved.- Parameters
transformer (
AbstractTransformer
) – The transformer instance to add.data_set_names (
Union
[str
,Iterable
[str
],None
]) – The Data Sets to add the transformer to. Or None to add the transformer to all Data Sets.
- Raises
DataSetNotFoundError – When a transformer is being added to a non existent data set.
TypeError – When transformer isn’t an instance of
AbstractTransformer
-
confirm
(name)¶ Confirm a dataset by its name.
- Parameters
name (
str
) – Name of the dataset.- Raises
DataSetError – When the dataset does not have confirm method.
- Return type
None
-
exists
(name)¶ Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.
- Parameters
name (
str
) – A data set to be checked.- Return type
bool
- Returns
Whether the data set output exists.
-
classmethod
from_config
(catalog, credentials=None, load_versions=None, save_version=None, journal=None)[source]¶ To create a
DataCatalogWithDefault
from configuration, please use:DataCatalogWithDefault.from_data_catalog( DataCatalog.from_config(catalog, credentials))
- Parameters
catalog (
Optional
[Dict
[str
,Dict
[str
,Any
]]]) – SeeDataCatalog.from_config
credentials (
Optional
[Dict
[str
,Dict
[str
,Any
]]]) – SeeDataCatalog.from_config
load_versions (
Optional
[Dict
[str
,str
]]) – SeeDataCatalog.from_config
save_version (
Optional
[str
]) – SeeDataCatalog.from_config
journal (
Optional
[Journal
]) – SeeDataCatalog.from_config
- Raises
ValueError – If you try to instantiate a
DataCatalogWithDefault
directly with this method.
-
classmethod
from_data_catalog
(data_catalog, default)[source]¶ Convenience factory method to create a
DataCatalogWithDefault
from aDataCatalog
A
DataCatalog
with a defaultDataSet
implementation for any data set which is not registered in the catalog.- Parameters
data_catalog (
DataCatalog
) – TheDataCatalog
to convert to aDataCatalogWithDefault
.default (
Callable
[[str
],AbstractDataSet
]) – A callable which accepts a single argument of type string, the key of the data set, and returns anAbstractDataSet
.load
andsave
calls on data sets which are not registered to the catalog will be delegated to thisAbstractDataSet
.
- Return type
DataCatalogWithDefault
- Returns
A new
DataCatalogWithDefault
which contains all theAbstractDataSets
from the provided data-catalog.
-
list
(regex_search=None)¶ List of all
DataSet
names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.- Parameters
regex_search (
Optional
[str
]) – An optional regular expression which can be provided to limit the data sets returned by a particular pattern.- Return type
List
[str
]- Returns
A list of
DataSet
names available which match the regex_search criteria (if provided). All data set names are returned by default.- Raises
SyntaxError – When an invalid regex filter is provided.
Example:
io = DataCatalog() # get data sets where the substring 'raw' is present raw_data = io.list(regex_search='raw') # get data sets which start with 'prm' or 'feat' feat_eng_data = io.list(regex_search='^(prm|feat)') # get data sets which end with 'time_series' models = io.list(regex_search='.+time_series$')
-
load
(name, version=None)[source]¶ Loads a registered data set
- Parameters
name (
str
) – A data set to be loaded.version (
Optional
[str
]) – Optional version to be loaded.
- Return type
Any
- Returns
The loaded data as configured.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-
release
(name)¶ Release any cached data associated with a data set
- Parameters
name (
str
) – A data set to be checked.- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-
save
(name, data)[source]¶ Save data to a registered data set.
- Parameters
name (
str
) – A data set to be saved to.data (
Any
) – A data object to be saved as configured in the registered data set.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-