kedro.runner.ParallelRunner¶

class kedro.runner.ParallelRunner(max_workers=None, is_async=False, extra_dataset_patterns=None)[source]¶

ParallelRunner is an AbstractRunner implementation. It can be used to run the Pipeline in parallel groups formed by toposort. Please note that this runner implementation validates dataset using the _validate_catalog method, which checks if any of the datasets are single process only using the _SINGLE_PROCESS dataset attribute.

Methods

`run`(pipeline, catalog[, hook_manager, ...])	Run the `Pipeline` using the datasets provided by `catalog` and save results back to the same objects.
`run_only_missing`(pipeline, catalog, hook_manager)	Run only the missing outputs from the `Pipeline` using the datasets provided by `catalog`, and save results back to the same objects.

__init__(max_workers=None, is_async=False, extra_dataset_patterns=None)[source]¶

Instantiates the runner by creating a Manager.

Parameters:

max_workers (int | None) – Number of worker processes to spawn. If not set, calculated automatically based on the pipeline configuration and CPU core count. On windows machines, the max_workers value cannot be larger than 61 and will be set to min(61, max_workers).
is_async (bool) – If True, the node inputs and outputs are loaded and saved asynchronously with threads. Defaults to False.
extra_dataset_patterns (dict[str, dict[str, Any]] | None) – Extra dataset factory patterns to be added to the DataCatalog during the run. This is used to set the default datasets to SharedMemoryDataset for ParallelRunner.

Raises:

ValueError – bad parameters passed

run(pipeline, catalog, hook_manager=None, session_id=None)¶

Run the Pipeline using the datasets provided by catalog and save results back to the same objects.

Parameters:

pipeline (Pipeline) – The Pipeline to run.
catalog (DataCatalog) – The DataCatalog from which to fetch data.
hook_manager (PluginManager | None) – The PluginManager to activate hooks.
session_id (str | None) – The id of the session.

Raises:

ValueError – Raised when Pipeline inputs cannot be satisfied.

Return type:

dict[str, Any]

Returns:

Any node outputs that cannot be processed by the DataCatalog. These are returned in a dictionary, where the keys are defined by the node outputs.

run_only_missing(pipeline, catalog, hook_manager)¶

Run only the missing outputs from the Pipeline using the datasets provided by catalog, and save results back to the same objects.

Parameters:

pipeline (Pipeline) – The Pipeline to run.
catalog (DataCatalog) – The DataCatalog from which to fetch data.
hook_manager (PluginManager) – The PluginManager to activate hooks.

Raises:

ValueError – Raised when Pipeline inputs cannot be satisfied.

Return type:

dict[str, Any]

Returns:

Any node outputs that cannot be processed by the DataCatalog. These are returned in a dictionary, where the keys are defined by the node outputs.