
Welcome to Kedro’s documentation!¶
Introduction
Get Started
Tutorial
Kedro Project Setup
Data Catalog
- The Data Catalog
- Using the Data Catalog within Kedro configuration
- Specifying the location of the dataset
- Data Catalog
*_args
parameters - Using the Data Catalog with the YAML API
- Adding parameters
- Feeding in credentials
- Loading multiple datasets that have similar configuration
- Transcoding datasets
- Transforming datasets
- Versioning datasets and ML models
- Using the Data Catalog with the Code API
- Kedro IO
Extend Kedro
- Custom datasets
- Scenario
- Project setup
- The anatomy of a dataset
- Implement the
_load
method withfsspec
- Implement the
_save
method withfsspec
- Implement the
_describe
method - The complete example
- Integration with
PartitionedDataSet
- Versioning
- Thread-safety
- How to handle credentials and different filesystems
- How to contribute a custom dataset implementation
- Dataset transformers
- Decorators
- Hooks
- Kedro plugins
- Create a Kedro starter
Logging
Development
Deployment
Tools Integration
- Build a Kedro pipeline with PySpark
- Centralise Spark configuration in
conf/base/spark.yml
- Initialise a
SparkSession
inProjectContext
- Use Kedro’s built-in Spark datasets to load and save raw data
- Use
MemoryDataSet
for intermediaryDataFrame
- Use
MemoryDataSet
withcopy_mode="assign"
for non-DataFrame
Spark objects - Tips for maximising concurrency using
ThreadRunner
- Centralise Spark configuration in
- Use Kedro with IPython and Jupyter Notebooks/Lab
- How to use Kedro on a Databricks cluster
FAQs
- Frequently asked questions
- What is Kedro?
- What are the primary advantages of Kedro?
- How does Kedro compare to other projects?
- What is the philosophy behind Kedro?
- What is data engineering convention?
- What version of Python does Kedro support?
- How do I upgrade Kedro?
- How can I use a development version of Kedro?
- How can I find out more about Kedro?
- How can I get my question answered?
- Kedro architecture overview
Resources
API Docs¶
kedro |
Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly. |