kedro.contrib.decorators.pyspark.pandas_to_spark

kedro.contrib.decorators.pyspark.pandas_to_spark(spark)[source]

Inspects the decorated function’s inputs and converts all pandas DataFrame inputs to spark DataFrames.

Note that in the example below we have enabled spark.sql.execution.arrow.enabled. For this to work, you should first pip install pyarrow and add pyarrow to requirements.txt. Enabling this option makes the convertion between pyspark <-> DataFrames much faster.

Parameters:spark (SparkSession) –

The spark session singleton object to use for the creation of the pySpark DataFrames. A possible pattern you can use here is the following:

spark.py

from pyspark.sql import SparkSession

def get_spark():
  return (
    SparkSession.builder
      .master("local[*]")
      .appName("kedro")
      .config("spark.driver.memory", "4g")
      .config("spark.driver.maxResultSize", "3g")
      .config("spark.sql.execution.arrow.enabled", "true")
      .getOrCreate()
    )

nodes.py

from spark import get_spark
@pandas_to_spark(get_spark())
def node_1(data):
    data.show() # data is pyspark.sql.DataFrame
Return type:Callable
Returns:The original function with any pandas DF inputs translated to spark.