Pipeline

Pipeline helps you to structure your Spark jobs with ease.

If you are building simple pipelines using Dataset, we can see a pipeline as:

a function reading a Dataset using a SparkSession.
some transformations to obtain our output Dataset.
an action to output the result from the transformed Dataset.

So to build a pipeline you will need to provide three functions describing the three steps above.

import zio.spark.sql._

val pipeline = Pipeline(
    load      = SparkSession.read.inferSchema.withHeader.withDelimiter(";").csv("test.csv"),
    transform = _.withColumn("new", $"old" + 1).limit(2),
    action    = _.count()
)

It creates a description of our job, you can then transform it into a ZIO effect using run:

val job: SIO[Long] = pipeline.run

SIO[Long] is an alias for ZIO[SparkSession, Throwable, Long] meaning that it returns an effect that need a SparkSession and will return a Long (the number of rows of the DataFrame).