Skip to main content

Pipeline

Pipeline helps you to structure your Spark jobs with ease.

If you are building simple pipelines using Dataset, we can see a pipeline as:

  • a function reading a Dataset using a SparkSession.
  • some transformations to obtain our output Dataset.
  • an action to output the result from the transformed Dataset.

So to build a pipeline you will need to provide three functions describing the three steps above.

import zio.spark.sql._

val pipeline = Pipeline(
load = SparkSession.read.inferSchema.withHeader.withDelimiter(";").csv("test.csv"),
transform = _.withColumn("new", $"old" + 1).limit(2),
action = _.count()
)

It creates a description of our job, you can then transform it into a ZIO effect using run:

val job: SIO[Long] = pipeline.run

SIO[Long] is an alias for ZIO[SparkSession, Throwable, Long] meaning that it returns an effect that need a SparkSession and will return a Long (the number of rows of the DataFrame).