Skip to main content

Schemas

A DataFrame in Spark always have a schema.

Generally, people have two ways to create it:

  • They infer the schema from the source files (not recommended since the schema can change and because the inference can be wrong)
  • They create a StructType and provide it to the DataFrameReader

Defining your schema is a best practice, and you should always do that. However, creating a StructType is annoying and this is even more true when you want to deal with a Dataset[T].

For example, imagine you have the following CSV:

name;age;phone
john;10;
fella;30;+3360000000

If you want to manipulate it as a Dataset[Person] you will have to write something like:

import zio.spark.sql._
import org.apache.spark.sql.types._

case class Person(name: String, age: Int, phone: Option[String])

val schema =
StructType(
Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false),
StructField("phone", StringType, nullable = true)
)
)

val ds Dataset[Person] = SparkSession.read.schema(schema).csv("path").as[Person].getOrThrow

ZIO-Spark, using magnolia, allows you to derive the StructType from your case class automatically:

import zio.spark.sql._

case class Person(name: String, age: Int, phone: Option[String])

val ds Dataset[Person] = SparkSession.read.schema[Person].csv("path").as[Person].getOrThrow