Getting started
[IN PROGRESS]
Import the library:
resolvers += "Spark-tools" at "http://dl.bintray.com/univalence/univalence-jvm"
// replace version with a suitable one
libraryDependencies += "io.univalence" %% "spark-test" % "0.2+245-09a064d9" % Test
Create a DataFrame with a JSON string
We start by importing and extending SparkTest.
import io.univalence.sparktest.SparkTest
import org.scalatest.FunSuiteLike
class MyTestClass extends FunSuiteLike with SparkTest {}
Then, we can create our test:
class MyTestClass extends FunSuiteLike with SparkTest {
test("create df with json string") {
// create df with json string
val df = dfFromJsonString("{a:1}", "{a:2}")
}
}
Comparing DataFrames
To compare DataFrames, you can simply call the assertEquals method. It throws an SparkTestError if they are not equal.
For instance, this:
val dfUT = Seq(1, 2, 3).toDF("id")
val dfExpected = Seq(2, 1, 4).toDF("id")
dfUT.assertEquals(dfExpected)
… throws the following exception:
io.univalence.sparktest.SparkTest$ValueError: The data set content is different :
in value at id, 2 was not equal to 1
dataframe({id: 1})
dataframe({id: 2})
in value at id, 1 was not equal to 2
dataframe({id: 2})
dataframe({id: 1})
in value at id, 4 was not equal to 3
dataframe({id: 3})
dataframe({id: 4})
Testing with predicates
One of our test functionality is shouldForAll.
It throws an AssertionException if there are rows that don’t match the predicate.
This example:
val rdd = sc.parallelize(Seq(Person("John", 19), Person("Paul", 17), Person("Emilie", 25), Person("Mary", 5)))
rdd.shouldForAll(p => p.age > 18) // Paul and Mary are too young
… will throw this exception:
java.lang.AssertionError: No rows from the dataset match the predicate. Rows not matching the predicate :
Person(Paul,17)
Person(Mary,5)
Whereas this example:
val rdd = sc.parallelize(Seq(Person("John", 19), Person("Paul", 52), Person("Emilie", 25), Person("Mary", 83)))
rdd.shouldForAll(p => p.age > 18) // Everyone pass the predicate
… will pass!