Inner

Inner contains meaningful information about rows with keys on both Datasets. It means that in both Datasets, there is at least one row with exactly same set of keys.

Structure

case class Inner(countRowEqual: Long,
                 countRowNotEqual: Long,
                 countDeltaByRow: Map[Set[String], DeltaByRow],
                 equalRows: DescribeByRow) {
  @transient lazy val byColumn: Map[String, Delta] = {
    val m = implicitly[Monoid[Map[String, Delta]]]
    m.combineAll(countDeltaByRow.map(_._2.byColumn).toSeq :+ equalRows.byColumn.mapValues(d => {
      Delta(d.count, 0, Both(d, d), Describe.empty)
    }))
  }
}

countRowEqual & countRowNotEqual

type: Long

They are two counter which tell you how many rows have exactly the same values by incrementing “countRowEqual” and how many have at least one difference by incrementing “countRowNotEqual”.

countDeltaByRow

type: Map[Set[String], DeltaByRow]

Rows with at least one difference are are attached to the “countDeltaByRow”. It is a map which is composed by a set of String as a key. Each key is composed by columns where differences happens.

Nothing is better than an example, imagine the following Datasets :

  • Left Dataset :

    Key Col1 Col2
    key1 a 1
    key2 b 2
    key3 c 3
    key4 d 4
  • Right Dataset :

    Key Col1 Col2
    key1 a 1
    key2 x 2
    key3 y 4
    key4 z 5

In this case, you will obtain two elements in that map, one with the key [Col1] containing information about the line n°2 and one with the key [Col1, Col2] containing information about the line n°3 and n°4. Each of them has a DeltaByRow as a value.

equalRows

type: DescribeByRow

equalRows contains composition of Rows which are exactly equal don’t need a DeltaByRow since we will not compare them to each other, we directly can add them to a DescribeByRow.

byColumn

type: Map[String, Delta]

The value “byColumn” is an additional analysis which gives you information in a column oriented manner instead of a row oriented manner. Each column has its own Delta describing how data evolve for that particular row no matter if there is or no a difference.