How to read the content of a nested data in Apache Spark SQL ? - programming tips about Spark SQL on waitingforcode.com

Nested data structures are interesting solution for data organization. They let us to bring values with similar characteristics into logical common groups. Most of time it improves raw data readability and simplifies the processing since we can delegate the processing of one block to a single function. An example of that is the following JSON document representing a visit on a website:

{"user": "398Q0SnxodsqoL1383", "browser": {"name": "Mozilla Firefox", "version": "1.2.3", "lang": "en-EN"}, "geo": {"country": "France", "city": "Paris"}}

Apache Spark SQL comes with a way to programmatically manipulate such data with Row API:

"Row API" should "be used to manipulate nested data" in {
  import sparkSession.implicits._
  val usersVisits = Seq(
    ("u1", Browser("Firefox", "1.23", "en-EN")),
    ("u2", Browser("Firefox", "1.23", "fr-FR")),
    ("u3", Browser("Internet Explorer", "11.23", "en-EN")),
    ("u4", Browser("Chrome", "16.23", "en-EN"))
  ).toDF("user_id", "browser")


  val visitCodes = usersVisits.map(row => {
    val browserContext = row.getAs[Row]("browser")
    s"${row.getAs[String]("user_id")}-${browserContext.getAs[String]("name")}-${browserContext.getAs[String]("version")}"
  }).collectAsList()

  visitCodes should have size 4
  visitCodes should contain allOf("u1-Firefox-1.23", "u2-Firefox-1.23", "u3-Internet Explorer-11.23",
    "u4-Chrome-16.23")
}
case class Browser(name: String, version: String, lang: String)