Sometimes your data may be stored in a nested hierarchy, like:
bartosz:/tmp/test-nested-wildcard$ tree
.
βββ 11
β βββ 11.json
β βββ 22
β βββ 22a.json
β βββ 22b.json
β βββ 33
β βββ 33.json
βββ 12
βββ 12.json
4 directories, 5 files
And if you're wondering whether you can read it with Apache Spark, yes - you can. You can use a wildcard sign ("*") at every level and, for instance, retrieve all data from directories under /tmp/test-nested-wildcard/11:
"Apache Spark" should "read nested data with 3 wildcards" in {
val sparkSession = SparkSession.builder()
.appName("Nested hierarchy").master("local[*]").getOrCreate()
import sparkSession.implicits._
val readData = sparkSession.read.json(s"${baseDir}/*/*/*").map(row => row.getAs[String]("letter"))
.collect()
readData should have size 3
readData should contain allElementsOf Seq("C", "B", "B")
}
You can also retrieve the data with a partial wildcard ("1*" in the example):
"Apache Spark" should "read nested data with a partial wildcard" in {
val sparkSession = SparkSession.builder()
.appName("Nested hierarchy").master("local[*]").getOrCreate()
import sparkSession.implicits._
val readData = sparkSession.read.json(s"${baseDir}/1*/*/*").map(row => row.getAs[String]("letter"))
.collect()
readData should have size 3
readData should contain allElementsOf Seq("C", "B", "B")
}