Sometimes your data may be stored in a nested hierarchy, like:
bartosz:/tmp/test-nested-wildcard$ tree . βββ 11 β βββ 11.json β βββ 22 β βββ 22a.json β βββ 22b.json β βββ 33 β βββ 33.json βββ 12 βββ 12.json 4 directories, 5 files
And if you're wondering whether you can read it with Apache Spark, yes - you can. You can use a wildcard sign ("*") at every level and, for instance, retrieve all data from directories under /tmp/test-nested-wildcard/11:
"Apache Spark" should "read nested data with 3 wildcards" in { val sparkSession = SparkSession.builder() .appName("Nested hierarchy").master("local[*]").getOrCreate() import sparkSession.implicits._ val readData = sparkSession.read.json(s"${baseDir}/*/*/*").map(row => row.getAs[String]("letter")) .collect() readData should have size 3 readData should contain allElementsOf Seq("C", "B", "B") }
You can also retrieve the data with a partial wildcard ("1*" in the example):
"Apache Spark" should "read nested data with a partial wildcard" in { val sparkSession = SparkSession.builder() .appName("Nested hierarchy").master("local[*]").getOrCreate() import sparkSession.implicits._ val readData = sparkSession.read.json(s"${baseDir}/1*/*/*").map(row => row.getAs[String]("letter")) .collect() readData should have size 3 readData should contain allElementsOf Seq("C", "B", "B") }