How to read data from nested directories in Apache Spark SQL?

Sometimes your data may be stored in a nested hierarchy, like:

bartosz:/tmp/test-nested-wildcard$ tree
.
β”œβ”€β”€ 11
β”‚   β”œβ”€β”€ 11.json
β”‚   └── 22
β”‚       β”œβ”€β”€ 22a.json
β”‚       β”œβ”€β”€ 22b.json
β”‚       └── 33
β”‚           └── 33.json
└── 12
    └── 12.json

4 directories, 5 files

And if you're wondering whether you can read it with Apache Spark, yes - you can. You can use a wildcard sign ("*") at every level and, for instance, retrieve all data from directories under /tmp/test-nested-wildcard/11:

  "Apache Spark" should "read nested data with 3 wildcards" in {
    val sparkSession = SparkSession.builder()
      .appName("Nested hierarchy").master("local[*]").getOrCreate()
    import sparkSession.implicits._

    val readData = sparkSession.read.json(s"${baseDir}/*/*/*").map(row => row.getAs[String]("letter"))
      .collect()

    readData should have size 3
    readData should contain allElementsOf Seq("C", "B", "B")
  }

You can also retrieve the data with a partial wildcard ("1*" in the example):

  "Apache Spark" should "read nested data with a partial wildcard" in {
    val sparkSession = SparkSession.builder()
      .appName("Nested hierarchy").master("local[*]").getOrCreate()
    import sparkSession.implicits._

    val readData = sparkSession.read.json(s"${baseDir}/1*/*/*").map(row => row.getAs[String]("letter"))
      .collect()

    readData should have size 3
    readData should contain allElementsOf Seq("C", "B", "B")
  }