How to read data from nested directories in Apache Spark SQL? - programming tips about Spark SQL on waitingforcode.com

Sometimes your data may be stored in a nested hierarchy, like:

bartosz:/tmp/test-nested-wildcard$ tree
.
├── 11
│   ├── 11.json
│   └── 22
│       ├── 22a.json
│       ├── 22b.json
│       └── 33
│           └── 33.json
└── 12
    └── 12.json

4 directories, 5 files

And if you're wondering whether you can read it with Apache Spark, yes - you can. You can use a wildcard sign ("*") at every level and, for instance, retrieve all data from directories under /tmp/test-nested-wildcard/11:

  "Apache Spark" should "read nested data with 3 wildcards" in {
    val sparkSession = SparkSession.builder()
      .appName("Nested hierarchy").master("local[*]").getOrCreate()
    import sparkSession.implicits._

    val readData = sparkSession.read.json(s"${baseDir}/*/*/*").map(row => row.getAs[String]("letter"))
      .collect()

    readData should have size 3
    readData should contain allElementsOf Seq("C", "B", "B")
  }

You can also retrieve the data with a partial wildcard ("1*" in the example):

  "Apache Spark" should "read nested data with a partial wildcard" in {
    val sparkSession = SparkSession.builder()
      .appName("Nested hierarchy").master("local[*]").getOrCreate()
    import sparkSession.implicits._

    val readData = sparkSession.read.json(s"${baseDir}/1*/*/*").map(row => row.getAs[String]("letter"))
      .collect()

    readData should have size 3
    readData should contain allElementsOf Seq("C", "B", "B")
  }