Apache Spark SQL articles

on waitingforcode.com

Multiple SparkSession for one SparkContext

Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question. Continue Reading →

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys. Continue Reading →

Dealing with nested data in Apache Spark SQL

Nested data structure is very useful in data denormalization for Big Data needs. It avoids joins that we could use for several related and fully normalized datasets. But processing such data structures is not always simple. Fortunately Apache Spark SQL provides different utility functions helping to work with them. Continue Reading →

Spark SQL Cost-Based Optimizer

Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer. Continue Reading →