Programming tips about Spark SQL on waitingforcode.com

December 14, 2017

How to use IN clause in Spark SQL query ?

Spark SQL provides the support for a lot of standard SQL operations, including IN clause. It can be easily used through the import of the implicits of created SparkSession object: private val spark...

Continue Reading →

July 15, 2018

How to deduplicate entries in Spark SQL ?

Sometimes the entries in processed dataset can be duplicates. For instance, the IoT device can by mistake send the same metrics more than once or our ingestion step can badly format the message. Spark...

Continue Reading →

July 15, 2018

How to deal with "org.apache.spark.sql.AnalysisException: Table or view not found" error ?

In your daily work you've certainly met the problem of Table or view not found. The error happens in the situations like this: val customerIds = JoinHelper.insertCustomers(1) JoinHelper.insertOrde...

Continue Reading →

July 15, 2018

How to read the content of a nested data in Apache Spark SQL ?

Nested data structures are interesting solution for data organization. They let us to bring values with similar characteristics into logical common groups. Most of time it improves raw data readabilit...

Continue Reading →

May 11, 2019

How to generate schema from a case class in Apache Spark SQL ?

Schema helps to ensure a good data quality and to optimize data exploration. However, in some cases defining schema may be hard. That's especially true if your dataset has a lot of nested levels which...

Continue Reading →

July 20, 2019

How to write advanced SQL queries without escaping characters in Apache Spark SQL?

I appreciate Apache Spark SQL because you can use it either as a data engineer, with some programmatic logic, or as a data analysts only by writing SQL queries. And sometimes writing these queries can...

Continue Reading →

August 24, 2019

How to extract data from stringified JSON field in Apache Spark SQL?

Apache Spark SQL lets us to manipulate JSON fields in many different ways. One of the features is a field extraction from a stringified JSON with json_tuple(json: Column, fields: String*) function: ...

Continue Reading →

September 18, 2019

How to create an empty DataFrame in Apache Spark SQL?

There are different ways to create a DataFrame in Apache Spark SQL. This rule applies also on an empty Dataset that may be useful if you prefer to deal with emptiness rather than missing values (null ...

Continue Reading →

November 1, 2019

How to use a User Defined Function in the column-related operations in Apache Spark SQL?

Using UDF in SQL statement or in programmatic way is quite easy because either you define the function's name or simply call the object returned after the registration. Using it in column-based operat...

Continue Reading →

November 1, 2019

How to show the generated code?

To debug your Apache Spark SQL programs, or even to understand how it works better, you can use debugging features exposed through org.apache.spark.sql.execution.debug package. One of them lets you se...

Continue Reading →

November 1, 2019

How to read data from nested directories in Apache Spark SQL?

Sometimes your data may be stored in a nested hierarchy, like: bartosz:/tmp/test-nested-wildcard$ tree . ├── 11 │ ├── 11.json │ └── 22 │ ├── 22a.json �...

Continue Reading →

February 27, 2025

How to implement a STRING_AGG function in Apache Spark SQL?

You need to migrate a SQL Server code that uses STRING_AGG function. Unfortunately, it's a SQL Server-specific feature. To migrate it, you will have to use other transformations available in Apache Sp...

Continue Reading →

Spark SQL tips