Spark SQL tips

How to use IN clause in Spark SQL query ?

Spark SQL provides the support for a lot of standard SQL operations, including IN clause. It can be easily used through the import of the implicits of created SparkSession object: private val spark...

Continue Reading β†’

How to deduplicate entries in Spark SQL ?

Sometimes the entries in processed dataset can be duplicates. For instance, the IoT device can by mistake send the same metrics more than once or our ingestion step can badly format the message. Spark...

Continue Reading β†’

How to deal with "org.apache.spark.sql.AnalysisException: Table or view not found" error ?

In your daily work you've certainly met the problem of Table or view not found. The error happens in the situations like this: val customerIds = JoinHelper.insertCustomers(1) JoinHelper.insertOrde...

Continue Reading β†’

How to read the content of a nested data in Apache Spark SQL ?

Nested data structures are interesting solution for data organization. They let us to bring values with similar characteristics into logical common groups. Most of time it improves raw data readabilit...

Continue Reading β†’

How to generate schema from a case class in Apache Spark SQL ?

Schema helps to ensure a good data quality and to optimize data exploration. However, in some cases defining schema may be hard. That's especially true if your dataset has a lot of nested levels which...

Continue Reading β†’

How to write advanced SQL queries without escaping characters in Apache Spark SQL?

I appreciate Apache Spark SQL because you can use it either as a data engineer, with some programmatic logic, or as a data analysts only by writing SQL queries. And sometimes writing these queries can...

Continue Reading β†’

How to extract data from stringified JSON field in Apache Spark SQL?

Apache Spark SQL lets us to manipulate JSON fields in many different ways. One of the features is a field extraction from a stringified JSON with json_tuple(json: Column, fields: String*) function: ...

Continue Reading β†’

How to create an empty DataFrame in Apache Spark SQL?

There are different ways to create a DataFrame in Apache Spark SQL. This rule applies also on an empty Dataset that may be useful if you prefer to deal with emptiness rather than missing values (null ...

Continue Reading β†’

How to use a User Defined Function in the column-related operations in Apache Spark SQL?

Using UDF in SQL statement or in programmatic way is quite easy because either you define the function's name or simply call the object returned after the registration. Using it in column-based operat...

Continue Reading β†’

How to show the generated code?

To debug your Apache Spark SQL programs, or even to understand how it works better, you can use debugging features exposed through org.apache.spark.sql.execution.debug package. One of them lets you se...

Continue Reading β†’

How to read data from nested directories in Apache Spark SQL?

Sometimes your data may be stored in a nested hierarchy, like: bartosz:/tmp/test-nested-wildcard$ tree . β”œβ”€β”€ 11 β”‚ β”œβ”€β”€ 11.json β”‚ └── 22 β”‚ β”œβ”€β”€ 22a.json β...

Continue Reading β†’