Spark SQL provides the support for a lot of standard SQL operations, including IN clause. It can be easily used through the import of the implicits of created SparkSession object: private val spark...
Sometimes the entries in processed dataset can be duplicates. For instance, the IoT device can by mistake send the same metrics more than once or our ingestion step can badly format the message. Spark...
In your daily work you've certainly met the problem of Table or view not found. The error happens in the situations like this: val customerIds = JoinHelper.insertCustomers(1) JoinHelper.insertOrde...
Nested data structures are interesting solution for data organization. They let us to bring values with similar characteristics into logical common groups. Most of time it improves raw data readabilit...
Schema helps to ensure a good data quality and to optimize data exploration. However, in some cases defining schema may be hard. That's especially true if your dataset has a lot of nested levels which...
I appreciate Apache Spark SQL because you can use it either as a data engineer, with some programmatic logic, or as a data analysts only by writing SQL queries. And sometimes writing these queries can...
Apache Spark SQL lets us to manipulate JSON fields in many different ways. One of the features is a field extraction from a stringified JSON with json_tuple(json: Column, fields: String*) function: ...
There are different ways to create a DataFrame in Apache Spark SQL. This rule applies also on an empty Dataset that may be useful if you prefer to deal with emptiness rather than missing values (null ...
Using UDF in SQL statement or in programmatic way is quite easy because either you define the function's name or simply call the object returned after the registration. Using it in column-based operat...
To debug your Apache Spark SQL programs, or even to understand how it works better, you can use debugging features exposed through org.apache.spark.sql.execution.debug package. One of them lets you se...
Sometimes your data may be stored in a nested hierarchy, like: bartosz:/tmp/test-nested-wildcard$ tree . βββ 11 β βββ 11.json β βββ 22 β βββ 22a.json β...