One of the points I wanted to cover during my talk but for which I haven't enough time, was the dilemma about using a local deduplication or Apache Spark's dropDuplicates method to not integrate duplicated logs. That will be the topic of this post.
When I was preparing the demo code for my talk about sessionization at Spark AI Summit 2019 in Amsterdam, I wrote my first version of code with DataFrame abstraction. I hadn't type safety but the data manipulation was quite clear thanks to the mapping. Later, I tried to rewrite the code with Dataset and I got type safety but sacrificed a little bit of clarity. Let me deep delve into that in this post.
The exceptions are our daily pain but the exceptions hard to explain are more than that. I faced one of them one day when I was integrating Apache Spark SQL on EMR.
This new post about Apache Spark SQL will give some hands-on use cases of date functions.
I wanted to write this post after the one about aggregation modes but I didn't. Before explaining different aggregation strategies, I prefer to clarify aggregation internals. It should help you to better understand the next part.
By the end of 2018 I published a post about code generation in Apache Spark SQL where I answered the questions about who, when, how and what. But I omitted the "why" and cozos created an issue on my Github to complete the article. Something I will try to do here.
There are 2 popular ways to come to the data engineering field. Either you were a software engineer and you were fascinated by the data domain and its problems (I did). Or simply you evolved from a BI Developer. The big advantage of the latter path is that these people spent a lot of time on writing SQL queries and their knowledge of its functions is much better than for the people from the first category. This post is written by a data-from-software engineer who discovered that aggregation is not only about simple arithmetic values but also about distributions and collections.
Partitioning is the most popular method to divide a dataset into smaller parts. It's important to know that it can be completed with another technique called bucketing.
When I was preparing my talk about Apache Spark customization, I wanted to talk about User Defined Types. After some digging, I saw that there are some UDT in the source code and one of them was VectorUDT. And it led me to the topic of this post which is the vectorization.
When I was writing posts about Apache Spark SQL customization through extensions, I found a method to define custom catalog listeners. Since it was my first contact with this, before playing with it, I decided to discover the feature.
Last time I presented ANTLR and how Apache Spark SQL uses it to convert textual SQL expressions into internal classes. In this post I will write a custom parser.
I started the series about Apache Spark SQL customization from the last parts of query execution, which are logical and physical plans. But you must know that before the framework generates these plans, it must first parse the query.
Last time I presented you the basics of code generation in physical plans of Apache Spark SQL. This time I will try to write a physical plan executing UNION operation as a JOIN without code generation.
I'm very happy when the readers comment on my posts or tweets. A lot of such discussions are the topics of posts. It's the case of this one where I try to figure out whether Apache Spark SQL Avro source is compatible with other applications using this serialization format.
In my previous post, I explained how to implement a custom physical plan execution. However, this first version didn't use generated code which is also an interesting option to customize Apache Spark. And it's also the feature that I will cover in this post.
If you follow Apache Spark SQL category on my blog, you can see a lot of posts about customizing this framework. After recently published custom logical rules, it's time to explore another part which is planner strategy.
Several weeks ago when I was checking new "apache-spark" tagged questions on StackOverflow I found one that caught my attention. The author was saying that randomSplit method doesn't divide the dataset equally and after merging back, the number of lines was different. Even though I wasn't able to answer at that moment, I decided to investigate this function and find possible reasons for that error.
Some time ago I was involved in a discussion about testing Apache Spark SQL code. In this post, I would like to share my observations about this topic.
In the previous post about Apache Spark SQL custom optimizations I presented a rule transforming UNION operator into JOIN. At this time I only did a simple version working only with 2 datasets. In this post, I will share its improved version.
The most popular partitioning strategy divides the dataset by the hash computed from one or more values of the record. However other partitioning strategies exist as well and one of them is range partitioning implemented in Apache Spark SQL with repartitionByRange method, described in this post.