Several weeks ago I played with watermark, just to recall some concepts. I wrote a query and...the watermark didn't work! Of course, my query was wrong but this intrigued me enough to write this short article.
When I was playing with my data-generator and Apache Spark Structured Streaming, I was surprised by one behavior that I would like to share and explain in this post. To not deep delve into the details right now, the story will be about the use of nested structures in several operations.
After my January's talk about Apache Kafka integration in Structured Streaming I got an interesting question on off. The question was, how to process 2 topics simultaneously with Structured Streaming? The "small" problem was the fact that both had different schemas.
_SUCCESS file generated by Apache Spark SQL when you successfully generate a dataset, is often a big question for newcomers. Why does the framework need this file? How is it generated? I will cover these aspects in this article.
Few weeks ago when I was preparing a talk for one local meetup, I wanted to list the most common operations we can do with Spark for the newcomers. And I found one I haven't used before, namely sortWithinPartitions.
Some time ago I watched an interesting Devoxx France 2019 talk about poison pills in streaming systems presented by LoΓ―c Divad. I learned a few interesting patterns like sentinel value that may help to deal with corrupted data but the talk was oriented on Kafka Streams. And since I didn't find a corresponding resource for Apache Spark Structured Streaming [and also because I'm simply an Apache Spark enthusiast ;)], I decided to write one trying to implement LoΓ―c's ideas in the Structured Streaming world.
I didn't know that join reordering is quite interesting, though complex, topic in Apache Spark SQL. The queries not only can be transformed into the ones using JOIN ... ON clauses. They can also be reordered accordingly to the star schema which we'll try to see in this post.
In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.
One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).
Even though I've already written a few posts about Apache Kafka as a data source in Apache Spark Structured Streaming, I still had some questions in my head. In this post I will try to answer them and let this Kafka integration in Spark topic for investigation later.
Containers are with us, data engineers, for several years. The concept was already introduced on YARN but the technology that really made them popular was Docker. In this post I will focus on its recommended practices to make our Apache Spark images better.
When I discovered microk8s I was delighted! An easy installation in very few steps and you can start to play with Kubernetes locally (tried on Ubuntu 16). However, running Apache Spark 2.4.4 on top of microk8s is not an easy piece of cake. In this post I will show you 4 different problems you may encounter, and propose possible solutions.
I've written a lot about data sources, including Apache Kafka. However, Apache Spark is not only about sources but also about targets called sinks. In this post I will focus on Apache Kafka sink integration and try to answer some question in FAQ mode.
On the one hand, I appreciate JSON for its flexibility but also from the other one, I hate it for exactly the same thing. It's particularly painful when you work on a project without good data governance. The most popular pain is an inconsistent field type - Spark can manage that by getting the most common type. Unfortunately, it's a little bit trickier for less common problems, for instance when a same field has different case sensitivity.
Under one of my posts I got an interesting question about ignoring maxPartitionBytes configuration entry by Apache Spark for text-based data sources. In this post I will try to answer it.
When the unit tests work on "your machine" but fail on your colleague's, you know you did something wrong. When the failures are not about test assertions but technical reasons, the "something wrong" transforms into "something strange". And it may happen with Apache Spark as well.
If you've ever wondered why when you write "2019-05-10T20:00", Apache Spark considers it as a timestamp field? The fact of defining it as a TimestampType is one of the reasons, but another question here is, how Apache Spark does the conversion from a string into the timestamp type? I will give you some hints in this blog post.
In my previous post I have shown you the writing and reading parts of my custom state store implementation. Today it's time to cover the data reprocessing and also the limits of the solution.
In my previous post I introduced the classes involved in the interactions with the state store, and also shown the big picture of the implementation. Today it's time to write some code :)
In my last Spark+AI Summit 2019 follow-up posts I'm implementing a custom state store. The extension is inspired by the default state store. At the moment of code analysis, one of the places that intrigued me was the put(key: UnsafeRow, value: UnsafeRow) method. Keep reading if you're curious why.