One of the steps in my preparation for the GCP Data Engineer certificate was the work with "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book. And to be honest, I didn't expect that knowing Apache Spark will help me so much in understanding the architectural concepts. If you don't believe, I will try to convince you in this blog post.
Pivot operation presented 2 weeks ago transforms some cells into columns. The reverse one is called stack and it's time to see how it works!
As you know from the last 2020 blog post, one of my new goals is to be proficient at working with AWS, Azure and GCP data services. One of the building blocks of the process is finding some patterns and identifying the differences. And before doing that exercise for BigTable (GCP) and DynamoDB (AWS), I thought both were pretty the same. However, you can't imagine how wrong I was with this assumption!
I wish I could say once day: "I optimized Apache Spark pipelines in all possible ways". But I'm aware of the realty and that can be very hard to achieve. That's why I decided to rely on the experience shared by experienced Spark users in Spark+AI and, recently, Data+AI Summit, and write a summary list of interesting optimization tips from the past talks.
"DataOps", this term is present in my backlog since a while already and I postponed it multiple times. But I finally found some time to learn more about it and share my thoughts with you.
If you came to data engineering after having a BI career, you certainly know what the pivot is. It was not my case and was quite amazed by this operation that transforms values from rows into columns. If you want to understand how it's possible, this article will present some internals of pivoting data in Apache Spark.
One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.
Very often you will find Apache Spark performance tips related to the hardware (memory, GC) or the configuration parameters (shuffle partitions number, broadcast join threshold). But they're not the single ones you can implement. Moreover, IMO, you should start by the ones presented in this article and optimize your pipeline code before going into more complicated hardware and configuration tuning.
Last December I passed the GCP Data Engineer exam and got my certification as a late Christmas gift! As for AWS Big Data specialty, I would like to share with you some feedback from my preparation process. Spoiler alert: I did it without any online course!
After the introductory part, it's time to share what I learned from the custom state store implementation.
It's time for the second update with the news on cloud data services. This time too, a lot of things happened!
After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.
Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.
The title of this blog post is maybe one of the first problems you may encounter with PySpark (it was mine). Even though it's quite mysterious, it makes sense if you take a look at the root cause.
I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.