Looking for something else? Check the categories of Storage:
Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series
If not, below you can find all articles belonging to Storage.
I like writing code and each time there is a data processing job to write with some business logic I'm very happy. However, with time I've learned to appreciate the Open Source contributions enhancing my daily work. Mack library, the topic of this blog post, is one of those projects discovered recently.
Compaction is also a feature present in Apache Iceberg. However, it works a little bit differently than for Delta Lake presented last time. Why? Let's see in this new blog post!
The small files is a well known problem in data systems. Fortunately, modern table file formats have built-in features to address it. In the next series we'll see how.
It's time to start the 4th part of the Table file formats series. This time the topic will be Change Data Capture, so how to stream all changes made on the table. As for the 3rd part, I'm going to start with Delta Lake.
After Delta Lake and Apache Iceberg it's time to see the reading part of Apache Hudi. Despite an apparent similarity with the aforementioned table formats, Apache Hudi has an interesting reading specificity related to the different table types.
Last week you could read about data reading in Delta Lake. Today it's time to cover this part in Apache Iceberg!
In the previous blog post about Delta Lake you discovered the logic for the writing part. Meantime Delta Lake 2 was released and it's for this brand new version that I'm going to share with you some findings related to the data reading.
It's time for the last data generation part of the ACID file formats series. This time we'll see how Delta Lake writes new files.
Last time you discovered data writing in Apache Hudi. Today it's time to see the 2nd file format from my list, Apache Iceberg.
It's only when I was preparing the 2nd blog post of the series that I realized how bad my initial plan was. The article you're currently reading had been initially planned as the 6th of the series. But indeed, how could we understand more advanced features without discovering the writing path first?
One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.
Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.
At first glance, managing users access in PostgreSQL is easy, you simply execute a CREATE USER, give him some grants, assign a role, and often that's all. However, after some time "permission denied" errors can appear as new objects are created and not owned by the user. To mitigate the maintenance burden for that case, PostgreSQL proposes ALTER DEFAULT privileges operator.
Data processing in Gnocchi is strongly related to the index information. One of such valuable assets are metrics and resources, covered just below.
The specificity of Gnocchi is the precomputation of the measures. It doesn't allow ad-hoc queries but in the other side provides pretty good reading performance. However, as new time series points are coming, the old ones aren't kept with them.
In the recent posts about Gnocchi we could often meet the concept of archive policy. However, as one of the main points in this system, it merits its own explanation.
Gnocchi writes data partitioned by split key. But often such splitted data must be merged back for reading operations. This post focuses on "how" and "when" of this process.
To facilitate parallel processing Apache Spark and Apache Kafka have their concept of partitions, Apache Beam works with bundles and Gnocchi deals with sacks. Despite the different naming, the sacks are the same for Gnocchi as the partitions for Spark or Kafka - the unit of work parallelization.
Even though carbonara is mostly known as an Italian pasta dish, in the context of Gnocchi it means completely different thing. Carbonara is the name of time points storage format in Gnocchi.
One of the reasons behind the choice of Gnocchi as time series database to study was its naturally provided horizontal scalability. At the moment of making that choice I was relying only on the official documentation. Now it's a good moment to come back and analyze the horizontal scalability by myself.