Storage articles

Looking for something else? Check the categories of Storage:

Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series

If not, below you can find all articles belonging to Storage.

ACID file formats - writing: Delta Lake

It's time for the last data generation part of the ACID file formats series. This time we'll see how Delta Lake writes new files.

Continue Reading →

ACID file formats - writing: Apache Iceberg

Last time you discovered data writing in Apache Hudi. Today it's time to see the 2nd file format from my list, Apache Iceberg.

Continue Reading →

ACID file formats - writing: Apache Hudi

It's only when I was preparing the 2nd blog post of the series that I realized how bad my initial plan was. The article you're currently reading had been initially planned as the 6th of the series. But indeed, how could we understand more advanced features without discovering the writing path first?

Continue Reading →

Data+AI Summit follow-up post: Why RocksDB rocks?

One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.

Continue Reading →

Data+AI follow-up: Introduction to MapDB

Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.

Continue Reading →

ALTER DEFAULT PRIVILEGES in PostgreSQL

At first glance, managing users access in PostgreSQL is easy, you simply execute a CREATE USER, give him some grants, assign a role, and often that's all. However, after some time "permission denied" errors can appear as new objects are created and not owned by the user. To mitigate the maintenance burden for that case, PostgreSQL proposes ALTER DEFAULT privileges operator.

Continue Reading →

Resources and metrics in Gnocchi

Data processing in Gnocchi is strongly related to the index information. One of such valuable assets are metrics and resources, covered just below.

Continue Reading →

Cleaning old measures in Gnocchi

The specificity of Gnocchi is the precomputation of the measures. It doesn't allow ad-hoc queries but in the other side provides pretty good reading performance. However, as new time series points are coming, the old ones aren't kept with them.

Continue Reading →

Archive policy in Gnocchi

In the recent posts about Gnocchi we could often meet the concept of archive policy. However, as one of the main points in this system, it merits its own explanation.

Continue Reading →

Reading aggregates in Gnocchi

Gnocchi writes data partitioned by split key. But often such splitted data must be merged back for reading operations. This post focuses on "how" and "when" of this process.

Continue Reading →

Sacks - data parallelization unit in Gnocchi

To facilitate parallel processing Apache Spark and Apache Kafka have their concept of partitions, Apache Beam works with bundles and Gnocchi deals with sacks. Despite the different naming, the sacks are the same for Gnocchi as the partitions for Spark or Kafka - the unit of work parallelization.

Continue Reading →

Carbonara storage format

Even though carbonara is mostly known as an Italian pasta dish, in the context of Gnocchi it means completely different thing. Carbonara is the name of time points storage format in Gnocchi.

Continue Reading →

Horizontal scalability in Gnocchi

One of the reasons behind the choice of Gnocchi as time series database to study was its naturally provided horizontal scalability. At the moment of making that choice I was relying only on the official documentation. Now it's a good moment to come back and analyze the horizontal scalability by myself.

Continue Reading →

Gnocchi architecture

Understanding the architecture is the key of working properly with any distributed system. It's why the series of post about Gnocchi starts by exploring its components.

Continue Reading →

Choosing time-series database for study

In order to learn a new thing, nothing better than try it. However in some cases the choice of the tool to study is not easy. It's especially true in the context of data storage and though also in the context of time-series databases introduced in one of previous posts.

Continue Reading →

Time series - general notes

Temporal data is a little bit particular. It can be generated very frequently, as for instance every 500 ms or less. It's then important to store it efficiently and to allow quick and flexible reads. It's also important to know the specificities of time-series as a popular case of temporal data.

Continue Reading →

Range query algorithm in Apache Cassandra

When I was learning about the secondary index in Cassandra, I've found the mention of special Cassandra's algorithm used to range and secondary index queries. After some time passed on exploring secondary index mechanism, it's a good moment to discover the algorithm making it work.

Continue Reading →

Compression in Parquet

Last time we've discovered different encoding methods available in Apache Parquet. But the encoding is not the single technique helping to reduce the size of files. The other one, very similar, is the compression.

Continue Reading →

Nested data representation in Parquet

Working with nested structures appears as a problem in column-oriented storage. However, thanks to Google's Dremel solution, this task can be solved efficiently.

Continue Reading →

Encodings in Apache Parquet

An efficient data storage is one of success keys of a good storage format. One of methods helping to improve that is an appropriate encoding and Parquet comes with several different methods.

Continue Reading →