Looking for something else? Check the categories of Storage:
Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series
If not, below you can find all articles belonging to Storage.
In the recent posts about Gnocchi we could often meet the concept of archive policy. However, as one of the main points in this system, it merits its own explanation.
Gnocchi writes data partitioned by split key. But often such splitted data must be merged back for reading operations. This post focuses on "how" and "when" of this process.
To facilitate parallel processing Apache Spark and Apache Kafka have their concept of partitions, Apache Beam works with bundles and Gnocchi deals with sacks. Despite the different naming, the sacks are the same for Gnocchi as the partitions for Spark or Kafka - the unit of work parallelization.
Even though carbonara is mostly known as an Italian pasta dish, in the context of Gnocchi it means completely different thing. Carbonara is the name of time points storage format in Gnocchi.
One of the reasons behind the choice of Gnocchi as time series database to study was its naturally provided horizontal scalability. At the moment of making that choice I was relying only on the official documentation. Now it's a good moment to come back and analyze the horizontal scalability by myself.
Understanding the architecture is the key of working properly with any distributed system. It's why the series of post about Gnocchi starts by exploring its components.
In order to learn a new thing, nothing better than try it. However in some cases the choice of the tool to study is not easy. It's especially true in the context of data storage and though also in the context of time-series databases introduced in one of previous posts.
Temporal data is a little bit particular. It can be generated very frequently, as for instance every 500 ms or less. It's then important to store it efficiently and to allow quick and flexible reads. It's also important to know the specificities of time-series as a popular case of temporal data.
When I was learning about the secondary index in Cassandra, I've found the mention of special Cassandra's algorithm used to range and secondary index queries. After some time passed on exploring secondary index mechanism, it's a good moment to discover the algorithm making it work.
Last time we've discovered different encoding methods available in Apache Parquet. But the encoding is not the single technique helping to reduce the size of files. The other one, very similar, is the compression.
Working with nested structures appears as a problem in column-oriented storage. However, thanks to Google's Dremel solution, this task can be solved efficiently.
An efficient data storage is one of success keys of a good storage format. One of methods helping to improve that is an appropriate encoding and Parquet comes with several different methods.
When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.
Previously we focused on types available in Parquet. This time we can move forward and analyze how the framework stores the data in the files.
Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.
Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.
Some time ago I tried to create Docker image with Cassandra and some other programs. For the "others", the operation was quite easy but Cassandra caused some problems because of several configuration properties.
Recovery process in HDFS helps to achieve fault tolerance. It concerns as well worker pipeline as blocks.
Among all previous posts we could learn a lot about HDFS transaction logs, operations on closed files and so on. Thanks to that we can take a look now on data structure of NameNode and DataNode.
HDSF is not an exception in the Big Data world and as other actors, it also uses checkpoints.