Storage articles

Home Storage

Looking for something else? Check the categories of Storage:

Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series

If not, below you can find all articles belonging to Storage.

April 8, 2018 • Apache Cassandra

Range query algorithm in Apache Cassandra

When I was learning about the secondary index in Cassandra, I've found the mention of special Cassandra's algorithm used to range and secondary index queries. After some time passed on exploring secondary index mechanism, it's a good moment to discover the algorithm making it work.

Continue Reading →

November 26, 2017 • Apache Parquet

Compression in Parquet

Last time we've discovered different encoding methods available in Apache Parquet. But the encoding is not the single technique helping to reduce the size of files. The other one, very similar, is the compression.

Continue Reading →

November 19, 2017 • Apache Parquet

Nested data representation in Parquet

Working with nested structures appears as a problem in column-oriented storage. However, thanks to Google's Dremel solution, this task can be solved efficiently.

Continue Reading →

November 12, 2017 • Apache Parquet

Encodings in Apache Parquet

An efficient data storage is one of success keys of a good storage format. One of methods helping to improve that is an appropriate encoding and Parquet comes with several different methods.

Continue Reading →

November 12, 2017 • Apache Parquet

Schema versions in Parquet

When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.

Continue Reading →

November 5, 2017 • Apache Parquet

Data storage in Apache Parquet

Previously we focused on types available in Parquet. This time we can move forward and analyze how the framework stores the data in the files.

Continue Reading →

October 29, 2017 • Apache Parquet

Data types in Apache Parquet

Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.

Continue Reading →

October 22, 2017 • Apache Parquet

Introduction to Apache Parquet

Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.

Continue Reading →

April 22, 2017 • Apache Cassandra

Dockerize Cassandra troubleshooting

Some time ago I tried to create Docker image with Cassandra and some other programs. For the "others", the operation was quite easy but Cassandra caused some problems because of several configuration properties.

Continue Reading →

March 5, 2017 • HDFS

Recovery in HDFS

Recovery process in HDFS helps to achieve fault tolerance. It concerns as well worker pipeline as blocks.

Continue Reading →

February 26, 2017 • HDFS

HDFS on disk explained

Among all previous posts we could learn a lot about HDFS transaction logs, operations on closed files and so on. Thanks to that we can take a look now on data structure of NameNode and DataNode.

Continue Reading →

February 19, 2017 • HDFS

Checkpoint in HDFS

HDSF is not an exception in the Big Data world and as other actors, it also uses checkpoints.

Continue Reading →

February 12, 2017 • HDFS

Handling small files in HDFS

HDFS is not well suited tool to store a lot of small files. Even if that's true, some methods exist to handle small files better.

Continue Reading →

February 12, 2017 • HDFS

Snapshot in HDFS

Implementing snapshots in distributed file systems is not a simple job. It must take into account different aspects, such as file deletion or content changes, and keep file system consistent among them.

Continue Reading →

February 4, 2017 • HDFS

Cache in HDFS

Hadoop 2.3.0 brought an in-memory caching layer to HDFS. Even if this is quite old feature (released in 02/2014), it's always beneficial to know it.

Continue Reading →

February 4, 2017 • HDFS

Append and truncate in HDFS

Making an immutable distributed file system is easier than building a mutable one. HDFS, even if initially was destined to not changing data, supports mutability through 2 operations: append and truncate.

Continue Reading →

February 4, 2017 • HDFS

FSImage in HDFS

Edit log would be useless without its complementary structure called FSImage.

Continue Reading →

February 4, 2017 • HDFS

Edit log in HDFS

HDFS stores everything that happens on transaction log files. They're used during checkpoint and file system recovery. So, they take quite important place in HDFS architecture.

Continue Reading →

January 29, 2017 • HDFS

States in HDFS

Replica and blocks are HDFS entities having more than 1 state in their whole lifecycle. Being aware of these states helps a lot to understand what happens when a new file is written.

Continue Reading →

January 29, 2017 • HDFS

Replication in HDFS

HDFS is a fault-tolerant and resilient distributed file system. It wouldn't be possible without, among others, blocks replication.

Continue Reading →