Looking for something else? Check the categories of Storage:
Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series
If not, below you can find all articles belonging to Storage.
Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.
Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.
Some time ago I tried to create Docker image with Cassandra and some other programs. For the "others", the operation was quite easy but Cassandra caused some problems because of several configuration properties.
Recovery process in HDFS helps to achieve fault tolerance. It concerns as well worker pipeline as blocks.
Among all previous posts we could learn a lot about HDFS transaction logs, operations on closed files and so on. Thanks to that we can take a look now on data structure of NameNode and DataNode.
HDSF is not an exception in the Big Data world and as other actors, it also uses checkpoints.
HDFS is not well suited tool to store a lot of small files. Even if that's true, some methods exist to handle small files better.
Implementing snapshots in distributed file systems is not a simple job. It must take into account different aspects, such as file deletion or content changes, and keep file system consistent among them.
Hadoop 2.3.0 brought an in-memory caching layer to HDFS. Even if this is quite old feature (released in 02/2014), it's always beneficial to know it.
Making an immutable distributed file system is easier than building a mutable one. HDFS, even if initially was destined to not changing data, supports mutability through 2 operations: append and truncate.
Edit log would be useless without its complementary structure called FSImage.
HDFS stores everything that happens on transaction log files. They're used during checkpoint and file system recovery. So, they take quite important place in HDFS architecture.
Replica and blocks are HDFS entities having more than 1 state in their whole lifecycle. Being aware of these states helps a lot to understand what happens when a new file is written.
HDFS is a fault-tolerant and resilient distributed file system. It wouldn't be possible without, among others, blocks replication.
Previous article presented theoretical information about HDFS files. This post deepens this topic.
Files in HDFS are different from files from local file system. They're fault-tolerant, can be stored in different abstractions and are based on quite big blocks comparing to blocks in local file system.
HDFS is one of most popular distributed file systems in our days. It changes from other older distributed file systems thanks to its reliability.
Before writing some code in Apache Cassandra, we'll try to explore very interesting dependency - cassandra-driver-mapping.
I/O operations are slower than memory lookups. It's the reason why memory cache helps to improve performances, in Cassandra too.
One of interesting data types used in Apache Cassandra are collections. In our model we can freely use maps, sets or lists.