Storage articles

Looking for something else? Check the categories of Storage:

Apache Avro Apache Cassandra Apache Hudi Apache Iceberg Apache Parquet Apache ZooKeeper Delta Lake Elasticsearch Embedded databases HDFS MySQL PostgreSQL Time series

If not, below you can find all articles belonging to Storage.

Introduction to Apache Cassandra

After some articles about data ingestion and serialization in Big Data applications, it's time to start to learn about storage. This part begins with Apache Cassandra.

This article presents basic concepts of Apache Cassandra. In the first part it tries to explain architecture and general concepts of this solution. The second part is focused more on developer topics and it describes some main points about data organization.

Continue Reading →

Watches in Apache ZooKeeper

A lot of programming tools implement event-driven approach. Apache ZooKeeper isn't an exception for this rule with its system of watchers.

Continue Reading →

ACL in Apache ZooKeeper

Apache ZooKeeper is very often compared to distributed file system. Because each file system has a feature to deal with file permissions, ZooKeeper, as a kind of file system, can't be different.

Continue Reading →

Asynchronous operations in Apache ZooKeeper

Sometimes network latencies can slow down the communication between Apache ZooKeeper and its client. It's one of the reasons of possible use of asynchronous operations for zNodes manipulations.

Continue Reading →

Manipulate zNodes in Apache ZooKeeper

Until now we've seen how to create zNodes. But creation is not the single thing that Apache ZooKeeper does.

Continue Reading →

Session in Apache ZooKeeper

Client connects to ZooKeeper server and maintains a session. There are several things to know about ZooKeeper sessions and we'll explore them in this article.

Continue Reading →

zNode in Apache ZooKeeper

As already told, zNodes are a key part in Apache ZooKeeper. They store information shared among different servers directly (as binary data) or indirectly (as parent directories).

Continue Reading →

Introduction to Apache ZooKeeper

Usually Apache ZooKeeper works in the shadow of more exposed Big Data tools, as Apache Spark or Apache Kafka. However, its role is very important in system architecture.

Continue Reading →

Elasticsearch migration from 1.6 to 2.2

At the begin Elastcisearch 2.2.0 was realeased on February 2016. Because my POC project was frozen with 1.6, I decided to upgrade. But not without surprises and some code rework.

Continue Reading →

Serialization and deserialization with schemas in Apache Avro

After theoretical introduction to Apache Avro, we can see how it can be used.

Continue Reading →

Introduction to Apache Avro

Previously we learned why serialization frameworks can facilitate work in distributed systems, where data provide from several different sources. Now, it's a good time to discover some real tools used in serialization step. As told, the chosen tool is Apache Avro.

Continue Reading →

Introduction to serialization in Big Data

NoSQL solutions are very often related to the word schemaless. Sometimes the absence of schema can lead to maintenance or backward compatibility problems. One of solutions to these issues in Big Data systems are serialization frameworks.

Continue Reading →

Reverse nested aggregation in Elasticsearch

Aggregations are a really powerful Elasticsearch feature. Besides aggregations known from RDBMS, such as sum, min, max, count, they offer the possibility to apply aggregation on different levels. It's particularly useful with nested documents.

Continue Reading →

Parent-children relationship in Elasticsearch

Make links between entities is quite easy in relational databases. And it's not a trivial task in document databases, adapted to less normalized data storage. Elasticsearch is not the exception of this rule but it defines some mechanisms to support parent-children relationship between documents.

Continue Reading →

Locks in Elasticsearch

Concurrency issues in Elasticsearch are often provoked by the lack of ACID transactions support. However, the search engine provides some of locking mechanisms to deal with them.

Continue Reading →

Aggregations in Elasticsearch

Even if Elasticsearch is not relational system, it allows to aggregate results. This operation is very helpful if we want to group set of documents.

Continue Reading →

Routing in Elasticsearch

If you've been worked with PHP frameworks like Zend or Symfony, you are certainly familiar with the concept of routing which is based on redirection of HTTP request to appropriated controller. Elasticsearch has similar feature, by the way, also called routing.

Continue Reading →

Proximity matching in Elasticsearch

Elasticsearch and its idea of inverted index is a kind of magic infinitely deep hat in which we can hide millions of terms. However, sometimes these terms need to be analyzed with some logic, not just only as plain words. It's here where proximity matching comes with help.

Continue Reading →

Partial matching and ngrams in Elasticsearch

Elasticsearch search matches only terms defined in inverted index. So even if we are looking for only two first letters of given term, we won't be able to do it with standard match query. Instead of it we should use partial matching, provided by Elasticsearch in different forms.

Continue Reading →

Filtered queries in Elasticsearch

Queries in Elasticsearch can be executed not only against full-text searches. They can also be filtered. And in Elasticsearch world, filters mean another operation than queries.

Continue Reading →