Big Data tools

Previously we discovered two popular architectures in Big Data systems - lambda and kappa. Because it was new and pretty long concepts to explain, we expressly ignored tools.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this article we can see which Big Data tools are adapted well to work with several layers of lambda and kappa architectures. At first, we describe lambda architecture and after, the kappa's one.

Lambda architecture tools

As already told in the article about Big Data architectures, batch layer in lambda architecture is responsible for manipulation of master dataset. Automatically, we can deduce the presence of all kind of batch processing tools. Among others, we could list: Hadoop/MapReduce and Apache Spark. We could also append to them two Apache projects helping to manage large datasets: Hive and Pig. Other tools used in batch layer can be those dedicated to schema declaration and serialization: Apache Thrift, Protobuffers and Apache Avro. At the begin of this paragraph, we mentioned the concept of master dataset. Solution used to store it is Hadoop Distributed File System (HDFS).

Part with the similar role to the batch is speed layer. As in the case of batch, data stored in speed layer real time views is a part of response for a query. So, the data must be stored somewhere. Among storage engines for speed layer views we can distinguish several popular databases: Cassandra, HBase, Redis, Elasticsearch and even MySQL (or other RDBMS). Their common point is that they support random reads and writes. It's because speed layer works with incremental algorithms instead of recomputation algorithms, as batch layer. Obviously, data must be also processed by one of stream processing frameworks to fed data stores: Apache Storm, Apache Spark Streaming or Apache Samza. It generates a view used further by serving layer.

As to serving layer, we know that it indexes and exposes views generated by two previously presented layers. One of expecations of this layer is latency. The responses should be delivered as quickly as possible. It's the reason why one of core components of this layer are low-latency databases, such as: ElephantDB, SploutSQL, Voldemort, Druid, Impala. These databases don't need (or even shouldn't) to support random writes. Supporting them could increase their complexity and lead to unexpected bugs.

Kappa architecture tools

Kappa architecture is a simplified version of lambda architecture. So, some of used tools can be shared by them. It's the case for append-only and immutable log data store layer where we can use Apache Kafka. Data is further processed through stream processing layer. We retrieve there almost the same low-latency systems as in batch layer of lambda architecture: Apache Storm, Apache Samza, Apache Spark Streaming, Amazon Kinesis or Flink.

Serving layer can be composed by any of available databases, quite as the same layer in lambda architecture: key-value or column-oriented store (HBase, Cassandra...) or event search engines (SolR, Elasticsearch).

Even if this article is short, it shows well how many tools are available to build Big Data systems. It shows also that one framework can be used in different layers to help to reduce system complexity.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects