Big Data architectures

When some years ago I done a small POC Hadoop/MapReduce project based on a million song dataset (my old blog in French), I expressly omitted the part about architecture. It was a mistake because correctly designed architecture is as important as code written behind.

It's the reason why this article describes two architectures commonly found in Big Data systems: lambda and kappa. The first part describes lambda architecture. The second part treats about kappa architecture. The last part is a table comparing these two architectures. The article describes these architectures only in conceptual point of view. List of tools adapted to them will be the topic of another article.

Lambda architecture

The core of lambda architecture is composed by 3 layers: batch, serving and speed. The batch layer contains a main source of data, commonly called master dataset. It's considered as the source of truth and helps to rebuild system data in the case of failure. Data of master dataset is stored in distributed filesystem (as Hadoop Distributed Filesystem). It's used by batch layer to compute batch view.

Batch view is used further by serving layer. This layer makes batch and speed layer views accessible to submitted queries. The main goal of this part is to serve results with low-latency. Serving layer can be analyzed under 2 angles, each one defined by one question:
- throughput: how many requests can be served simultaneously ?
- latency: how much time is needed to generate response for one request ?

Serving layer's latency shows well the complexity of Big Data architectures. Since data is stored in distributed filesystem, nodes in the cluster haven't necessarily the same load - some servers are used by more users, some by less users. In consequence, time needed to generate final response can be influenced by bad performance of one from servers, ie. if one server takes 3 seconds to generate response and two others only 1 second, before see final result, user will need to wait 3 seconds and not 1.

Speed layer is the last part of lambda architecture used to generate query responses. As its name indicates, this layer is also responsible for returning results quickly. Normally, it stores less data than batch layer. In additionally, this data is fresher than the data stored by master dataset. But once some age reached, speed layer data is moved to master dataset. Basing on this event, batch views recomputation begins again.

But the data age is not the single difference between speed and batch layer. Another one concerns algorithms applied to data management. For batch layer, we use incremental (new data is appended to already existent) or batch computation approach (views precomputed from scratch with old + new data). Speed layer is not adapted for the second solution because it imposes quick data access. And it's easier to achieve that by adding new data rather than computing everything from scratch (even if there are less data)

A common point between speed and batch layer is that they should be protected against fault tolerance. It means that when some of nodes goes down, query execution shouldn't be compromised. It's achieved thanks to data replication. But there are still the risk that all nodes storing duplicated data go down. So, a good replication design is mandatory to avoid this kind of rare, but possible and catastrophical situations.

Kappa architecture

Similar to lambda is kappa architecture. This paradigm is often described as a "simplified version of lambda architecture". It's based on the principle of common master dataset for all used layers. And this common master dataset has the form of immutable and append-only log.

Data stored as logs are further moved to appropriated serving storage engines. Their role is the same as the role of serving layer in lambda architecture. It means that data put inside them is directly used to generate optimized responses for final user queries. We can now deduce that kappa architecture is composed by only 2 layers: stream processing and serving.

Automatically, a potential issue comes to mind - what to do if we must recompute serving layer ? One of retained solutions consists on starting new instance of stream processing job and parametrizing its output to new structure in serving layer. Once this job finished, we can switch between structure used currently in serving layer to the structure just generated. In consequence of that, old stream processing job and old structure in serving layers are deleted.

Lambda and kappa architecture comparison

We can note that kappa architecture doesn't have batch and speed layer as lambda architecture. But to compare easier both architecture, anything better than simple and clear comparison table:

: : : : : : :

	Lambda	Kappa
Layers:	batch, serving and speed	stream processing and serving
data storage algorithms	incremental, recomputation	incremental
code maintenance	code supposed to produce the same results must be maintained in two layers (batch and speed)	code in only one layer must be maintained (stream processing job)
reprocessing need	depends on applied algorithm, but often reprocessing when speed layers starts to saturate	because of incremental approach, reprocessing is needed only when code changes or errors are detected
data mutability	immutable	immutable
real-time processing	fresh	fresh data is reprocessed quicker
serving layer	exists and optimized for queries	exists and optimized for queries
complexity	because of more layer and, possibly, more tools, system can become complex very quickly	usually, less tools used than in lambda, so complexity (normally) is smaller

The article shows two architectures often implemented in Big Data systems: lambda and kappa architecture. The first one, more complex, is composed by 3 layers: batch, serving and speed. It increases maintenance and system complexity. Kappa architecture is simpler version of lambda because it has only stream processing and serving layer. Both prefer to work on immutable data.