Originally, Big Data can seem to be strictly related to one tool - Hadoop. However, it's a misunderstanding of the concept because it hides more interesting stuff.
In this article we try to discover this "interesting stuff". Each part of it should help us to understand better the design and use of systems based on Big Data concept. The idea behind this article is to make a glossary of terms related to Big Data systems. It's the reason why, it's conceived as an ordered list of items. The list is only my personal point of view but your suggestions to enrich it are welcome.
Two main architecture paradigms are used in Big Data systems: lambda architecture and kappa architecture. Without going into details, the first paradigm is based on separation of data storage layers on 3 types: batch, serving and speed. The first layer is associated to main data storage system. It uses this data to precompute information used by serving layer and returned directly to final users. Speed layer stores data too. But unlike batch layer, speed layer uses more fresh data, entered into system not so long time ago. It also generates a part of response to the final user.
In the other side, kappa architecture is a kind of simplified version of lambda architecture. It can be thought as a lambda architecture without batch layer. The goal of this removal was to simplify code maintenance (only one layer for data processing). New data is appended into already existent data system. And it leads us to another point related to Big Data - updates.
We retrieve two algorithms of data manipulation: recomputation and incrementation. The first one is quite easy because it triggers the preparation for serving layer view on all existent data. The second is more complicated because it updates already prepared views with fresh data. Both are strictly related to another important term - human-fault tolerancy.
- Human-fault tolerant
Because there are a lot of data to process, the costs of human errors (coding bugs, bad implementations or whatever) are very important. Systems should be human-fault tolerant. It's achieved, mostly, by a specificity of stored data which is immutable. It means that, in the example of a visits counter, we won't make an incrementation of visits but, instead, we will prefer to store raw data about visit. And another point of Big Data will do the job of count - batch processing.
- Batch processing
Even if it's not a new concept, batch processing is plenty exploited in Big Data systems to, among others, prepare responses to user queries in serving layer. And queries are a key point of systems.
- Query-oriented systems
Big Data systems are designed around queries. To design them well, we need to know to which demand they should respond. Theses systems are formulated as computing functions of data, very often illustrated by pseudo-code snippet:
query = function(all data)
Computing real time views on petabytes of data is very difficult to achieve without pre-computation. It's one the reasons why real time views are mixed by speed and batch layer computations. In consequence, systems must be defined depending on supported queries. To do so, data must be sometimes denormalized.
For some queries, data must be specifically prepared, not always in not redundant format. The operation leading to this result is called denormalization. Typically in Big Data, denormalization is used to make static data serving as a query results. In the other side, normalized data is stored in master dataset, ie. a part from which views in serving layer are constructed.
- 4 Vs
Big Data is very often characterized by the concept of 4 Vs:
- Volume: the quantity of generated data which will, potentially, have to be processed by the system.
- Velocity: defines how often new data is generated.
- Variery: identifies which types of data (structured or unstructured) are ingested to the system. Among unstructured data we can distinguish images, videos, sounds. For structured ones, we could list text documents with defined structure (XML files respecting a schema etc.).
- Veracity: relates to a question on how well given data is representative. The idea of this 'V' is to establish a trust relation for processed data.
- Distributed nature
Previously described human-fault tolerancy is related to another concept of possible errors - hardware failures. In Big Data systems, data is stored in distributed filesystems, such as Haddop's HDFS. Distributed filesystems means that there are data replication to prevents against node blocking failures.
The concept of shared-nothing is related to parallel computation of stored data. An algorithm illustrating it is famous MapReduce. Simply speaking, is uses several servers to compute in parallel maneer, chunks of data. Once computed, they're assembled into format corresponding to expected response format.
- Horizontal scaling
Distributed filesystem is linked to another concept - horizontal scaling. More nodes we add to our cluster, more data in respective amount of time we are able to process.
- Low latency
This concept goes often with Big Data. It illustrates quick responses for search queries.
An important point to take in consideration is the data reprocessing need. This aspects describes the relation between code change and output used in query responses. Every time when code changes, output can change too. If it's the case, data must be reprocessed and new output must replace the old.
The article lists several important concepts related to Big Data systems. In some words, it presents every item, such as architecture (scaling, distributed filesystem), algorithms (incremental, shared-nothing, recomputation) or systems specificity (4V, query-oriented or human-fault).