Looking for something else? Check the categories of Data engineering:
Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL
If not, below you can find all articles belonging to Data engineering.
Even though the RDBMS is more and more completed (replaced?) by NoSQL solutions, it still remains an important piece of the data processing. It's even more true with the distributed databases as BigQuery supporting SQL standards so the correlated subqueries. But they're also implemented in other Big Data engines as Spark SQL or more classical ones as PostgreSQL.
Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.
Dynamo paper, already quoted here in other posts, was published in 2007. It's 10 years ago. Even though the time passed, it still proposes interesting concepts to know for data-driven applications. And one of them are vector clocks used to conflict resolution.
The data organization in key-value oriented NoSQL databases is very often based on a pair of keys: partition and sorting. However they also offer other feature called secondary index that can be a good alternative to previously described index table pattern.
Good write throughput and horizontal scalability are maybe the most visible advantages of NoSQL storage systems. However very often people with a solid RDBMS background fall in the trap of index that can't be so easily created. Fortunately, a lot of patterns helping to deal with this problem exist. One of them is the index table pattern.
Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.
The popularization of NoSQL data stores brought a new concept in data management called polyglot persistence. This term is very similar to polyglot programming and it'll be presented below.
Sequential writes made their proofs in distributed data-driven systems. Usually they perform better than random writes, especially in systems with intensive writes. Beside the link to the Big Data, the sequential writes are also related to another type of systems called log-structured file systems that were defined late 1980's.
Counting the number of distinct elements can appear a simple task in classical web service-based applications. After all, we usually have to deal with a small subset of data that simply fits in memory and can be automatically counted with the data structures as sets. But the same task is less obvious in Big Data applications where the approximation algorithms can come to the aid.
One of previous posts presented partitioning strategies. Among described techniques we could find hashing partitioning based on the number of servers. The drawback of this method was the lack of flexibility. With the add of new server we have to remap all data. Fortunately an alternative to this "primitive" hashing exists and it's called consistent hashing.
Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.
Since the gain of popularity of cloud operators, serverless processing became one of serious alternatives to the cluster-based data pipelines. It's often cheaper to have event-based applications than different processings in the clusters. However, using serverless (and not only) in distributed and stateful computing can sometimes be difficult. But often one property can help in a lot of problems - idempotence.
Almost every year new concept of data-centric architecture appears. In 2014 Kappa conception was published by Jay Kreps. One year after another concept emerged - the architecture called Zeta.
Previously we discovered two popular architectures in Big Data systems - lambda and kappa. Because it was new and pretty long concepts to explain, we expressly ignored tools.
When some years ago I done a small POC Hadoop/MapReduce project based on a million song dataset (my old blog in French), I expressly omitted the part about architecture. It was a mistake because correctly designed architecture is as important as code written behind.
Originally, Big Data can seem to be strictly related to one tool - Hadoop. However, it's a misunderstanding of the concept because it hides more interesting stuff.