Articles about Data engineering on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns Databricks General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

May 27, 2018 • Big Data algorithms

Consensus problem and the Raft consensus algorithm

During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.

Continue Reading →

May 27, 2018 • SQL

Window functions in SQL

Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.

Continue Reading →

May 20, 2018 • Big Data problems - solutions

Immutability and key-value storage

The immutability is a precious property of systems dealing with a lot of data. It's especially true when something goes wrong and we must recover quickly. Since the data is immutable, the cleaning step is not executed and with some additional computation power, the data can be regenerated efficiently.

Continue Reading →

May 13, 2018 • General Big Data

ACID 2.0

ACID is a well-known acronym for almost all developers growing with the RDBMS as the main storage. However with the popularization of NoSQL and distributed computing, another ACID acronym appeared - ACID 2.0.

Continue Reading →

May 6, 2018 • Big Data algorithms

Conflict-Free Replicated Data Types - flags, graphs and maps

Previous post about the Conflict-free Replicated Data Types presented some of basic structures of this type. This one will describe some of recently uncovered types such as: flags, graphs and arrays.

Continue Reading →

May 6, 2018 • General Big Data

Immutability in Big Data

The interest of immutability in Big Data is often difficult to understand at the first glance. After all it introduces some complexity - especially at the reading path. But when the first problems appear and some of data need to be recomputed in order, the immutability comes to the rescue.

Continue Reading →

April 28, 2018 • Big Data algorithms

Conflict-Free Replicated Data Type

Pessimistic replication requires a synchronous communication between the main node writing the data and the replicas. However in some cases the optimistic replication can be more efficient and still guarantee the same final result. One of solutions from this category are conflict-free replicated data types.

Continue Reading →

April 28, 2018 • SQL

Hierarchical queries

SQL hides for the most of its daily users a lot of interesting and powerful functions that though are not used very frequently. One of them are hierarchical queries.

Continue Reading →

April 22, 2018 • Big Data algorithms

Frequency estimation with Count-min sketch

HyperLogLog algorithm described some weeks ago is not the single one approximate solution in the world of Big Data applications. Another one is Count-min sketch.

Continue Reading →

April 22, 2018 • SQL

Correlated subqueries

Even though the RDBMS is more and more completed (replaced?) by NoSQL solutions, it still remains an important piece of the data processing. It's even more true with the distributed databases as BigQuery supporting SQL standards so the correlated subqueries. But they're also implemented in other Big Data engines as Spark SQL or more classical ones as PostgreSQL.

Continue Reading →

April 15, 2018 • General Big Data

Data processing frameworks concepts

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

Continue Reading →

March 31, 2018 • Big Data algorithms

Conflict resolution in distributed applications - vector clocks

Dynamo paper, already quoted here in other posts, was published in 2007. It's 10 years ago. Even though the time passed, it still proposes interesting concepts to know for data-driven applications. And one of them are vector clocks used to conflict resolution.

Continue Reading →

March 25, 2018 • General Big Data

Secondary index in NoSQL data stores

The data organization in key-value oriented NoSQL databases is very often based on a pair of keys: partition and sorting. However they also offer other feature called secondary index that can be a good alternative to previously described index table pattern.

Continue Reading →

February 25, 2018 • Big Data problems - solutions

Index table pattern in NoSQL

Good write throughput and horizontal scalability are maybe the most visible advantages of NoSQL storage systems. However very often people with a solid RDBMS background fall in the trap of index that can't be so easily created. Fortunately, a lot of patterns helping to deal with this problem exist. One of them is the index table pattern.

Continue Reading →

February 25, 2018 • General Big Data

Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.

Continue Reading →

February 18, 2018 • General Big Data

Polyglot persistence - definition and examples

The popularization of NoSQL data stores brought a new concept in data management called polyglot persistence. This term is very similar to polyglot programming and it'll be presented below.

Continue Reading →

February 11, 2018 • General Big Data

Log-structured file system

Sequential writes made their proofs in distributed data-driven systems. Usually they perform better than random writes, especially in systems with intensive writes. Beside the link to the Big Data, the sequential writes are also related to another type of systems called log-structured file systems that were defined late 1980's.

Continue Reading →

December 10, 2017 • Big Data algorithms

HyperLogLog explained

Counting the number of distinct elements can appear a simple task in classical web service-based applications. After all, we usually have to deal with a small subset of data that simply fits in memory and can be automatically counted with the data structures as sets. But the same task is less obvious in Big Data applications where the approximation algorithms can come to the aid.

Continue Reading →

December 10, 2017 • General Big Data

Dynamo paper and consistent hashing

One of previous posts presented partitioning strategies. Among described techniques we could find hashing partitioning based on the number of servers. The drawback of this method was the lack of flexibility. With the add of new server we have to remap all data. Fortunately an alternative to this "primitive" hashing exists and it's called consistent hashing.

Continue Reading →

November 26, 2017 • General Big Data

Data partitioning strategies

Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
Next ⟶

Data engineering articles