Big Data algorithms articles

Reservoir sampling for bounded and unbounded data

Every time when I see a new thing, I try to note it somewhere and come back later. The "later" is driven by how many times I will meet that thing. More often I see it in the books or conferences, earlier I deep delve into it. And that's the story of this post about reservoir sampling algorithm I met twice during last month.

Continue Reading β†’

Stable Bloom filter

Even though Bloom filter perfectly suits to bounded data it also has some interesting implementations for unbounded sources too. One of them is Stable Bloom filter.

Continue Reading β†’

Scalable Bloom filter

Bloom filter has a lot of versions addressing its main drawbacks - bounded source and add-only character. One of them is Scalable Bloom filter that fixes the first issue.

Continue Reading β†’

Bloom filter

After HyperLogLog and Count-min sketch it's time to cover another popular probabilistic algorithm - Bloom filter.

Continue Reading β†’

Cardinality estimation with linear probabilistic counting

This post follows the series about approximation algorithms. But unlike before, this time we'll focus on simpler solution, the linear probabilistic counting.

Continue Reading β†’

Consensus problem and the Raft consensus algorithm

During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.

Continue Reading β†’

Conflict-Free Replicated Data Types - flags, graphs and maps

Previous post about the Conflict-free Replicated Data Types presented some of basic structures of this type. This one will describe some of recently uncovered types such as: flags, graphs and arrays.

Continue Reading β†’

Conflict-Free Replicated Data Type

Pessimistic replication requires a synchronous communication between the main node writing the data and the replicas. However in some cases the optimistic replication can be more efficient and still guarantee the same final result. One of solutions from this category are conflict-free replicated data types.

Continue Reading β†’

Frequency estimation with Count-min sketch

HyperLogLog algorithm described some weeks ago is not the single one approximate solution in the world of Big Data applications. Another one is Count-min sketch.

Continue Reading β†’

Conflict resolution in distributed applications - vector clocks

Dynamo paper, already quoted here in other posts, was published in 2007. It's 10 years ago. Even though the time passed, it still proposes interesting concepts to know for data-driven applications. And one of them are vector clocks used to conflict resolution.

Continue Reading β†’

HyperLogLog explained

Counting the number of distinct elements can appear a simple task in classical web service-based applications. After all, we usually have to deal with a small subset of data that simply fits in memory and can be automatically counted with the data structures as sets. But the same task is less obvious in Big Data applications where the approximation algorithms can come to the aid.

Continue Reading β†’