Looking for something else? Check the categories of Data engineering:
Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns Databricks General Big Data General data engineering Graphs SQL
If not, below you can find all articles belonging to Data engineering.
As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions.
Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation.
During last weeks we've discovered a lot about graph data processing in distributed world. However we haven't learned yet about the problems the graphs can solve. And it's as important as the knowledge about the processing techniques. Hopefully, this post will try to catch up this late.
Previously described vertex-centric model is not the single one used to process graph data. Another one uses subgraphs as the processing unit.
Use cases of streaming surprise me more and more. In my recent research about graph processing in Big Data era I found a paper presenting the graph framework working on vertices and edges directly from a stream. In case you've missed that paper I'll try to present this idea to you.
Graph data processing, even though seems to be less popular than streaming or files processing, is an important member of data-oriented systems. And as its "colleagues", it also has some different processing logics. The first described in this blog is called vertex-centric.
In this blog I've covered the topics about relational databases, key-value stores, search engines or log systems. There are still some storage systems deserving some learning effort and one of them are graphs considered here in the context of data processing.
Distributed computing opened a lot of possibilities and horizontal scaling is only one of them. But at the same time it brought some new problems that we need to address during applications conception. And writing a data on different data stores inside one transaction is the one of them.
A lot of names in IT come from the real world phenomena. One of them are epidemic protocols, aka gossip protocols, covered in this post.
Having data close to the computation has a lot of advantages. This idea called data locality is not new since it was popularized with Hadoop MapReduce. Despite of that, it's worth recalling some of its main points and trying to adapt it to modern data pipelines most of the time based on cloud services.
Some time ago I've started the series of posts about immutability in data-oriented applications. One of approaches helping to deal with it was based on version flags. But fortunately it's not the only solution - especially for the ones who don't like to mix valid and invalid data in a single place.
Querying big amounts of data has never been so simple as nowadays. Amazon Redshift and Azure SQL Data Warehouse are one of the solutions. But using them wouldn't be possible without a more global concept known as MPP.
Keeping different database synchronized is not an easy task. Thankfully some techniques exist to facilitate it and one of them is called the Changed Data Capture pattern.
Even though Bloom filter perfectly suits to bounded data it also has some interesting implementations for unbounded sources too. One of them is Stable Bloom filter.
Bloom filter has a lot of versions addressing its main drawbacks - bounded source and add-only character. One of them is Scalable Bloom filter that fixes the first issue.
After HyperLogLog and Count-min sketch it's time to cover another popular probabilistic algorithm - Bloom filter.
This post follows the series about approximation algorithms. But unlike before, this time we'll focus on simpler solution, the linear probabilistic counting.
During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.
Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.
The immutability is a precious property of systems dealing with a lot of data. It's especially true when something goes wrong and we must recover quickly. Since the data is immutable, the cleaning step is not executed and with some additional computation power, the data can be regenerated efficiently.