Looking for something else? Check the categories of Data engineering:
Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL
If not, below you can find all articles belonging to Data engineering.
Change Data Capture (CDC) is a technique helping to smoothly pass from classical and static data warehouse solution to modern streaming-centric architecture. To do that you can use solutions like Debezium which connects RDBMS or MongoDB to Apache Kafka. In this post, I will try to check whether CDC can also apply to other data stores like Apache Cassandra, Elasticsearch and AWS DynamoDB.
When I was analyzing the API of Gelly, I was quite surprised for its support of bipartite graphs. First, because I didn't know that data structure and second because it wasn't supported in other analyzed frameworks. Hence, I added that graph structure to my backlog and sometime later wrote a post to explain it better.
In the previous post from Big Data patterns implemented series, I wrote about a pattern called fan-in ingress. The idea was to consolidate the data coming from different sources. This time I will cover its companion called fan-out ingress, doing exactly the opposite.
Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea.
Streaming processing is great because it guarantees low latency and quite fresh insight. But on the other side, we won't always need such latency and for these situations, a batch processing will often be a better fit because of apparently simpler semantics. In data architectures, batch layer is perceived differently. Kappa, which is a streaming-based model, makes it optional when the streaming broker can guarantee long data retention. But if it's not the case, the data must be copied into some more persistent storage like a distributed file system.
The series about the implementation of Big Data patterns continues. This time I will focus on a streaming pattern called fan-in ingress.
Sometimes metadata is disregarded but very often it helps to retrieve the information easier and faster. One of such use cases are the headers of Apache Parquet where the stats about the column's content are stored. The reader can, without parsing all the lines, know whether what is he looking for is in the file or not. The metadata is also a part of one of Big Data patterns called automated processing metadata insertion.
Some time ago I found a site listing Big Data patterns (link in "Read also" section). However, that site describes them from a very general point of view and it's not always obvious to figure out the what, why and how. That's why I decided to start a new series of posts where I will try to describe these patterns and give some more technical context.
Two great features whose I experienced when I have been working with Dataflow were the serverless character and the auto-scalability. That's why when I first saw the Apache Spark on Kubernetes initiative, I was more than happy to write one day the pipelines automatically adapting to the workload. That also encouraged me to discover the horizontal scalability and this post is the first result of my recent research on that topic.
I have already described grouping sets feature in the context of Apache Spark. But natively they are a part of SQL standard and that's why I would like to extend the previous post here. After all, you don't need Big Data to use them - even though nowadays it's difficult to not to deal with it.
Last time we've discovered the INTERSECT operator. To recall it quickly, it returns all rows that are defined in the combined datasets. Today we'll discover another operator, doing the opposite and called depending on the vendor: MINUS or EXCEPT.
Key-value stores have the advantage of being a kind of distributed and high-available memory cache. But even though they're quite easy to manipulate thanks to the key-based access, they also have some complicated tasks. One of them is the strategy of picking a good key.
Some time ago I wrote a post about the graph data processing with streams. That article was based on X-Stream framework proposed by the searchers of EPFL research institute. At this occasion, I also mentioned the existence of newer alternative for X-Stream, adapted for distributed workloads, called Chaos. I voluntary omitted the explanation of Chaos in the previous post. Putting it aside of X-Stream would introduce too many new concepts. But now, after some weeks of graph processing discoveries, I would like to return to the successor of X-Stream and present it more in details.
Thanks to modern Big Data solutions like BigQuery or Apache Spark SQL, the knowledge of the advanced SQL concepts is important. After covering the operations like window functions or grouping sets, it's time to show another interesting SQL feature, the INTERSECT operator.
The series about graph processing continues. Today it's the moment to analyze some major graph processing frameworks and choose the framework that I'll present more in details in incoming posts.
Big Data enforces denormalized storage. Joins are costly and it's often much more efficient to store all related information in a single row. Such rows with a lot of columns are called wide rows and they'll be explained in the sections below.
Because of its connected nature, graph structure has its own branch in data mining. Thanks to this branch we can get insight into relationships and dependencies between vertices.
As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions.
Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation.
During last weeks we've discovered a lot about graph data processing in distributed world. However we haven't learned yet about the problems the graphs can solve. And it's as important as the knowledge about the processing techniques. Hopefully, this post will try to catch up this late.