Waiting for code

on waitingforcode.com

Index table pattern in NoSQL

Good write throughput and horizontal scalability are maybe the most visible advantages of NoSQL storage systems. However very often people with a solid RDBMS background fall in the trap of index that can't be so easily created. Fortunately, a lot of patterns helping to deal with this problem exist. One of them is the index table pattern. Continue Reading →

Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions. Continue Reading →

RPC in Apache Spark

The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model. Continue Reading →

Log-structured file system

Sequential writes made their proofs in distributed data-driven systems. Usually they perform better than random writes, especially in systems with intensive writes. Beside the link to the Big Data, the sequential writes are also related to another type of systems called log-structured file systems that were defined late 1980's. Continue Reading →

Joins in Apache Beam

Dealing with joins in relational databases is quite straightforward thanks to underlying data structures (e.g. trees). However it's less convenient to work with them in data processing world where schemaless and denormalization rule. Continue Reading →

Side output in Apache Beam

The possibility to define several additional inputs for ParDo transform is not the single feature of this type in Apache Beam. The framework provides also the possibility to define one or more extra outputs through the structures called side outputs. Continue Reading →