Looking for something else? Check the categories of Data engineering:
Apache Airflow Big Data Big Data algorithms Big Data problems - solutions Data engineering patterns Databricks General Big Data General data engineering Graphs SQL
If not, below you can find all articles belonging to Data engineering.
A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.
Since the gain of popularity of cloud operators, serverless processing became one of serious alternatives to the cluster-based data pipelines. It's often cheaper to have event-based applications than different processings in the clusters. However, using serverless (and not only) in distributed and stateful computing can sometimes be difficult. But often one property can help in a lot of problems - idempotence.
Almost every year new concept of data-centric architecture appears. In 2014 Kappa conception was published by Jay Kreps. One year after another concept emerged - the architecture called Zeta.
Previously we discovered two popular architectures in Big Data systems - lambda and kappa. Because it was new and pretty long concepts to explain, we expressly ignored tools.
When some years ago I done a small POC Hadoop/MapReduce project based on a million song dataset (my old blog in French), I expressly omitted the part about architecture. It was a mistake because correctly designed architecture is as important as code written behind.
Originally, Big Data can seem to be strictly related to one tool - Hadoop. However, it's a misunderstanding of the concept because it hides more interesting stuff.