Big Data patterns implemented - Complex Logic Decomposition on waitingforcode.com

Versions: Apache Airflow 1.10.3

KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post begins with a short description of the pattern. Next, it shows an example of the implementation with Apache Airflow.

Definition

The goal of Complex Logic Decomposition pattern is to simplify data processing logic. Hence, instead of writing all the logic inside one application and deploy all of it with a distributed data processing framework, you will rather opt for a divide-and-conquer approach. The logic will be then split into smaller and much easier to understand applications.

A good use case of this pattern is the situation where you have to process the data in different paradigms, like with iterative and not iterative algorithms. Instead of combining the logic of them inside one program, you can locate them in 2 different steps. It will allow you to run these steps concurrently and do whatever you want with generated results in a kind of merge step.

Example

Fortunately, implementation of Complex Logic Decomposition pattern is quite easy with modern orchestration tools. With Apache Airflow you can define it like this:

with dag:
    graph_data_update = PythonOperator(
        task_id='update_graph_data',
        python_callable=update_graph_data)
    pregel_computation = PythonOperator(
        task_id='pregel_computation',
        python_callable=execute_pregel)
    sql_aggregator = PythonOperator(
        task_id='sql_aggregator',
        python_callable=aggregate_with_sql)
    graph_sql_results_reducer = PythonOperator(
        task_id='graph_sql_results_reducer',
        python_callable=merge_results)

    graph_data_update >> pregel_computation >> graph_sql_results_reducer
    sql_aggregator >> graph_sql_results_reducer

The pattern increases readability but it can also have a bad impact on the performance. Let's take the example of an Apache Spark SQL pipeline. If your 2 "branches" use the same input dataset, keeping it inside Apache Spark you can benefit from faster data access since it can be kept in memory between the jobs. Otherwise, you will need to read them from their primary store and very often this input loading step is one of the longest ones. So like always, it's up to you to make a decision about trade-offs performance vs isolation.

To wrap-up, the pattern works well in the cases where you have separate datasets that must be merged or when the data must be computed with different paradigms (e.g. iterative and not iterative). In such a case, decomposing the workflow should help not only to understand it better but very likely to execute it in parallel and optimize it. On the other side, when the input dataset is shared by the logic, decomposing it may have a negative impact on the overall performance.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

Read also about Big Data patterns implemented - Complex Logic Decomposition here:

Complex Logic Decomposition

If you liked it, you should read:

This week's #data engineering pattern is Complex Logic Decomposition with an example of a data pipeline https://t.co/Swfpwd9uHo
— Bartosz Konieczny (@waitingforcode) July 3, 2019