Big Data patterns implemented - Complex Logic Decomposition

KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post begins with a short description of the pattern. Next, it shows an example of the implementation with Apache Airflow.

Definition

The goal of Complex Logic Decomposition pattern is to simplify data processing logic. Hence, instead of writing all the logic inside one application and deploy all of it with a distributed data processing framework, you will rather opt for a divide-and-conquer approach. The logic will be then split into smaller and much easier to understand applications.

A good use case of this pattern is the situation where you have to process the data in different paradigms, like with iterative and not iterative algorithms. Instead of combining the logic of them inside one program, you can locate them in 2 different steps. It will allow you to run these steps concurrently and do whatever you want with generated results in a kind of merge step.

Example

Fortunately, implementation of Complex Logic Decomposition pattern is quite easy with modern orchestration tools. With Apache Airflow you can define it like this:

with dag:
    graph_data_update = PythonOperator(
        task_id='update_graph_data',
        python_callable=update_graph_data)
    pregel_computation = PythonOperator(
        task_id='pregel_computation',
        python_callable=execute_pregel)
    sql_aggregator = PythonOperator(
        task_id='sql_aggregator',
        python_callable=aggregate_with_sql)
    graph_sql_results_reducer = PythonOperator(
        task_id='graph_sql_results_reducer',
        python_callable=merge_results)

    graph_data_update >> pregel_computation >> graph_sql_results_reducer
    sql_aggregator >> graph_sql_results_reducer

The pattern increases readability but it can also have a bad impact on the performance. Let's take the example of an Apache Spark SQL pipeline. If your 2 "branches" use the same input dataset, keeping it inside Apache Spark you can benefit from faster data access since it can be kept in memory between the jobs. Otherwise, you will need to read them from their primary store and very often this input loading step is one of the longest ones. So like always, it's up to you to make a decision about trade-offs performance vs isolation.

To wrap-up, the pattern works well in the cases where you have separate datasets that must be merged or when the data must be computed with different paradigms (e.g. iterative and not iterative). In such a case, decomposing the workflow should help not only to understand it better but very likely to execute it in parallel and optimize it. On the other side, when the input dataset is shared by the logic, decomposing it may have a negative impact on the overall performance.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©