Let's Work Together

I am a freelance Data Engineer, author of Data Engineering Design Patterns and a Databricks MVP. I help companies build scalable, maintainable, and cost-effective data platforms.


πŸ† Trust & Authority

  • Databricks MVP: Recognized for technical expertise and contributions to the Spark/Databricks ecosystem.
  • Author at O'Reilly: A deep dive into architectural trade-offs in Data Engineering Design Patterns.
  • Speaker: Frequent presenter at Data+AI Summit, Spark Summit, Delta Lake webinars, and NDC.

πŸ› οΈ How I Can Help You

I specialize in the "You build it, you run it" philosophy. I don't just deliver code; I deliver automated, tested, and observable systems.

Core Expertise

  • Databricks & Apache Spark: Optimization, migration, and architectural design.
  • Cloud Data Services: Expert-level implementation on AWS, Azure, and GCP.
  • Streaming & Batch: Real-time pipelines (Structured Streaming, Flink, Beam) and robust ETL/ELT.
  • Software Craftsmanship: Clean code (Scala, Python, Java), CI/CD (Gitlab, Jenkins), and IaC (Terraform).

πŸ“‚ Recent Projects & Impact

I focus on solving complex architectural puzzles, not just connecting tools:

Category Problem Tech stack Result
Data cleansing A stateful streaming job preparing data for the Silver layer, including the modifications like standardization, reformatting, deduplication. Apache Spark Structured Streaming, AWS, Jenkins, Scala
Data dispatching A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. AWS, Scala Resolved cluster maintenance friction by migrating the dispatching logic to serverless functions.
Data dispatching A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. Apache Beam, GCP, Java, Cloud Build Simplified the overall architecture by clearly classifying data as being "ready for processing" and "requiring investigation". Centralized the dispatching logic in a central, scalable, and fully serverless component instead of having many isolated jobs doing the same work.
Data migration Batch migration from Hadoo Hive to GCP BigQuery. Apache Spark, GCP, Scala Enabled decomissioning of old and expensive Hadoop stack.
Data migration Migrating PowerBI Dataflows to Databricks with PySpark. Azure, PowerBI, Databricks, Python Resolved knowledge gap issue in the team responsible for this data transformation part. With the as-code solution, fully versioned on Github, and a clear review process, the team members gained a better understanding on the overall work done with the data.
Data preparation Data cleansing, normalization, and enrichment for a predictive ML use case. Apache Airflow, Apache Spark SQL, Azure, Azure DevOps, Python
Data privacy GDPR right-to-be-forgotten system in the data warehouse based on GCP BigQuery. Apache Airflow, GCP, SQL, Python Ensured regulatory data erasure through scalable, automated, and auditable SQL-based deletion logic.
Data validation Data quality real-time and stateful pipeline to validate the order of the transformed events. Apache Spark Structured Streaming, AWS, Scala Optimized the feedback provided by the customer support team.
Data visualization Delivering sensors data for PowerBI in near-real time. Azure, Azure DevOps, Python Helped the IoT monitoring team to get a better insight on key indicators for the health of the IoT devices.
ELT Various business-related batch and SQL-based only pipelines, including sessions generation, data cleansing, data preparation for data marts used by data analysts. Apache Airflow, GCP, Github Actions, Python, SQL Improved the resilience of the pipelines by implementing idempotency patterns, reducing maintenance pressure on the team. Previously team members had to perform manual actions on the tables, such as DELETEs or TRUNCATEs before starting any reprocessing..
Reverse ETL Real-time streaming pipeline integrating relevant events into an external CRM tool. Apache Beam, GCP, Java, Cloud Build Enabled a real data insight for the marketing team, based on the data present in data mart.
Ordered streaming data delivery. Streaming pipeline delivering multi-tenant data with the ordering and maximum latency constraints. Apache Spark Structured Streaming, AWS, Gitlab CI, Scala
Scaling automation Automating a project deployment for new regions. Reduced deployment time from 2 weeks to 1 day. AWS, Databricks, Pyton Saved many hours, reducing the deployment effort from manual configuration to an automated Jinja-powered templates, deployed from a configuration file.
Sessionization Hourly batch jobs processing for sessions generation and a follow-up part for the data ingestion to the data warehouse layer. I also gave a talk on that topic. Apache Airflow, AWS, Apache Spark SQL, Scala, Python, Jenkins Implemented the backbone for the future data use cases within the organization by rewriting an unsuccesful PoC in SQL on Teradata to a more scalable cloud environment leveraging Apache Spark for distributed processing.
Sessionization Real-time users sessions generation with the help of session windows. Apache Beam, GCP, GCP Cloud Build, Java, Terraform Improved the precision of the geolocated recommendations by providing an additional context to the backend service with session windows generated in near real-time (< 2 min of latency).
Sessionization. Serverless sessionization pipeline relying on the Change Data Capture for providing users activity insight in near real-time. AWS, Scala, Terraform Established a single source of truth for audience analytics by integrating with an external audience measurment system, providing the organization with certified viewership validation for all broadcast channels.

🀝 Working With Me

To ensure the best results, I typically work under the following model:

  • Role: I act as a Senior/Staff Data Engineer or Architectural Consultant.
  • Methodology: Async-first, high-documentation, and heavy emphasis on Code Reviews and mentoring your internal team.
  • Engagement: I prefer project-based milestones or long-term part-time advisory roles.
  • Location: 100% Remote (CET Timezone).

πŸ“© Ready to chat?

If you have a data engineering challenge, especially involving Databricks, Spark, or Streaming, I'd love to hear about it.

To help me give you a quick "Yes/No" on fit, please include:

  • The nature of the technical problem.
  • Your current cloud/tech stack.
  • Expected timeline/urgency.

πŸ‘‰ Email me at contact@waitingforcode.com.