Cloud articles

Looking for something else? Check the categories of Cloud:

Data engineering on AWS Data engineering on Azure Data engineering on GCP Data engineering on the cloud

If not, below you can find all articles belonging to Cloud.

Object stores on the cloud

The next step of my multi-cloud exploration will be object stores. In the article I will try to find similarities between S3, Storage Account and GCS.

Continue Reading →

Dead-letter pattern on the cloud

Data is not always as clean as we would like it to be. The statement is even more true for semi-structured formats like JSON, where we feel working with a structure, but unfortunately, it's not enforced. Hence, from time to time, our code can unexpectedly fail. To handle this problem - as for many others - there is a pattern. It's called dead-letter qnd I will describe it below in the context of cloud services.

Continue Reading →

Azure Data Factory control flows in Apache Airflow

How to orchestrate your data pipelines on the cloud? Often, you will have a possibility to use managed Open Source tools like Cloud Composer on GCP or Amazon Managed Workflows for Apache Airflow on AWS. Sometimes, you will need to use cloud services like for Azure and its Data Factory orchestrator. Is it complicated to create Data Factory pipelines with the Apache Airflow knowledge? We'll see that in this blog post.

Continue Reading →

Streaming data sources on the cloud

Streaming broker is one of very common entry points for modern data systems. Since they're running on the cloud, and that one of my goals for this year is to acquire a multi-cloud vision, it's a moment to see what AWS, Azure and GCP propose in this field!

Continue Reading →

My journey to Azure Data Engineer Associate

I'm happy to complete my quest for data engineering certification on top of 3 major cloud providers. Last year I became AWS Big Data certified, in January a GCP Data Engineer, and more recently, I passed DP-200 and DP-201 and became an Azure Data Engineer Associate. Although DP-203 will soon replace the 2 exams, I hope this article will help you prepare for it!

Continue Reading →

Serverless streaming on AWS - an overview

If you already worked on AWS and tried to implement streaming applications, you certainly noticed one thing. There is no single way to do it! And if you didn't notice that, I hope that this blog post will convince you, and by the way, help you to get a better understanding of the available solutions.

Continue Reading →

What's new on the cloud for data engineers - part 3 (02-04.2021)

It's time for the 3rd part of "What's new on the cloud for data engineers" series. This time I will cover the changes between February and April.

Continue Reading →

Make your data disappear on the cloud

Even though the storage is cheap and virtually unlimited, it doesn't mean we have to store all the data all the time. And to deal with this lifecycle requirement, we can either write a pipeline that will remove obsolete records or we can rely on the cloud services offerings for data management. I propose a short overview of them in this blog post.

Continue Reading →

GCP Dataflow by an Apache Spark guy

Some months ago I wrote a blog post where I presented BigQuery from a perspective of an Apache Spark user. Today I will do the same exercise but applied to the same category of data processing frameworks. In other words, I will try to understand GCP Dataflow thanks to my Apache Spark knowledge!

Continue Reading →

AWS Redshift vs GCP BigQuery

Despite the recent architectural proposals with the lakehouse principle, a data warehouse is still an important part of a data system. But there is no "a single way" to do it and if you analyze the cloud providers, you will see various offerings like Redshift (AWS) or BigQuery (GCP), presented in this article.

Continue Reading →

GCP BigQuery by an Apache Spark guy

One of the steps in my preparation for the GCP Data Engineer certificate was the work with "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book. And to be honest, I didn't expect that knowing Apache Spark will help me so much in understanding the architectural concepts. If you don't believe, I will try to convince you in this blog post.

Continue Reading →

GCP BigTable or AWS DynamoDB, yet another comparison

As you know from the last 2020 blog post, one of my new goals is to be proficient at working with AWS, Azure and GCP data services. One of the building blocks of the process is finding some patterns and identifying the differences. And before doing that exercise for BigTable (GCP) and DynamoDB (AWS), I thought both were pretty the same. However, you can't imagine how wrong I was with this assumption!

Continue Reading →

My journey to GCP Data Engineer

Last December I passed the GCP Data Engineer exam and got my certification as a late Christmas gift! As for AWS Big Data specialty, I would like to share with you some feedback from my preparation process. Spoiler alert: I did it without any online course!

Continue Reading →

What's new on the cloud for data engineers - part 2 (11.2020-01.2021)

It's time for the second update with the news on cloud data services. This time too, a lot of things happened!

Continue Reading →

Lakehouse and BigQuery?

You know me already, I'm a big fan of Apache Spark but also of all kinds of patterns. And one of the patterns that nowadays gains in popularity is lakehouse. Most of the time (always?), this pattern is implemented on top of an ACID-compatible file system like Apache Hudi, Apache Iceberg or Delta Lake. But can we do it differently and use another storage, like BigQuery?

Continue Reading →

What cloud features for data processing patterns (ETL/ELT)?

During my study of BigQuery I found an ETL pattern called feedback loop. Since I never heard about it before, I decided to spend some time and search for other ETL patterns and the cloud features we could use in them.

Continue Reading →

What's new on the cloud for data engineers - part 1 (08-10.2020)

Cloud computing is present in my life for 4 years and I never found a good system to keep myself up to date. It's even more critical at this moment, when I'm trying to follow what happens on the 3 major providers (AWS, Azure, GCP). Since blogging helped me to achieve that for Apache Spark, and by the way learn from you, I'm gonna try the same solution for the cloud.

Continue Reading →

An ideal cloud for a data engineer

I had a chance to use, for a longer or shorter period of time, 3 different cloud providers. In this post I would like to share with you, how my perfect cloud provider could look like.

Continue Reading →

DataFrames for analytics - Glue DynamicFrame

When I came to the data world, I had no idea what the data governance was. One of the tools which helped me to understand that was AWS Glue. I had a chance to work on it again during my AWS Big Data specialty exam preparation and that's at this moment I asked myself - "DynamicFrames?! What's the difference with DataFrames?" In this post I'll try to shed light on it.

Continue Reading →

AWS data services security - encryption, authentication and exposition

When I was preparing my AWS Big Data specialty certification, I was not comfortable with 2 categories, the visualization and security. Because of that I decided to work on them, starting with the latter one which can have a more direct impact. The work that I initiate with this post about data security practices on AWS data services.

Continue Reading →