Data engineering on the cloud articles

What's new on the cloud for data engineers - part 4 (05-08.2021)

It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.

Continue Reading →

Cloud networking aspects for data engineers

Guess what topic I was afraid of at the beginning of my cloud journey as a data engineer? Networking! VPC, VPN, firewalls, ... I thought I would be able to live without the network lessons from school, but how wrong I was! IMO, as a data engineer, you should know a bit about networking since it's often related to the security part of the architectures you'll design. And in this article, I'll share with you some networking points I would like to know before starting to work on the cloud.

Continue Reading →

Costs management on the cloud

The easiest way to learn is by doing but what if it involves leaving your credit card number beforehand? I've never been comfortable with that but there is no other choice to get some hands-on experience on the cloud. Hopefully, it doesn't mean you can't control your expenses. In this article, we'll see how.

Continue Reading →

Serverless streaming processing on the cloud: Azure Stream Analytics vs AWS Kinesis Data Analytics

I was writing this blog post while preparing for Azure's DP-200 and DP-201 certification. Why? To make some cleaning in my head and organize what I learned about Azure Stream Analytics and compare it with what I knew about AWS Kinesis Analytics.

Continue Reading →

Data architectures on the cloud

I haven't fully understood it yet, why the story of data architectures is the story of Greek letters. With time, they changed the context and had to adapt from an on-premise environment, often sharing the same main services, to the cloud. In this blog post, I will shortly present data architectures and try to fit them to cloud data services on AWS, Azure and GCP. Spoiler alert, there will be more pictures than usual!

Continue Reading →

Windows to the clouds

Guess what? My time-consuming learning mode based on reading the documentation paid again! This time on Azure because while reading about Stream Analytics windows I discovered that I missed some of them in the past. And since today is the day of the cloud, I will see if the same types of windows exist on AWS and GCP streaming services. And if no, what are the differences.

Continue Reading →

AWS Redshift vs Azure Synapse Analytics

You know me already. I like to compare things to spot some differences and similarities. This time, I will do this exercise for cloud data warehouses, AWS Redshift, and Azure Synapse Analytics.

Continue Reading →

Small data processing on the cloud

Believe it or not, but data processing is not only about Big Data. Even though data is one of the most important assets for modern data-driven companies, there is still a need to process small data. And to do that, you will not necessarily use the same tools as for bigger datasets.

Continue Reading →

Object stores on the cloud

The next step of my multi-cloud exploration will be object stores. In the article I will try to find similarities between S3, Storage Account and GCS.

Continue Reading →

Dead-letter pattern on the cloud

Data is not always as clean as we would like it to be. The statement is even more true for semi-structured formats like JSON, where we feel working with a structure, but unfortunately, it's not enforced. Hence, from time to time, our code can unexpectedly fail. To handle this problem - as for many others - there is a pattern. It's called dead-letter qnd I will describe it below in the context of cloud services.

Continue Reading →

Streaming data sources on the cloud

Streaming broker is one of very common entry points for modern data systems. Since they're running on the cloud, and that one of my goals for this year is to acquire a multi-cloud vision, it's a moment to see what AWS, Azure and GCP propose in this field!

Continue Reading →

What's new on the cloud for data engineers - part 3 (02-04.2021)

It's time for the 3rd part of "What's new on the cloud for data engineers" series. This time I will cover the changes between February and April.

Continue Reading →

Make your data disappear on the cloud

Even though the storage is cheap and virtually unlimited, it doesn't mean we have to store all the data all the time. And to deal with this lifecycle requirement, we can either write a pipeline that will remove obsolete records or we can rely on the cloud services offerings for data management. I propose a short overview of them in this blog post.

Continue Reading →

AWS Redshift vs GCP BigQuery

Despite the recent architectural proposals with the lakehouse principle, a data warehouse is still an important part of a data system. But there is no "a single way" to do it and if you analyze the cloud providers, you will see various offerings like Redshift (AWS) or BigQuery (GCP), presented in this article.

Continue Reading →

GCP BigTable or AWS DynamoDB, yet another comparison

As you know from the last 2020 blog post, one of my new goals is to be proficient at working with AWS, Azure and GCP data services. One of the building blocks of the process is finding some patterns and identifying the differences. And before doing that exercise for BigTable (GCP) and DynamoDB (AWS), I thought both were pretty the same. However, you can't imagine how wrong I was with this assumption!

Continue Reading →

What's new on the cloud for data engineers - part 2 (11.2020-01.2021)

It's time for the second update with the news on cloud data services. This time too, a lot of things happened!

Continue Reading →

What cloud features for data processing patterns (ETL/ELT)?

During my study of BigQuery I found an ETL pattern called feedback loop. Since I never heard about it before, I decided to spend some time and search for other ETL patterns and the cloud features we could use in them.

Continue Reading →

What's new on the cloud for data engineers - part 1 (08-10.2020)

Cloud computing is present in my life for 4 years and I never found a good system to keep myself up to date. It's even more critical at this moment, when I'm trying to follow what happens on the 3 major providers (AWS, Azure, GCP). Since blogging helped me to achieve that for Apache Spark, and by the way learn from you, I'm gonna try the same solution for the cloud.

Continue Reading →

An ideal cloud for a data engineer

I had a chance to use, for a longer or shorter period of time, 3 different cloud providers. In this post I would like to share with you, how my perfect cloud provider could look like.

Continue Reading →