Cloud articles

Looking for something else? Check the categories of Cloud:

Data engineering on AWS Data engineering on GCP Data engineering on the cloud

If not, below you can find all articles belonging to Cloud.

GCP Dataflow by an Apache Spark guy

Some months ago I wrote a blog post where I presented BigQuery from a perspective of an Apache Spark user. Today I will do the same exercise but applied to the same category of data processing frameworks. In other words, I will try to understand GCP Dataflow thanks to my Apache Spark knowledge!

Continue Reading →

AWS Redshift vs GCP BigQuery

Despite the recent architectural proposals with the lakehouse principle, a data warehouse is still an important part of a data system. But there is no "a single way" to do it and if you analyze the cloud providers, you will see various offerings like Redshift (AWS) or BigQuery (GCP), presented in this article.

Continue Reading →

GCP BigQuery by an Apache Spark guy

One of the steps in my preparation for the GCP Data Engineer certificate was the work with "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book. And to be honest, I didn't expect that knowing Apache Spark will help me so much in understanding the architectural concepts. If you don't believe, I will try to convince you in this blog post.

Continue Reading →

GCP BigTable or AWS DynamoDB, yet another comparison

As you know from the last 2020 blog post, one of my new goals is to be proficient at working with AWS, Azure and GCP data services. One of the building blocks of the process is finding some patterns and identifying the differences. And before doing that exercise for BigTable (GCP) and DynamoDB (AWS), I thought both were pretty the same. However, you can't imagine how wrong I was with this assumption!

Continue Reading →

My journey to GCP Data Engineer

Last December I passed the GCP Data Engineer exam and got my certification as a late Christmas gift! As for AWS Big Data specialty, I would like to share with you some feedback from my preparation process. Spoiler alert: I did it without any online course!

Continue Reading →

What's new on the cloud for data engineers - part 2 (11.2020-01.2021)

It's time for the second update with the news on cloud data services. This time too, a lot of things happened!

Continue Reading →

Lakehouse and BigQuery?

You know me already, I'm a big fan of Apache Spark but also of all kinds of patterns. And one of the patterns that nowadays gains in popularity is lakehouse. Most of the time (always?), this pattern is implemented on top of an ACID-compatible file system like Apache Hudi, Apache Iceberg or Delta Lake. But can we do it differently and use another storage, like BigQuery?

Continue Reading →

What cloud features for data processing patterns (ETL/ELT)?

During my study of BigQuery I found an ETL pattern called feedback loop. Since I never heard about it before, I decided to spend some time and search for other ETL patterns and the cloud features we could use in them.

Continue Reading →

What's new on the cloud for data engineers - part 1 (08-10.2020)

Cloud computing is present in my life for 4 years and I never found a good system to keep myself up to date. It's even more critical at this moment, when I'm trying to follow what happens on the 3 major providers (AWS, Azure, GCP). Since blogging helped me to achieve that for Apache Spark, and by the way learn from you, I'm gonna try the same solution for the cloud.

Continue Reading →

An ideal cloud for a data engineer

I had a chance to use, for a longer or shorter period of time, 3 different cloud providers. In this post I would like to share with you, how my perfect cloud provider could look like.

Continue Reading →

DataFrames for analytics - Glue DynamicFrame

When I came to the data world, I had no idea what the data governance was. One of the tools which helped me to understand that was AWS Glue. I had a chance to work on it again during my AWS Big Data specialty exam preparation and that's at this moment I asked myself - "DynamicFrames?! What's the difference with DataFrames?" In this post I'll try to shed light on it.

Continue Reading →

AWS data services security - encryption, authentication and exposition

When I was preparing my AWS Big Data specialty certification, I was not comfortable with 2 categories, the visualization and security. Because of that I decided to work on them, starting with the latter one which can have a more direct impact. The work that I initiate with this post about data security practices on AWS data services.

Continue Reading →

My journey to AWS Certified Big Data specialty

January 10, 2020 I successfully passed my AWS Certified Big Data specialty with the overall score of 82%. Despite the fact that it will be replaced soon (April 2020) by AWS Certified Data Analytics - Specialty, I'd like to share with you my learning process and interesting resources.

Continue Reading →

My journey to AWS Cloud Practitioner

One of goals in my 3-Levels List was to get 3 certificates: AWS Cloud Practitioner, AWS Big Data and GCP Data Engineer. I've already passed the first one and that's the reason I'm writing this blog post.

Continue Reading →

Loading data into Redshift with COPY command

One of approaches to load big volumes of data efficiently is to use bulk operations. The idea is to take all the records and put them into data store at once. For this purpose, AWS Redshift exposes an operation called COPY.

Continue Reading →

AWS Kinesis Firehose, event time and batch layer

Last time I wrote about sending Apache Kafka data to batch layer. This time I would like to do the same but with AWS technologies, namely Kinesis, Firehose and S3.

Continue Reading →

Listening EMR events with AWS Lambda

I really appreciate AWS services and one of the main reasons for that is the facility to implement event-driven systems. One of the interesting use cases of these events is related to the EMR service, responsible for running Apache Spark pipelines. In this post I will try to associate an action invoked every time an EMR step completes successfully.

Continue Reading →

AWS Lambda - does it fit in with data processing ?

Despite the recent critics (cf. "Serverless Computing: One Step Forward, Two Steps Back" link in the Read also section), serverless movement gains the popularity. Databricks proposes a serverless platform for running Apache Spark workflows, Google Cloud Platform comes with a similar service reserved to Dataflow pipelines and Amazon Web Services, ... In this post, I will summarize the good and bad sides of my recent experiences with AWS Lambda applied to the data processing.

Continue Reading →

Doing data on AWS - overview

Open Source provides a lot of interesting tools to deal with Big Data: Apache Spark, Apache Kafka, Parquet - to quote only a few of them. However nowadays data platforms without cloud support are more and rarer. It's why this topic merits its own category and posts on this blog. To not go too quickly, the first article speaks about services you can use to work with the data on AWS.

Continue Reading →