What's new on the cloud for data engineers - part 6 (01-04.2022)

It's time for the first cloud news blog post this year. The update summary lists all changes of data or data-related services between January 1 and April 25.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

I'm covering here the data services with some exceptions most often related to the security services. For the updates, I'm omitting the version upgrades which are quite frequent changes especially for the managed RDBMS services. This time, I'm also trying to highlight the most important features. The view is subjective, though.

AWS

Athena

Aurora

Batch

Data Exchange

DocumentDB

DynamoDB

EC2

Although it's not a pure data service, EC2 got a few interesting auto-scaling updates:

Warm Pools

It's a pool of pre-initialized EC2 instances sitting alongside an Auto Scaling group. It helps decreasing the latency for the applications having exceptionally long boot times, for example, because of writing massive amounts of data to disk.

Besides, it also supports new instance types:

EMR

Quite a lot updates for Kubernetes:

ACID file formats:

Lake Formation:

Studio:

Scaling:

EventBridge

FSx

Although it's not a pure data service, one of its recent updates brings something good for databases:

Glue

Data Brew:

FindMatches:

Jobs:

Schema registry:

Kendra

There are 3 new features for this search engine:

Keyspaces

Two changes for the clients:

Kinesis

Two new sinks are available in Firehose:

Lambda

Processing:

Security:

Ops/Others:

Macie

Managed data identifier

The service supports different managed data identifiers. Each of them designed to detect a specific type of sensitive data, such as credit card numbers, or passport numbers.

MSK

Performance:

Security:

Ops:

Redshift

Spectrum:

Data sharing:

Data types and language:

I/O:

Security:

Others:

S3

Security:

MemoryDB

Neptune

RDS

Oracle:

SQL Server:

MySQL:

PostgreSQL:

Global:

SNS

Step Functions

Storage Gateway

QuickSight

Azure

Backup

Batch

Cache for Redis

Cosmos DB

Synapse Link:

Security:

Misc:

Data Explorer

Data Box

NOT

Data Factory

Databricks

Functions

Some news for the serverless offering:

Monitor

Although it's not a pure data service, you might find some changes indirectly related to the data services:

Unsupported tables

Azure Monitor represents collected logs as tables. The supported tables are the one currently available for exploration and export rules. An example of that type is DatabricksWorkspace which is an audit log for Databricks Workspace.

The unsupported tables, albeit coming from real Azure resources, are not ready yet for querying. For example, a Perf table stores performance counters from Windows and Linux agents. However, only Windows data can be used in the export rules.

Key Vault

Although it's not a pure data service, it has an important quotas update:

Purview

SQL Database

Hyperscale:

PostgreSQL:

SQL Managed Instance:

SQL Server on VM:

Security:

Misc:

Storage Account

Monitoring:

Misc:

Table Storage:

Stream Analytics

Synapse

GCP

BigQuery

Two major announcements for BigQuery:

SQL:

Other features:

Cloud Composer

Cloud Composer 2:

Bug fixes for:

Others:

Cloud Functions

Cloud SQL

SQL Server:

PostgreSQL:

Global:

Cloud Storage

Security:

Others

Data Catalog

Data Fusion

Data Loss Protection

New detectors and connections:

Dataflow

Dataplex

Dataproc

Firestore

IAM

Two interesting changes for this security service:

Pub/Sub

Spanner

Performance:

Other features:

Storage Transfer Service

The data innovation on the cloud is in progress. My top news are Apache Iceberg in AWS, Kubernetes support on EMR, new data services (Dataplex, BigLake, Analytics Hub) on GCP, optimized runtime environments (Dataflow Runner V2, Dataproc Serverless), and integrated ML capabilities (Stream Analytics). What's are yours?

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©