What's new on the cloud for data engineers - part 5 (09-12.2021) on waitingforcode.com - articles about Data engineering on the cloud

It's time for the 5th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between September and December.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

I'm covering here the data services with some exceptions most often related to the security services. For the updates, I'm omitting the version upgrades which are quite frequent changes especially for the managed RDBMS services. This time, I'm also trying to highlight the most important features. The view is subjective, though.

AWS

Athena

Query execution:

EXPLAIN ANALYZE statement. It's available to get a more detailed view of the query execution plan. The plan includes fine grained details such as the CPU usage or the number of processed rows.
Partition pruning support. Athena connects now to Glue Data Catalog to look for the partitions that are relevant for the query execution.
Cross-account federated query. A federated query can read data stored elsewhere than in an S3 bucket. Recently, this feature got extended by the possibility to query data stores of different AWS accounts.

ACID:

ACID transactions. In addition to the governed tables, Athena now supports Apache Iceberg as the transactional data format. The feature is currently in a public preview.
Fine-grained access. Athena uses Lake Formation Data Filtering to implement cell-, row- and column-level fine-grained access for ACID-compliant governed tables .

Governed tables

Governed tables are Glue Data Catalog tables that support ACID transactions and benefit from data layout optimizations, such as small files compaction.

Console:

A modernized UI. More user-friendly and easier to use console is available.
Step Functions connection. The modernized UI has a Workflows section with state machines involving Athena queries.

Aurora

Database Activity Streams :

Graviton2 support. AWS Graviton2 processors are a new generation of Graviton processors providing improved performance. They're now supported by the Database Activity Streams
Babelfish available for PostgreSQL. This Open Source tool provides a capability for PostgreSQL to understand queries running on SQL Server. It's then a big facilitator for the data migration scenarios.

Database Activity Streams

An activity stream stores all change and access events. Change events represent data modification, such as INSERT or CREATE TABLE whereas access events represent data reads, such as SELECT statements.

Backup

New supported data stores:

DocumentDB with MongoDB compatibility.
DynamoDB.
Neptune.
S3 in public preview to create continuous backups or periodic snapshots of S3 buckets.

Batch

Two new features in the AWS Batch service:

Step Functions integration. Batch console shows Step Function workflows using Batch jobs.
Fair-share scheduling. The First-In, First-Out (FIFO) policy was completed with a fair-share policy where the service tries to allocate resources to the jobs equally, or based on the defined weights and priorities. The new scheduling should work great for queues of both long-running and short-running jobs.

Data Exchange

New features:

Automated export. The 3rd party data subscribers can now set up an auto-export feature to bring the new revisions of the subscribed datasets to their S3 buckets. Previously, the action required manual intervention or a dedicated data pipeline.
Redshift integration. Subscribers can now query 3rd party data sets directly from Redshift without the need of copying the data to the clust. The datashare must be located in the data producer's Redshift, though. The feature is in public preview.

Data Sync

Data Sync is the service for data synchronization between on-premise and cloud storage, or different cloud storage services. Recently it got some new connectors:

HDFS support. It's now possible to transfer data between HDFS and S3, EFS or FSx for Windows File Server.
FSx for Lustre support. Besides HDFS, the service now supports Amazon FSx for Lustre. The new service provides cost-effective, high-performance, scalable storage for compute workloads.

Database Migration Service

Source-related changes:

Google Cloud SQL for MySQL is now supported as a migration source.
MongoDB databases. The service can migrate multiple databases in a single task using MongoDB and Amazon DocumentDB with MongoDB compatibility as a source.
MongoDB segmented migration. By default DMS will use a single thread to migrate data from MongoDB or Amazon DocumentDB with MongoDB compatibility source. Depending on the dataset size, the process can be slow. To optimize it, it's now possible to use a segmented migration, so to run a multi-threaded migration.

Target-related changes:

Apache Kafka. DMS can now migrate multiple schemas into multiple Apache Kafka topics in a single task.
Redis, Amazon ElastiCache for Redis, and Amazon MemoryDB for Redis are new available target data stores.
Redshift. The migration performance should improve thanks to multithreaded full load task settings.
S3. The migration performance should improve thanks to the parallel upload of different partitions. The feature will migrate each partition data concurrently.

Other changes:

Time Travel. The feature provides an extra logging capability to investigate the replication task issues. Under-the-hood, the service will log any of the selected replication activities (e.g. insert or update operations) on S3 as encrypted CSV files with the fields such as raw data, source and target information, or the replication action time.

DocumentDB

User-defined roles support. Previously the service supported only built-in roles. Now, it's possible to specify user-defined roles and for example enable a scenario where a user can access one collection as a reader and work with another one as a writer.
JDBC driver. JDBC driver is available. It can be used to run queries against DocumentDB from a BI tool (Tableau, MicroStrategy, and QlikView) or a data exploration tool (SQL Workbench).
Geospatial data. It's now possible to store, query and index geospatial data.
Support for $literal, $map, and $$ROOT. DocumentDB improves its compatibility with MongoDB and added a support for $literal and $map operators, and $$ROOT system variable.
Graviton2-based instances support. DocumentDB with MongoDB compatibility supports Graviton2-based T4g.medium and R6g instance types. They can provide up to 30% performance improvement.

DynamoDB

New features of DynamoDB:

Standard-Infrequent Access table class. This new storage class helps reduce costs of infrequently accessed data by 60%. It's a good choice for data requiring long-term storage and infrequent access, as for example application logs or historical data.
DynamoDB Streams data-plan activity in CloudTrail. CloudTrail supports retrieving and filtering of DynamoDB Streams data-plane API activity. The feature gives a more granular control of the activity logged in CloudTrail.
Import/export in NoSQL Workbench for DynamoDB. The UI tool for DynamoDB has a possibility to import a CSV dataset into an existing data model, and to export query results in CSV format.

EC2

Although it's not a pure data service, EC2 got a few interesting auto-scaling updates:

EC2 Fleet with targeted On-Demand Capacity Reservations. Thanks to the feature you can reserve compute capacity in a specific Availability Zone for any duration.
Predictive auto scaling. You can use custom metrics to predict the compute capacity needed by an Auto Scaling group. The feature will proactively add the capacity to meet the predicted load. The feature also supports metrics coming from a different service that might impact the needed compute capacity.
Auto termination for Spot Instances when using Capacity Rebalancing. Previously, when a Spot Instance was at risk of interruption, EC2 Fleet or Spot Fleet was launching a replacement instance without terminating the risky one. Now, the termination is automatic and you don't need to do it on your own.
Attribute-based instance type selection. The feature is useful for instance type flexible workloads on EC2 Spot Instances. In this mode you don't need to create a list of targeted instance types. Instead, you define a targeted capacity in terms of vCPU, memory or storage, and the service will select spot instances matching these attributes.
Spot instance placement score. The score indicates the probability of success for a spot instance request in a Region or Availability Zone.

Besides, it also supports new instance types:

C7g instances powered by Graviton3 processors in preview. They're optimized for workloads such as high performance computing (HPC), gaming, video encoding, and CPU-based machine learning inference.
G5g instances powered by AWS Graviton2 processors are available. They're optimized for Android game streaming.
M6a instances are generally available. They're optimized for web and application servers, back-end servers supporting enterprise applications (e.g. SAP Business Suite, MySQL, Microsoft SQL Server, and PostgreSQL databases), web servers, micro-services, multi-player gaming servers, caching fleets, as well as for application development environments.
EC2 R6i instances are generally available. They're optimized for memory-intensive workloads, including SQL and noSQL databases, distributed web scale in-memory caches like Memcached and Redis, in-memory databases like SAP HANA, and real time big data analytics like Hadoop and Spark clusters.
M6i instances are available. They're optimized for web and application servers, back-end servers supporting enterprise applications, gaming servers, caching fleets, as well as for application development environments.
C6i instances are available. They're optimized for compute-intensive applications like batch processing, distributed analytics, high performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding.
G5 instances are generally available. They're optimized for graphics intensive and machine learning use cases.

EMR

Serverless:

EMR Serverless in preview. This new EMR runtime doesn't require any prior cluster configuration. Instead, the service will provision and scale the resources needed to run the job.

Cluster:

Multiple custom Amazon Machine Images (AMI) supported. Previously, it was possible to use a single AMI in the instance group. Now, it's possible to define a different AMI for each instance type.
Idle clusters auto-termination. EMR cluster supports auto-termination with an idleness period. If there is no action on the cluster within this time, the service will automatically destroy the cluster.

Security:

Write operations with Apache Ranger enabled. Earlier this year EMR added Apache Ranger support for securing access to S3 tables backed by Hive Metastore from Apache Spark SQL. Previously, this feature was limited only to the read operations. Now, it's possible to perform INSERT INTO, INSERT OVERWRITE, and ALTER TABLE.

Studio:

Multi-language support. Starting from EMR 6.4.0 it's possible to use Python, Scala, Spark SQL and R within the same Jupyter notebook.
Regulated workloads eligibility. EMR Studio is Health Insurance Portability and Accountability Act (HIPAA) eligible, and is Health Information Trust Alliance (HITRUST) certified. It can now be used to run healthcare workloads regulated by these 2 organizations.
IAM-based authentication. In addition to the AWS Single Sign-On, Studio supports IAM-based authentication and IAM Federation.
Easier execution of notebooks and scripts. Previously to execute Python scripts or other notebooks directly from the Studio required copying the files to the EMR cluster. This step is not necessary anymore and both can be executed directly from the notebook.
Subnet extension. So far the Studio could support EMR clusters from the subnet selected while creating the workspace. Now, it's possible to connect it to any of the subnets specified for the Studio.

EventBridge

A schema management feature:

Cross-account schema discovery. EventBridge stores schema definitions in a schema registry and maps them to Java, Python, and Typescript code. The service adds new schemas automatically if the schema discovery is turned on. Recently, the schema discovery got extended by the support for the events sent from a different AWS account.

Glue

Data Brew:

AppFlow integration. Thanks to this new integration Data Brew can work on the datasets exported from various SaaS platforms, such as Salesforce, Google Analytics, or Zendesk. The export will be either manual (on-demand), event-driven, or scheduled.
Custom SQL statements for Redshift and Snowflake. The service supports datasets creation from custom SQL queries executed on top of Redshift or Snowflake tables.
Data quality rules. The users can now create data quality rules for the manipulated datasets. These rules can for example ensure that the column doesn't contain duplicate values or that these values are included within a specified range. The data quality evaluation is later exposed from the dashboard.
PII data masking. Personally Identifiable Information (PII) can now be masked during data preparation. The service can detect the PII data from a data profiling job and mask it with one of the available transformations (substitution, hashing, encryption).

Crawlers:

Event-driven crawlers. The crawlers support S3 Event Notifications as a source. Each time an S3 object is deleted or written, S3 sends a notification to a SQS queue used by Glue crawler to detect new data to add or remove from the Data Catalog tables. If the queue is empty, the crawler immediately stops the action.

FindMatches:

Incremental matching. The transform can now automatically match new data against existing data to better identify duplicates or similar records.
Match scores output. The transform got a new option to output match scores and get a better idea on how each grouping of records match each other.

Jobs:

Integration with Data Catalog tables created with Schema Registry. Thanks to this new feature, Glue ETL streaming jobs can read Data Catalog tables created from Schema Registry to manage and enforce streaming data schemas.

Kendra

There are 3 new features for this search engine:

Experience Builder. The builder is a UI tool to create search applications without writing any frontend code.
Search Analytics dashboard. The dashboard provides a global view of the search performance and helps find areas of improvement.
Custom Document Enrichment. The service now has a pre-processing capability to change the document before indexing. The pre-processing can use native features like field deletion or more complex and custom operations managed from a Lambda function.

Kinesis

Data Streams:

On-Demand mode. Kinesis Data Streams got a new capacity mode not requiring a capacity planning. In this new mode, you pay per GB of data written and read from your data streams, and hence, do not need to set the number of shards for the stream. Under-the-hood, the service will adapt the stream capacity to the current load.

Firehose:

Dynamic Partitioning. With this feature, Firehose can use any of the event attributes as a partition key while writing it to S3. The partition key can be defined as a JQ expression for JSON data, or as a custom code for any other input format.

Lake Formation

Managed VPC endpoints support. Thanks to VPC endpoints you can authorize access for client applications and services inside of your VPC and on-premises using private IP connectivity. The feature also supports a fine-grained access control with VPC endpoint policies.
Governed Tables support. This new type of tables ensures a consistent view of the data in case of concurrent operations and conflicts. They also have an optimized storage improving access performance.
Row and cell-level permissions. This new security control is available through Amazon Athena, Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight.

Lambda

Triggers:

Cross-account SQS trigger. A Lambda function can now be triggered by an SQS event coming from a different AWS account.
Event filtering for SQS, DynamoDB and Kinesis. The feature adds a possibility to process only the events matching a custom filtering expression associated with the source mapping. Each mapping supports up to 5 filters that are combined with an OR operator. The event will be passed to the Lambda instance if any of the filters matches. Otherwise, it'll be dropped.
OffsetLag for Apache Kafka. The Lambda function reading data from Amazon MSK or a self-managed Apache Kafka broker got a new metric called OffsetLag to monitor the delay in processing. The metric measures the total number of messages waiting for processing.
SQS partial batch response. Previously the SQS batch processing semantic was all-or-nothing. Now, it's possible to have partial results. The function must return a list of unsuccessfully processed items so that the next invocation can only reprocess these failed records.

Graviton2 processor:

Graviton2 support. The new processor is available to provide better performance at a lower cost for serverless workloads.
CloudWatch Lambda Insight support. The Insight can monitor and troubleshoot Graviton2-powered Lambda functions.

Security:

IAM authentication for MSK. Lambda functions can now use an IAM role to connect to Amazon MSK topics. It completes the already available SASL/SCRAM method.
Mutual TLS authentication. It's now possible to provide a certificate to establish a trust relationship between a Lambda function and an Apache Kafka broker (self-managed or MSK) event source. In this mode, the verification is mutual. The broker sends a certificate to Lambda, and the Lambda instance sends a certificate to the broker.

Ops:

Cross-account ECR support. The functions can now use container images stored in an ECR repository in a different AWS account. The feature can be useful to the configurations with a centralized ECR repository shared by all AWS accounts.

Macie

Sensitive data customization. Before running the sensitive data discovery job you can select a list of managed data identifiers to look for in the processing. For example, you could customize the job to look for personal health information (PHI), and personally identifiable information (PII).

Managed service for Prometheus

Although it's not a pure data service, it's worth noticing the general availability of the managed version of Prometheus.

MSK

Serverless:

MSK Serverless is in public preview. This new mode lets you run Apache Kafka without capacity planning and scale your cluster accordingly to the current I/O workload.

Security:

Multiple concurrent authentication modes. The new or existing MSK cluster now has multiple authentication modes at the same time. The feature can be useful for authentication mode migration or simple support of multiple modes simultaneously.
Secure connection over the internet. It's now possible to query an MSK cluster from the clients external to the VPC. The feature requires enabling public access, TLS encryption in-motion and one of the supported authentication methods (IAM, SASL/SCRAM, mutual TLS).

Others:

Kafka Connect integration. Amazon MSK can now run fully managed Kafka Connect clusters. The feature scales automatically in response to the current load without the need to manage the infrastructure.
New 19 metrics. The service publishes 19 new metrics to CloudWatch. Among them you'll find CPU, storage and the network usage information.

Neptune

Some new features for the graph database:

JDBC driver. An Open Source JDBC driver to connect to Amazon Neptune is available.
Graviton2-powered instances.General-purpose T4g and memory-optimized R6g database instances powered by the AWS Graviton2 processor are now generally available.
Auto Scaling. The feature can automatically add or remove Read Replicas as a response to the load changes detected from CloudWatch metrics analysis.
Extra data types in search. The service supports numbers and dates for full-text search. Previously, only string type was supported.

Redshift

Serverless:

Redshift Serverless in preview. Instead of managing clusters, you can now create a serverless Redshift instance that will scale automatically to meet the load.

Client:

Improved Open Source integration. Apache Airflow implemented RedshiftSQLHook and RedshiftSQLOperator to facilitate the interaction with Redshift and SQLAlchemy got updated Redshift dialect.
Amazon Redshift RSQL. It's a command line tool supporting PostgreSQL's psql capabilities with some extra Redshift features. It can also run scripts if executed in batch mode.
Query Editor V2. A new version of the web-based query editor is available. Besides classical query-related forms, the editor also supports SQL Notebooks that helps organize multiple queries in a single document.
RI migration feature. Amazon Redshift Console, CLI, and API have a new feature to migrate DS2 RI clusters to RA3 RI clusters.

Data sharing:

Cross-region data sharing. Data sharing was previously available for the clusters in the same and different AWS accounts. This time the feature was extended by the possibility to share data with clusters located in different regions. It's currently in preview for the clusters using RA3 node types.
Performance improvements. Data sharing should perform better thanks to the performance enhancements, such as result caching and concurrency scaling.

Data types:

VARBYTE. This first new type can be used to store variable-length binary strings. It takes one parameter defining the max allowed number of bytes (between 1 and 1 024 000).
GEOGRAPHY. Thanks to this new type, Redshift supports 2 major spatial data types (another is GEOMETRY).

Other features:

Federated query for RDS. RDS for MySQL and Aurora for MySQL are new databases supported in the federated query capability. Previously the feature was in preview.
AQUA cache for RA3.xlplus nodes. AQUA is a new distributed hardware-accelerated cache. Thanks to it, Redshift can run up to 10x faster for certain types of queries. The feature is now Generally Available for RA3.xlplus nodes.
Automated Materialized View (AutoMV). This preview feature monitors current workload and uses Machine Learning models to propose new materialized views creation and old materialized view removal.
AmazonRedshiftAllCommandsFullAccess policy. A new IAM policy called AmazonRedshiftAllCommandsFullAccess is now available to quickly start playing with Redshift. It provides all required privileges to use Redshift-related services, such as S3, SageMaker, Lambda, Aurora and Glue. It facilitates the external tables manipulation and data loading.
Concurrency scaling for writing. Concurrency scaling adds compute capacity to handle the extra load. Now, it's also supported for writing operations like COPY, INSERT, UPDATE, and DELETE.

S3

Events:

Event notifications in EventBridge. S3 notifications can be now directly delivered to EventBridge. The new target provides various features, such as filtering and fan-out pattern (event delivery to multiple other targets).
S3-managed events. The service extended its support for notification types. It's now possible to catch a transition or expiration S3 Lifecycle policy event, or a move within the S3 Intelligent-Tiering storage class to Archive Access or Deep Archive Access tiers.

Storage classes:

Glacier rename and pricing. S3 Glacier is now named S3 Glacier Flexible Retrieval and it includes free bulk retrievals that are ideal for getting the data once or twice per year. It also has a 10% price reduction for the storage.
Archive Instant access tier in S3 Intelligent-Tiering. The S3 Intelligent-Tiering includes Archive Instant Access Tier. The new tier provides cost savings for rarely accessed data, millisecond retrieval and high throughput performance.

Security:

Policy warnings. S3 console reports security warnings, errors, and suggestions from IAM Access Analyzer. The feature can for example notify you about overly permissive access policies.
Object Ownership. This new feature disables access control list to simplify access management. The option works at the bucket level and when enabled, it assigns objects ownership to the bucket owner. In addition, only IAM policies can be used to grant permissions.
S3 Access Points. Access Point aliases can be used to manage granular access to the S3 datasets shared from AWS Transfer Family.

S3 File Gateway :

S3 File Gateway

The gateway is a proxy providing access to virtually unlimited cloud storage from SMB and NFS protocols..

NFS file share auditing. NFS client operations, such as delete, read, write, rename, and change of permissions, are logged. The logs can be delivered to CloudWatch or Firehose for further analysis.
Files closing on SMB shares. Previously the clients could leave the files in an open or locked state. Starting from now, the members of the GatewayAdmin local group have permission needed to force-close these locked files.

Other features:

Versions limit. You can set the max number of versions to retain for the versioned objects.
Object size lifecycle rule. Recently S3 also got support for transitioning objects to different storage classes by their size.
Strong consistency for S3 on Outposts. S3 on Outposts supports strong read-after-write and list-after-write consistency.
Multi-Region Access Points performance improvement. You can notice up to 60% performance improvement when accessing datasets replicated across multiple AWS regions from Amazon S3 Multi-Region Access Points. The improvement relies on the intelligent routing of the request to the closest location with the requested object.

OpenSearch

Data streams support. Data streams are an abstraction simplifying time-series data management, including the rollover process to move old data to cold storage.
Summarized view. The service has a new index transform to aggregate and store summarized views from large data sets.
New console. A new version of the console is available and provides a better user experience for the domain creation and monitoring.

RDS

Oracle:

New sqlnet.ora parameters. The new parameters control the encryption and checksum behavior on the client side.

Global:

RDS on Outposts can export database logs to CloudWatch.
RDS on Outposts supports backups on Outposts. The backups can now be created on the same Outpost as the RDS instance.
Support for Customer-Managed Encryption Keys from a different account. It's now possible to export an RDS Snapshot to S3 and encrypt it with KMS keys located in a different AWS account.

Snow Family

Tape data migration. It's a new capability of AWS Snowball Edge to migrate tape data to the cloud when the network connectivity is limited or expensive.

SNS

Message batching for the writing. Data producers can write up to 10 messages in a single batch request for Standard and FIFO topics.
Token-based authentication. This new authentication mode targets Apple devices and provides faster communication than classical certificate-based mode.

SQS

Dead Letter Queue redrive. With the feature, the dead-lettered messages can be moved back to the source queue.

Timestream

Faster and more cost-effective writing. AWS' time series databases added 3 capabilities to optimize writes:
- scheduler queries to schedule aggregate and other real-time analytics queries
- multi-measure records to store multiple time series measures in a single table row, instead of storing one measure per row
- magnetic storage writes to store late arriving data instead of keeping large memory store to process late data

Azure

Backup

Security:

Multi-user authorization in public preview. The feature adds an extra layer of protection for the critical operations, such as disabling soft-deletes or the multi-user authorization. If enabled this extra protection enabled, it requires gaining a Contributor role on the Resource Guard by the users interacting with these critical operations.
Long-term retention for Azure Database for PostgreSQL - Single Server. Previously in preview, the feature is now generally available. Backup data can be retained up to 10 years in the standard or archive tier.

Batch

Pool Managed Identities. You can associate managed identities with Batch pools. The pool nodes will use these identities to access Azure resources.
Account Authentication Control. The service API authentication can be limited to azure Active Directory. Hence, API calls using access key authentication won't be accepted.
Pool VM Extensions. It's now possible to associate VM extensions with Batch pools to automatically install these extensions on the pool nodes.
Pool Availability Zone Support. A pool got a new node placement policy that allocates nodes across all the availability zones.
Pool Node Ephemeral OS Disk Support. A new setting specifies ephemeral OS disk usage for the pool nodes. The ephemeral disk is created on local VM storage instead of remote Azure Storage instance.
Possibility to list supported VM sizes via API, CLI and Powershell.
Increased limit of applications for a Batch account from 20 to 200.
Pool supports the exact marketplace VM image version used to create the nodes.

Cosmos DB

Cassandra API:

Server-side retries for Cassandra API. Instead of applying rate-limiting (429 response), the service will now retry operations.
Named indexes. You can now define a new index with the name on a single column.
Glowroot support. You can access Glowroot to monitor application performance. Cosmos DB Cassandra API got some new features to support Glowroot: truncate, lightweight transaction, and named indexes

Cost management:

Provisioned throughput spending limit. You can set a target budget on the provisioning that won't be exceeded.
Azure Advisor recommendations. The service provides 2 new recommendations for cost saving and optimization. The first of them checks whether autoscaling is still better for the workloads rather than the manual throughput configuration. The second does the opposite and analyzes the historical usage patterns to detect if autoscale would be better than the manual configuration.

Other updates:

Async IO support for Python. Python SDK supports async IO for better concurrent tasks execution.
Partial document update generally available. Previously, any update was done entirely on the document. With the partial update it's possible to change only some of the document attributes.
Indexing metrics. They show utilized and recommended indexed paths in the queries.
Logic Apps Standard Connector. The connector includes a trigger for the change feed, as well as CRUD actions on the documents.

Data Explorer

A lot of announcements with general availability promotion of the preview features:

Data copy with Data Factory or Synapse Analytics. A new connector in both services is available to read and write data to Data Explorer from Mapping Data Flows.
Hot windows. It's a cache policy that adds a past period of data from cold storage to the hot cache for better querying experience. For example, a table stores data from years 2000 - 2021 and only the last year is persisted in the hot cache. If you needed to analyze data from 2010, you could add only this year to the hot cache as a hot window. There is no need to cache the whole period from 2020 to 2010. It's then a useful feature for audit scenarios.
Telegraf plugin. The plugin enables writing time-series data to Data Explorer from Telegraf agent. The feature is also present for Synapse Data Explorer but in public preview.
Data Explorer Insights. The feature built on Azure Monitor Workbooks provides unified view of your Azure Data Explorer performance, cache, ingestions and usage.

Data Factory

Two updates:

Compute Optimized data flows retired on 31 August 2024. It's necessary to migrate them to General Purpose data flows before that date.
Managed Virtual Network is generally available. Thanks to this feature Azure Integration Runtime can benefit from private connectivity to the data sources available in Data Factory.

Event Hubs

Two general availability announcements:

Event Hubs Premium. The new tier offers better performance and predictability by leveraging the power of dedicated memory, storage and compute.
EventHub action in Azure Monitor action groups. Azure Monitor alerts can now be delivered to an Event Hub.

Functions

Some news for the serverless offering:

New Cosmos DB extension. The new extension is currently in preview and provides a bunch of interesting features, such as support for identity-based connections instead of the secret-based.
New triggers and bindings. The Azure Functions service supports Azure Blob, Azure Queue, Azure Event Hubs, Azure Service Bus, and Azure Event Grid as the triggers and bindings.
Azure SQL bindings. The service supports input and output bindings for Azure SQL. The feature is in preview.
Dynamic concurrency. The feature is currently in preview for Azure Blob, Azure Queue, Service Bus triggers. It helps adapt the number of instances to the workload.

HDInsight

Two network and one API change:

Restricted public connectivity. The feature is supported in all regions and helps connecting with the cluster privately, e.g. by bringing Private Link-enabled resources.
Private Endpoint over Private Link. This new connection mode can be helpful when the Virtual Network peering is not available and the communication can't go over the public internet.
API version update. The new API version called 2021-01-01 is generally available. In addition, the 2018-06-01 preview API will be retired on 30 November 2024 and t

Monitor

It's not a pure data change but it's important to notice the rename of action rules to alert processing rules.

Key Vault

Although it's not a pure data service, it got 2 important security updates:

Automated key rotation. Thanks to this feature you can set a key rotation policy that will create a new version of the key on a scheduled basis. The feature is currently in preview.
Azure governance policy. Azure Policy helps ensure that the Key Vault items are compliant with the organization policy. The feature is generally available.

Managed Instance for Apache Cassandra

The service providing a managed infrastructure for Apache Cassandra is now generally available.

Purview

General availability. The service is now generally available.
Snowflake and Amazon RDS connectors. Purview has Snowflake and Amazon RDS new connectors in preview.
Dataset provisioning by data owner for Azure Storage. Thanks to this feature, data owners can manage data access to Azure Storage datasets directly from Purview.
Microsoft Defender for Cloud integration. The integration adds an extra protection layer. The Microsoft Defender service can alert about any potential data security issues.

Service Bus

Large message. Premium tier namespaces now support messages up to 100 MB. The previous limit was 1 MB. The change is generally available.

SQL Database

Hyperscale:

Virtual Network connection with Private Link. Hyperscale (Citus) nodes can be securely connected to the Virtual Network with Private Link.

Database for PostgreSQL - Flexible Servier:

Intelligent tuning feature. The feature helps find the best storage parameters based on the usage patterns.
New compute tiers (Ddsv4 and Edsv4) are available.
Flexible Server deployment option is generally available.
Geo-redundant backup and restore is in preview. The feature can restore a geo-redundant backup in a paired region in case of a region outage.

Database for MySQL:

Flexible Server deployment option is generally available.
Geo-redundant backup and restore is in preview. The feature can restore a geo-redundant backup in a paired region in case of a region outage.

SQL Managed Instance:

New premium-series hardware. The hardware based on the 3rd Gen Intel® Xeon Scalable processor provides better compute and memory performance than standard-series (Gen 5) hardware.
Not only admin can access system-wide wait-stats and resource stats. There are new roles (MS_ServerStateReader, MS_DefinitionReader, MS_ServerStateManager) that can access this information without being the server administrator.
Virtual cluster and other dependent components are removed with last instance deletion in the subnet.
Long-term backup retention. The backup can now be stored for up to 10 years.
Backup storage configuration. It can be set to local, zone, or geo-redundant.
Network traffic limit to a list of Storage Accounts with outbound firewall rules.
Increased storage limit to 16 TB in General Purpose tier.
Security control to prevent data exfiltration to Azure Storage. The feature relies on Virtual Network service endpoint policies.
Distributed transactions across databases within multiple instances are generally available in .NET and T-SQL.
Azure AD-only authentication. With the feature you can disable classical user/password authentication mechanisms.
Linked servers support pass-through and Managed Identity authentication methods.
A live instance can be moved to another subnet.

Storage Account

Security:

Attribute-based Access Control (ABAC) in preview. The feature adds a possibility to extend access rules with the custom attribute-based conditions, such as blob tags or matching resource attributes with the principal attributes.
Multi-region Storage Account access. Previously a service endpoint enabled the connections to the Storage Account only for the Virtual Networks and subnets from the same region. The feature now applies to any Azure region.
Soft deletes for Azure Data Lake Storage. The feature that protects files and directories from accidental deletes is now generally available.
Immutable storage with versioning for Blob Storage. With the immutable storage feature, blobs can be written only once and cannot be modified or deleted. You can also set a temporal or explicit lock on them. Now, the feature is generally available to the versions of the blob.

Other features:

Hierarchical namespace for existing accounts. Enabling hierarchical namespace for existing Storage Accounts is now generally available.
Object replication for Premium Block Blob Storage. This new feature in preview enables replicating data from one storage account to another.

Stream Analytics

A few changes for High Availability and supported target data stores:

Availability Zones with Dedicated Cluster. A Dedicated Cluster supports Availability Zone. Thanks to this feature, in case of a disaster, the service will automatically fail over to other zones.
Data Explorer support. A Stream Analytics job can now write data to a Data Explorer table.
Azure Database for PostgreSQL support. A Stream Analytics job can now write data to an Azure Database for PostgreSQL table. The feature is available for Single Server, Flexible Server, or Hyperscale (Citus) deployment modes and is still in preview.

Synapse

Several changes for the service:

Compute Optimized data flows retirement on 31 August 2024. As for Data Factory, it's necessary to switch to General Purpose by that date.
Dynamic allocation for Apache Spark. The service supports autoscaling with a simple definition of the minimal and maximal number of executors. It completes already available autoscaling based on the number of nodes.
PREDICT functionality. The operator can be used to apply a prediction model on the data. It requires enabling SynapseML predict with the spark.synapse.ml.predict.enabled property.
Delta Lake support. Serverless SQL pools can now read data in Delta Lake format.
Custom partitioning for Synapse Link for Cosmos DB. The feature adds an extra partitioning capability that may be different from the transactional workload needs.
Optimized engine for telemetry data. Synapse Data Explorer has a dedicated query engine optimized and built for log and time series data workloads.
Pre-purchase plan. It's possible to pre-purchase Synapse compute for 1 year and save up to 28% compared to the pay-as-you-go pricing.

GCP

BigQuery

BigQuery Omni, a multi-cloud analytics solution, is now generally available.

Administration:

Slot Estimator in preview. The tool helps reservation users estimate the right number of slots to purchase.
Resource Charts generally available. Thanks to it, the administrators of reservation users can easily monitor the slot consumption, jobs concurrency, or their execution times or errors, across the entire organization.
Session support. You can group several SQL activities across scripts or multi-statement transactions inside a common session.
Authorized datasets. The feature authorizes access to all views in the dataset without configuring an authorized view for each of them separately.
DDL column in INFORMATION_SCHEMA view. Thanks to this new column you can see the DDL statement used to create the table or view.
Deleting the metadata for a specific job from CLI. The bq command line supports this delete operation.

IO:

Parquet format for the export. The Parquet export format is now generally available.
Table snapshots. The feature of creating a copy of a table at a specific point in time is now generally available.
Storage Write API. This new API combines the functionality of high-throughput streaming ingestion and batch loading. It's now generally available.
BigQuery Migration Service is in preview. The new BigQuery component adapts Teradata SQL queries to BigQuery SQL standard.

SQL:

Geospatial functions. The service supports new geospatial functions: ST_EXTERIORRING, ST_INTERIORRINGS, ST_ANGLE, ST_AZIMUTH, ST_NUMGEOMETRIES, ST_GEOMETRYTYPE, ST_BUFFER, ST_BUFFERWITHTOLERANCE, ST_BOUNDINGBOX, ST_EXTENT, S2_COVERINGCELLIDS, S2_CELLIDFROMPOINT.
New scripting statements: CASE to execute the first SQL statement evaluated to true, CASE ${search} to execute the first SQL statement where the ${search} matches a WHEN expression, FOR…IN for looping, LABELS to control jumps to the end of blocks or loops associated with the label, REPEAT to repeatedly execute a list of SQL statements until evaluating an expression to true (a bit like a while loop).
Parameterized types. It's possible to define constraints on strings, bytes and numbers in script variables and columns.

Security:

Row-level security for historical data. For time-travel, if a table has or has had row-level access policy, only table administrators can access historical data.
Column-level encryption with KMS. Thanks to it, Cloud Key Management Service (Cloud KMS) can encrypt the keys that in turn encrypt the values within BigQuery tables. Therefore, the user must have access to the encryption key and the table to get the data. The feature is now generally available.

Pricing:

300 TB of data at no charge. BigQuery Storage Read API users can read up to 300 TB of data per month for free.
BigQuery Storage Read API charged for network egress.

BigQuery Transfer Service

Audit Logging, Cloud Logging and Cloud Monitoring support. It's in preview.

Cloud Functions

Minimum number of instances. The feature keeps some number of instances alive to address the cold start issues.
Secret Manager integration. Secret references can now be passed to the function at the deployment time without writing any code to decrypt them in the function.
Customer-managed encryption keys. This encryption method can be used to protect any data at rest used by the function.

Cloud SQL

SQL Server:

Flexible Microsoft Active Directory integration. It's possible to integrate a SQL Server instance with an Active Directory domain located in a different project.

MySQL:

CSV customization. It's now possible to customize formatting controls for CSVs for the import and export jobs.
The mysqldump options. The feature gives a possibility to specify mysqldump options during migration from external servers.
Custom import to set up replication from large external databases.

PostgreSQL:

CSV customization. It's now possible to customize formatting controls for CSVs for the import and export jobs.
Enhanced support for multiline log entries in postgres.log. Previously, each line was written as a separate entry in Cloud Logging. Now they're written as a single line.
Native logical replication extension called pglogical is supported.
Automatic IAM database authentication is now generally available. It's a recommended way for the most secure and reliable experience. In this mode users need only to pass the IAM database username instead of the access token required in the manual authentication.
New extensions are generally available.
- auto_explain to automatically log execution plans of slow statements
- pg_cron to schedule commands from a database with a cron expression
- pg_hint_plan to specify query hints as comments
- pg_proctab to use access operating system process tables from PostgreSQL
Support for new flags: huge_pages, shared_buffers, wal_buffers, min_wal_size, max_pred_locks_per_page and max_pred_locks_per_relation.

Global:

Out-of-disk recommender generally available. The feature generates recommendations to avoid the risk of downtime due to no free disk space left.
Database minor version available when viewing information about an instance.
Two new cost recommenders. The first identifies idle database instances in the project and generates recommendations to shut them down. The second detects overprovisioned instances and recommends a rightsizing operation.
An instance nearly out of storage capacity will be automatically stopped to avoid losing the data.
Access Approval is generally available. With this feature you have to explicitly approve access to your database instance for Google Support.
Limited rate for backup operations (5 operations every 50 minutes at most).

Cloud Storage

Features:

Object Versioning in the Cloud Console. The feature can now be fully managed from the Cloud Console.
List Object V2. New list objects request is generally available. It includes some extra parameters such as continuation-token or start-after.
Turbo replication in preview. This premium feature asynchronously replicates objects across regions within 15 minutes.

Security:

Public Access Prevention. The feature is generally available and protects objects from being accidentally exposed to the public.
Cloud External Key Manager (EKM) encryption support. The service can now be used to encrypt Cloud Storage data.
orgpolicy.policy.get permission in GCS IAM roles. The permission allows the users to discover the project's organizational policy constraints.

Others:

Object listing performance degradation fixed. The operation is no longer impacted by large-scale object deletion.
XML API multipart uploads are immutable. The objects uploaded with this method cannot be rewritten or copied within GCS.

Data Fusion

Security:

Role-Based Access Control( RBAC). RBAC is supported in preview to define what users can do at the namespace level.
Customer-Managed Encryption Keys (CMEK) supported. It provides encryption capability for the pipeline logs and metadata, Dataproc cluster metadata, and data sinks, actions and sources.

Ops:

Connections management. Admins can fully manage connections from Pipeline Studio, Wrangler, or the Namespace Admin page.
Shielded VMs. You can now set Shielded VMs in the Dataproc provisioner configuration.
Labels. Dataproc provisioner also supports labels.
Apache Spark 3 is the new default engine for Dataproc cluster pipelines.

Other features:

Dataproc cluster reuse. You can optimize pipeline run startup by reusing clusters from previous runs.
Transformation pushdown. When enabled, Data Fusion will execute joins in BigQuery instead of Apache Spark.
SAP data source. SAP is a new data source for batch-based and delta-based data extraction.

Data Loss Protection

New detectors and connections:

Data profiler for BigQuery in preview. The data profiler scans all BigQuery tables and creates a dedicated data profile for each of them to identify sensitive and risky data locations. The feature is in preview.
IMSI_ID infoType detector. It identifies users on a mobile network.
ICCID_NUMBER infoType detector. It identifies each SIM card.

Dataflow

Dataflow Prime is available in preview. This new runtime environment executes Apache Beam pipelines in a serverless mode. Apart from that, there are 2 other announcements:

Shielded VMs support. A Shielded VM gives a guarantee that the instance haven't been compromised by boot- or kernel-level malware or rootkits.
Zonal DNS for worker resources. Dataflow uses Zonal DNS for workers to enable highest reliability guarantee around internal DNS registration.

Dataproc

Log4J vulnerability fixes. GCP releases several sub-minor image versions to address Log4j 2 vulnerability.
YARN activity detection turned on. The dataproc:dataproc.cluster-ttl.consider-yarn-activity property is set to true for image versions 1.4.64+, 1.5.39+, and 2.0.13+. Thanks to it, the service includes YARN activity to the idle time analysis in the Cluster Scheduled Deletion feature.
Apache Spark configuration properties. spark.history.fs.gs.outputstream.type and spark.history.fs.gs.outputstream.sync.min.interval.ms are now available as Apache Spark properties to control GCS flush behavior for event logs.

Datastream

General availability. The service is now generally available in all GCP regions.
Customer-managed encryption key support.

Firestore

Triggers for Cloud Function are generally available.
DATA_READ and DATA_WRITE Data Access audit logs are generally available. They help monitor all read and write operations on the documents.

IAM

Two interesting changes for this security service:

Workload identity federation extension. The feature can now be used with any SAML 2.0-compatible identity provider.
Documentation enhancement. The documentation provides a better explanation how to choose the most appropriate predefined roles on this page.

Pub/Sub

Extended topic retention. Published messages can now be retained for 31 days at most.

Spanner

Ops:

Support for modifying the leader region.
Request and transaction tags. Both types of tags can now be assigned in the application code to facilitate query performance, transaction latency and lock contention troubleshooting.
Time To Live policy. You can now define a TTL policy to remove irrelevant data from the tables.
Statistics package customization. The query optimizer can use a different statistics package than the most recent one to ensure predictability in the query plans. Each new statistics package has its dedicated name that can be found in the INFORMATION_SCHEMA.SPANNER_STATISTICS table.

Other features:

JSON data type support. Spanner can now work with the fields storing JSON documents.
PostgreSQL interface in preview. The feature makes Spanner features accessible from the PostgreSQL ecosystem. During a Spanner database creation you can now choose the SQL dialect and configure it as PostgreSQL SQL dialect.
Query page supports multiple query tabs. There is no more constraint to run a query at a time in the single query form.
The django-spanner plugin is available. Spanner can be used as a backend database for Django Web framework.
R2DBC driver is available in preview. The R2DBC is a specification to enable non-blocking access to relational databases.

Spark on Google Cloud

The autoscaling serverless Spark service integrated with other GCP offerings. It's still in a Private Preview.

Storage Transfer Service

Export data from GCS to a POSIX file system. The feature is in preview.
AWS Security Token Service (STS) integration. The integration enables data transfer from AWS S3 without passing long-term AWS S3 credentials, reducing at the same time the management overhead of these credentials.
Data Lake Storage Gen 2 is generally available. It's now possible to transfer data from Azure Data Lake Storage Gen 2 to GCS.
Manifest file. It's possible to create a Manifest file with the list of all files to transfer from cloud and on-premises sources.
Agent pools. You can use this feature to create isolated agents groups and run concurrent data transfer jobs. It's an alternative to creating multiple transfer projects.
API for on-premise transfer management. The RESTful API to automate on-prem to GCS transfer workflow is now generally available.
Detailed logs for copied objects. Individual objects copied from AWS 3, Azure Blob, Azure Data Lake Gen 2 now have detailed logging information.
gcloud command-line facility. The CLI supports creation and management of data transfer either directly from the terminal or from automated migration scripts. The feature is currently in preview.

This time too, a lot of exciting news for the ones of us who are working on the cloud. If I had to pick the most important announcement, I would definitively remember all the serverless changes for streaming brokers, data processing and data warehouse services.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #what's new on the cloud for data engineers