What's new on the cloud for data engineers - part 8 (09-12.2022) on waitingforcode.com - articles about Data engineering on the cloud

It's the last update on the data engineering news on the cloud this year. There are a lot of things coming out. Especially for the streaming processing!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The update covers updates until December 17th. The remaining 2 weeks will be included in the first update blog post next year. As usual, I'm trying to present an opinionated list here with the most important updates highlighted in bold.

AWS

Athena

Performance:

A new query engine version 3 is available. It integrates the latest features from the Trino Open Source project, including 50 new SQL functions, 30 new features, and more than 90 query performance improvements.
Query Result Reuse. This feature accelerates repeatedly executed queries by returning results for them from the cache layer.

Integration:

Data source connector for IBM Db2 is available.
Integration improvements with Apache Iceberg. The list of the integrated features includes commands like CREATE TABLE AS SELECT (CTAS), MERGE, and VACUUM, or views for Iceberg-managed tables.
Apache Spark support. Athena supports interactive workloads running PySpark code.
Amazon Managed Streaming for Kafka (MSK) and Apache Kafka connector. This new querying capability relies on the federated query feature and helps analyze data in motion without writing it to S3 beforehand.

Misc:

AWS Lake Formation fine-grained access control support.

Aurora

Snapshotless export to S3. Aurora clusters can export data to S3 in Apache Parquet format without creating the snapshot beforehand. There is no need anymore to manage the snapshots after the data export.
Simplified secure connection setup with EC2. The feature automates secure connection setup between an Aura cluster and an EC2 machine. It removes the requirement of creating the network configuration between these 2 components.

Backup

Legal hold extended capability. You can now create a legal hold beyond defined retention policies. When enabled, the backups won't be automatically deleted after reaching the end of the retention policy.
AWS CloudFormation integration. A new application-aware protection ensures a single Recovery Point Objective (RPO) for entire applications.
Amazon Redshift support. It's now possible to use AWS Backup to schedule and restore Redshift manual snapshots.
Backup delegation. Previously, only the management account could administrate backups. Now, an organization-wide administration is possible by a dedicated management account.

Batch

Job report retention period extended from 24 hours to 7 days.

CloudWatch

Amazon CloudWatch Logs protection for the data-in-motion. The service supports ML- and pattern matching-based data protection for the sensitive data logged by your application. The feature can detect and hide the sensitive information before writing it to the service, reducing the risk of accidental exposure.
Cross-account observability. The feature adds a possibility to monitor and troubleshoot applications across multiple AWS accounts.

Data Exchange

AWS Lake Formation integration. The users can now subscribe and find 3rd party datasets managed by AWS Lake Formation. The feature is in Preview.
Amazon S3 integration. Users can now directly access datasets shared from an S3 bucket. The feature is in Preview.

Data Sync

AWS DataSync Discovery. This new component should accelerate any on-premise-to-cloud migration. It analyzes the on-premise data and gives automated recommendations for migrating the storage layer to AWS services.

Database Migration Service

Internet Protocol version 6 (IPv6) support for the replication instances.
Schema Conversion integration. This feature allows you to convert the schema, views, stored procedures, and functions from a source database into the schema for the target database service.

DocumentDB

Elastic Clusters are GA. The Elastic Clusters help scale the DocumentDB collections much easier by simply changing the number of shards. The underlying infrastructure changes are managed by AWS under-the-hood.

DynamoDB

The actions limit per transaction increased from 25 to 100.
Bulk imports from an S3 bucket. The import works with CSV, DynamoDB JSON and Amazon Ion file formats and completes the already existing export to S3 feature.
DynamoDB Local integration with NoSQL Workbench to locally test DynamoDB operations.

EC2

Although it's not a pure data service, EC2 got a few interesting updates if you run a data workload on it:

Automated connection set-up with an RDS database. The solution implements the best security practices and automatically sets up security groups.
Tests for Spot Instance interruptions situation. The test can be made directly from Amazon EC2 console via AWS Fault Injection Simulator integration.

EMR

Kubernetes:

Job templates support to store and share parameters across job runs.
Spark SQL scripts execution through the StartJobRun API support.
GPU-based computing is available for EMR on EKS with the Nvidia RAPIDS Accelerator support.

Serverless:

Real-time monitoring for EMR Serverless jobs with the native Spark and Hive Tez UIs.
Real-time monitoring for EMR Serverless jobs with CloudWatch metrics.
Cross-account S3 access and DynamoDB connector available for the Serverless Spark and Hive workloads.
Spark properties configuration support within EMR Studio Jupyter Notebooks.

Others:

Metastore check command optimization for Hive workloads. The optimization reduces the number of S3 file system calls when fetching partitions.
Parquet Modular Encryption for Hive workloads. The feature enables encrypting and authenticating sensitive information in Parquet files.
Checkpointing on S3 or HDFS for long-running fault-tolerant queries with Trino. The feature reduces wasting cluster resources by storing intermediary results and using them in case of job restart.
More performant integration with Amazon Redshift. EMR can now access Redshift clusters data by up to 10x faster thanks to the operations push down. The solution also integrates with IAM for credentialless connection.
Amazon SageMaker Data Wrangler can now use Amazon EMR Presto as the query engine.

EventBridge

The CloudFormation templates generation from the rules and buses console pages.
Event patterns generation from a schema.
Amazon EventBridge Pipes is Generally Available. It simplifies point-to-point integrations between event producers and consumers.

Fault Injection Simulator

Although it's not a pure data service, it's good to know it to check the reliability of your infrastructure.

Network connectivity disruption is a new fault type supported in the service. It includes disrupting all traffic, or, limiting the disruption to traffic to/from a specific Availability Zone, VPC, custom prefix list, or service.

Glue

Crawlers:

Incremental crawling for Amazon S3 data based on S3 Notifications read from a SQS queue.
Support for crawling data stored in Snowflake.

Others:

Ray integration. This new engine option is another option on Glue to process data in Python.
Native support for Apache Hudi, Apache Iceberg and Delta Lake. The feature natively integrates them and removes the need of installing the connectors.
Data Quality in Preview. The feature analyzes the data stored in your data lake and recommends the most appropriate data quality rules for the datasets.
Git integration for tracking Glue jobs history and facilitate their deployment.
The Sensitive Data Detection feature can identify and process sensitive data for Japan and UK entities.
Possibility to create custom visual transforms and share them in different jobs.

Kinesis

Firehose:

Support for Amazon OpenSearch Serverless sink.

Data Analytics:

3 new container level metrics for Apache Flink runtime. It's now possible to visualize CPU Utilization, Memory Utilization, and Disk Utilization of Flink Task Managers.

Lake Formation

Cross-account sharing version 3. The new version supports sharing AWS Glue Data Catalog resources to direct IAM principals (roles or users). Additionally, the release also brings sharing with AWS Organization/Org unit using LF-tags based sharing.

Lambda

Processing:

Amazon S3 Object Lambda supports custom code to modify the results returned by S3 HEAD and LIST API requests. Previously the feature was available only for the GET requests.
SnapStart for Java functions. It's a performance optimization that initializes the execution environment and puts it into a cache to avoid the from-scratch initialization that might have a huge impact on the latency-sensitive applications.

Security:

Amazon Inspector supports AWS Lambda functions and provides a continual and automated vulnerability assessments.

Ops/Others:

AWS Parameters and Secrets Lambda Extension to retrieve parameters from AWS Systems Manager Parameter Store and secrets from AWS Secrets Manager without having to use the core SDK.
Telemetry API availability for Lambda Extensions. It opens a possibility to easily connect a Lambda function to third-party tools such as Coralogix, Datadog, Dynatrace, Lumigo, New Relic, Sedai, Serverless.com, Site24x7, Sumo Logic, Sysdig, and Thundra.

Managed Grafana

Prometheus Alertmanager rules visualization support.
AWS CloudFormation support.

Managed Workflows for Apache Airflow

New metrics for container, queue, and database are available in CloudWatch.

MemoryDB

Data tiering helps reduce storage costs. The feature uses cheaper SSDs in the cluster and reduces the overall storage cost. It's available for the clusters running on the Graviton2-based R6gd nodes.
System and Organization Controls (SOC) compliance.

MSK

Connect:

Private DNS hostnames support.

Serverless:

The serverless version is HIPAA eligible. The service can be used to store, process, and access protected health information.

Others:

New low-cost storage tier. It reduces the storage cost by 50% and extends the previous options by offering virtually unlimited storage.

Neptune

Some new features for the graph database:

General Availability of the Neptune Serverless.
Concise Bounded Description query support for SPARQL language. The query opens a way to find a subgraph of not blank leaf nodes.
Real-time inductive inference for Neptune ML. The feature makes possible applying the existing predictions on the new nodes in the graph.

OpenSearch

Support for cross-VPC connectivity with AWS PrivateLink. The feature enables private connection to the OpenSearch Service without using public IPs or requiring traffic to traverse the Internet.
OpenSearch Serverless is in Preview.

RDS

Oracle:

The instance store for temporary tablespace and the Database Smart Flash Cache for M5d and R5d instances support.
Oracle Multitenant support in RDS Custom for Oracle.

SQL Server:

TDE-enabled SQL Server Database Migration support.
Cross Region Read Replica support.
Linked server to Oracle. You can access external Oracle databases from your SQL Server RDS instance.
Access to transaction log backups. The backups can also be used to restore the database to the specific point-in-time.
Scaling storage support to run applications without interruption.

MySQL:

An option to enforce the encrypted SSL/TLS connections to the database instances.
Optimized Reads with up to 50% faster queries.
Optimized Writes with up to 2x higher write throughput.

PostgreSQL:

Cascaded read replicas support to get up to 30x more read capacity!

Global:

M6i and R6i instance types for MySQL and PostgreSQL.
Usage metrics against AWS service limits are published to CloudWatch.
Internet Protocol Version 6 (IPv6) support for the newly created RDS instances.
Easier view to the Performance Insight for any time interval.
Support for publishing events to encrypted SNS topics.
RDS events now contain attributes for filtering with Amazon SNS.
Concurrent copy limit to 20 snapshots per destination region.
CloudFormation support for Multi-AZ deployments with 2 readable standbys.
Up to 2x faster transaction commit latency in 12 extra regions for Multi-AZ option.
General Purpose gp3 storage volumes support.
Blue/Green Deployments for safer and simpler updates.

Redshift

Integration:

Zero-ETL integration for Amazon Aurora. Put other way, you can access Amazon Aurora data directly from your Redshift cluster and combine it with the analytical data stored there.
Informatica Data Loader integration for data uploads.

Others:

Data sharing support for centralized access control with AWS Lake Formation in Preview. It simplifies data governance with this central permissions management space.
Dynamic Data Masking in Preview. The feature impacts the way a query might return the result containing sensitive data to the user. For example, the result can contain fully masked or partially masked values.
GA for direct data ingestion for Kinesis Data Streams and Managed Streaming for Apache Kafka (MSK). No more need to write a custom pipeline to copy data from these streaming sources to the data warehouse!
Support for MERGE, ROLLUP, CUBE, and GROUPING SETS to simplify building multi-dimensional analytics.
Auto-copy from S3 support. It enables continuous file ingestion rules from S3.
Namespace and workgroup tagging for Redshift Serverless.
Enhanced system logs with consistent durability. You can fetch up to 7 days logs located in the system table (stl) and system views (svl).
CONNECT BY construct supported to create queries processing hierarchical data.

S3

Storage classes:

Restore throughput improved by up to 10x in Amazon S3 Glacier. It's possible thanks to the new transactions per second limit per account, which is 1000 after the update.

Outpost:

Access Point aliases support. The aliases simplify access management to the data by giving the possibility to create unique control accesses to the shared datasets.
Object versioning support.
S3 Lifecycle rules support.

Security:

Access Points. Bucket owners can now share access to their datasets with access points created in other accounts.
S3 Block Public Access will be automatically enabled and access control lists automatically disabled for new buckets created in April 2023 and later.

Other features:

Active-passive configuration with Amazon S3 Multi-Region Access Points to improve the High Availability of the bucket in case of a failover.
Objects encrypted with customer-provided keys (SSE-C) are supported in Amazon S3 Replication.
Additional 34 new metrics for Amazon S3 Storage Lens to improve the visibility into object storage usage and activity.
Amazon S3 Select improves the Trino performance by up to 9x for data queried from S3.
Access Control Lists (ACLs) usage added to Amazon S3 server access logs and AWS CloudTrail.

Security Lake

It's a new service and for now, it's available in Preview. Amazon Security Lake simplifies collecting and analyzing security data sources from AWS CloudTrail management events, Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, Amazon Route 53 Resolver query logs, and AWS Security Hub.

SNS

Message data protection. The feature helps discover sensitive data in motion.
Message signatures can now be based on SHA256 hashing. It completes the existing possibility with the SHA1 algorithm.
Payload-based message filtering. It extends the already available attribute-based filtering by giving a possibility to filter out irrelevant messages after analyzing their content.

SQS

Server-Side Encryption with Amazon SQS-managed encryption keys is a new default encryption configuration for newly created queues.
Increased throughput quota for FIFO High Throughput (HT) mode. THe new limit is 6000 Transactions Per Second.
Attribute-based access control (ABAC) using queue tags improves the flexibility of the access permissions.

Step Functions

New execution observability features for Express Workflows. The visual improvements simplify tracing and root causing issues directly from the AWS Console.
Simplified cross-account access to more than 220 AWS services invoked as a part of the workflows.
Parallel workflows for data processing. You can now use Step Functions to process S3 data. The service will launch and coordinate thousands of parallel workflows to process the identified dataset and write the result to S3. The processing logic still need to be implemented elsewhere (e.g. AWS Lambda), though.

Storage Gateway

Increased cloud upload (up to 5.2 Gbps) and download (up to 8 Gbps) performance by 2x for Tape Gateway.

Transfer Family

Post-upload processing of partially uploaded files. It completes the previous feature working only on fully uploaded files.

QuickSight

QuickSight Q:

Support for questions concerning datasets with Row level Security (RLS) is available in QuickSight Q.
Automated data preparation for QuickSight Q. This AI-based feature automatically selects fields, classifies dimensions and measures, crates name labels, and applies column value formats.
New question types for QuickSight Q for prediction ("forecast") and understanding of current trends ("why").

Misc:

Missing Data control added for line and area charts.
Customer Managed Keys (CMK) for SPICE data encryption.
A new admin asset management console to interactively list and search all account assets regardless of who the owner of these assets are.
Dashboards availability for seller reporting and insight in AWS Marketplace. Previously, the sellers had to download a CSV file and analyze the reporting locally.
NULL parameter support.
New styling options are available for line and marker in line charts.
Small Multiples for line, bar and pie charts. The feature creates multiple versions of the base visual, presented side-by-side, with its data partitioned across these versions by a dimension.
Databricks connectivity support.
SPICE consumption monitoring available in CloudWatch.
Cluster points for Geospatial Visual. They improve the visibility by merging visually data points closely located to each other.
API capabilities are GA. A dashboard can now be considered as any software project.
Paginated Reports to create, schedule and share multi page reports.

Azure

Backup

Reserved capacity for Azure Backup Storage. You can now purchase a reserved capacity and save up to 24% of your backup storage.
Smart tiering to vault-archive tier. This new backup policy adds an extra guarantee that the recommended recovery points will be moved to the vault-archive tier.
Zone-redundant storage support.
Enhanced soft-deletes. This new feature enforces the soft-delete capability by removing the possibility to disable it. As a consequence, a soft-deleted data will remain in the soft-deleted state.
Multi-user authorization for Backup vaults. It adds an extra security layer with a resource guard to ensure critical operations are performed with proper authorization.
Immutable vaults in Public preview. It ensures the created recovery points cannot be deleted before their expiration.
Support for creating backups of confidential VMs using Platform Managed Keys.

Cache for Redis

Improved passive geo-replication on the Premium tier. It includes additional metrics and simplified failover initialization.

Cosmos DB

MongoDB:

16MB limit per document is Generally Available. It's x8 more than previously.
Retryable writes from MongoDB driver are Generally Available.
A new fine-grained, role-based permission model using role-based access control (RBAC) is available.

PostgreSQL

Azure Cosmos DB for PostgreSQL is Generally Available. The feature relies on the Citus extension to horizontally scale workloads.
Cross-region read replicas General Availability.
Azure Blob Storage integration from the pg_azure_storageextension. It simplifies various operations, such as uploading new documents. The feature is GA.

Cassandra:

Materialized views are in Public Preview. The feature enables having tables with different primary/partition keys.
Intra-account container copy. It allows seamless creation of offline copies of containers within your account.

Misc:

Azure IoT Hub integration. You can now use Azure Cosmos DB as an endpoint for the IoT Hub workload. It reduces the complexity that previously had required using an intermediate data processing component like a Stream Analytics or an Azure Function.
API renaming:
- Core (SQL) API → Azure Cosmos DB for NoSQL
- API for MongoDB → Azure Cosmos DB for MongoDB
- Azure Cosmos DB → Azure Cosmos DB for PostgreSQL
- Cassandra API → Azure Cosmos DB for Apache Cassandra
- Table API → Azure Cosmos DB for Table
- Gremlin API → Azure Cosmos DB for Apache Gremlin

Data Explorer

Kusto Emulator is GA. It's a Docker Container encapsulating the Kusto Query Engine that enables local development and automated testing.
OpenTelemetry exporter is available. You can use this vendor-neutral open-source observability framework to ingest telemetry data.

Database Migration

Migration assessment for Oracle to Azure migration. An assessment helps identify the complexity of the migration workload before performing any real action.
Migration assessment for SQL Server to Azure migration.

Event Hubs

Log compaction support. The service automatically compacts events sharing the same partition key. The feature is actually in Public preview.

Functions

Azure SQL trigger. You can now invoke an Azure Function to as a response to a row creation, update, or delete in a SQL database.
V2 programming model using Python. It makes the functions more Pythonic with some notable changes, like triggers and bindings declared as decorators or importing through blueprints.
Increased maximum scale-out limits for Azure Functions Linux Elastic Premium plan.
Isolated worker model support in the .NET Framework.

Monitor

Ingestion-time transformations. It's now possible to apply custom transformations to the data collected through Diagnostic Settings, AMA and MMA agents, and Sentinel Connectors.
Custom logs API. It's another way to customize the data sent to the Monitor service.
Custom Log and IIS Log collection capability. It enables collecting the text-based logs generated in the services or applications.

Service Bus

Performance improvements for the Premium tier with the scaling partitions and more consistent low latency due to the underlying architecture changes.

SQL Database

Hyperscale:

You can migrate back to the general purpose tier from Hyperscale.
Long-term backup retention.
Geo-zone redundant storage option for an improved resilience of the backups.
Import and export operations are GA.
Premium series hardware based on the 3rd Gen Intel® Xeon Scalable processor and AMD EPYCTM 7763v (Milan) chipsets is in Preview.

PostgreSQL - Flexible Server:

Fast restore. It can be useful for the situations not requiring the latest data, such as testing or development.
Geo-redundant backup and restore.
Read replicas are in preview.
Customer managed keys encryption for the service-managed keys.
Azure Active Directory authentication support.
Enhanced metrics with up to 93 days of history.

MySQL - Flexible Server:

Read replicas are GA.
Data at rest encryption with Customer-Managed Keys.
Azure Active Directory authentication support.
AMD hardware availability.
Autoscale IO. There is no need for provisioning. Instead the feature scales up and down the IO according to the workload.

SQL Managed Instance:

The approximate_percentile function to compute percentiles faster for large datasets is available.
Geo-zone redundant storage for improved backup redundancy.
Memory-optimized premium series instances to handle more demanding workloads. They offer over 2.5 times more memory per vCore compared to standard-series.
A new TempDB configuration to configure the number of TempDB files and their growth increments can improve the Managed Instance performance.
Backup portability to SQL Server 2022 is GA.
The link for SQL Server 2022 is GA. You can use this to offload or scale-out R/O workloads and analytics.
Stop & start capability to optimize costs.
Zone-redundant deployments to improve the resiliency of the database. It uses Availability Zones to replicate your instances across multiple physical locations within an Azure region.
Faster creation time. The first instance in a new subnet can be created in less than 30 minutes.
New features are available at the subnet level. They include removing the public management endpoint, narrowing the scope of inbound and outbound rules imposed on its subnets, and giving the consumption of IP address space a trim.
Distributed transactions across managed instances with Distributed Transaction Coordinator is supported.
A more granular level monitoring for the database restore progress.
Transactional replication can now be used to replicate data to tables in remote databases or from remote database to Azure SQL Managed Instance.
Time series capabilities for data analysis over time with time-windowing, aggregation, and filtering.
T-SQL queries supported for Azure Data Lake files.
Virtual clusters with new set of capabilities, including subnet association (Resource Navigation Link → Service Association Link), added tag support, and asynchronous update DNS servers API pattern.
Additional transparency for automated and manual backups using the msdb_backupset tables.
Resumable add table constraints enabled with a T-SQL execution.
Log Replay Service can now be used to migrate the database to Azure SQL Managed Instance.
Cross-subscription Point-In-Time database restore.
Online database copy and database move operations across Azure SQL Managed Instances within Azure Subscription support.
A new 128 vCore option on standard-series hardware is available to boost performance for demanding applications.
Query Store hints. It's a method for shaping query plans and behavior without changing the application code.

Misc:

Higher limits for auto-scale compute. You can now use up to 80 vCores in selected regions for Azure SQL Database serverless.
An extension for offline migration from SQL Server to SQL Database.
Azure Function integration with Azure SQL Database queries. You can now invoke the function or any other REST endpoint directly from SQL!
MySQL extension for Azure Data Studio to connect to and modify MySQL databases along with your other databases.

Storage Account

Security:

Attribute-based access control for standard Storage Accounts. The new security strategy supports defining access levels based on attributes associated with security principals, resources, and requests.
Resource instance rules for access to Azure Storage are GA. With the feature you can restrict access to specific resources of select Azure services.

MISC

Improved Append Capability. Immutable Storage for Blob Storage. This evolution gives a possibility to append data to the immutable blocks without compromising their write-once state.
Immutable storage for Azure Data Lake Storage is GA.
Customer-initiated Storage Account conversion from non-zonal redundancy (LRS/GRS) to zonal redundancy (ZRS/GZRS) is in preview.
Encryption scopes on hierarchical namespace in Preview. The feature enables using provision multiple encryption keys in a storage account with hierarchical namespace.
SFTP support for Azure Blob Storage.

Stream Analytics

No-code editor updates: Azure SQL Database integration as a new sink or data enrichment storage, added runtime logs with diagnostic logs, Delta Lake support as a sink, Event Hubs support as a sink, Azure Data Explorer support as a sink, support for multiple parameters built-in functions, and UI refreshed.
Service Bus authentication is possible now with the managed identity.
Azure Database for PostgreSQL is a supported output in the jobs.
Improved performance. You can observe up to 45% gains thanks to the CPU boost.
Managed private endpoint support to Synapse SQL output.
User-assigned managed identities support. The change includes their creation and usage to access Cosmos DB.
Azure Data Explorer output support.
Exactly-once delivery for Azure Data Lake Storage Gen2 in Preview.
Physical job diagram creation for easier troubleshooting.
Job diagram simulator in Visual Studio Code.

Synapse

Several changes for the service:

Azure Synapse Link for Azure Cosmos DB Gremlin API is in Public preview.
Azure Synapse Link for SQL is Generally Available.
Improved Azure Synapse Analytics Spark performance up to 77%. It's made possible with the most recent changes, like moving to the latest Azure v5 VMs with improved CPU performance, increased temporary SSD throughput, and lastly higher remote storage IOPS.

GCP

BigQuery

Administration/OPS:

The Slot recommender helps optimize BigQuery usage for on-demand billing customers.
The Slot estimator helps manage slot capacity based on historical performance metrics.
Metrics for quota usage and limits of the Storage Write API's concurrent connections and throughput quotas are in Cloud Monitoring.
Shuffle usage ratio is available in the admin resource charts.
The projects.list API returns the number of items per page instead of the approximate total number of projects across all pages.
Concurrent connections quotas are based on the project initiating the Storage Write API request and not the project containing the written BigQuery dataset.

IO:

LOAD DATA statement support for loading data directly from an Amazon S3 bucket or Azure Blob Storage.
ASCII control characters in CSV files and reference files with the expected tables schema for Avro, ORC and Parquet-based external tables are 2 new data loading features.

SQL:

ROUND_HALF_EVEN rounding mode for NUMERIC and BIGNUMREIC columns.
ST_ISCLOSED and ST_ISRING are new geography functions GA.
Search indexes and SEARCH() function for search in the text and semi-structured data are GA.
Wrapped keysets management functions for Cloud KMS keys are GA.

Security:

Customer-managed encryption keys integration with CMEK organization policies.
Private connectivity with Cloud SQL.

Omni:

New quotas and limits: 1TB/day for total query result sizes and 10GB for the maximum result size for a query.
On-demand pricing model for a limited duration support.

Other features:

Case insensitivity configuration with is_case_insensitive schema option.
Multi-statement transactions are GA.
Stored procedures for Apache Spark written in Python are available in BigQuery.
Remote functions are GA. You can now call Cloud Functions or Cloud Run from the BigQuery queries.
BigQuery migration assessment available for Amazon Redshift.
The query execution graph that helps diagnose performance issues is in Preview.
Support for querying Apache Iceberg tables.
Object tables are available in Preview. They're read-only tables with metadata for unstructured data located in GCS. Typically, you can use them to analyze images, audio files, or other file types with BigQuery ML or BigQuery remote functions.
Metadata caching is in Preview. It can improve performance for BigLake tables and object tables by avoiding listing objects from GCS.
Analytics Hub is GA.
New dashboard metrics. Among them you will find BI Engine Top Tables Cached Bytes, BI Engine Query Fallback Count, and Query Execution Count.
Add data feature available in the Cloud console to search for and ingest data sources.

BigQuery Transfer Service

Google Ads API support in preview.

Cloud Composer

Cloud Composer 2:

Airflow triggerer and Deferrable Operators are in Preview.
The Local Development CLI tool is available to test and develop on local Airflow environments..
Data lineage is in Preview. The feature connects to Dataplex to track data movement through the system.

Security:

Encryption with Customer-managed encryption keys (CMEK) supported in persistent disks of the environment's Redis queue.

Some of the bug fixes:

Race condition fix where a task could have been scheduled on a worker scaled down.
CPU limits for the FluentD environment are adjusted to avoid missing logs.
OOM issues fixed with the improved file synchronization.

Others:

Maintenance and other environment operations (snapshotting, configuration changes) are now displayed in the Monitoring Dashboard.

Cloud Functions

The cloudfunctions.googleapis.com/v2 API can now read 1st gen functions with get and list methods. To restrict their response to the 2nd gen functions only, you can use the filter field, liek for example filter=environment="GEN_2".

Cloud SQL

SQL Server:

Permanent time zone support for a new instance.

MySQL:

High-availability for the self-service migration.
Query insights is GA. The feature helps detect, diagnose, and prevent query performance problems.
Preview version of the recommenders for High number of open tables and Hugh number of tables.

PostgreSQL:

The log_timezone and TimeZone flags are supported.
Support the preview version of the high-transaction-ID-utilization recommender to reduce the risk of transaction id wraparound.
Additional metrics and events available in the System insights dashboard.

PostgreSQL and MySQL:

Cascading Replicas are GA. With the feature the primary read replicas can have read replicas at their level.

Global:

Possibility to reuse an instance name immediately after the deletion.

Cloud Storage

Security:

Public access prevention is enabled by default for the new buckets.
Google Cloud console can now suggest role recommendations and policy insights for buckets to help better understand permissions management.

Misc:

Buckets tags are GA.
gcloud storage CLI is GA. It provides a faster uploading and downloading than the gsutil.
The Autoclass feature to automatically transition objects based on their access patterns is available.
In Preview you can see an expanded GCS monitoring dashboard. It supports alerts creation and exposes various metrics (server and client error rates, write request counts, network ingress rates, and network egress rates).
Turbo replication mode is available for all dual-regions.

Data Catalog

Analytics Hub integration passes GA. The Data Catalog indexes the shared datasets in the service.
Public tags are Generally Available. They have a less strict access control for searching and viewing the tag than private tags.

Data Fusion

Asset level lineage in Preview for the integration with Data Catalog.
Two new versions are available. The 6.7.2 is a new GA version whereas the 6.8.0 is the most recent Preview release.
DNS Resolution is GA. It enables you using domain or hostnames for sources instead of their IP addresses.

Data Loss Protection

New detectors and connections:

New detection model called STREET_ADDRESS is available.
New detector called OAUTH_CLIENT_SECRET infoType is available.
New detector called NEW_ZEALAND_IRD_NUMBER is available.
New VAT_NUMBER infoType detector is available.
Approximate percentage of non-null rows in which the infoType was detected will be included by default for new data profiles with infoTypes other than the predicted ones.
New exclusion rule called ExcludeByHotword to exclude a column from inspected finding if the column name matches or if that finding matches a regular expression.

Dataflow

Regional placement for workers. The workers can now be placed in any zone of the region. It adds an extra flexibility in case of an unavailability of one of the zones.
Upgraded VM Image. The new image addresses the vulnerability named Retbleed.

Dataplex

BigLake integration is in Preview. You can use it to create BigLake tables instead of external ones.
Auto data quality (Auto DQ) to define and measure data quality is in Preview.
Data profiling is in Preview. It helps get a better understanding of the datasets and provide recommended data quality rules.
Data exploration workbench called Explore is GA. It's a one click access to Spark SQL scripts and Jupyter notebooks.
Source and Sink plugins are GA in Cloud Data Fusion.

Dataproc

A new metric called dataproc.googleapis.com/job/state is available to track the status of the Dataproc jobs.
The spark.dataproc.diagnostics.enabled property is supported in Dataproc Serverless. It enables auto diagnostics on Batch failure.
Dataproc Serverless for Spark supports Artifact Registry with image streaming for fast application startup and autoscaling.
Spark and System metrics are available in Dataproc Serverless for Spark. They include metrics from Spark-related components like block manager or DAG Scheduler.
Spot VMs can be used as secondary workers without a maximum lifetime. In the previous versions, they had a 24-hour maximum lifetime.
DECOMMISSIONING, NEW, and SHUTDOWN are included in the the /cluster/yarn/nodemanagers metric.
Auto Zone Placement takes any reservation into account by default.

Datastream

BigQuery data and schema updates support in preview.
BigQuery supported as a migration destination.
PostgreSQL supported as a migration source.

Firestore

count() queries to determine a number of documents in a collection are available in preview.
Time-to-live policies are Generally Available. The policy helps remove stale data automatically from the collection.

IAM

Deny policies are GA.
Google Cloud console exposes the view authentication activities. For example, you can use it to see when a service account and keys were last used.

Pub/Sub

Exactly-once delivery support in GA.
BigQuery subscription changes:
- Support for writing string fields as TIMESTAMP, DATETIME, DATE, or TIME columns in BigQuery.
- Support for JSON type for all string fields.
- Support for the Avro logical types, such as timestamp-micros, date, and time.
GA of the Kafka Connector library for Pub/Sub and Pub/Sub Lite.
New monitoring dashboards for topics and subscriptions.

Spanner

Change Data records:

Change Data records template for Dataflow. You can use it to stream changed records directly from a Spanner database to a Pub/Sub topic.
Two new capture types for changed records: NEW_VALUES streaming only new values and NEW_ROW that includes the not updated columns.

Performance:

A new default optimizer version number 5 is GA.

Querying:

ARRAY_SLICE function to return a part of the input array is available.
Support for JSONB data type.
RETURNING statement. You can use it to return the columns updated as a part of the executed DML statement.

Security:

Fine grained access control is in preview. The feature supports column- and table-level security with classical GRANT/REVOKE SQL statements.

Other features:

Time to Live is supported in PostgreSQL-dialect databases.
Cross-region and cross-project backup scenarios support.
Concurrent database restore operations per instance increased from 5 to 10.
Fixed calculating rule for the Total Database Storage metric. It was estimated lower than in reality. It should have a billing impact less than 0.5% for the majority of the impacted users.
Custom instance configuration with optional read-only replicas support.
Support for instance reconfiguration requests, including between regional and multi-regional configurations.
Additional dashboards to troubleshoot latency and locks. You should find them in the Lock insights and Transaction insight dashboards.
Preview integration for Spanner Vertex AI. It enhances the Spanner application with ML capabilities.
Increased number of mutations per commit from 20 000 to 40 000.

Storage Transfer Service

Multipart upload for file system transfers is GA and enabled by default. This mode can speed up transfers of large files.
GA Support for transferring data between file systems, including on-premises file systems and Filestore instances.
Preview support for event-driven transfers for real-time replication from AWS S3 to Cloud Storage, and between Cloud Storage buckets.
A transferJobs.delete method is available in the Storage Transfer Service REST API.
Export data from GCS to a file system is GA.
Preview support for moving data from a S3-compatible storage to GCS.

As usual, I highlighted my top picks in the most recent release. I can see a lot of work on the streaming processing. The list of updates for Azure Stream Analytics has never been so long! AWS Athena can now query MSK and Redshift get data directly from MSK or Kinesis Data Streams! Besides, I also have a feeling that the cloud providers are starting to consider open table formats (Apache Iceberg, Apache Hudi, Delta Lake) more and more seriously. The added support for Apache Iceberg and BigLake changes announce some other major changes in 2023. And finally, I also have a feeling of some deja vu. The zero-ETL making Aurora data available on Redshift looks very similar to the Azure Synapse Link and a more general concept called Hybrid Transactional/Analytical Processing, doesn't it?

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #what's new on the cloud for data engineers