What's new on the cloud for data engineers - part 12 (10.2023-02.2024) on waitingforcode.com

It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 5 months.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

AWS

Athena

Provisioned Capacity reservations. It's an alternative pricing model when you pay for the reserved compute capacity, independently on the volume of data scanned.
Provisioned Capacity monitoring on CloudWatch with additional metrics for used compute.
Support for S3 Express One Zone storage that can boost the performance by x2.1 compared to the queries executed on the Standard class.
Performance improvement with cost-based optimizer (CBO). The feature leverages statistics from AWS Glue Data Catalog tables to create the best execution plans.
New and improved JDBC driver has been released in November.
Trusted identity propagation with AWS IAM Identity Center.

Aurora

Reduced database restart by up to 65% thanks to deferring portions of the buffer pool initialization and validation process to occur after the database is already online and accepting connections.
Replica instances query plans can be captured and integrated in the query planning to improve the execution time.
Write forwarding support from the secondary to primary region instances.
Federated query support for MySQL and MariaDB.
Zero-ETL integration with Redshift in Public Preview so that you can query Aurora transactional datasets directly from Redshift.
HypoPG extension support for creating hypothetical indexes to facilitate indexes impact on the real queries.
Support of h3-pg for geospatial indexing.
Better control over managed extensions by regular users with the rds_extension database role.

Batch

Access to the array size via AWS_BATCH_JOB_ARRAY_SIZE variable. The feature applies to array jobs that are a way to parallelize processing on AWS Batch.

CloudWatch

Support for custom data protection policies to protect the sensitive data in the logs.
Integration with AWS Lambda in response to the alarm state change to OK, ALARM or INSUFFICIENT_DATA.

Data Exchange

Support for data grants which are time-bounded read-only permissions for other AWS customers.
Support for provider-generated notification where data providers can notify their subscribers about any updates in the related datasets.

Data Sync

Support for manifest files that contain all files to be used in the data transfer task.

Database Migration Service

Amazon Timestream support as a new target.
Amazon Relational Database Service for Db2 support as a new target.

DataZone

General Availability of the DataZone. It's a data management service is designed to catalog, discover, analyze, share, and govern data. The release also includes a dedicated console, AWS CloudFormation support, and HIPAA eligibility.

DocumentDB

Support for JSON schema validation with dedicated rules attached to the collections.
In-place major version upgrade without service interruption is available.
Vector search for storing, indexing, and searching representations of unstructured data.

DynamoDB

Zero-ETL integration with Amazon Redshift.
Zero-ETL integration with Amazon OpenSearch Service.
Support for ReturnValuesOnConditionCheckFailure parameter and deletion protection for DynamoDB local.

EMR

EMR Serverless:

Support for default configurations at the application level for a given EMR Serverless job.
Support for fine-grained data access control with Apache Spark for EMR Serverless via LakeFormation policies.
EMR Studio connection to EMR Serverless for interactive analytics support.

EMR Studio:

EMR Studio has a new interactive query editor for Athena.
Simplified EMR Studio creation workflow in the console.
EMR Studio integration with Amazon CodeWhisperer for smart code suggestions.
Support for customer managed keys for EMR Studio workspace storage.
Users can specify their own secrets in an EMR Studio Workspace and through user role permissions ensure that the secrets are only accessible by them. Previously Git secret credentials were available for any user with access to the Workspace.

EMR:

General Availability of Apache Flink on EMR on EKS.
Backup and restore capability for HBase tables managed on EMR. The feature relies on Amazon EMR WAL which is a durable managed storage layer that outlives the cluster.
Support for fine-grained access controls (FGAC) on Open Table Formats (OTFs) in LakeFormation for jobs running from Amazon EMR.
High-availability EMR on EC2 clusters availability with instance fleets configuration. The feature has three on-demand primary nodes to recover from the primary node's failure.
General Availability of EMR on EKS Interactive Endpoints for running interactive workloads, such as EMR Studio.
Accelerated data processing even by the factor of x4 thanks to S3 Express One Zone connectivity.

EventBridge

EventBridge Pipes supports logging to Amazon CloudWatch Logs, Amazon Simple Storage Service (Amazon S3), and Amazon Kinesis Data Firehose. Previously, only the metrics were available.
EventBridge EventBus ready-only calls support in CloudTrail. Previously, only data mutation events were present.
Integration with Event Ruler v1.5.0 library for better filtering capabilities.

Glue

Support for new notebook kernel magics: assume_role, tags, session_type and matplot in Glue Interactive Sessions.
Support for IAM Conditionals in Glue Interactive Sessions.
Improved visual identification of custom transformations for Glue Studio with custom .
GitLab and BitBucket are new members of Git integration Glue.
Native connectivity to Google BigQuery.
Native connectivity to Amazon OpenSearch Service.
Native connectivity to new databases: Teradata, SAP HANA, Azure SQL, Azure Cosmos DB, Vertica, and MongoDB.
Column-level statistics support for Glue Data Catalog. Other services, such as Redshift can leverage them to optimize Spectrum and Serverless queries.
Support for delegating KMS key permissions to an IAM role.
Support for automatic compaction for Apache Iceberg tables in Data Catalog.
Integration with Amazon Q to build data pipelines using natural language.
Data Quality can identify records failed to the CustomSQL rule type.
Anomaly detection and dynamic rules are now part of Data Quality toolset.
AWS Glue serverless Apache Spark UI and AWS Glue observability metrics are now Generally Available for an improved observability of Apache Spark jobs.
Entity-level actions such as partial or full redaction, and encryption to manage sensitive data.
Multi-engine views that you can later query from Athena or EMR.
Faster and embedded interactive data preview experience for Glue Studio with Glue Data Preview.
Improved user experience with example jobs for visual ETL and notebooks, drag-and-drop for connecting nodes, a data-preview-focused layout, and a simpler UI for Glue.

Keyspaces

Frozen collections are available. The feature enables creating complex indexing structured since frozen collections' primary keys can contain other collections.
Data Manipulation Language (DML) query support in CloudTrail for auditing purposes.
Support for provisioned capacity mode for Multi-Region Replication tables.

Kinesis

Data Streams:

Cross-account access with AWS Lambda.
Support for the on-demand mode in RDS Database Activity Streams. Database Activity Streams captures events from the database, encrypts them, and uploads the records to an Amazon Kinesis data stream.
On-demand write throughput increased to 2GB/s, which is twice that much as currently.
Integration with EventBridge Pipes console.

Firehose:

Amazon Kinesis Data Firehose becomes Amazon Data Firehose. It's only the rebranding but do not be surprised not to find the former reference in the documentation anymore.
0 seconds buffering. With the feature you can deliver data directly as it comes to the buffer.
Support for Amazon Redshift Serverless as a target.
Support for Splunk configured with either an Application Load Balancer (ALB) or a Classic Load Balancer (CLB).
Support for Snowflake Snowpipe Streaming in Preview.
Support for delivering decompressed CloudWatch Logs to S3 and Splunk destinations.

Lake Formation

Hybrid Access Mode for Glue Data Catalog. The feature enables Lake Formation for a specific set of users without interrupting other existing users or workloads.
Support for permissions on subfields of their nested tables using data filters.

Lambda

Faster scaling and more consistent throughput for Kafka event source.
Faster scale up thanks to an increased rate from 500 concurrent runs per minute to 1000 runs per 10 seconds.
Support for IAM access control for multi-VPC enabled Amazon MSK clusters.
Improved monitoring with a new ClaimedAccountConcurrency metric.
More intuitive user experience in the console for connecting to RDS and RDS Proxy.
Launch of a single pane view of metrics, logs, and traces in the Lambda console.
Support for view and export the function’s template to AWS Application Composer.

Managed Workflows for Apache Airflow

Shared VPC support via customer managed endpoints. It enables teams of different AWS accounts creating resources in a centrally-managed VPC.

MemoryDB

Simplified cluster creation from the console by reducing the settings to configure by default.
IAM Authentication support.

MSK

Fully managed data delivery to S3 with Firehose.
MSK Replicator to enable cross-regions and same-region MSK replication scenarios.
IAM connection support for all popular languages, including Java, Python, Go, JavaScript, and .NET.
Storage capacity alerts automatic notification when you are at risk of exhausting your storage capacity.
Trusted Advisor includes a check on the max recommended partitions in the broker.
Integration with Amazon EventBridge Pipes in the MSK service console.

Neptune

Neptune Analytics, an API for loading, querying, and analyzing graph data, is Generally Available. It removes the overhead of building and managing complex data pipelines for analytics.

OpenSearch

Integrated alerts and anomalies onto dashboard visualization line charts to optimize the user experience.
OpenSearch Service Integrations for improved integration with AWS services.
Support for Open Cybersecurity Schema Framework (OCSF) data format and custom logs. It simplifies integration with other AWS offerings such as Security Lake.
Four new language analyzer plugins, Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese) plugins, are available.
IPv6 support.
Multimodal support on Neural Search.
Two new domain statuses, Domain processing status, and Configuration change status, are available for a better monitoring of domain updates.
Support for TLS 1.3.
Support for combining lexical and semantic search for hybrid search score.
Improved accuracy and speed with efficient vector query filters for FAISS.

RDS

Oracle:

Support for Oracle Multitenant.

SQL Server:

Support for changing the server-level collation.
Support for Service Master Key Retention and Transparent Data Encryption.

MySQL:

Up to 3x higher write throughput with recent service releases.
Group Replication support for active-active replication.
Support for multi-source replication.

Db2:

Support for up to 5 000 database users.
Support of EBCDIC collation sequence.

Global:

Dedicated Log Volume for PostgreSQL, MySQL, and MariaDB databases.
Amazon RDS Optimized Writes can be optimized using RDS Blue/Green Deployments.
Aurora and RDS PostgreSQL support in Blue/Green Deployments.
Multi-AZ with two readable standbys support minor version upgrades with 1 second of downtime.
RDS Performance insight generated recommendations on database performance and availability issues before they become critical.

Redshift

Administration:

Canonical Name (CNAME) or custom domain name support for an easier fail-over and connections setup.
Multi-AZ is generally available for RA3 clusters.
Smarter and faster sort and distribution key recommendations are possible thanks to the improvements of Redshift Advisor.
MaxRPU is a new cost control setting for Redshift Serverless. It allocates the maximum compute level to the cluster resources.
AI-driven scaling and optimizations in Redshift Serverless.

Security:

Preview support for fine-grained access control capabilities to nested objects, such as the ones from SUPER field type.
Cross-account cross-VPC, custom domain name(CNAME), snapshot scheduling, cross-region copy (CRC), improved visibility for serverless billing in the Redshift console, and version tracking are new features in Redshift Serverless.
A new CONJUNCTION TYPE to support row-level security (RLS) policies and RLS on standard views and late binding views.
Support for metadata security that enables administrators to restrict the visibility on their catalog data based on user roles and permissions
Integration with AWS Secrets Manager for an easier management of administrator credentials.
Support for routing user queries to the queues associated with their roles.

Client:

SUPER data type can support now up to 16MB.
General Availability for Apache Iceberg tables.
Support for incremental refresh for materialized views on Apache Iceberg and standard AWS Glue tables. An incremental refresh doesn't require rerunning the view creation query and overwriting the whole dataset.
Multidimensional Data Layouts is a new powerful table sorting mechanism that improves performance of repetitive queries.
Programmatic access to the Advisor API.
Integration with Visual Studio Code to execute SQL queries against a Redshift cluster.
Autocomplete suggestions available on Query Editor V2.
Multi-data warehouse writes through data sharing.
Provisioned concurrency scaling and serverless autoscaling now supports Create Table As Select (CTAS).
Support for H3 Indexing and other spatial grid indexing functions.
Auto and incremental materialized views for base tables that are shared data.

S3

Last-Modified time for delete markers using S3 Head and Get APIs.
S3 Object Lambda integration with Amazon Athena. You can leverage this feature to mask data on the fly, or perform any transformations prior returning data to the users.
Support for IPv6 on Amazon S3 on Outposts.
Amazon S3 Access Grant integration with identity providers, such as Active Directory, AWS Identity and IAM.
S3 Storage Lens groups to aggregate metrics using custom filters based on object metadata.
S3 Storage Lens aggregates activity and status code metrics by prefix.
Automatic date-based partitioning for S3 Server access logging.
S3 Batch Operations managed buckets or prefixes in a single step.
S3 Connector for PyTorch.
Mountpoint for S3 can optimize repeated data access thanks to the underlying cache on EC2.
Mountpoint for S3 Container Storage Interface (CSI) driver. Thanks to it, a Kubernetes container can access S3 from a file system interface.
Mountpoint for Amazon S3 supports the new S3 Express One Zone storage class.
General Availability of S3 Express One Zone storage class that provides access speed up to 10 times faster and request costs up to 50% lower than Amazon S3 Standard.

SNS

Custom data identifiers support for data protection. The custom policy is a RegEx that the service will use to detect any sensitive attributes in the message. As an action, you can configure it to report or redact the values.
Increased FIFO topic throughput to 3000 messages per second.
Support for delivery status logging configuration with CloudFormation.
Delivering mobile push notification support for Google Firebase’s HTTP V1 API.

SQS

Extended Client Library for Python for managing big payloads (up to 2GB). Internally, the library stores them on S3 and only sends a message with the object reference in SQS.
Increased throughput quota for FIFO High Throughput mode. The numbers may vary depending on the regions.
EventBridge Pipes console integration.
JSON protocol support for reducing end-to-end message latency and resources usage on the consumer side.
Logging data events support in CloudTrail.
Support for dead-letter queue redrive for FIFO queues.

Step Functions

Optimized Integration for EMR Serverless. It adds support for running synchronous jobs with 6 new API Actions (CreateApplication, StartApplication, StopApplication, DeleteApplication, StartJobRun, and CancelJobRun).
Support for restarting workflows from failure.
New TestState API to test a single step in the workflow.
Support for HTTPS endpoints.
Integration for additional 33 AWS services.

Transfer Family

Events published to EventBridge for SFTP connectors.
Events published to EventBridge for SFTP, FTPS, and FTP servers.
Events published to EventBridge for AS2 servers and connectors.
Static IPs for sending AS2 messages and MDNs, and for SFTP connectors.

Azure

Backup

Enhanced soft deletes. The feature strengthens the protection by making soft delete always-on and irreversible.
Long-term retention for backup of Azure Database for PostgreSQL - Flexible Server.
Cross Region restore for PostgreSQL backups is Generally Available.
Multi-user authorisation for Backup vaults.

Cache for Redis

Flush data operation to delete all cached data is in Public preview.
Update channel configuration for previewing upcoming changes. The feature is recommended to be enabled on non-critical environments.
Logging capability for all connection, disconnection, and authentication events occurring on the cache. The feature is available for the Enterprise tier.

Cosmos DB

Shared throughput databases can now merge unused partitions.
PgAudit extension for audit logging is Generally Available for Cosmos DB for PostgreSQL.
Customer-managed keys (CMK) support in Azure Cosmos DB for PostgreSQL.
Integration with Microsoft Copilot to transform user questions into Cosmos DB queries.
Azure Cosmos DB account can now be defined as a custom routing endpoint in Azure IoT Hub.
Enhanced migration experience with Azure Cosmos DB Migration for MongoDB extension for any MongoDB to Cosmos DB for MongoDB migrations.
General Availability of Azure Cosmos DB for MongoDB vCore.
Free tier of Azure Cosmos DB for MongoDB vCore is Generally Available.
Vector search in Azure Cosmos DB for MongoDB vCore is Generally Available.
Azure OpenAI Studio's "Use your data" integration with Azure Cosmos DB for MongoDB vCore.
Priority-based execution in Azure Cosmos DB to run some important requests before others.
Cross-account container copy for NoSQL API.
Support for autoscale, dynamic scaling per partition and per region.

Data Explorer

Ingestion connector from Splunk to Azure Data Explorer is in Public Preview.
Apache Flink connector is in Public Preview.
Kust supports three new geospatial functions: geo_polygon_to_h3cells(), geo_angle() and geo_azimuth().
Support for seamless migration from VNet injected Data Explorer cluster to Private Endpoints.

Database Migration

"As on-premise" sizing recommendation in Azure Migrate SQL Discovery and Assessment that proposes the sizing based on the source instance.

Event Grid

New MQTT broker feature is now generally available.
A new system topic for Azure health resources and resource management events.
Public preview support for events from services like Microsoft Entra ID, Microsoft Outlook, and Microsoft Teams.

Functions

Unlimited execution time with Flex Consumption Plan. The feature is only available with early access preview.

HDInsight

Apache Flink on Azure HDInsight on AKS.

Managed Instance for Apache Cassandra

Preview features of Azure Managed Instance for Apache Cassandra, including vector search, multi-region deployment with turnkey replication, reduced tail latencies.

Monitor

Alerts integration with Event Grid for Key Vault system events.
Support for ingesting JSON logs into Log Analytics from Azure Monitor Agent.

Service Bus

Partitioned namespaces are Generally Available for the Premium tier.

SQL Database

Hyperscale:

New pricing strategy has been put in place in December. It can lead up to 35% lower bills.
Elastic pools to improve the performance and scalability with new premium-series hardware.
Standby replica without license costs.

PostgreSQL:

Storage auto-grow and online disk scaling are Generally Available.
Extension for Azure AI allows developers to leverage large language models (LLMs) and build rich PostgreSQL generative AI applications.
Private endpoints to bring the database into Virtual Network are in preview.
Performance and scalability enhancements thanks to Premium SSD, IOPS scaling, and near-zero downtime compute/storage scaling.
Improved disaster recovery features with Virtual Endpoints and "Promote to primary server" feature.
Server logs are now Generally Available.
Backup Long Term Retention is in Public Preview.

SQL Managed Instance:

Support for restoring backups from Amason S3.
Double max log rate for Business Critical tier.
Always Encrypted has a new software-based solution that completes the hardware one available so far.
Improved throughput for transactional logs with a faster storage solution for Business Critical tier.

MySQL:

Flexible maintenance is in Public Preview.
Azure Private Link is now Generally Available.
New read-replica in universal regions is Generally Available.

Misc:

Free tier is in Public Preview.

Storage Account

TLS 1.2 is the new minimum TLS version for the service.
Azure Blob Storage Cold Tier support for Blob Batch operations is Generally Available.

Stream Analytics

No-code Editor changes, including enhanced operator operation experience, instructional bubbles for new jobs, or switch to query editor from the no-code editor.

Synapse

Custom partitioning in Synapse Link for Cosmos DB.
Synapse Link creation for existing MongoDB collections in Cosmos DB.
Synapse Link compatibility with Cosmos DB continuous backup.

GCP

Analytics Hub

Listings include data encrypted with customer-managed encryption keys.

BigLake

Cross-cloud features are in preview. They include materialized views over S3 metadata cache-enabled BigLake tables.

BigQuery

Administration/OPS:

Administrative query inspector is Generally Available to monitoring the slots utilization.
Resource utilization chart at the project level and filtering resource utilization data on different billing models are two new resource charts available in preview.
Clients of Enterprise or Enterprise Plus edition can use cached results from the same query issued by other users.
Operational Health administrative resource charts are now in preview.

IO:

Apache Hive connector is Generally Available for pipelines migration.
Support for referencing structured data in the materialized views over BigLake metadata cache-enabled tables.
Migration assessment for Apache Hive is in preview.
Migration assessment for Snowflake is available in preview.
Support for copying tables across regions.
Native support for Delta Lake for S3 and Azure tables.
Preview support for Data Manipulation Language (DML) operations to modify recent rows written by the Storage Write API.
Query performance insight about partition skew is in preview.
Slot estimation supports project level cost-optimal commitment and autoscale recommendations for on-demand workloads.

Security:

Authorized stored procedures are Generally Available. It enables sharing stored procedures without giving access to their tables.
IAM conditions support to control access to BigQuery resources.
Support for tagging BigQuery tables to conditional grant or deny access with IAM policies.
Custom data masking is Generally Available in Enterprise Plus edition.
Custom data masking supports more functions, including SHA hash functions with salt.

SQL:

ST_LINESUBSTRING and ST_HAUSDORFFDISTANCE are now geography functions Generally Available.
Additional methods to work with grouping sets are in preview:GROUP BY GROUPING SETS clause, GROUP BY CUBE clause, GROUP BY ROLLUP clause, and GROUPING function.
Support for describing columns of a view is Generally Available.
New views in INFORMATION_SCHEMA: TABLE_STORAGE_USAGE_TIMELINE and TABLE_STORAGE_USAGE_TIMELINE_BY_ORGANIZATION.
Support for text analysis configuration options, such as CREATE SEARCH INDEX DDL, LOG_ANALYZER and new PATTERN_ANALYZER analyzers, new TEXT_ANALYZE function.
New advanced text analysis functions are in preview: ML.BAG_OF_WORDS, ML.TF_IDF BAG_OF_WORDS, TF_IDF, COSINE_DISTANCE, EUCLIDEAN_DISTANCE, EDIT_DISTANCE.
Vector search and vector indexes support in preview. It also brings a new VECTOR_SEARCH function.
ORGANIZATION_OPTIONS_CHANGES and PROJECT_OPTIONS_CHANGES are new views to display the history of configuration changes to the organization and project options.

Other features:

Entity resolution support that allows tch records across datasets even when a common identifier is missing.
Increased limit of the rows returned in Connected Sheets to 50 000 for pivot tables and for data extracts.
Native integration in Looker studio is in preview.
Stored procedures for Apache Spark are in preview.

Streaming:

Support for Change Data Capture based on upsert and delete row operations that are streamed in real time by the BigQuery Storage Write API, is Generally Available.

BigQuery Transfer Service

General Availability for transferring data from Azure Blob Storage into BigQuery.
Preview for transferring campaign reporting and configuration from Display & Video 360 to BigQuery.
General Availability support for federated workforce identities when creating a data transfer from most data sources.

Cloud Composer

Bring your own bucket is Generally Available. You can use any GCS bucket for Cloud Composer.
The constraints/gcp.restrictServiceUsage constraints doesn't check the non-blockable services such as Logging and Monitoring anymore.
Configuration for the preferred Cloud SQL zone when creating a standard resilience environment.
Increased quotas for snapshot operations (up to 52 daily for an environment).
Environment creation doesn't start if the zone value is invalid.
for dags list-import-errors command.
Data lineage support from Dataplex is Generally Available. It's enabled by default for newly created environments for Composer 2.1.2 version and above.
Tasks logs saved only in Cloud Logging by default for newly created environments.

Cloud Functions

Support for Shared VPC Ingress is Generally Available.
Support for Eventarc events and Firestore for Cloud Functions 2nd gen.

Cloud SQL

SQL Server:

Bulk insert support for importing data for SQL Server 2022.
Support for importing transaction log backup to reduce downtime when migrating to Cloud SQL with backups.

MySQL:

Up to 35 days of retained transaction logs for point-int-time recovery for Enterprise Plus edition instances.
InnoDB page compression support.
New flags are supported: innodb_buffer_pool_dump_now, innodb_buffer_pool_load_abort, innodb_buffer_pool_load_now.
Support for IAM group authentication.

PostgreSQL:

The oracle_fdw extension, version 1.2 is now available to simplify accessing Oracle databases.
Data cache availability for Enterprise Plus edition instances.
New flags are generally available: autovacuum_vacuum_insert_scale_factor, autovacuum_vacuum_insert_threshold, effective_io_concurrency, hash_mem_multiplier, logical_decoding_work_mem, maintenance_io_concurrency, vacuum_failsafe_age, vacuum_multixact_failsafe_age.
The pgvector 0.5.1 extension is Generally Available to enable storing and searching for vector embeddings.

PostgreSQL and MySQL:

Support for configuring SSL mode.
Support for restoring backups across instances of different editions for Enterprise edition and Cloud SQL Enterprise Plus edition.
Simplified upgrade process from the Enterprise to Enterprise Plus edition with minimal disruption.
A new demote API is available to demote an existing standalone instance to be a read replica for an external database server.
Near-zero downtime planned maintenance on High Availability-enabled Cloud SQL Enterprise Plus instances with all combinations of public IP connectivity.
Support for upgrading SQL instances to use new network architecture.

Global:

Automatic update for read replicas when you perform self-service maintenance on the primary instance.
Private Service Connect supports cross-region read.

Cloud Storage

Object Retention Lock feature is available. It enables placing a retention configuration on individual objects, protecting them against removals or overwriting actions.
The uniformBucketLevelAccess constraint enabled by default for newly created organizations. It opens access to the buckets only with bucket-level Identity and Access Management (IAM) permissions.
A user-defined prefix for naming temporary components in parallel composite upload support in the gcloud CLI.
GCS Fuse is available on ARM64-based machines.
Configuration file can be used to configure mounting behavior of GCS Fuse instead of global options.
Log rotation configuration for GCS Fuse.
Improved parallelized uploads and downloads for Node.js and Python clients.
Managed folders available in Preview. You can use them to group related objects and set IAM policies to control access.
Autoclass feature support for existing buckets.
Regional endpoints are in preview. You can use them to route the request traffic to the region from a given endpoint.
The policy to restrict unencrypted HTTP access to GCS resources is Generally Available.
Changed egress bandwidth quotas. They now depend on a project history, including billing account's standing.
Improved monitoring for turbo replication performance with a new, real-time Maximum delay in turbo replication graph.

Data Fusion

Three patch revisions were released. They address issues like failing pipelines wit Dataproc secondary workers, KubeTwillRunnerService error on shutdown, issues with slowing deployment applications

Data Loss Protection

A new FINANCIAL_ACCOUNT_NUMBER detector.
Changed sensitive score from HIGH to MODERATE, and type category from PII to DEMOGRAPHIC, for COUNTRY_DEMOGRAPHIC.
The rowsLimitPercent for BigQuery sampling is approximate. For a hard limit you should use rowsLimit property.
Discovery scans can be configured to reprofile the data at the inspection template change.

Dataflow

Resource-based billing support for Streaming Engine.
Improved scaling to up to 4 000 worker VMs.
Template for Spanner-to-BigQuery is Generally Available.
Template for Spanner to Vertex AI Vector Search is Generally Available.
Data sampling for unhandled exceptions. Thanks to this feature, you can see data processed when an unhandled exception happens.
New GPU types (NVIDIA® L4 and NVIDIA® A100 80 GB) are supported.
Job graph validation check feature that you can run to validate if the replacement job is valid before starting the new one.
Archival for completed Dataflow jobs.
Dashboard monitoring for Dataflow jobs at the project level.

Dataplex

BigLake integration is Generally Available. It enabled upgrading a GCS bucket to managed, creating BigLake tables and Object tables instead of external tables.

Dataproc

General Availability of Dataproc Serverless for Spark Interactive sessions.
A new initial executor number for Serverless for Spark is determined as the max of spark.dynamicAllocation.initialExecutors and spark.executor.instances.
Additional encryption capabilities for Serverless Batch with a CMEK key. The encryption covers now job arguments.
The console supports Dataproc Spark Enhancements which are special GCP configuration properties for improved jobs execution.
Serverless GPU accelerators are Generally Available.
The dataproc.googleapis.com/node/yarn/nodemanager/health, dataproc.googleapis.com/job/yarn/vcore_seconds and dataproc.googleapis.com/job/yarn/memory_seconds metrics are now collected during the job execution on YARN.
Dataproc Flexible VMs are in Preview. With the feature you can define a prioritized list of secondary worker VM types that the service can choose if the primary types are not available.
Custom Dataproc images TTL extension from 60 to 365 days.
Autoscaling V2 is available for Dataproc serverless. Besides, you can configure the autoscaling version with the spark.dataproc.scaling.version property.
Dataproc Jupyter Plugin is Generally Available in Vertex AI Workbench instance notebooks..
Customer Managed Encryption Keys support for Dataproc data, including staging bucket, persistent disk data, queries, or job arguments.
Customer Managed Encryption Keys for workflow template job arguments.

Datastream

PostgreSQL ARRAY type support.
SSL/TLS encryption for PostgreSQL sources that don't require client certificates.
BLOB, CLOB, and NCLOB support for object data types in Oracle sources.
Permanently failed stream recovery.
Possibility to start a stream from a specific binary log position for MySQL.
Position-based recovery for Oracle sources.
Increased maximum event size for BigQuery to 20MB.
JA16SJIS character encoding support for Oracle sources.

Firestore

Collection-level exemption support for documents with many fields that don't require indexing.
The general availability of the sum() and average() aggregation functions.
The Point-in-time recovery support to protect against accidental deletion or writes.
Index scans in Key Visualizer is Generally Available.
Non-default databases can be created and deleted in the Console.
Support for creating multiple databases in each project.

IAM

Identities from workforce and workload identity pools can be used in IAM deny policies.

Pub/Sub

Change Data Capture support for BigQuery subscription.

Spanner

A limit value is available for varchars in the information_schema.columns.spanner_type and information_schema.index_columns.spanner_type columns.
Directed reads feature is in preview; you can use it to route read-only transactions to a specific replica type or region.
INSERT OR IGNORE and INSERT OR UPDATE statements support.
ON CONFLICT DO NOTHING and ON CONFLICT DO UPDATE SET clauses support.
COSINE_DISTANCE() and EUCLIDEAN_DISTANCE() functions support.
Batch write is available in Preview. They provide a way to commit multiple mutations non-atomically in a single request with low latency.
Vertex AI integration supports Vertex AI Generative AI text embeddings.
Sampled query plans are Generally Available. You can use this feature to track and compare query performance over time.
FULL JOIN with USING statement is supported for PostgreSQL-dialect databases.
SELECT DISTINCT statement support for PostgreSQL databases.
Query Optimizer v6 is Generally Available and becomes a new default one.
General Availability of the table and index operations statistics for a better insight and monitoring.
General Availability of the PostgreSQL dialect emulator.
Automatic cleanup of long running transactions is in Preview with Java and Go client libraries.
Integration workflow with Vertex AI Vector Search to enable vector similarity search on Spanner data.
Support for new PostgreSQL functions: unnest, array_length, array(subquery), date_trunc, extract, spanner.date_bin, spanner.timestamptz_add, spanner.timestamptz_subtract.
Support for batch-oriented scans for an optimized throughput and performance.
Increased number of mutations per commit from 40 000 to 80 000.
Support of Spanner tables in Data Catalog is Generally Available.
Partition queries support for an improved and parallel query execution.
Managed autoscaler is in preview. The feature enables automatic adaptation of the compute capacity to the changing workloads.
Faster and more efficient data updates thanks to optimizations applied to the groups of statement in the ExecuteBatchDml API.

Storage Transfer Service

Support for transferring data from Amazon S3 via a CloudFront domain.
Improved auto-scaling with gradually ramp-up the number of requests. It should improve transferring small files across the duration of the transfer.
Support for transfers from cloud and on-premises HDFS sources.

It's difficult to summarize all these changes but among the most notable features you'll find S3 Express, Flink on EMR and HDInsight, vector search changes, and Athena pricing evolution that gives more flexibility. See you in the next edition, probably in three months!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com 📩

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

What would it take for you to trust your Databricks pipelines in production?

AWS

Athena

Aurora

Batch

CloudWatch

Data Exchange

Data Sync

Database Migration Service

DataZone

DocumentDB

DynamoDB

EMR

EventBridge

Glue

Keyspaces

Kinesis

Lake Formation

Lambda

Managed Workflows for Apache Airflow

MemoryDB

MSK

Neptune

OpenSearch

RDS

Redshift

S3

SNS

SQS

Step Functions

Transfer Family

Azure

Backup

Cache for Redis

Cosmos DB

Data Explorer

Database Migration

Event Grid

Functions

HDInsight

Managed Instance for Apache Cassandra

Monitor

Service Bus

SQL Database

Storage Account

Stream Analytics

Synapse

GCP

Analytics Hub

BigLake

BigQuery

BigQuery Transfer Service

Cloud Composer

Cloud Functions

Cloud SQL

Cloud Storage

Data Fusion

Data Loss Protection

Dataflow

Dataplex

Dataproc

Datastream

Firestore

IAM

Pub/Sub

Spanner

Storage Transfer Service

Data Engineering Design Patterns

Related blog posts: