It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 4 months.
This 11th part covers all that happened between 28.05.2023 and 16.09.2023. As previously, I highlighted the most interesting news.
tl;nr. The blog post includes all major changes on the data engineering-like services. If you don't have time passing through all of them, you can find a short list of my top picks for that period:
- AWS: Managed Apache Flink in EMR.
- AWS: Timestamp-based starting position for Lambda on Apache Kafka.
- AWS: Support for querying Apache Iceberg tables
- Azure: Entra ID is the new name for Azure Active Directory.
- Azure: General availability of the auto-scaling for Stream Analytics jobs.
- GCP: Data clean rooms for a simplified data sharing on BigQuery.
- GCP: Preview of BigQuery Studio for enhanced data discovery experience.
- GCP: Pub/Sub direct synchronization with GCS.
- GCP: Event-driven transfers for Storage Transfer Service.
- Athena for Apache Spark
- Custom Java libraries support.
- Support for the following table file formats: Apache Hudi 0.13, Apache Iceberg 1.2.1, and Linux Foundation Delta Lake 2.0.2
- New ODBC driver with new features such as AWS IAM Identity Center integration for authentication or reading query results from S3.
- Local Write Forwarding is Generally Available for MySQL. The feature enables sending both read and write queries in the same transaction to the read replica. The writes are then automatically forwarded to the single node for execution.
- Simplified connectivity with AWS Lambda via an RDS Proxy.
- Min vCPUs support for Multi-Node Parallel Jobs. The feature retains a fixed number of vCPUs on a compute environment even if there are no active workloads.
- A new price-capacity-optimized allocation strategy for Spot Instances is available. You can use it to better balance the price and capacity.
- Support for copying data to and from Azure Blob Storage.
- Support for copying data to and from: DigitalOcean Spaces, Wasabi Cloud Storage, Backblaze B2 Cloud Storage, Cloudflare R2 Storage, and Oracle Cloud Storage.
- Detailed data transfer task reports. They help track and audit data transfers, monitor the chain of custody of the copied files, and troubleshoot transfer errors.
Database Migration Service
- Enhanced homogeneous migration capabilities with built-in native database tooling.
- Database Migration Service Serverless is Generally Available.
- An improvement premigration assessment is available.
- Improved index management enabling faster index builds on collections and the ability to view index build statuses.
- Document compression with LZ4 algorithm support. So compressed documents can be up to 7x smaller than the uncompressed ones.
- AWS Database Encryption SDK for DynamoDB is Generally Available. The feature enables encrypting particular attributes on a table before writing them physically to the table.
- Failed conditional writes now can return the failed item. You must define the ReturnValuesOnConditionCheckFailure parameter in the writing operation.
- DynamoDB local was upgraded to version 2.0.
- EMR on EKS:
- Support for managed Apache Flink is in Public Preview.
- Support for Spark Operator and spark-submit as new job submission models. They complete the already existing StartJobRun API.
- Container log rotation support. It's especially important for long running jobs because it should avoid running out-of-disk space.
- Custom job scheduling with Volcano and Apache Yunikorn schedulers.
- Programmatic execution of Jupyter notebooks for managed endpoints.
- Support for logs storage in CloudWatch.
- Support for secrets retrieval from AWS Secrets Manager.
- API support for retrieving Application UIs. The new method (get-dashboard-for-job-run) is available to retrieve the URL of the running jobs.
- Simplified fine-grained log configuration. It supports specifying custom Log4j2 settings for driver and executor logs for each EMR Serverless job run.
- Open Source connector for Kafka Connect is available.
- EventBridge Scheduler supports schedule deletion after completion. If enabled, this feature will remove any already completed schedule.
- Glue Crawlers support Apache Iceberg Tables and Apache Hudi Tables.
- Glue Streaming supports Kinesis Data Streams enhanced fan-out for event sources.
- Glue jobs support Snowflake connector out-of-the-box.
- Glue jobs support using DataBrew Recipes as steps in the flow of transformations.
- Glue for Ray is General Availability.
- Glue Data Quality is Generally Available. The feature automatically measures and monitors data lake and data pipeline quality.
- Sensitive data detection can now detect 250 sensitive entity types from over 50 countries.
- 5 new visual transforms are available. The list includes: Record matching, Remove null rows, Extract string fragments from a regular expression, Parse JSON column, and Extract JSON path.
- Amazon Redshift Serverless support as a data source or target out-of-the-box.
- Data preview for streaming jobs. Now you can visualize the data output at each step of the authored pipeline.
- Cross-region table access supported for databases and tables stored in the Glue Data Catalog.
- Delegation of LF-Tag management support to non-Lake Formation administrators.
- Read-Only Administrator role with read-only permissions for Glue Data Catalog metadata and Lake Formation permissions is available.
- Support for time-travel for Kafka event sources. You can now start your Lambda from a specific timestamp for MSK or self-managed Apache Kafka instance.
- Support of enhanced filtering for Amazon EventBridge Pipes. The extra features include the ability to match against characters at the end of a value (suffix filtering), to ignore case sensitivity (equals-ignore-case), and to have a single rule match if any conditions across multiple separate fields are true (OR matching).
- The service can detect and stop recursive loops in the running functions. The feature tracks the number of times a function was triggered by a particular event. If this number is 16, the event gets written to the Dead-Letter Queue.
- AWS X-Ray tracing supported for SnapStart-enabled functions.
- The console code editor now includes a read file with the list of all available environment variables.
- In-place version upgrades support for new versions of Apache Airflow.
- OpenSearch Trace Analytics support.
- Support for the customer-managed configuration for RabbitMQ brokers.
- Query scheduling for recurring queries and single sign-on support for Redshift Serverless.
- QUALIFY clause support in the SELECT statement. It allows you to apply filtering conditions to the result of a window function without using a subquery.
- Improved encryption experience to RA3 node types. The feature reduces the overall encryption time and improves the availability of the warehouse during the encryption process. The process is now up to 5x faster on big datasets.
- Native console integration with ThoughtSpot interface.
- Support for querying Apache Iceberg tables.
- Amazon Redshift integration for Apache Spark with AWS Secrets Manager integration and support for Parquet writes.
- Cross-region data sharing support through Lake Formation.
- Automatic mounting of Glue Data Catalog. There is no more need to create an external schema in Amazon Redshift to use the data lake tables cataloged in AWS Glue Data Catalog.
- Dual-layer server-side encryption for compliance workloads. Each layer of encryption uses a different implementation of the 256-bit Advanced Encryption Standard with Galois Counter Mode (AES-GCM) algorithm. DSSE-KMS uses AWS Key Management Service (KMS) to generate data keys, allowing customers to control their customer managed keys by setting permissions per key and specifying key rotation schedules
- ACLs can be included in the inventory reports as object metadata.
- S3 Glacier Flexible Retrieval feature improves data restore time by up to 85%.
- Mountpoint for S3 is Generally Available. It's a new open source file client that delivers high-throughput access to S3. It's perfectly adapted to the Big Data workloads with the support of sequential and random read operations on existing files and sequential write operations for creating new files.
- The service is Generally Available.
- A new AWS Snowball Edge Storage Optimized device is available. It has higher capacity (210TB instead of 80TB) and several performance improvements that accelerate the migration.
- Versions and Aliases are available for workflows. The feature impacts the CI/CD process by giving the capability to maintain multiple versions of the workflows, track their usage for each execution, and create aliases that route traffic between them.
- Enhanced error handling. The feature comprises custom error messages in Fail states with runtime details and maximum limit on retry intervals.
- Support for dead-letter queue redrive via SDK or CLI. Three following actions are available for moving the messages from the queue: StartMessageMoveTask, CancelMessageMoveTask, and ListMessageMoveTasks.
- JSON protocol support in Preview. The new protocol reduces application client side CPU and memory usage, and end-to-end message processing latency by up to 23%.
- Increased throughput quota for FIFO High Throughput mode. The mode supports now up to 9 000 transactions per second.
- Support for ingesting events from Amazon Security Lake.
- Support for converting instances to CDB architecture.
- Support for using AWS CloudFormation Templates.
- Support for replicas in Single-tenant instances.
- Time zone auto-propagate support for Single-tenant instances.
- SQL Server:
- Secondary host metrics support in Enhanced Monitoring.
- Local Time Zones support.
- HypoPG support for creating hypothetical indexes.
- Progress indicator for the storage optimization process.
- PostgreSQL and MySQL:
- Optimized Reads on Multi-AZ deployment option with 2 readable standby database instances.
- Interactive Performance Insights analysis for a time period of your choice.
Managed Workflows for Apache Airflow
- Support for Customer-Defined Partition Keys.
- Geospatial heat map style that improves readability of points on maps visuals.
- API support to automate assets deployment. The support concerns dashboards, analysis, datasets including ingestion schedules, datasources, themes, and VPC configurations.
- Standardized user-level cost and usage data in AWS Billing Cost and Usage Reports.
- Axis customization options for small multiples and radar chart.
- Snapshot Export API support.
- A new Analysis file menu.
- Embedded callback actions for a better integration between dashboards and your applications.
- Hierarchy layout for pivot tables.
- Scheduled and programmatic export to Excel format.
- Microsoft Entra ID is the new name for Azure Active Directory (Azure AD).
- Cross-Region Restore for PostgreSQL backups is supported in Public Preview.
Cache for Redis
- Support up to 30 shards is in Public Preview.
- JSON support for active geo-replication is Generally Available. The feature enables creating a globally synchronized network of caches using the RedisJSON module to store and search JSON-style data.
- MongoDB: intra-account collection copy is in Public Preview.
- Pgvector extension is Generally Available. It enables advanced vector operations, simplifying building and training machine learning models.
- Azure AD integration is in Public Preview.
- Vercel integration is in Public Preview.
- Input and output bindings for Azure Functions are in Public Preview.
- Kusto Emulator on Linux is Generally Available. It enables local development and automated testing.
- NLog Sink is Generally Available. For the ones who don't know it, NLog is flexible and free logging platform for various .NET platforms, including .NET standard.
- DropMappedField transformation in data mappings is Generally Available.
- Managed ingestion from Azure Cosmos DB is Generally Available.
- PostgreSQL, MySQL and CosmosDB SQL external tables support is Generally Available.
- Azure Portal support for creating and managing migrations is in Public Preview.
- Upgrade enhancements for AKS are Generally Available. They include Upgrade completed/ canceled/ failed notification, cluster going out of support, cluster out of support notices.
- Redis extension is in Public Preview. You can use it as a new trigger for the functions which should simplify using the service for write-behind cache architectures.
- Several enhancements for the autoscaling. They include improved feedback loop for scaling decisions, significant improvement on latency for scaling and support for recommissioning the decommissioned nodes.
- Ingesting events from Azure Event Hubs to Monitor Logs is in Public Preview.
- Improved table-level RBAC is in Public Preview. The new method allows granular RBAC also for custom log tables. It works by assigning permissions to the table sub-resource under the workspace resource.
- Agent Health experience is in Public Preview to monitor the health of on-premise and cloud Azure Monitor Agents.
- Customizable cost optimization settings for Azure Monitor container insights are Generally Available.
- A new 128 vCore option on standar-series hardware.
- Higher end-compute and memory choices with premium series.
- Feedback persistence and a new feedback algorithm for an improved memory grand feedback experience.
- Integration with PowerBI is in Public Preview.
- 32 TB storage is Generally Available.
- Storage auto-grod and online disc scaling are in Public Preview.
- SQL Managed Instance:
- A new Intel SGX-enabled hardware with up to 40 vCores.
- Read replica HA is Generally Available.
- MySQL extension for Azure Data Studio is Generally Available.
- Online migrations feature is Generally Available.
- No-Code Editor available in the Portal. The feature is in Public Preview.
- Autoscaling Stream Analytics jobs is Generally Available.
- SQL and query changes:
- TRUNCATE TABLE support for multi-statement transactions is Generally Available.
- The quantitative LIKE operator is in Preview. It can check the search value against at least one pattern (LIKE ANY, LIKE SOME), or all patterns matching (LIKE ALL).
- New JSON functions are Generally Available: JSON_ARRAY, JSON_ARRAY_APPEND, JSON_ARRAY_INSERT, JSON_OBJECT ,JSON_REMOVE, JSON_SET, JSON_STRIP_NULLS, LAX_BOOL, LAX_FLOAT64, LAX_INT64, LAX_STRING
- New functions are Generally Available in queries and materialized views: HAVING MAX and HAVING MIN clauses for the ANY_VALUE function, MAX_BY function, MIN_BY function
- The array subscript operator returns a value in an array directly by index and not offset and ordinal.
- The new struct subscript operator is Generally Available. It enables STRUCT fields access by index, offset, or ordinal.
- Security changes:
- GRANTE and REVOKE access is Generally Available for materialized views with a SQL statement;
- Deny access support for bigquery.googleapis.com/capacityCommitments.*, bigquery.googleapis.com/bireservations.*, bigquery.googleapis.com/reservationAssignments.*, bigquery.googleapis.com/reservations.*, bigquery.googleapis.com/[datasets, tables, models, routines, jobs, connections].delete, bigquery.googleapis.com/datasets.[createTagBinding, listTagBinding], bigquery.rowAccessPolicies.[create, delete, update, setIamPolicy]
- Custom data masking routines feature using the REGEX_REPLACE function is in Preview.
- Data clean rooms is in Preview. It's a secure environment where different tenants can shar, join, and analyze their datasets without physically moving them across accounts.
- IO-related changes:
- Apache Iceberg tables support is Generally Available.
- Metadata caching is Generally Available. The feature avoids listing objects from GCS for BigLake tables and object tables.
- Manifest files for external tables support is Generally Available.
- LOAD_DATA statement for loading Avro, CSV, JSON newline delimited, JSON, ORC, or Parquet files is Generally Available.
- EXPORT_DATA statement support for moving BigQuery data to Bigtable is in Preview.
- Primary and foreign key table constraints are Generally Available.
- BigQuery Storage Write API multiplexing is Generally Available. You can use the feature to write multiple destination tables with shared connections.
- User-defined functions support for exporting BigQuery data as Protobuf columns is Generally Available.
- Cross-region dataset replication is in Preview.
- BigQuery DataFrames, a Python API with partial Pandas and scikit-learn compatibilities is in Preview.
- Analytics Hub:
- tracking usage metrics of shared datasets is Generally Available
- support for routines in linked datasets is in Preview
- subscriptions management is in Preview
- Query performance insights about high cardinality joins feature is Generally Available.
- Duet AI in BigQuery is in Preview. The feature helps complete, generate, and explain SQL queries.
- BigQuery Studio is in Preview. It simplifies data discovery by providing Python notebooks and asset management for notebooks and saved queries.
- Federated dataset support for AWS Glue databases.
- Scan tables creating data profiles and monitoring data quality is Generally Available.
- Search indexes support to optimize some queries with the equal operator (=), IN operator, LIKE operator, and STARTS_WITH function is in Preview.
- Administration and ops-related changes:
- Query execution graph is Generally Available.
- Fail-safe period to extend 7 days of data storage after the time travel window is in Preview.
- Time-travel window configuration is Generally Available. You can now set the time travel between 2 and 7 days.
- The INFORMATION_SCHEMA views that show table storage metadata are Generally Available.
- Cost-optimal commitment and autoscale recommendations based on editions pricing and historical performance metrics are part of the slot estimator. The feature is in Preview.
- Query queues are Generally Available.
- Physical bytes can be used for storage billing. The feature is generally Available.
- BigLake Metastore is Generally Available. It can be used to access and manage Apache Iceberg tables metadata from multiple sources.
- The number of web server workers in Cloud Compose 2 can now be set dynamically based on available web server CPU and memory.
- A new composer.environments.executeAirflowCommand permission is required to run Airflow CLI commands through the gcloud environments run command.
- High Resilience mode can be enabled and disabled for an existing environment.
- Airflow CLI commands can be run through Cloud Composer API.
- Maintenance windows are Generally Avaialble.
- Airflow CLI commands no longer require access to the control plane of your environment's cluster.
- Logs producer in the callbacks are visible in Cloud Logging.
- New metric, the composer.googleapis.com/workflow/task_runner/terminated_count, was added to monitor exit_codes of task runners.
- DataprocSubmitJobOperator supports data lineage for Hive, SparkSQL, Presto, and Trino jobs.
- Two Airflow triggerers can be run in any environment. Previously it was only possible in High Resilient configurations.
- Maven wrappers are supported for the Java runtime.
- Cloud Functions 2nd generation:
- Deterministic URLs support is Generally Available.
- Customer-managed encryption keys support is Generally Available.
- Serverless VPC Access connector can be created and configured directly from the Create form in the console. It's in Preview.
- Performance recommendations support is Generally Available. It analyzes cold starts and suggests setting up minimum instances to improve the performance.
- SQL Server:
- Support for importing and exporting differential database backups to reduce migration downtime and increase export frequency.
- A security vulnerability fix for the issue of getting sysadmin privileges through triggers creation in the tempdb database.
- Support up to 500 000 tables for instances of 32+ cores and 200G+ memory.
- Increased retention logs for Enterprise Plus edition for up to 35 days.
- Support for migrating large MySQL databases using Database Migration Service.
- Query insights can be enabled for multiple instances at a time.
- The pgvector extension is Generally Available.
- Support for setting password policies for local database users.
- Point-in-time recovery support for recovering a not available Cloud SQL instance.
- PostgreSQL and MySQL:
- Cloud SQL System insights dashboard is Generally Available with even more metrics.
- Two editions to support various business and application needs: Cloud SQL Enterprise Plus edition and Cloud SQL Enterprise edition.
- Support for canceling the import and export of data.
- Private Service Connect is Generally Available. It enables connecting to a Cloud SQL instance from multiple VPC networks that belong to different groups, teams, projects, or organizations.
- Cloud SQL Proxy Operator is Generally Available.
- Default maintenance windows support.
- Non-RFC 1918 IP address ranges support.
- Support for re-encrypting existing Cloud SQL CMEK-enabled primary instance or replica with a new primary key version.
- Multiple categories of API rate quotas.
- Support for rewriting and copying objects created with XML API multipart uploads.
- The gcloud storage list and describe commands changed the format of some returned metadata.
- Autoclass feature changes starting October 16, 2023.
- by default, new buckets with that class will transition objects between Standard and Nearline classes only
- incompatibility with the matchesStorageClass condition in Object Lifecycle Management
- cost changes for transitioning data from Coldline or Archive to Standard; previously free, this operation will be now charged as a Class A operation at the Standard storage rage
- Class B operations, such as reading object data, will be charged at the Standard storage rate.
- Autoclass-specific SKUs will be used for billing objects in Autoclass buckets after October 30, 2023
- Custom audit logging is Generally Available.
- A new limit of 10 HMAC keys per service account.
- Cloud Storage FUSE is Generally Available.
- A new role (roles/storage.objectUser). It allows creating, viewing, listing, updating, and deleting objects and their metadata, without granting access to the object's ACLs.
- Manifest files are available in Storage Insights. It's generated when an inventory report is split into shards.
- Local endpoints can be used to perform operations compliant with International Traffic in Arms Regulations (ITAR).
- Workforce identity federation is Generally Available.
- Support for extracting data through CDS view for SAP ODP plugin.
- General Availability of the 6.9.2 release.
Data Loss Protection
- New detectors and connections:
- PORTUGAL_SOCIAL_SECURITY_NUMBER infoType detector
- CROATIA_PERSONAL_ID_NUMBER infoType detector
- The subscription pricing mode for the discovery service is Generally Available. It offers predictable and consistent costs, regardless of your data growth by choosing the reserved compute time for profiling.
- Enrichment for manually curated metadata in Dataplex with insights from Sensitive Data Protection data profiles.
- Support for Confidential VMs for the workers. These VMs use AMD Secure Encrypted Virtualization (SEV) for enhanced performance and security for high-memory workloads.
- Dynamic thread scaling is Generally Available as a part of the vertical scaling features. The feature enables running more tasks in parallel on a single worker node.
- Support for NVIDIA Multi-Process Service (MPS) when running multiple SDK processes on a shared Dataflow GPU. The feature is an option of the service.
- General Availability for the: BigQuery to Bigtable, Pub/Sub to Splunk, MySQL to BigQuery, PostgreSQL to BigQuery, and SQL Server to BigQuery templates.
- Capability to update a streaming job without stopping it with an in-flight job option.
- Streaming stragglers are not visible in the Google Cloud console.
- Automatic data quality and data profiling are Generally Available.
- General Availability of the Data Lineage for Dataproc. The feature captures data transformations in Spark jobs and publishes them to Dataplex Lineage.
- Serverless improvements:
- Interactive sessions detail and list pages are available in the Google Cloud console.
- Preview release of the Spark Interactive sessions and the Dataproc Jupyter Plugin.
- Improved the reliability of the compute node initialization with a Premium disk tier option.
- Clusters with a driver node group can configure YARN queues with user-limit-factor set to 2. A single user can then burst to 2x utilization of capacity.
- Increased maximum event size to 10MB for BigQuery sink and 30MB for GCS.
- ENUM and CITEXT data types support for PostgreSQL sources.
- BigQuery Migration Toolkit is available. It simplifies migrating from the "Dataflow Datastream to BigQuery template" to the Datastream native BigQuery replication solution.
- Support for OR queries is Generally Available.
- Point-in-time recovery (PITR) support is in Preview. The feature provides extra protection against accidental deletion or writes.
- Multiple databases are now in Preview.
- Scheduled backups are now in Preview.
- View and list multiple databases are available from the Google Cloud console directly.
- Heatmap pattern visualization for index keys from Key Visualizer is in Preview.
- Browser-based sign-in with the Google Cloud CLI support in the Workforce identity federation.
- Service agent creation can be now triggered.
- Uniform bucket-level access is not a requirement for Credential Access Boundaries anymore.
- Payload unwrapping for push subscriptions is available. An unwrapped message is delivered as an HTTP body without the metadata. Otherwise, the message is a JSON payload containing the metadata as well.
- GCS subscription in Pub/Sub to write messages directly to the existing bucket is Generally Available.
- Query Optimizer version 6 is Generally Available but the Version 5 remains the default optimizer.
- Spanner Vertex AI integration is Generally Available.
- Support for cascading deletes for foreign keys.
- Duet AI integration in Spanner Studio is in Preview.
- Enhanced query editor in Google Cloud console with full support for SQL, DML, and DDL operations.
- Progress of long-running operations, including backups, restores, and schema updates, is Generally Available.
- Deletion protection is Generally Available. It can be enabled to prevent an accidental deletion of databases.
- Definer's rights view support. It adds additional security functionality by providing different privileges on the view and the underlying schema objects.
- Fine-grained access control is now available for PostgreSQL-dialect databases.
- New PostgreSQL functions: ARRAY_UPPER, QUOTE_IDENT, SUBSTRING, REGEXP_MATCH, REGEXP_SPLIT_TO_ARRAY(string, pattern [, flags], TO_CHAR(timestamptz, format), TO_CHAR(double, format), TO_CHAR(bigint, format), TO_CHAR(numeric, format), TO_NUMBER(string, format), TO_DATE(string, format), TO_TIMESTAMP(string, format)
- New PostgreSQL operators DATE - DATE, DATE - INTEGER, DATE + INTEGER, STRING !~ PATTERN
- Integer sequences and bit reversal support.
- It's now possible to generate a UUID v4 as a part of a table's primary key DEFAULT expression with the GENERATE_UUID or generate_uuid() functions.
Storage Transfer Service
- Even-driven transfers for a serverless and real-time replication from S3 to GCS or within GCS, is Generally Available.
- Support for Secret Manager integration with transfer jobs from S3 or Azure Storage is in Preview.
- Security vulnerability issue detected and fixed for the agent container. It requires an action for the agents created on or before February 17, 2023.
- No more need of s3:GetBucketLocation permission on the source bucket for the S3 transfers.
For the last 2 times I've been trying to be more concise and make the list more digestible for you. Hopefully this position in my blogging list is going in a good direction!