What's new on the cloud for data engineers - part 2 (11.2020-01.2021) on waitingforcode.com - articles about Data engineering on the cloud

It's time for the second update with the news on cloud data services. This time too, a lot of things happened!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The blog post is organized in 3 sections describing the new announcements of cloud data services between November 2020 and January 2021.

AWS

Amazon Managed Workflows for Apache Airflow

If you missed an orchestrator on AWS, this news should make you happy! Last year AWS announced the creation of a managed version of Apache Airflow. Its implementation looks very similar to the Cloud Composer on GCP. You have to specify the DAGs location on S3, check your logs and metrics on CloudWatch, and control the access with IAM. Even though it's not yet available in all regions, the perspective of working with Open Source projects on fully managed environments looks promising!

AWS Athena

The single news here. The federated queries feature is generally available in the us-east-1, us-west-2, and us-east-2 regions.

AWS Aurora

Good news for SQL Server users. Aurora supports Babelfish translation layer that will seamlessly execute SQL Server-specific queries and applications using SQL Server wire-protocol.

Besides, there is also a new version v2 in preview, more scalable and cheaper thanks to fine-grained increments to handle the extra load.

AWS Batch

AWS Batch service can be a good component to work on small data, and starting from December, you can do it in a serverless manner with the help of AWS Fargate! No more need to choose the EC2 type. Instead, you define the desired amount of vCPU and memory and let the service allocate the resources.

AWS Database Migration Service

Some performance and integration changes for DMS. If you use Change Data Capture to write data to Redshift, you can use a new ParallelApply* task family that will enable a concurrent synchronization.

Regarding the synchronization changes, you can now use Amazon Document DB 4.0 as a new source or target. Also, Aurora PostgreSQL serverless was added to the list of target destinations.

AWS DynamoDB

DynamoDB opens to the data analysts world with the release of PartiQL language. If you know SQL and do not want to code the DML operations, you can now use PartiQL for them. All this for the same price and performance guarantees.

Also, starting from November, you can directly stream all your DynamoDB operations to AWS Kinesis Data Streams and retrieve the same entries as you would use DynamoDB Streams. A warning, though. The records can arrive in a different order than the modification order, and it's recommended to use ApproximateCreationDateTime of each record to identify the correct sequence and the duplicates.

There is also some news for the disaster recovery scenarios. You can accelerate the tables restore by disabling secondary indexes creation at restore. And thanks to the Point-in-time recovery (PITR), you can restore the version of the table on a second basis for the last 35 days!

To terminate the DynamoDB's updates, you also can specify your encryption keys in AWS KMS for global tables. Keep in mind that each regional replica should have its own customer-managed key in this scenario.

AWS Elasticsearch Service

For Elasticsearch Service, an interesting feature was released in November. Thanks to the hot reload of dictionary files, no more need to reindex the data if you add a synonym or stop word. The service will update the files at search time. In addition to that, ES supports anomaly detection for high cardinality datasets. Thanks to ML models, this feature can identify entities with abnormal patterns. Please notice that both features are part of Open Distro for Elasticsearch.

Also, the service was upgraded to Elasticsearch version 7.9.

AWS EMR

THE news for EMR is the ability to run Apache Spark on Elastic Kubernetes Service (EKS) instead of YARN. And it's not only reserved for the batch jobs. EMR Studio, an interactive notebook released in Preview in December, can also be run against an EKS-backed EMR cluster.

From the user experience side, your workloads have a chance to run faster and for a lower price thanks to the new M6g, C6g and R6g EC2 instances. According to AWS's benchmark, using them can improve the execution time by 15% and decrease costs by up to 30%.

In addition to the mentioned evolutions, EMR also gives you the possibility to use Apache Ranger as the authorization mechanism for data access through Hive Metastore and S3 via EMRFS. To use it, create an EMR with Apache Ranger and reference it in the EMR security configurations of the clusters requiring the authorization.

AWS EventBridge

It's another service I discovered in my summary activity. In a nutshell, EventBridge is a service bus used to connect SaaS applications with AWS cloud services. Thanks to it, you can intercept data from Datadog, Shopify, or any other integrated partner and trigger an AWS Lambda in response to the event.

And EventBridge introduced recently an interesting feature of events replay. To use it, you have to configure the archive in the service bus, with a retention period, configure the reprocessing conditions like filter rules, start, end time, and at the end, trigger the replay activity.

AWS Glue

Probably the most "impacted" service was AWS Glue. After streaming implementation in the previous update, it now can control the records schema thanks to the Schema Registry component! It's fully compatible with AWS streaming services but is reserved exclusively for Java applications and the records serialized with Apache Avro.

For the batch part of AWS Glue, you can now use the workload partitioning feature to divide one big input into multiple parallel or sequential jobs. It should help you reduce the risk of memory errors on the driver (too many files listed) or executors (data skew) but should not be used for correlated input files. It can be a good try to, for example, horizontally partition the input dataset - even though there will still be the risk of small files because the workload partitioning is made by the file names and not their content.

Apart from these 2 changes, AWS Glue also proposes now a no-code solution to perform data preparation tasks. In November, the service got a new component called AWS Glue DataBrew that lets you prepare the input dataset by transforming and profiling the data. After the transformation defined on the AWS console, you can create a job that will write the transformed data to S3 in the same or different format like Parquet, Avro, ORC, or more classical ones like JSON or CSV.

Finally, in preview, you can now use AWS Glue Elastic Views. This feature looks like a data virtualization tool that materializes the underlying data sources. It supports various AWS data stores (DynamoDB, RDS, Aurora, S3, Redshift) and lets you expose all of them as a single virtual table data source created with the help of SQL.

AWS IoT

Even though AWS IoT is not a pure data product, it offers some interesting data features like watermarks added in November to IoT Analytics service. You have to specify the allowed lateness, and whenever a record is behind this time, a notification will be sent to the CloudWatch Events service. To process these late records, you later trigger an AWS Lambda function or send them to Kinesis Data Streams.

AWS Kinesis

For the AWS Kinesis family, 2 big announcements. The first is for the data retention. You can now store your events in Kinesis Data Streams for up to 1 year instead of 7 days previously. The change will cost you $0.023 starting from the 8td day for storage for 1 GB/month and $0.0237 for reading (per GB); Ireland region example.

For Kinesis Data Analytics for Apache Flink news, AWS now exposes the Flink UI and you can get a better idea of the details of your streaming job execution like watermarks, backpressure, or checkpoints. It's enabled by default.

AWS Lake Formation

Some data governance improvements in this service. Row-level security and transactions are available in preview thanks to the new API for data lakes and governed tables on S3.

AWS Lambda

AWS Lambda was not in my previous summary, and it was a mistake since it can work pretty well in all serverless and stateless data processing scenarios. Sorry for that, I will do better this time! So what did change in the past 3 months? First, Lambda got a capacity upgrade to 10GB of memory and 6 vCPUs. Also, the billing granularity decreased from 100ms to 1ms. It's especially interesting for short functions that won't be anymore rounded to the nearest 100ms.

At the same time, you can use AWS Lambda in response for the events generated in Aurora PostgreSQL (via stored procedures or UDFs), Amazon MQ and self-managed Apache Kafka.

Apart from the new data sources, you can use old ones in new modes. If you listen for SQS messages, you can now configure a MaximumBatchingWindowInSeconds and wait the specified period before invoking the Lambda function. It can be a good way to simulate processing time-based windows, up to 5 minutes.

BTW, regarding windows, they're natively supported for Kinesis Data Streams and DynamoDB streams data sources. When you create a new Lambda function, you can now specify the max of 15 minutes' tumbling window size per shard.

To terminate, a new fascinating feature for checkpointing. By default Lambda will retry all records of the failed micro-batch, but if you set a BatchItemFailure in the Lambda's response, indicating the first failed record, the function will retry only the failed or not processed records, reducing the deduplication effort.

AWS Neptune

AWS shared their graph notebook Python library that you can use in a Jupyter notebook to work on TinkerPop graph data.

AWS Redshift

Also a lot of things changed in Redshift. If you worked with GCP and BigQuery before, you certainly noticed that it was possible to do some ML tasks on top of this data warehouse. It's now possible with Redshift! Redshift, and more exactly, Redshift ML is a SageMaker-powered environment to create and train ML models from the data stored in this data warehouse.

Besides the ML aspect, Redshift also supports automatic refresh of materialized views. Whenever the underlying table(s) change and there are enough cluster resources, Redshift will automatically rebuild the materialized view. Another automatic feature released recently is the automatic table optimization for the distribution and sort keys. Redshift will analyze the query patterns and define the distribution and sort keys with ML algorithms' help. After that, it will automatically put them to the table, so far created without this information.

From the querying perspective, you can now use Open Source JDBC and Python connectors, introducing some performance improvements and integration with other AWS services like IAM. Besides the connectors, the federated query feature now supports the MySQL from RDS and Aurora services.

For the schemas, you can use new types like TIME and TIMEZ to store, respectively, time without timezone information and time with the timezone information. In addition to that, you can directly work on semi-structured formats with a binary type called SUPER type providing better ingestion and querying performances than any other scalar type.

Regarding the clusters, in December AWS released the Advanced Query Accelerator to accelerate analytical workloads. AQUA is a scalable cache layer built on top of Redshift Managed Storage. It's available in preview. Besides, you can now share your data across clusters with data sharing without involving any data movement. It also relies on Redshift Managed Storage and, exactly like AQUA, is available in preview.

Among other Redshift news, you will also find a fine-grained access control for COPY and LOAD. This feature lets you authorize only some users or groups to perform these operations. In addition to that, you should be able to seamlessly migrate your cluster between Availability Zones and to use new and more performant RA3.xplus nodes that enable independent scaling of compute and storage capacity.

AWS S3

Another service impacted by many changes, including the ones we could classify as major, is S3. This kind of major evolution is strong consistency. Starting from December, any writes are immediately visible to the subsequent reading requests, without sacrificing the performance or availability.

In addition to this strong consistency, there were also some changes in the replication. The first of them introduced the two-way replication for objects metadata. Starting from November, the delete markers can be replicated between the synchronized buckets. Initially, the marker was put only in the source bucket. Regarding the data replication, you can now configure a multi-destination replication; i.e. replicate data from one to multiple buckets, not necessarily located in the same region. Finally, you can also get a better idea on the replication status with improved metrics and notifications for all replication rules.

What else happened lastly for S3? The object ownership feature became generally available. To recall, by default any uploaded object is owned by the user who did the upload. With the feature, the ownership can be assigned to the bucket ownership. Also, you should see a performance improval for server-side encryption with S3 bucket keys reducing the number of calls to KMS. To terminate this "misc" part, your data should be more secure thanks to 3 new threat detections in Amazon GuardDuty. Thanks to them you can detect an access to your data from the IP addresses considered as malicious actors on the internet.

Azure

Azure CosmosDB

The biggest announcement for CosmosDB is the serverless support in Preview for all underlying APIs (core, MongoDB, ...). It looks very similar to the on-demand mode of AWS DynamoDB; i.e. you don't need to worry about the available throughput capacity, the service scales it for you. As a result, you don't pay for the not used capacity and at the end, you pay only for what was really consumed.

From the client perspective, if you use Java SDK, you can now specify the transactional batch. All operations defined inside this batch will be executed completely or not at all, exactly like any transaction in ACID-compatible data stores. This feature was previously available for .NET client and starting from November, it's also available for Java applications.

Azure Data Factory

The first interesting Data Factory feature are cached lookups that you can use to access lookup data from a cache with the help of expressions.

In the category of connectors, you can now use Delta Lake and Common Data Model format as inline datasets (= sinks and sources in mapping data flows that do not require a dataset resource).

Azure Data Lake

Among Azure Data Lake news, there is a General Availability for Premium Tier. It's an ideal solution for Big Data analytics workloads requiring low consistent latency and having a high number of transactions.

Azure Data Share

Data Share is the service letting you to share datasets with other people. Its December's update integrated 2 new data sources from SQL Database (tables, views) and Synapse Analytics (tables).

Azure Event Hub

The Event Hub is finally Generally Available for Azure Stack Hub. It means that you can build hybrid streaming architecture composed of a part running on premise through Stack Hub, and another on the cloud, with the managed Event Hub service.

Azure Functions

You can now write Azure Functions even if the language is not natively supported. There is a single requirement, though. Your language must be able to handle HTTP primitives because the custom handlers feature runs lightweight web servers to process incoming events.

Azure HDInsight

Two news for HDInsight in this summary. The first one is the preview for Private Link integration that you can use to create a fully isolated environment without public IP addresses.

The second feature accelerates HBase writes by using SSD-managed disks instead of Azure page blobs for Write Ahead Log (WAL).

Azure Purview

Azure Purview is a new service dedicated for data governance. Since it's composed of a catalog, crawler and data lineage management, it looks very similar to AWS Glue - the first 2 components are the same. The addition of a data lineage is a differentiation factor since as far as I know, there is no similar component available in other described here cloud providers.

Azure Storage Account

Regarding Storage Account, in December, Azure announced a blob inventory feature. The feature is in public preview stage and helps to get a better overview of the underlying blob data by reporting the total data size, age or encryption status.

Another public preview feature is resource logs that monitors individual requests against the service and sends them to any of log consolidation places like Log Analytics, Storage Account or Event Hub.

And to terminate, we have now the possibility to restore a deleted Storage Account from the portal as long as it was not recreated and we restore it in the last 14 days.

Azure Stream Analytics

For this streaming service, the first important change concerns reference data max size. Before the December's change, the max allowed size for the static dataset used to enrich your streaming processing was 300 MB. Now, this limit has reached 5GB!

In addition to that, you don't need anymore to deal with individual service connection strings for Blob Storage, ADLS Gen 2, Event Hub, Azure Synapse Analytics SQL pool and Storage Account. Instead you can use Managed Identities.

Azure SQL Database

Regarding managed relational database service, PostgreSQL has now the ability to stream decoded data changes in Change Data Capture. In other terms, the consumers can get the operations in a readable format like JSON instead of the binary one of PostgreSQL. This feature called logical decoding was previously in preview and starting from December it's Generally Available.

For Hyperscale tier, Azure added a support for Bring Your Own Key (BYOK) feature meaning that you can use your own encryption keys stored in the Key Vault for encrypting data at rest.

Finally, MariaDB version got a start/stop feature. Thanks to it, you can stop the database when nobody will use it and start it later. As BYOK, it's also in preview.

GCP

BigQuery

Good news for data governance in BigQuery. The column-level security is generally available and the policy tags can be replicated across locations.

Also, but this time in preview, you can use a new BigNumeric type for high-precision computations. It has twice as big precision as the standard Numeric type.

Regarding BigQuery side services, Data Transfer Service can integrate within VPC Service Control service perimeter.

Cloud BigTable

You can now better monitor BigTable health. Previously, the data points in disk load charts displayed mean for the selected period. They reflect now the maximum which helps to identify the peaks.

Cloud Composer

Two interesting preview features for Cloud Composer. The first one is the possibility to use Customer-Managed Encryption Keys TODO: find where used. The second preview feature is the support for Airflow Role-Based Access Control UI for a fine-grained access to the resources within Airflow like DAGs, tasks or connections.

Also, the support for VPC Security controls is Generally Available. It can help to strengthen the security of the service by preventing against data exposure or unauthorized data copy to unauthorized GCP resources.

The service also announces that the DAG serialization feature will be enabled by default in the next version release for the new Composer environments.

Cloud Dataflow

Two important changes for GCP's data processing layer. The first is the GA of interactive notebooks executed with Apache Beam interactive runner.

The second evolution concerns a preview feature of custom containers used by the workers. It can be useful to preinstall some dependencies to reduce the starting time or integrate 3rd party software in the background processes.

Cloud Dataproc

From my perspective, Dataproc was the most upgraded service in the last 3 months. First, you can now define the max number of total failures per hour or overall, while submitting the job. From the same workflow control category, the workflow timeout feature became Generally Available. To recall, you can use set how long the jobs in the cluster can execute. If they are not terminated within this period, all running jobs are stopped and the workflow ended.

For the Beta features, you can stop and restart the cluster when needed and also configure a secure multi-tenancy clusters where the workload submitted by the users will take the user identities and map them to predefined service accounts.

Dataproc also got support for new balanced persistent disks backed by SSD and balancing performance and costs. Besides the persistent disks, Dataproc also supports shielded VMs that offer a verifiable integrity of the VMs.

Cloud Pub/Sub

Another GA announcement concerns subscriptions with filters. This feature lets you create a subscription and define the filter expression against message attributes. An important thing to keep in mind is that the attributes are part of the metadata and aren't included in the message payload itself.

Cloud Spanner

Let's move now to Spanner. The first new feature is the support for LOCK_SCANNED_RANGES hint, used to acquire an exclusive lock for the set of ranges scanned by the transaction, so in the scenarios with high write contention risk. Without the hint - and remember, it's only a hint, not the exclusive locking mechanism - multiple transactions could try to upgrade the shared to exclusive lock. And it could never happen since each shared lock would prevent the others from upgrading to exclusive lock.

Still regarding the lock mechanism, starting from January you can get the information on the lock statistics from SPANNER_SYS.LOCK_STATS* tables.

Other system-based table, the SPANNER_SYS.QUERY_STATS*, got some new information about the failed, timed out and explicitly cancelled queries.

And finally, Spanner got 3 new multi-regional instances in Los Angeles/Oregon/Salt Lake City, /Northern Virginia/Oregon/Oklahoma and Netherlands/Frankfurt/Zurich configurations - the first 2 regions are read-write (quorum) regions whereas the last one is the witness region. To recall, multi-regional configuration distributes data across multiple zones in different regions at the cost of a small increase of write latency because now, the quorum is distributed regionally.

Cloud SQL

Among the news for PostgreSQL distribution, the first one is the Generally Available support for IAM authentication. The PostgreSQL version also got a support for effective_cache_size flag (estimated memory available for caching) and dblink, ip4r and prefix extensions.

Regarding common announcements, PostgreSQL, MySQL and SQL Server supports retention settings for automated backups which can be set between 1 and 365 days. In addition, but only in MySQL and PostgreSQL, you can configure the retention settings for point-in-time recovery for a shorter period of time than 7 days. The second shared feature is the exposition of database/memory/total_usage metric representing the total RAM usage in bytes.

In MySQL, you can also use parallel replication to reduce the replication lag; i.e. how far the replica falls behind the primary instance.

If I had to highlight the major announcements for the last 3 months, I would take:

Managed Airflow on AWS
Purview for the data governance on Azure
S3 strong consistency

And you, what's your top 3?

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #what's new on the cloud for data engineers