What's new on the cloud for data engineers - part 1 (08-10.2020)

Cloud computing is present in my life for 4 years and I never found a good system to keep myself up to date. It's even more critical at this moment, when I'm trying to follow what happens on the 3 major providers (AWS, Azure, GCP). Since blogging helped me to achieve that for Apache Spark, and by the way learn from you, I'm gonna try the same solution for the cloud.

The post is organized into 3 sections. Each of them covers the last data-services changes on AWS, Azure and GCP, released between August and October 2020.

Data engineering news on AWS

Data Migration Service

To recall, the Database Migration Service (DMS) helps to migrate the data between different databases (same or different type). You can perform the one-shot migration as well as continuous ones with the Change Data Capture (CDC) pattern. One of the available target data stores is S3 and starting from October, you can store the continuously arriving changes in timestamped "directories" based on the transaction commit dates.

EMR

On EMR's side, good news for FinOps people! You can now optimize your fleet instances and put up to 15 different instances types. It's 3 times more than in the previous version of the service, and a better way to diversify and save some money on Spot Instances.

And to improve the fault-tolerance of the master nodes, you can now put them in different subnets. To use it, configure the cluster with a placement group (--placement-group-configs InstanceRole=MASTER CLI configuration).

Glue

If you missed the news, AWS Glue is not anymore tied to batch data sources. After implementing the support for Amazon Managed Streaming for Apache Kafka data source, October's release brought a possibility to read the data from any Kafka broker, even the self-managed ones! Another evolution was related to the streaming Glue jobs concerns schema detection. After the recent service upgrade, the job can detect schema evolutions and handle it for every record. In addition to that, the job will also update the schema definition in the Glue Data Catalog.

For the batch jobs, the September's update brought the possibility to define the partition indices in the Data Catalog. The feature looks very similar to a partition pushdown concept in data stores where the engine can reduce the volume of data to process by eliminating the partitions not matching the predicate. But don't be surprised, it doesn't work for all queries involving the partitions. For example, a query with OR statement won't use the optimization. The full list of the supported operations is available on the documentation.

Kendra

A small update on this relatively recent service introduced in December 2019. You can now manage it with CloudFormation. If like me, you didn't know it before, it's worth checking the documentation to discover this ML-based search engine.

Kinesis Data Analytics

First and foremost, Java's branch of Kinesis Data Analytics was renamed from Amazon Kinesis Data Analytics for Java to Amazon Kinesis Data Analytics for Apache Flink. And regarding Apache Flink, its integration also had some extra improvements like Apache Beam support, bug fixes, Kinesis Data Firehose sink support, and an improved monitoring for auto-scaled applications (Autoscaling status).

The recent changes should also improve Apache Flink's consumer performances thanks to the HTTP/2 support and Enhanced Fan-Out (EFO) data retrieval. In a nutshell, HTTP/2 uses a binary data format that allows more compact and more easily compressible messages. This new version of HTTP protocol also supports the push mode, ie. the server can push data to the client, without necessarily waiting for the clients to request them. The EFO is an already existing feature of the Kinesis Data Stream service, released in 2018. It guarantees a reserved consuming capacity of 2MB/second/shard for every application consuming data from the Kinesis stream.

Lake Formation

One year after the first tests, the fine-grained access control policies for Apache Spark EMR workloads are finally Generally Available. To recall, this feature lets you limit the access scope of the EMR cluster at the database, table, or column level. It also natively integrates with already existing enterprise directory like Microsoft Active Directory to facilitate access management.

The second big feature allows Lake Formation sharing. Previously, the data lake items stored on different accounts were isolated, an account couldn't access the data managed by another account. And so despite the legitimate reasons for that separation within an organization (costs or security concerns for example). Fortunately, it's now possible thanks to a cross-account database sharing feature that considerably reduces data silos.

Redshift

A good news for the approximate algorithms fans. Starting from October you can use a new type called HLLSKETCH to define a column storing an approximate number of distinct values in the dataset.

Another good news is for the ones of you who need to store a lot of tables in the cluster. The previous limitation of 20k tables was increased to 100k. There is no more need to artificially split these tables across different clusters or storage backends.

The September's release notes also announced the availability of a Data API. Thanks to it you can for example execute a SQL query from AWS CLI.

Finally, for Redshift, and more exactly for Redshift Spectrum, you can use new ACID-compatible (BTW I'm still looking for a better name for them! If you have any, feel free to suggest) file formats like Apache Hudi or Delta Lake.

S3

Two announcements for S3. The first of them is about bucket ownership. Thanks to this feature an AWS account can assume the ownership for all objects uploaded to its buckets, even by other AWS accounts. It overrides the default setting where the object is owned by the account uploading it.

In addition to that, AWS Outposts service is now Generally Available. To recall, the AWS Outposts is a managed service providing AWS infrastructure and services on different data centers, even the ones belonging to the customers (aka "Run AWS Services on-premises"). AWS Outposts for S3 comes with a new S3 class called S3 Outposts that provides similar fault-tolerant guarantees that standard S3 classes, but in the AWS Outputs environment.

Data engineering news on Azure

Azure Blob Storage

If you are familiar with S3, you will notice that Azure Blob Storage release proposes some similar features The first of these features is versioning. By the end of August, the blob versioning was released as the GA feature. It adds an extra data protection layer. You can freely restore a previous version of the blob when it changes or is deleted.

The second additional data protection feature is the soft deletes protection for containers. Since October any deleted container will be retained for the period you specified - of course, only if the soft deletes protection is used.

The next property added to the Blob Storage is an improvement of the lifecycle management. You can now track the last access time for every blob and configure the lifecycle policy accordingly, for example by moving rarely accessed items to a cooler storage tier. The second change related to the lifecycle management comes from append blobs. An append blob is a blob composed of multiple appendable blocks. Since September, lifecycle management supports append blobs expiration.

Finally, you can also freely replicate your objects to any Azure container, for example across accounts or regions.

Azure Data Explorer

Azure Data Explorer is a data analytics service that you can use to analyze batch (eg. from Data Factory) and streaming (eg. EventHub) data sources. You can later expose the Data Explorer to your data visualization tools (PowerBI, Redash, ...).

In the recent releases, Kafka Connect Data Explorer connector evolved a lot. It supports now the schema registry, has an enhanced configuration (retries, behavior on error), and dead letter queue. The icing on the cake, the connector is now Confluent gold certified.

Almost one month after this release, Data Explorer had another change, this time related to the hardware running the cluster. The isolated compute, so a compute environment reserved for a single customer, is now supported. In simpler terms, this environment ensures that the servers used by one’s cluster don’t run VMs from a different tenant.

Azure Data Lake

A big news for Azure Data Lake service concerns the immutability. In this new preview feature you can write your data only once (Write Once Read Many, aka WORM). Also, you can add the holds on the data to mark it as non-erasable and non-modifiable until the hold removal. FYI, the same model exists on S3 (S3 Locks) and quite similar on GCS (bucket lock).

The second feature in the preview is the static website hosting. Since the beginning of October, you can use Data Lake to host static content accessible from the browser.

The last feature, also in preview btw, comes from August releases and is about the ACL inheritance. If you make any permission change on the parent directory, you can now propagate it to all children's items. Maybe it doesn't sound revolutionary, but this feature also supports error tracking (or checkpointing if you prefer). It means that in case of any failure, the process will start from the failure points and not the parent directory.

Azure Databricks

If you follow the Apache Spark ecosystem (Open Source and commercial), in September you certainly heard about the Photon powered Delta engine. This C++-based engine can speed up Apache Spark and Delta Lake workloads up to 20x. But what did bring this change? This new engine comes to overcome the hardware limitations due to not aligned performance improvements between network, storage, and CPU. Whereas the former 2 increase their processing time by a factor of 10, the latter only changed marginally. The idea of the new Photon engine is to leverage the modern hardware and take advantage of the data-level and instruction-level parallelism.

The second Databricks' feature is related to the Common Data Model which is the standardized format to represent data on the Azure platform. Since September, you can use an Apache Spark extension to deal with this CDM abstraction natively, from Apache Spark's level. This connector is pre-installed on Azure Synapse Analytics and also can be installed on Databricks.

Azure HDInsight

In September the auto-scaling feature for interactive queries became Generally Available. To recall, the interactive query is an HDInsight cluster type with the support for in-memory cache added to accelerate Hive queries.

Azure SQL

The first big feature from this category are distributed database transactions across different Azure SQL Managed Instances. The feature is still in preview but the possibility to execute a transaction across different databases sounds promising for horizontally partitioned systems. I don't know you but when I discovered this news, I immediately thought about GCP's Spanner distributed transactions.

The second big announcement concerns general performance improvements of the Managed Instances. Business-critical instances benefit from an improved data/log IOPS and a better transaction log write throughput.

In addition to that, you can also set a smaller retention period for PITR backups (a minute granularity). Since August, you can define the retention period to 1 day or 0 days (no backup persisted), if the database was deleted.

Azure Synapse Analytics

To recall, Azure Synapse Analytics is a data analytics platform combining the best of 2 words, SQL and Apache Spark runtime. The former part is supposed to replace SQL Data Warehouse (SQL pool) whereas the latter a customized Apache Spark (Spark pool) version. It got a few updates in the past months.

The first of them is the support - in preview - for MERGE operation, that you certainly know already from SQL. Another feature, also related to the syntax, is the promotion of COPY command to GA. This option lets you load data to Azure Synapse Analytics from a Storage Account, and the idea is quite similar to the one offered by COPY command on AWS Redshift and PostgreSQL.

Regarding Spark pool, it got a brand new implementation for caching and shuffle components. It uses the best from modern hardware and OS, but I'm really curious to know a bit more about that. If you have any hints, I will be happy to learn 🙏

Data engineering news on GCP

BigQuery

A lot of GA announcements for BigQuery. The first of them concerns dynamic SQL, so the string expressions that you can execute with EXECUTE IMMEDIATE operator. In the same "new GA features" category you will also find a better SQL support for new operators (ASCII, LEFT, OCTET_LENGTH, ASSERT [!] ...) and statements. Thanks to the latter ones you can now EXPORT DATA to GCS, CREATE EXTERNAL TABLE for data stored outside BigQuery (looks like Redshift Spectrum, doesn't it?) or even ALTER TABLE ADD COLUMN for not required extra columns of the table.

The 2nd GA feature is table access control. Thanks to it, you can authorize a user to see only a subset of tables or views from a given dataset.

Apart from the access control, you can also use time-unit column-partitioned tables (GA). As you can now, and also as you saw that for the new Glue feature, partitions in Big Data helps to considerably reduce the amount of processed data. To recall, it's one of the recommended ways in BigQuery to save costs ($ and time). With this new feature, you can use any of TIMESTAMP, DATE or DATETIME columns as your partitioning column.

Finally, also some quotas changed. The new minimum slot for flat-rate pricing is 100 and the slots can be then purchased in 100-slot increments. In addition to that, the quotas for BigQuery export increased from 10TB to 50TB per day! And this limit is of course cumulative across the exports.

Cloud Composer

For the managed Apache Airflow service, a big change related to the secrets management. You can now use the Secret Manager service to store the connections and variables. If the used connection or variable is missing in the Secret Manager service, it will fall back and look for it in the default metadata store.

Cloud Data Fusion

To recall, Data Fusion is the data integration managed service. Sounds mysterious? In simple terms, you can build ETL/ELT pipelines from a graphical interface. At first glance, it looks quite similar to Cloud Composer, except for this UI aspect and a more no-code approach. Among the use cases covered by the service, you will find data wrangling (eg. filtering CSV stored on GCS), combining datasets (GCS + BigQuery), writing processed data (combined dataset to BigQuery). Everything is visual but under-the-hood, it leverages Dataproc clusters to translate the graphical definition and execute it as code. The approach is a bit similar to the one you can find in Azure Data Factory.

Among the new features for this service you will find auto-scaling support for the underlying Dataproc clusters and views support for BigQuery-based wrangling. And regarding the data wrangling tasks, the new changes improved the schemas support to > 5K fields and more than 20 levels of nesting.

Cloud Dataflow

Dataflow brings 2 important changes. The first of them is the general support for Flex Templates. The classical templates have some limitations like the ones related to the DAG parametrization (eg. couldn't change the data source in the template). Flex templates are an improved version of the templates addressing these points.

The second major evolution is the availability of Network Tags. You can link them to the Compute Engine instances and use a more fine-grained firewall rules definition, for example, apply some rules only for the tagged instances.

Cloud Dataproc

The recent Dataproc changes are mostly related to the feature called optional components. Optional components are the non-standard Hadoop components that you can add very easily to the created Dataproc cluster. Recently Apache Flink and Docker optional components were promoted to General Availability.

Also, new Apache Spark, Delta Lake, Iceberg, and a few other data engineering frameworks and libs, were upgraded. To use them, you'll have to change the image version of the Dataproc cluster.

If you use Dataproc in "workflow mode" (multiple jobs sequentially processed), you can now set the global timeout for the workflow. If exceeded, the whole workflow terminates. The timeout applies only to the "processing" part, any setup involving cluster creation is not included. One thing to notice though, the feature is in beta.

Cloud Firestore

Firestore is an improved version of Cloud Datastore and it's a serverless document-oriented database. Among the most recent changes, you will find the support for not equals operations (!= and not-in) that you can use to find the documents with the fields matching this negative predicate. The documents without the filtered fields won't be returned.

Cloud Pub/Sub

For the streaming part of GCP, messaging ordering is one of the most important changes. The feature was launched in beta in August and 2 months later passed to GA status. Starting from now your Pub/Sub can deliver the messages sharing the same ordering key in the order they were received by the broker.

Similar life cycle applies to the Pub/Sub Lite service. Launched in beta in May 2020, it became GA in October. Consider Pub/Sub Lite as a poorer version of Pub/Sub. The differences are related to the zones (a single-zone service), capacity (provision before using), pricing model (based on provisioning), storage, and retention period (unlimited).

Cloud Spanner

Cloud Spanner also has some interesting changes. The support for generated columns is now GA. A generated column is a column that you can create dynamically, ie. from already existing columns. It's like you were using SQL functions at runtime, except that the outcome is materialized. A good thing is that if any of the columns composing the dynamic column changes, the function used to compute the dynamic column is reevaluated and the stored value is updated.

Also, the CHECK constraints on the rows were promoted to a GA feature. A constraint is an expression using at least one of the non generated table's columns. It ensures that the value you set is valid. Think about it like the assertions you can put in your Python or Java code to be sure that the arguments passed to the function of the results of the intermediate functions are correct (fail-fast approach).

To debug the queries you can now use an "Oldest Active Queries" feature to find out what are the longest-running queries. It should help you to understand what can be the potential reasons for the performance issues.

Cloud SQL

For different managed RDBMS, you will find some interesting changes too. PostgreSQL supports the IAM access in addition to the classical username/password authentication. Still for PostgreSQL, you can use now the pgAudit extension to record and track any changes made on the database.

Another feature added to PostgreSQL, but also to SQL Server and MySQL, is the deny maintenance period. You can use it to define the time range when the planned upgrades for the service shouldn't be applied. This deny period can be configured for 90 days in a row at most and only once per year.

GCS

To terminate, one big change for GCS service, the custom time (Custom-Time metadata attribute that you can set to the stored objects. You can later use it with the DaysSinceCustomTime condition in the object's lifecycle management.

And here we are. A lot of things have happened in the last 3 months. If it's way too much for you, as a reader, feel free to share your feedback in the comments. I will be happy to learn your feedback and improve the format of this "What's new on the cloud..." series!.