Data wrangling on the cloud

Data is not perfect, and in each project, you'll probably need to do some cleaning to prepare it for business use cases. To make this task easier, cloud providers have dedicated data wrangling services, and they'll be the topic of this blog post.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

Data wrangling 101

But before going into the main topic, let's stop for a while and recall what the data wrangling is. You'll see the definition will help understand the mechanisms implemented in the cloud services.

In a nutshell, data wrangling is a data transformation process aiming to make the data more valuable for the downstream consumers which are often business-related. However, reducing data wrangling to an outcome is not correct because it's a more complex process often composed of the following steps:

  1. Discovery. It's the step that will impact the number of iterations of the process. The goal is to familiarize yourself with the dataset. It's the moment to identify the data issues (nulls, inconsistent formats, ...), as long as the patterns (skews). From the technical standpoint, you'll use here data profiling tools (e.g. Great Expectations) or an ad-hoc querying service with an extra component to visualize the data (e.g. a notebook).
  2. Mapping. Once you know the dataset, you can start working on it and define the outcome. In the mapping step you'll determine the shape of the output. For example, you might have a column with the delivery addresses that must be splitted into 2 or more columns/rows. In this step you can use a data processing framework or SQL query.
  3. Cleaning. After defining the shape, it's time to increase dataset value by removing inconsistencies, such as dates defined in different formats, or countries expressed with ISO codes and full names. Depending on the data wrangling target, which can be a particular data analysis or data promotion from bronze to silver era (Lakehouse pattern), you can also remove the outliers or empty cells/rows. You can use the same tools as for mapping here.
  4. Enriching. The dataset gets better and better. It's consistent but it probably lacks some business-meaningful information. Very often data producers don't include all the static data to the record. For example, they won't put all the customer details but rather only the customer id with eventually a few extra values. The goal of the enriching step is to augment the manipulated dataset by combining it with other data sources. Here too, you can use a data processing framework or SQL queries.
  5. Validating. You're going close to the end. After doing all the preparation work, it's time to validate it against the predefined business rules, such as maximal length of a column (e.g. country column using an ISO code must always be of 2 characters) or a pattern (e.g. zip code). If the data wrangling is a part of a regularly executed batch pipeline, the validating step can be a separate job included in the data pipeline. You'll probably use the same tools as previously (data processing framework or SQL query). It may discover new data issues and have to start the process again.
  6. Publishing. In this final step, the prepared data is made available for the others.

Cloud offerings

To compare data wrangling on the cloud, I took Glue Data Brew for AWS, Data Factory for Azure and Dataprep for GCP. Although Data Factory has also a Power Query data wrangling, for the sake of simplicity I decided to focus here only on the Mapping data flow transformation. And no text this time. You'll see the comparison in a table, hoping it'll provide both a quick summary and clear view of each offering:

AWS Glue Data Brew Azure Data Factory GCP Dataprep by Trifacta
Discovery Visual, has a Data profile job type that will display dataset profile including per-column view. Visual, Mapping data flow task shows data profile at the column level after clicking on Statistics. Visual, runs a job after defining the data source that will provide a visual representation of the data at the column level.
Mapping Supports creation of new columns, for example by applying a function to an existing column. It's also possible to create multiple columns from a single one with the SPLIT_ column structure recipes. Derived column, Flatten and Select are the main transformations that can be used to manipulate the dataset structure in the mapping data flows. Has a Structuring Task that will perform any necessary operations on the dataset format, such as unnesting a column to multiple rows, moving, or splitting a column.
Cleaning Has more than 250 native transformations to clean up the dataset. The mapping data flow has a quite flexible support for data cleaning. Although it has less built-in transformations, it provides a flexible way to define them by supporting custom expressions in the Parse transformation.. Implemented as a Cleanse Task with various native transformations, such as deduplicating rows, changing column type, or replacing values using patterns.
Enrichment Can join or union 2 datasets. The join supports various types: inner join, left join, right join, outer join, left excluding join (returns all rows from the left table without matches in the right), right excluding join, and outer excluding join (all records in the left and right tables that don't match). Mapping data flow supports joins, unions and lookup datasets. Among the supported join types, you'll find left join, right join, outer join, inner join, and cross join Supported via Enrichment Task. Although it contains a pretty standard set of features (joins, unions), some of them have advanced options. For example joins support fuzzy matching joins meaning that the operation can match columns having slightly different values. The task also supports lookup dataset.
Validation DataBrew validates the dataset with data quality rules executed from a Profile job. Mapping data flow can run in a debug mode. You can use this interactive mode to validate the transformations on the dataset. However, it doesn't provide a global validation because the debugging applies the transformation and displays only the outcome for a configurable subset of rows. To validate the data you can use data quality rules and for example check if the column contains only an allowed set of values, or whether it doesn't contain duplicates or nulls.
Publishing The data transformation job can be downloaded as JSON or YAML and later published to other environments with dataset names overriding etc. The deployment of Mapping data flow fits into Azure Data Factory mode with publishing a template and substituting the variables to match the resources on the deployed environment. Also integrates with a CI/CD pipeline. The transformation flow can be exported to GCS after validation, and imported by the CI/CD job for the deployment on other environments. The deployment involves calling the importPackage API method but also updateInputDataset to adapt the data sources to the deployed environment.

Data wrangling on the cloud is mostly visual but as you saw in the last steps of the table, this no-code mode integrates with the code-based deployment tools. Of course, those are only some examples of how to do the data wrangling. After all, it's a data processing job that could be built on a service such as Databricks that combines an ad-hoc querying service (notebook, data visualization libraries) with a data processing service (jobs). The job package could be deployed from a CI/CD pipeline and executed on different environments from an orchestrator (Cloud Composer, Data Factory,...).

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!