Small data processing on the cloud

Believe it or not, but data processing is not only about Big Data. Even though data is one of the most important assets for modern data-driven companies, there is still a need to process small data. And to do that, you will not necessarily use the same tools as for bigger datasets.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

The article goes through different kinds of services you can use to process small volumes of data. In the article, we'll consider small data as a dataset that you can process in a single node. Do you think that you can't use an Apache Spark managed service for it? The answer is just below it may surprise you!

Batch services

The first type of services to process small data are batch services like AWS Batch or Azure Batch. I didn't find any real batch-dedicated equivalent on GCP and didn't want to put here the services that are already quoted in other sections.

Anyway, the idea behind a batch service is to create a job that will process the small data. The job definition is a Docker image executed on a preconfigured compute environment. The environment can either have only vCPU/memory specification or be assigned to a specific machine type. And since we're working here with the containers, there are no real limitations in the library or framework we can use. It can be a custom data processing code with a native language library or a locally contextualized Apache Spark job.

In addition to these attributes, you can also specify the number of retries for a task and the allowed execution time. You will often be able to extend the execution to multiple nodes and therefore split your single big task into multiple ones assigned to different compute nodes. However, unlike the distributed data processing frameworks like Apache Spark, this job division work will be mainly made manually, with the help of the cloud provider SDK.

Function as a Service

The second category of services adapted for small data use cases are serverless functions like AWS Lambda, Azure Function or GCP Functions. Even though, they're very often integrated with an event-driven architecture, you can also use them to process data in batch mode.

Functions are pretty similar to batch services. To use them, you have to provide your code that can be either a packaged version of your application (JAR for Java, ZIP for Python, ...) or a custom Docker image. The compute specification is similar too. With functions, you'll have the possibility to reserve vCPU and memory, but not the whole machine or a specific machine type. It's also important to notice that the execution of the functions is limited in time, whereas the one for batch services is configurable.

How to run the functions? They execute in the response for a particular event but also on-demand, with the manual trigger. You can then quite easily integrate them with your orchestration tool. The single point to keep in mind, their execution is either synchronous or asynchronous. In the former one, you keep the connection between the client and the function open as long as the function doesn't terminate. For the asynchronous mode, the triggered function executes in the background and it's your responsibility to check whether it completed.


A drawback of the 2 presented approaches is the risk of vendor locking. Even if you use a Docker image - which by the way, often has to extend the image provided by the cloud provider - you still rely on the provider-specific compute environment definition. If you want to avoid that, you can implement your workload on top of one of container services like AWS EKS, Azure Kubernetes Service or Google Kubernetes Engine.

The idea is very simple. Use a Docker image you want, write the manifest file of theJob kind, define the restartPolicy for the failure management, backoffLimit for the number of retries, request and max allowed memory and CPU, and make it run! Similarly to the batch services, you can also transform a single-node job into a multiple-nodes one.

Degraded distributed mode

If despite all these, you really want to stay with data processing frameworks, you can use also the Big Data processing cluster-based services! When you know that your processing will fit in a single machine, you can use the single node cluster on AWS EMR, Azure Databricks or GCP Dataproc.

A single-node cluster is the cluster composed only of the master node acting as workers. It implies a weaker fault-tolerance guarantee since the node failure is equivalent to the application's failure. Additionally, it may not support the same features as the cluster-based deployment. For example, Dataproc doesn't support preemtible VMs in the single-node configuration.

Despite these small drawbacks, a single-node cluster can be a good first step to the data processing frameworks world and a good way to be prepared for bigger volumes of data to process in the future.

Ad-hoc serverless query services

And to terminate, a category for the SQL users. The ad-hoc serverless query services like AWS Athena or Azure Data Lake Analytics can be used here. Even GCP BigQuery could be a good candidate, but I didn't want to mix data warehouses with pure ad-hoc querying services.

So, the idea of using serverless ad-hoc query services is based on one important fact. These services often work in pay-as-you-go mode, where you pay for the volume of processed data. And if this volume is small - remember, we're talking here about small data - paying only for the query time can be much cheaper than storing the data in a relational database to get a SQL querying capability. Especially when the query doesn't run very often and you need to have access to the data continuously.

Concretely speaking, to use AWS Athena, you can create a table from the SELECT statement on a location of your choice in S3. Azure Data Lake Analytics works similar way because it uses a U-SQL script with an OUTPUT instruction.

I hope you got the idea. Data engineering is mainly about Big Data since the volume of the data increases very quickly. However, you may also be responsible for small data pipelines, not necessarily requiring big clusters. And even if you use them, always check the used resources. Remember, you always pay for what you use, and often switching to a small data offer for small data processing will be more cost-effective.

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!