Data ingestion to the cloud object store

The volume of the data to migrate from an on-premise to a cloud environment will probably be less significant than previous years since a lot of organizations are already on the cloud. However, it's interesting to see different methods to bring the data there and that's something I'll show you in this blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Types

Bad news. There is no a single way to transfer data to the cloud. Firstly, there are 2 types of data ingestion, the online and offline. The difference? The offline mode works asynchronously. The users first copy the data from an on-premise environment to a physical device. Later, they send this device to the cloud provider, who takes care of ingesting it to its data center. In the online mode, the process copies data synchronously.

Moreover, you will find several implementations in these 2 modes. To choose the migration tool, you'll often need to consider these 3 points:

Data volume. A small dataset will not have the same network requirements as a big one. It's then the first factor to analyze.
Network bandwidth. An online copy for a dataset of several PBs over a very slow network will take ages to complete. If you don't have a way to improve the throughput, offline mode will probably be more efficient.
Transfer frequency. If you must copy the data every hour, offline mode won't work. But using the online mode will require a sufficient bandwidth for the data volume to transfer. You can then see all these 3 factors are interconnected.

Those are the generic points to consider in the data ingestion process. But what are the cloud services you can use?

Offline transfer

Let's start with the offline ingestion. The cloud provider sends you a physical device that can be small or really huge, where you'll be responsible for putting all the data to integrate into the cloud storage. In this category, you'll find the following services:

Service	Storage capacity	Misc
AWS Snowcone	14TB	Optimized for transferring large files, starting from 5MB. Supports AWS KMS encryption keys.
AWS Snowball	80TB	The device can have an extra compute capacity to perform local processing on Lambda functions or EC2-compatible instance. Data is encrypted with the keys managed from AWS KMS.
AWS Snowmobile	100PB	AWS sends a 45-foot long shipping container that you can connect to your network with the help of a dedicated AWS engineer. All transfered data is encrypted with the keys from AWS Key Management Service (KMS). Additionally, the container has GPS tracking, alarm monitoring, and 24/7 video surveillance.
Azure Data Box Disk	35TB	Uses a USB 3.0 connection to transfer the data and stores the data encrypted.
Azure Data Box	80TB	Exposes 1-Gbps or 10-Gbps network interfaces and stores the data encrypted.
Azure Data Box Heavy	800TB	Transfers data with 40-Gbps network interfaces and encrypts it at rest.
Azure Import/Export	variable	In this mode, you can use your SSD/HDD SATA II or SATA III disks. You must install Azure Import/Export binary, copy the data, and send the disk to Azure. This mode is supported for Azure Blobs and Azure Files.
GCP Transfer Appliance	300TB encrypted	Recommended for 10TB or more, when the network transfer would take more than a week. Supports encryption with the keys managed in the Cloud Key Management Service (KMS).

Online transfer

The online ingestion works best for small datasets or a high speed bandwidth. If you need to transfer a big volume of data and have a poor connection, you can ask your cloud provider for a dedicated high throughput network. AWS provides it with DirectConnect, Azure with ExpressRoute, and GCP with Interconnect. The idea is simple. They provide up to 100 Gbps and a dedicated connection between your data center and the cloud provider. You can then leverage it for a faster data transfer, even for bigger datasets. How fast can it go? According to the GCP's calculator, moving 1TB of data over a 100 Gbps network takes only 2 minutes!

But even if you don't have a dedicated fast connection to the cloud, you can still work in the online mode and use one of the available dedicated cloud data ingestion services:

Service	Description
AWS DataSync	The service synchronizes data between on-premise and cloud storage services, including S3, EFS, FSx, NFS, SMB, HDFS, and on-premise object stores.
AWS Storage Gateway	The service has 3 gateways (file-based, volume-based, tape-based). It's a kind of proxy between on-premise appliances and the virtually unlimited cloud storage on S3. The applications can use one of the supported protocols (iSCSI, SMB, and NFS) to interact with S3 without needing to override the I/O exchange part.
AWS Transfer Family	Supports SFTP, FTPS and FTP-based data transfers to S3 and EFS.
AWS Database Management Service	You can use it for any database supporting the S3 as the target destination. The service will perform either full load or an incremental load from an on-premise database to the cloud.
GCP Storage Transfer Service	The service to exchange the data between GCS and an on-premise environment. It comes with a native scheduling capability to support incremental copy.
Azure Data Factory	The service supports moving the data between Storage Account and an on-premise storage with a COPY activity. The operation works also for the services of other cloud providers, such as BigQuery or RDS.
Azure Data Box Gateway	Unlike other Data Box products, Gateway is a virtual machine provisioned in your virtualized environment or hypervisor. It's located in your premises and supports NFS and SMB protocols to transfer the data continuously to Storage Account (block blob, page blob, Files).

In addition to the dedicated services, AWS, Azure, and GCP has command line utils for an accelerated data transfer. On AWS you can use DistCp, on Azure AzCopy, whereas on GCP gsutils.

With the preparation for this blog post, I almost satisfied my curiosity regarding the data ingestion to the cloud. I've still one remaining topic to cover, which is API-based ingestion.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects