The volume of the data to migrate from an on-premise to a cloud environment will probably be less significant than previous years since a lot of organizations are already on the cloud. However, it's interesting to see different methods to bring the data there and that's something I'll show you in this blog post.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
Types
Bad news. There is no a single way to transfer data to the cloud. Firstly, there are 2 types of data ingestion, the online and offline. The difference? The offline mode works asynchronously. The users first copy the data from an on-premise environment to a physical device. Later, they send this device to the cloud provider, who takes care of ingesting it to its data center. In the online mode, the process copies data synchronously.
Moreover, you will find several implementations in these 2 modes. To choose the migration tool, you'll often need to consider these 3 points:
- Data volume. A small dataset will not have the same network requirements as a big one. It's then the first factor to analyze.
- Network bandwidth. An online copy for a dataset of several PBs over a very slow network will take ages to complete. If you don't have a way to improve the throughput, offline mode will probably be more efficient.
- Transfer frequency. If you must copy the data every hour, offline mode won't work. But using the online mode will require a sufficient bandwidth for the data volume to transfer. You can then see all these 3 factors are interconnected.
Those are the generic points to consider in the data ingestion process. But what are the cloud services you can use?
Offline transfer
Let's start with the offline ingestion. The cloud provider sends you a physical device that can be small or really huge, where you'll be responsible for putting all the data to integrate into the cloud storage. In this category, you'll find the following services:
Service | Storage capacity | Misc |
---|---|---|
AWS Snowcone | 14TB | Optimized for transferring large files, starting from 5MB. Supports AWS KMS encryption keys. |
AWS Snowball | 80TB | The device can have an extra compute capacity to perform local processing on Lambda functions or EC2-compatible instance. Data is encrypted with the keys managed from AWS KMS. |
AWS Snowmobile | 100PB | AWS sends a 45-foot long shipping container that you can connect to your network with the help of a dedicated AWS engineer. All transfered data is encrypted with the keys from AWS Key Management Service (KMS). Additionally, the container has GPS tracking, alarm monitoring, and 24/7 video surveillance. |
Azure Data Box Disk | 35TB | Uses a USB 3.0 connection to transfer the data and stores the data encrypted. |
Azure Data Box | 80TB | Exposes 1-Gbps or 10-Gbps network interfaces and stores the data encrypted. |
Azure Data Box Heavy | 800TB | Transfers data with 40-Gbps network interfaces and encrypts it at rest. |
Azure Import/Export | variable | In this mode, you can use your SSD/HDD SATA II or SATA III disks. You must install Azure Import/Export binary, copy the data, and send the disk to Azure. This mode is supported for Azure Blobs and Azure Files. |
GCP Transfer Appliance | 300TB encrypted | Recommended for 10TB or more, when the network transfer would take more than a week. Supports encryption with the keys managed in the Cloud Key Management Service (KMS). |
Online transfer
The online ingestion works best for small datasets or a high speed bandwidth. If you need to transfer a big volume of data and have a poor connection, you can ask your cloud provider for a dedicated high throughput network. AWS provides it with DirectConnect, Azure with ExpressRoute, and GCP with Interconnect. The idea is simple. They provide up to 100 Gbps and a dedicated connection between your data center and the cloud provider. You can then leverage it for a faster data transfer, even for bigger datasets. How fast can it go? According to the GCP's calculator, moving 1TB of data over a 100 Gbps network takes only 2 minutes!
But even if you don't have a dedicated fast connection to the cloud, you can still work in the online mode and use one of the available dedicated cloud data ingestion services:
Service | Description |
---|---|
AWS DataSync | The service synchronizes data between on-premise and cloud storage services, including S3, EFS, FSx, NFS, SMB, HDFS, and on-premise object stores. |
AWS Storage Gateway | The service has 3 gateways (file-based, volume-based, tape-based). It's a kind of proxy between on-premise appliances and the virtually unlimited cloud storage on S3. The applications can use one of the supported protocols (iSCSI, SMB, and NFS) to interact with S3 without needing to override the I/O exchange part. |
AWS Transfer Family | Supports SFTP, FTPS and FTP-based data transfers to S3 and EFS. |
AWS Database Management Service | You can use it for any database supporting the S3 as the target destination. The service will perform either full load or an incremental load from an on-premise database to the cloud. |
GCP Storage Transfer Service | The service to exchange the data between GCS and an on-premise environment. It comes with a native scheduling capability to support incremental copy. |
Azure Data Factory | The service supports moving the data between Storage Account and an on-premise storage with a COPY activity. The operation works also for the services of other cloud providers, such as BigQuery or RDS. |
Azure Data Box Gateway | Unlike other Data Box products, Gateway is a virtual machine provisioned in your virtualized environment or hypervisor. It's located in your premises and supports NFS and SMB protocols to transfer the data continuously to Storage Account (block blob, page blob, Files). |
In addition to the dedicated services, AWS, Azure, and GCP has command line utils for an accelerated data transfer. On AWS you can use DistCp, on Azure AzCopy, whereas on GCP gsutils.
With the preparation for this blog post, I almost satisfied my curiosity regarding the data ingestion to the cloud. I've still one remaining topic to cover, which is API-based ingestion.