Migrate big data to the cloud using tsunami UDP

Source: Internet
Author: User
Tags file transfer protocol secure copy cloudformation template signiant

When your data scale reaches the Pb level, it will be time-consuming and labor-consuming to move such large data volumes, this is also one of the biggest challenges enterprises face when taking advantage of AWS's scale and elasticity to process analysis tasks. This article mainly introduces the accelerated File Transfer Protocol and describes how to use tsunami DUP to migrate large-scale data to the cloud, where UDP is used to process data transmission and TCP is responsible for Connection Control.

It is worth mentioning that, unlike pure TCP-based protocols such as SCP, FTP, or HTTP, these mixed UDP/TCP Protocols process data with better throughput, it can make full use of the current available bandwidth and is not easily affected by network latency. These features make it a good choice for long-distance data transmission. For example, large files are locally transferred to the cloud between AWS regional infrastructure. Ideally, the file transfer protocol accelerated using the mixed UDP/tcp mode is dozens or even hundreds of times faster than the traditional TCP protocol (such as FTP) in terms of transmission rate.

This article describes how to use tsunami DUP. This file transfer scheme is a UDP/tcp hybrid acceleration file transfer protocol, it is designed to migrate large-scale data from Amazon EC2 to Amazon S3 (other powerful File Transfer and workflow acceleration solutions include aspera, mongodat, file catalyst, signiant, and attunity. Most of these products can be obtained from AWS marketplace ).

AWS public Dataset

AWS hosts a large number of public datasets for the community for free. In today's article, we will try to measure the Wikipedia traffic v2-total capacity of 650 GB, including the Comprehensive browsing statistics of all Wikipedia articles per hour over 16 months. This public dataset is saved as an EBS snapshot and can be used in an amazonec2 instance. For details about the processing process, click here to view the EC2 documentation. We will move this 650 GB (already compressed) dataset from an amazonec2 instance running in AWS Virginia to an amazons3 bucket in AWS Tokyo. Once a large amount of data is migrated to amazons3, you can use this large amount of data for big data analysis to apply it to your project. You can use the amazonelastic mapreudce Service (EMR) in AWS and Amazon redshift to quickly import data from amazons3 for large-scale analysis. In this example, our main task is to migrate a large-scale data set from an amazonec2 instance in a region to another region, however, this principle also applies to moving data from an internal data center to AWS or to the opposite of a dataset.

Tsunami UDPTsunami UDP is a set of free open source file transmission protocols. In addition, there is a command line interface designed to implement control over TCP and data transmission through UDP. The design aims to use UDP as a backup mechanism to replace the congestion control mechanism based on the TCP packet validation scheme, thus greatly improving the network transmission efficiency, at the same time, it brings more efficient transmission effects in networks with packet loss or unstable latency. This method can significantly reduce and improve the impact of network latency on data throughput. In contrast, the traditional pure TCP protocol is far from achieving the same transmission rate.

Originally in 2002, tsunami UDP was created by Mark meiss and his colleagues at the Indiana University lab. The widely used version today is the fork version of the original code. It was released in 2009 and is currently managed by sorceforge. Tsunami UDP is supported by many AWS users mainly for the following reasons:

(1) fast speed;

(2) completely free;

(3) easy to use.

Set AWS public dataset in amazonec2 instance

Before we test tsunamiudp, We need to download a set of data sets for our own test. We have reserved a copy of data in Amazon S3 buckets in the ap-northeast-1 region in advance.

1. Set the tsunami UDP Server(1) to start an Amazon Linux instance in the ap-northeast-1 region (that is, Tokyo), we first need to specify a 10gbit network and ample temporary capacity to save the dataset. For example, you can select an i2.8xlarge Amazon EC2 instance. Of course, c3.4xlarge can be used as a more cost-effective alternative. For more details about available Amazon EC2 instance types, click here to view the Amazon EC2 instance Type page. For convenience, we have created a set of cloudformation templates to start Amazon EC2 instances. The initial ports are TCP 22 and TCP/UDP 46224 for SSH and tsunami UDP access. Next set a local temporary volume shard on the EC2 instance and from the https://console.aws.amazon.com/cloudformation/home? Region = ap-northeast-1-cstack = Sn % 7etsunamiudp | Turl % 7E https://s3.amazonaws.com/aws-blog-examples/tsunami.templatesource to install the tsunami UDP application. If you encounter implementation obstacles, you can click here to view the guidance on how to start the cloudformation stack in the AWS instruction document.

(2) log on to the instance we just created through SSH, as shown below:

Ssh-I mykey. pem ec2-12-234-567-890.compute-1.amazonaws.com

(3) Use the iam credential to set aws cli:

AWS configure

(4) Copy Wikipedia statistics to a temporary device:

AWS S3 CP -- region ap-northeast-1 -- Recursive S3: // accel-file-TX-demo-Tokyo/\/mnt/bigephemeral

It takes a long time to download these files. If you do not want to use hadoop for subsequent operations on the dataset, but only want to test the data throughput or tsunami UDP-related functions, then, you can use the following code in your Amazon Linux instance to quickly create a temporary file and use it to replace the required objective g Dataset:

Fallocate-l 650g bigfile. img

The tsunami UDP transmission mechanism can reach the maximum available transmission rate in a short time. When a large number of small files are transferred, the coordination of processing tasks may have a negative impact on the performance. Therefore, we 'd better move a few large files rather than a large number of small files. For example, in an i2.2xlarge instance, tsunami UDP can stabilize the transmission speed at around Mbps when transferring a single sans GB file. In contrast, the transfer of a large number of page counting files with a size of about 50 MB can only bring about Mbps transmission level.

To maximize Data Transmission Performance for Wikipedia, you can use the following command to create a single large tar file:

Tar CVF/mnt/bigephemeral/wikistats.tar \/mnt/bigephemeral

(5) After the file is ready for transmission, use the TCP/udp46224 port to start the tsunamiudp server and provide all the files to the temporary raid0 storage array:

Tsunamid -- Port 46224/mnt/bigephemeral /*

2. Set the tsunami UDP client

(1) Start an amazonlinux instance in US-East-1 (Virginia. To check whether this instance is of the same type as an instance we previously had in the ap-northeast-1 facility, you can use the cloudformation template mentioned earlier, but this time it is used in US-East-1.

(2) log on to the instance we just logged on through SSH.

3. Data Transmission and Performance Testing

(1) run the tsunami UDP client and use the public IP address of the tsunami UDP server of the Amazon EC2 instance we launched in the AWS Tokyo facility to replace the "server" section in the brackets below:

Tsunami connect [server] Get *

(2) If you want to control the transmission rate to avoid network link usage becoming saturated, you can use the "set rate" option. For example, the following command limits the transmission rate to 100 Mbps:

Tsunami set rate 100 m connect [server] Get *

(3) Use cloudwatchnetworkout on the tsunami UDP server and use networkin on the tsunami UDP client to obtain performance data.

During the long-distance file transfer from Tokyo to Virginia, our i2.2xlarge instance received 651 Mbps, which is an outstanding rate of 81.4 MB per second. Considering the interval between the two locations, we are confident that such achievements will satisfy everyone.

(4) to compare the results with other TCP-based protocols, you can try to use the SCP protocol (namely secure copy) for another transmission for comparison. Example:

SCP-I yourkey. pem [email protected] [server]:/mnt/bigephemeral/bigfile. img

We provide the same i2.2xlarge instance for servers, clients, and SCP protocols. We can only get the transmission performance of about 9 MB per second when transferring a single large memory GB file. This is only one tenth of tsunami UDP. It can be seen that, even considering the factors introduced by SCP to encrypt SSH connections, tsunami UDP can indeed bring significant performance improvements.

Move a dataset to Amazon S3

When data is transferred to EC2 instances in the ap-northeast-1 facility, we can migrate it further to Amazon S3. After this task is completed, you can use the parallel Copy command to import it to amazonredshift, and use Amazon EMR to directly analyze or archive it for future use:

(1) create a new Amazon S3 bucket in the AWS Tokyo facility.

(2) copy data from the US-East-1 Amazon EC2 instance to the bucket you just created:

AWS S3 CP -- Recursive/mnt/bigephemeral \ S3: // <your-New-bucket>/

Note:The new general aws cli automatically uses a multi-channel transmission mechanism to optimize data throughput for transmission activities directed to Amazon S3.

Note:If you package your Wikipedia traffic statistics V2 dataset before using tsunami UDP for transmission, you need to decompress it before using Amazon EMR for further analysis.

Use Amazon EMR for dataset Analysis

After a dataset is saved in amazons3, you can use Amazon EMR to perform big data analysis. The examples used in this article are specific to Wikipedia datasets. You can click here to use apachespark on Amazon EMR to query the statistics.

Summary:

Tsunami UDP provides a free, simple, and efficient file transmission method that allows you to transmit large amounts of data freely between AWS and local environments or between different regions, it can also be combined with the multi-channel upload mechanism of aws cli, and tsunami UDP can become a convenient large-scale data set movement method without investment costs. Amazon S3 and Amazon glacier not only provide persistent and extremely low-cost object storage services, but also allow us to use Amazon EMR and AWS services such as Amazon redshift for big data analysis, it not only helps you solve the time cost, but also brings you better efficiency.

Of course, it should be noted that tsunami UDP also has its own limitations. It does not support native encryption mechanisms and can only implement transmission through a single thread. It is difficult to implement automatic execution due to the lack of SDK or plug-in assistance. Using a single thread to run multiple clients and servers may interrupt transmission, resulting in retry, which usually reduces the overall data throughput. Tsunami UDP does not support native Amazon S3 integration. Therefore, the transmission mechanism on Amazon EC2 instances must be suspended first, and then re-sent to Amazon S3 using tools such as aws cli.

Finally, because the last update to the tsunami UDP code library was in 2010, it does not have any commercial support mechanisms or any active open-source forums related to the product. In contrast, the movie dat S3 gateway and signiant skydrop file acceleration transfer solutions can effectively solve these drawbacks, and both products support encryption mechanisms, native S3 integration and more additional features are provided, which makes these two solutions very attractive to commercial customers.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.