Comparison of Datax and sqoop of large data synchronization tools

Source: Internet
Author: User
Keywords Large data comparison running

Datax is a tool for high-speed data exchange between heterogeneous database/file systems, which implements the http://www.aliyun.com/zixun/aggregation/34332.html "> processing system in arbitrary data" (rdbms/ Hdfs/local filesystem data Exchange, by the Taobao data Platform department completed. Sqoop is a tool used to transfer data from Hadoop and relational databases to the HDFS of a relational database (such as MySQL, Oracle, Postgres, etc.). HDFs data can also be directed into a relational database. The same is large data heterogeneous environment data synchronization tools, what is the difference between the two? This article is from Dean's blog.

From contact Datax have a question, it and sqoop exactly what difference, yesterday deployed datax and Sqoop, can have a deeper understanding of both.

Both from the principle of a bit similar, are to solve the heterogeneous environment of data exchange, all support the exchange of oracle,mysql,hdfs,hive, for different database support is plug-in, for the new data source type, as long as the newly developed a plug-in is good,

But if you look at the architecture of both, you'll soon find the obvious difference.

Datax Frame composition

Comparison of Datax and sqoop of large data synchronization tools

Job: A data synchronization job

Splitter: Job Segmentation module, a large task and decomposition into a number of small tasks can be concurrent.

Sub-job: Small tasks after segmentation of data synchronization jobs

Reader (Loader): Data read into the module, responsible for running the small task after the segmentation, the data from the source load into the Datax

Storage:reader and writer Exchange data through Storage

Writer (dumper): Data writing module, which is responsible for importing data from datax to destination data

Sqoop Frame composition

Comparison of Datax and sqoop of large data synchronization tools

Datax directly on the machine running Datax data extraction and loading.

and Sqoop fully inside the Map-reduce computing framework. Sqoop generates a Map-reduce job based on the input criteria and runs in the framework of Hadoop.

Theoretically, it should be more efficient than running multiple parallel imports from a single node by using the Map-reduce framework to import more than one node at a time. The same is true with testing an Oracle to HDFS job, datax only see the database connection on the machine running Datax, and when Sqoop runs, 4 task-tracker all produce a database connection. The Sqoop machine will also produce a database connection, should be read data table for some metadata information, the amount of data, etc., do partition.

Sqoop now as the top item in Apache, I think I'll choose Sqoop if I choose between datax and Sqoop. And Sqoop has a lot of third-party plug-ins. In the morning using the quest developed Oraoop Plug-ins, indeed, as Quest said, the speed has been greatly improved, quest experience in the database, indeed more profound than others.

Comparison of Datax and sqoop of large data synchronization tools

In my test environment, a 700m memory, IO Oracle database, gigabit Network, the use of Quest Sqoop Plug-ins in 4 degree of parallelism, export to HDFs speed 5mb/s, which has made me very satisfied. Compared with the use of native Sqoop 2.8mb/s nearly one times faster, sqoop than datax 760kb/s twice times faster.

Another point sqoop the use of command-line calls, such as easy to integrate with our existing scheduling monitoring scheme, datax the way to adopt XML configuration file, in the development of the operational dimension is still a bit inconvenient.

Fig. 1. Sqoop with Quest Oracle Connector

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.