Hadoop data transfer tool: Sqoop

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Data transmission different dfs

Tags access aliyun big data connector data data transfer data transmission different

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

& http: //www.aliyun.com/zixun/aggregation/37954.html "> The ApacheSqoop (SQL-to-Hadoop) project is designed to facilitate efficient big data exchange between RDBMS and Hadoop. Users can access Sqoop's With the help of, it is easy to import the data of the relational database into Hadoop and its related systems (such as HBase and Hive); at the same time, the data can be extracted from the Hadoop system and exported to the relational database. In addition to these main functions , Sqoop also provides useful gadgets such as viewing database tables etc. In theory, Sqoop supports any database that supports JDBC specifications such as DB2, MySQL, etc. Sqoop also imports data from DB2 databases into HDFS and, Save as a variety of file types Common delimited text type, Avro binary type and SequenceFiles type In this article, the uniform use of delimited text type.

One of the highlights in Sqoop is the ability to import data from relational databases into HDFS via hadoop's mapreduce. The Sqoop architecture is very simple and incorporates Hive, Hbase, and Oozie, which deliver data with the map-reduce task to provide concurrency and fault tolerance.

Sqoop in the import, the need to develop split-by parameters. Sqoop splits according to different split-by parameter values and then divides the split areas into different maps. In each map, we process the value of one row and one row obtained in the database and write to HDFS. At the same time split-by different parameter types have different segmentation methods, such as a relatively simple int type, Sqoop will take the maximum and minimum split-by field value, and then according to the incoming num-mappers to determine the division of several areas. For example, max (split-by) and min (split-by) of select max (split_by) and min (split-by) from are 1000 and 1, respectively, and num-mappers is 2, ) And (501-100), and will also be divided into two sql to 2 map to import operation, respectively selectXXXfromtablewheresplit-by> = 1andsplit-by <500 and selectXXXfromtablewheresplit-by> = 501andsplit-by <= 1000. Finally, each map each get its own SQL data for import work.

Sqoop probably process

Read the table structure to import data, generate run classes, the default is QueryResult, labeled jar package, and then submitted to Hadoop set job, the main is set up in the sixth chapter of the various parameters here by Hadoop to execute MapReduce To execute the Import command

1) The first data to be split, which is DataSplit, DataDrivenDBInputFormat.getSplits (JobContext job)

2) split the range, write range, in order to read DataDrivenDBInputFormat.write (DataOutput output), here is lowerBoundQuery and upperBoundQuery

3) read the above 2) the scope of the write DataDrivenDBInputFormat.readFields (DataInput input)

4) Then create a RecordReader to read data from the database DataDrivenDBInputFormat.createRecordReader (InputSplit split, TaskAttemptContext context)

5) Create MAP, MapTextImportMapper.setup (Context context)

6) RecordReader his party read data from the relational database, set the Map's Key and Value, to MapDBRecordReader.nextKeyValue ()

7) run MAP, mapTextImportMapper.map (LongWritable key, SqoopRecord val, Context context), the last key generated is the row data generated by the QueryResult Value is NullWritable.get ()

Sqoop1 and Sqoop 2 architecture changes

The first two versions are completely incompatible, the specific version number difference for the 1.4.x sqoop1, 1.99x sqoop2. sqoop1 and sqoop2 in the architecture and usage has been completely different. In architecture, sqoop1 uses only one sqoop client, sqoop2 introduces sqoopserver, and implements centralized management of the connector. Its access methods have also diversified and are accessible through RESTAPI, JAVA API, WEBUI, and CLI console. In addition, its security performance also has some improvement in sqoop1 we often use the script to import data in HDFS into mysql, or mysql data in turn into HDFS, which must be displayed inside the script specified Mysql database user name and password, security is not doing too well. In sqoop2, if you access through the CLI, there will be an interactive process interface, you enter the password information is not seen, while Sqoop2 introduces role-based security mechanism. The figure below is a simple schema comparison between sqoop1 and sqoop2:

Sqoop1 architecture diagram:

Sqoop2 architecture diagram:

sqoop1 Advantages: simple deployment sqoop1 Disadvantages: command-line error-prone, tightly coupled format, unable to support all data types, security mechanisms are not perfect, such as password leaks, installation requires root privileges, connector must conform to JDBC model sqoop2 Advantages: Interaction mode, command line, web UI, rest API, conncetor centralized management, all the links installed on the sqoop server, improve the authority management mechanism, connector standardization, only responsible for data reading and writing sqoop2 Disadvantages: a bit more complicated architecture, the deployment of more configuration Tedious

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More