& http: //www.aliyun.com/zixun/aggregation/37954.html "> The ApacheSqoop (SQL-to-Hadoop) project is designed to facilitate efficient big data exchange between RDBMS and Hadoop. Users can access Sqoop's With the help of, it is easy to import the data of the relational database into Hadoop and its related systems (such as HBase and Hive); at the same time, the data can be extracted from the Hadoop system and exported to the relational database. In addition to these main functions , Sqoop also provides useful gadgets such as viewing database tables etc. In theory, Sqoop supports any database that supports JDBC specifications such as DB2, MySQL, etc. Sqoop also imports data from DB2 databases into HDFS and, Save as a variety of file types Common delimited text type, Avro binary type and SequenceFiles type In this article, the uniform use of delimited text type.
One of the highlights in Sqoop is the ability to import data from relational databases into HDFS via hadoop's mapreduce. The Sqoop architecture is very simple and incorporates Hive, Hbase, and Oozie, which deliver data with the map-reduce task to provide concurrency and fault tolerance.
Sqoop in the import, the need to develop split-by parameters. Sqoop splits according to different split-by parameter values and then divides the split areas into different maps. In each map, we process the value of one row and one row obtained in the database and write to HDFS. At the same time split-by different parameter types have different segmentation methods, such as a relatively simple int type, Sqoop will take the maximum and minimum split-by field value, and then according to the incoming num-mappers to determine the division of several areas. For example, max (split-by) and min (split-by) of select max (split_by) and min (split-by) from are 1000 and 1, respectively, and num-mappers is 2, ) And (501-100), and will also be divided into two sql to 2 map to import operation, respectively selectXXXfromtablewheresplit-by> = 1andsplit-by <500 and selectXXXfromtablewheresplit-by> = 501andsplit-by <= 1000. Finally, each map each get its own SQL data for import work.
Sqoop probably process
Read the table structure to import data, generate run classes, the default is QueryResult, labeled jar package, and then submitted to Hadoop set job, the main is set up in the sixth chapter of the various parameters here by Hadoop to execute MapReduce To execute the Import command
1) The first data to be split, which is DataSplit, DataDrivenDBInputFormat.getSplits (JobContext job)
2) split the range, write range, in order to read DataDrivenDBInputFormat.write (DataOutput output), here is lowerBoundQuery and upperBoundQuery
3) read the above 2) the scope of the write DataDrivenDBInputFormat.readFields (DataInput input)
4) Then create a RecordReader to read data from the database DataDrivenDBInputFormat.createRecordReader (InputSplit split, TaskAttemptContext context)
5) Create MAP, MapTextImportMapper.setup (Context context)
6) RecordReader his party read data from the relational database, set the Map's Key and Value, to MapDBRecordReader.nextKeyValue ()
7) run MAP, mapTextImportMapper.map (LongWritable key, SqoopRecord val, Context context), the last key generated is the row data generated by the QueryResult Value is NullWritable.get ()
Sqoop1 and Sqoop 2 architecture changes
The first two versions are completely incompatible, the specific version number difference for the 1.4.x sqoop1, 1.99x sqoop2. sqoop1 and sqoop2 in the architecture and usage has been completely different. In architecture, sqoop1 uses only one sqoop client, sqoop2 introduces sqoopserver, and implements centralized management of the connector. Its access methods have also diversified and are accessible through RESTAPI, JAVA API, WEBUI, and CLI console. In addition, its security performance also has some improvement in sqoop1 we often use the script to import data in HDFS into mysql, or mysql data in turn into HDFS, which must be displayed inside the script specified Mysql database user name and password, security is not doing too well. In sqoop2, if you access through the CLI, there will be an interactive process interface, you enter the password information is not seen, while Sqoop2 introduces role-based security mechanism. The figure below is a simple schema comparison between sqoop1 and sqoop2:
Sqoop1 architecture diagram:
Sqoop2 architecture diagram:
sqoop1 Advantages: simple deployment sqoop1 Disadvantages: command-line error-prone, tightly coupled format, unable to support all data types, security mechanisms are not perfect, such as password leaks, installation requires root privileges, connector must conform to JDBC model sqoop2 Advantages: Interaction mode, command line, web UI, rest API, conncetor centralized management, all the links installed on the sqoop server, improve the authority management mechanism, connector standardization, only responsible for data reading and writing sqoop2 Disadvantages: a bit more complicated architecture, the deployment of more configuration Tedious