Detailed Sqoop architecture and installation deployment

Last Update:2016-05-18 Source: Internet

Author: User

Tags relational database table sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, what is Sqoop?

Sqoop is a bridge connecting traditional relational databases to Hadoop. It includes the following two areas:
1. Import data from a relational database into Hadoop and its associated systems, such as Hive and HBase.
2. Extract the data from the Hadoop system and export it to the relational database.
Sqoop's core design idea is to use MapReduce to speed up data transfer. This means that Sqoop's import and export functions are implemented through the MapReduce job. So it is a batch processing method for data transmission, it is difficult to implement real-time data import and export.

Second, why choose Sqoop

Why do we choose Sqoop? It is usually based on three considerations:
1, it can effectively and controllable use of resources, you can adjust the number of tasks to control the concurrency of the task. In addition, it can configure the database access time and so on.
2, it can automatically complete the data type mapping and conversion. We tend to import data that is typed, that can be automatically converted to Hadoop based on the type in the database, and of course users can customize the mapping between them.
3, it supports a variety of databases, such as Mysql, Oracle and PostgreSQL and so on database.

third, Sqoop architecture and common Operation

Sqoop architecture is very simple, it mainly consists of three parts: Sqoop client, hdfs/hbase/hive, Database. Let's take a look at the architecture diagram of Sqoop.
　　
After the user initiates a command to Sqoop, the command is converted to a MapReduce job based on the Map Task. The map task accesses the metadata information of the database, reads the data from the database through a parallel Map task, and then imports it into Hadoop. Of course, you can also import data from Hadoop into a traditional relational database. Its core idea is to realize the concurrent copy and transfer of data through the MapReduce job based on map Task (map only), which can greatly improve the efficiency.

Combination of Sqoop and HDFs
Below we combine HDFS, introduced Sqoop from the relational database import and export.
Sqoop Import
Its function is to import data from a relational database into HDFS, with the flowchart shown below.
　　
Let's analyze the Sqoop data import process, first the user enters a Sqoop import command, Sqoop gets the metadata information from the relational database, such as what the schema of the database table is, what fields the table has, what data type these fields are, and so on. After it obtains this information, it converts the input command into a MAP-based MapReduce job. There are many map tasks in the MapReduce job, and each map task reads a piece of data from the database so that multiple map tasks implement concurrent copies of the entire data to be quickly copied to HDFS.
Let's take a look at how Sqoop uses the command line to import data with command-line syntax as shown below.

Sqoop Import--connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop--username Sqoop--password Sqoop--table User--target-dir/junior/sqoop/\//optional, no directory specified, data is imported by default to/user --where "sex= ' female '" \//  Optional--as-sequencefile \//optional, no format specified, data format default to text Format --num-mappers  \//optional, this value should not be too large --null-string ' \\n ' \//optional --null-non-string ' \\n ' \//optional --connect: Specifies the JDBC URL. -the user name of the-username/password:mysql database. --table: The database table to read. --target-dir: Imports the data into the specified HDFS directory, and the file name, if not specified, will be the default database table name. --where: Filters the data to be imported from the database. --as-sequencefile: Specifies the data import data format. --num-mappers: Specifies the concurrency of the Map task. --null-string,--null-non-string: Using the null field in the database can be converted to ' \ n 'because the fields in the database are null and occupy a lot of space.

Here we introduce several special applications of SQOOP data import.
1, Sqoop each time you import data, do not need to re-import all the previous data into HDFs, just need to add the new data into HDFs, let's see how to import new data.

sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop --username sqoop --password sqoop --table user --incremental append \      //代表只导入增量数据--check-column id \         //以主键id作为判断条件--last-value 999        //导入id大于999的新增数据        上述三个组合使用，可以实现数据的增量导入。

2, Sqoop data import process, the direct input of the clear existence of security risks, we can avoid this risk by the following two ways.
1)-p:sqoop command line at the end of the use of-p, the user is prompted to enter a password, and the user entered the password is invisible, play a role in security protection. The Sqoop command is not executed until the password is entered correctly.

sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop --username sqoop --table user -P

2) –password-file: Specify a password to save the file and read the password. We can set this file to only our own readable files to prevent password leaks.

sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --table user --password-file my-sqoop-password

Sqoop Export
Its function is to import data from HDFS into a relational database table, with the flowchart shown below.

Let's analyze the Sqoop data export process, first the user enters a Sqoop export command, which gets the schema of the relational database and establishes a mapping between the Hadoop field and the database table fields.　　 The input commands are then converted into map-based mapreduce jobs, so that there are many map tasks in the MapReduce job that read the data in parallel from HDFS and copy the entire data into the database.
Let's take a look at how Sqoop uses the command line to export data, and its command-line syntax is as follows.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --export-dir user        --connect：指定 JDBC URL。        --username/password：mysql 数据库的用户名和密码。        --table：要导入的数据库表。        --export-dir：数据在 HDFS 上的存放目录。

Here we introduce several special applications of SQOOP data export.
1, Sqoop Export data into the database, in general, is a single import, so the efficiency of import is very low. At this point we can use the Sqoop Export bulk Import to improve efficiency, the specific syntax is as follows.

sqoop export-- Dsqoop . Export.records.per.statement=10  -- Connect JDBC:  mysql: / /db.ywendeng.net:3306/csdb  _hadoop --username Sqoop --password sqoop --table user -- Export-dir user --batch --dsqoop< /span>.export.records.per.statement: Specifies that each import 10  bar data,--batch: Specifies a bulk import.

2, in the actual application of such a problem, such as the import of data when the map task failed to execute, then the map job will be transferred to another node to perform re-run, this time before the import of data and re-import a copy, resulting in duplicate data import. Because the Map Task does not roll back the policy, once the run fails, the data that has been imported into the database cannot be recovered. Sqoop Export provides a mechanism to guarantee atomicity, using the –staging-table option to specify a temporarily imported table. Sqoop Export the data will be divided into two steps: The first step, the data into a temporary table in the database, if the MAP task fails during import, the temporary table data re-import is deleted, and the second step is to confirm that all Map task tasks succeed, the temporary table name will be the specified table name.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --staging-table staging_user

3, in the Sqoop export data process, if we want to update the existing data, you can take the following two ways.
1) Update the existing data with the –update-key ID.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --update-key id

2) with the –update-key ID and –update-mode allowinsert two options, if the data already exists, update the data and insert a new data record if the data does not exist.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --update-key id --update-mode allowinsert

4, if the amount of data in HDFS is large, many fields do not need, we can use –columns to specify the insertion of a few columns of data.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --column username,sex

5, when the imported field data does not exist or is null, we use –input-null-string and –input-null-non-string to handle.

sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --input-null-string ‘\\N‘ --input-null-non-string ‘\\N‘

Sqoop combined with other systems
Sqoop can also be combined with HIVE, hbase and other systems to achieve data import and export, users need to add Hbase_home, hive_home and other environment variables in sqoop-env.sh.
1, Sqoop and hive combination is relatively simple, using the –hive-import option can be achieved.

sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --hive-import

2, Sqoop and HBase with a little trouble, you need to use –hbase-table to specify the table name, use –column-family to specify the column name.

sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --hbase-table user --column-family city

Iv. installation steps of Sqoop

Our Hadoop cluster is installed in the Hadoop2.2.0 version, so the Sqoop installation version should also match it, otherwise the use of the Sqoop tool will cause problems. Here we choose the sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz version to install. The installation of Sqoop is simple, divided into the following steps to complete.
1, first put the downloaded sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz into the/usr/java/directory, and then unzip the installation package, modify the file name and modify user rights.

[root@cs0 java]# tar zxvf sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //解压[root@cs0 java]# rm sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //删除安装包[root@cs0 java]# mv sqoop-1.4.6.bin__hadoop-1.0.0 sqoop //修改安装文件目录[root@cs0 java]# chown -R hadoop:hadoop sqoop //赋予sqoop hadoop用户权限

2. Switch to the/sqoop/conf directory and execute the following command.

[[Email protected] java]$ CD Sqoop/conf[[email protected] java]$ mv sqoop-env-template.sh sqoop-env.sh then use VI sqoo p-env.sh command, open the file to add the following content. #SetPath to whereBin/hadoop isAvailableexport hadoop_common_home=/usr/java/hadoop-2.2. 0-x64#SetPath to whereHadoop-*-core.jar isAvailableexport hadoop_mapred_home=/usr/java/hadoop-2.2. 0-x64#SetThe path to whereBin/hbase isAvailable#export hbase_home=#SetThe path to whereBin/hive isAvailableexport hive_home=/usr/java/hive-1.0. 0#SetThe path for whereZookeper Config dir is#export zoocfgdir= If the data reads do not involve hbase and hive, then the configuration of the associated HBase and hive is not added, and if the cluster has a separate zookeeper cluster, then the configuration zookeeper, instead of configuration.

3. Copy the associated drive jar package to the Sqoop/lib directory. There are three core jar packages installed Hadoop2.2.0 that need to be imported: Commons-cli-1.2.jar, Hadoop-common-2.2.0.jar, and Hadoop-mapreduce-client-core-2.2.0.jar. The database driver jar package needs to be imported, and here we are using the MySQL database, so we need to import the Mysql-connector-java-5.1.21.jar package.

cp commons-cli-1.2.jarcp hadoop-common-2.2.0.jarcp hadoop-mapreduce-client-core-2.2.0.jarcp mysql-connector-java-5.1.21.jar /usr/java/sqoop/lib

4. Add environment variables.

 [hadoop@cs0 java]$ vi ~/.bash_profilePATH=$PATH:$HOME//sqoop安装目录export PATH=$PATH:$SQOOP_HOME/bin        环境添加完毕后，执行以下命令使环境生效。[hadoop@cs0source ~/.bash_profile

5. Test run

[Email protected] java]$ SqoopList-databases >--connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop \>--username sqoop \>--password Sqoop the/ ./Geneva  Geneva: -: -INFO Sqoop. Sqoop:running Sqoopversion:1.4. 6 the/ ./Geneva  Geneva: -: -WARN tool. Basesqooptool:setting Your password on  theCommand-Line isInsecure. Consider Using-p instead. the/ ./Geneva  Geneva: -: -INFO Manager. Mysqlmanager:preparing toUse a MySQL streaming resultset.information_schema sqoop command executed successfully, which represents the successful installation.

Detailed Sqoop architecture and installation deployment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More