first, what is Sqoop?
Sqoop is a bridge connecting traditional relational databases to Hadoop. It includes the following two areas:
1. Import data from a relational database into Hadoop and its associated systems, such as Hive and HBase.
2. Extract the data from the Hadoop system and export it to the relational database.
Sqoop's core design idea is to use MapReduce to speed up data transfer. This means that Sqoop's import and export functions are implemented through the MapReduce job. So it is a batch processing method for data transmission, it is difficult to implement real-time data import and export.
Second, why choose Sqoop
Why do we choose Sqoop? It is usually based on three considerations:
1, it can effectively and controllable use of resources, you can adjust the number of tasks to control the concurrency of the task. In addition, it can configure the database access time and so on.
2, it can automatically complete the data type mapping and conversion. We tend to import data that is typed, that can be automatically converted to Hadoop based on the type in the database, and of course users can customize the mapping between them.
3, it supports a variety of databases, such as Mysql, Oracle and PostgreSQL and so on database.
third, Sqoop architecture and common Operation
Sqoop architecture is very simple, it mainly consists of three parts: Sqoop client, hdfs/hbase/hive, Database. Let's take a look at the architecture diagram of Sqoop.
After the user initiates a command to Sqoop, the command is converted to a MapReduce job based on the Map Task. The map task accesses the metadata information of the database, reads the data from the database through a parallel Map task, and then imports it into Hadoop. Of course, you can also import data from Hadoop into a traditional relational database. Its core idea is to realize the concurrent copy and transfer of data through the MapReduce job based on map Task (map only), which can greatly improve the efficiency.
Combination of Sqoop and HDFs
Below we combine HDFS, introduced Sqoop from the relational database import and export.
Sqoop Import
Its function is to import data from a relational database into HDFS, with the flowchart shown below.
Let's analyze the Sqoop data import process, first the user enters a Sqoop import command, Sqoop gets the metadata information from the relational database, such as what the schema of the database table is, what fields the table has, what data type these fields are, and so on. After it obtains this information, it converts the input command into a MAP-based MapReduce job. There are many map tasks in the MapReduce job, and each map task reads a piece of data from the database so that multiple map tasks implement concurrent copies of the entire data to be quickly copied to HDFS.
Let's take a look at how Sqoop uses the command line to import data with command-line syntax as shown below.
Sqoop Import--connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop--username Sqoop--password Sqoop--table User--target-dir/junior/sqoop/\//optional, no directory specified, data is imported by default to/user --where "sex= ' female '" \// Optional--as-sequencefile \//optional, no format specified, data format default to text Format --num-mappers \//optional, this value should not be too large --null-string ' \\n ' \//optional --null-non-string ' \\n ' \//optional --connect: Specifies the JDBC URL. -the user name of the-username/password:mysql database. --table: The database table to read. --target-dir: Imports the data into the specified HDFS directory, and the file name, if not specified, will be the default database table name. --where: Filters the data to be imported from the database. --as-sequencefile: Specifies the data import data format. --num-mappers: Specifies the concurrency of the Map task. --null-string,--null-non-string: Using the null field in the database can be converted to ' \ n 'because the fields in the database are null and occupy a lot of space.
Here we introduce several special applications of SQOOP data import.
1, Sqoop each time you import data, do not need to re-import all the previous data into HDFs, just need to add the new data into HDFs, let's see how to import new data.
sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop --username sqoop --password sqoop --table user --incremental append \ //代表只导入增量数据--check-column id \ //以主键id作为判断条件--last-value 999 //导入id大于999的新增数据 上述三个组合使用,可以实现数据的增量导入。
2, Sqoop data import process, the direct input of the clear existence of security risks, we can avoid this risk by the following two ways.
1)-p:sqoop command line at the end of the use of-p, the user is prompted to enter a password, and the user entered the password is invisible, play a role in security protection. The Sqoop command is not executed until the password is entered correctly.
sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop --username sqoop --table user -P
2) –password-file: Specify a password to save the file and read the password. We can set this file to only our own readable files to prevent password leaks.
sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --table user --password-file my-sqoop-password
Sqoop Export
Its function is to import data from HDFS into a relational database table, with the flowchart shown below.
Let's analyze the Sqoop data export process, first the user enters a Sqoop export command, which gets the schema of the relational database and establishes a mapping between the Hadoop field and the database table fields. The input commands are then converted into map-based mapreduce jobs, so that there are many map tasks in the MapReduce job that read the data in parallel from HDFS and copy the entire data into the database.
Let's take a look at how Sqoop uses the command line to export data, and its command-line syntax is as follows.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --export-dir user --connect:指定 JDBC URL。 --username/password:mysql 数据库的用户名和密码。 --table:要导入的数据库表。 --export-dir:数据在 HDFS 上的存放目录。
Here we introduce several special applications of SQOOP data export.
1, Sqoop Export data into the database, in general, is a single import, so the efficiency of import is very low. At this point we can use the Sqoop Export bulk Import to improve efficiency, the specific syntax is as follows.
sqoop export-- Dsqoop . Export.records.per.statement=10 -- Connect JDBC: mysql: / /db.ywendeng.net:3306/csdb _hadoop --username Sqoop --password sqoop --table user -- Export-dir user --batch --dsqoop< /span>.export.records.per.statement: Specifies that each import 10 bar data,--batch: Specifies a bulk import.
2, in the actual application of such a problem, such as the import of data when the map task failed to execute, then the map job will be transferred to another node to perform re-run, this time before the import of data and re-import a copy, resulting in duplicate data import. Because the Map Task does not roll back the policy, once the run fails, the data that has been imported into the database cannot be recovered. Sqoop Export provides a mechanism to guarantee atomicity, using the –staging-table option to specify a temporarily imported table. Sqoop Export the data will be divided into two steps: The first step, the data into a temporary table in the database, if the MAP task fails during import, the temporary table data re-import is deleted, and the second step is to confirm that all Map task tasks succeed, the temporary table name will be the specified table name.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --staging-table staging_user
3, in the Sqoop export data process, if we want to update the existing data, you can take the following two ways.
1) Update the existing data with the –update-key ID.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --update-key id
2) with the –update-key ID and –update-mode allowinsert two options, if the data already exists, update the data and insert a new data record if the data does not exist.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --update-key id --update-mode allowinsert
4, if the amount of data in HDFS is large, many fields do not need, we can use –columns to specify the insertion of a few columns of data.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --column username,sex
5, when the imported field data does not exist or is null, we use –input-null-string and –input-null-non-string to handle.
sqoop export --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --input-null-string ‘\\N‘ --input-null-non-string ‘\\N‘
Sqoop combined with other systems
Sqoop can also be combined with HIVE, hbase and other systems to achieve data import and export, users need to add Hbase_home, hive_home and other environment variables in sqoop-env.sh.
1, Sqoop and hive combination is relatively simple, using the –hive-import option can be achieved.
sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --hive-import
2, Sqoop and HBase with a little trouble, you need to use –hbase-table to specify the table name, use –column-family to specify the column name.
sqoop import --connect jdbc:mysql://db.ywendeng.net:3306/csdb_hadoop --username sqoop --password sqoop --table user --hbase-table user --column-family city
Iv. installation steps of Sqoop
Our Hadoop cluster is installed in the Hadoop2.2.0 version, so the Sqoop installation version should also match it, otherwise the use of the Sqoop tool will cause problems. Here we choose the sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz version to install. The installation of Sqoop is simple, divided into the following steps to complete.
1, first put the downloaded sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz into the/usr/java/directory, and then unzip the installation package, modify the file name and modify user rights.
[root@cs0 java]# tar zxvf sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //解压[root@cs0 java]# rm sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //删除安装包[root@cs0 java]# mv sqoop-1.4.6.bin__hadoop-1.0.0 sqoop //修改安装文件目录[root@cs0 java]# chown -R hadoop:hadoop sqoop //赋予sqoop hadoop用户权限
2. Switch to the/sqoop/conf directory and execute the following command.
[[Email protected] java]$ CD Sqoop/conf[[email protected] java]$ mv sqoop-env-template.sh sqoop-env.sh then use VI sqoo p-env.sh command, open the file to add the following content. #SetPath to whereBin/hadoop isAvailableexport hadoop_common_home=/usr/java/hadoop-2.2. 0-x64#SetPath to whereHadoop-*-core.jar isAvailableexport hadoop_mapred_home=/usr/java/hadoop-2.2. 0-x64#SetThe path to whereBin/hbase isAvailable#export hbase_home=#SetThe path to whereBin/hive isAvailableexport hive_home=/usr/java/hive-1.0. 0#SetThe path for whereZookeper Config dir is#export zoocfgdir= If the data reads do not involve hbase and hive, then the configuration of the associated HBase and hive is not added, and if the cluster has a separate zookeeper cluster, then the configuration zookeeper, instead of configuration.
3. Copy the associated drive jar package to the Sqoop/lib directory. There are three core jar packages installed Hadoop2.2.0 that need to be imported: Commons-cli-1.2.jar, Hadoop-common-2.2.0.jar, and Hadoop-mapreduce-client-core-2.2.0.jar. The database driver jar package needs to be imported, and here we are using the MySQL database, so we need to import the Mysql-connector-java-5.1.21.jar package.
cp commons-cli-1.2.jarcp hadoop-common-2.2.0.jarcp hadoop-mapreduce-client-core-2.2.0.jarcp mysql-connector-java-5.1.21.jar /usr/java/sqoop/lib
4. Add environment variables.
[hadoop@cs0 java]$ vi ~/.bash_profilePATH=$PATH:$HOME//sqoop安装目录export PATH=$PATH:$SQOOP_HOME/bin 环境添加完毕后,执行以下命令使环境生效。[hadoop@cs0source ~/.bash_profile
5. Test run
[Email protected] java]$ SqoopList-databases >--connect jdbc:mysql://db.ywendeng.net:3306/djtdb_hadoop \>--username sqoop \>--password Sqoop the/ ./Geneva Geneva: -: -INFO Sqoop. Sqoop:running Sqoopversion:1.4. 6 the/ ./Geneva Geneva: -: -WARN tool. Basesqooptool:setting Your password on theCommand-Line isInsecure. Consider Using-p instead. the/ ./Geneva Geneva: -: -INFO Manager. Mysqlmanager:preparing toUse a MySQL streaming resultset.information_schema sqoop command executed successfully, which represents the successful installation.
Detailed Sqoop architecture and installation deployment