Combination of Sqoop and HDFs
Below we combine HDFS, introduced Sqoop from the relational database import and export.
Sqoop Import
Its function is to import data from a relational database into HDFS, with the flowchart shown below.
Let's analyze the Sqoop data import process, first the user enters a Sqoop import command, Sqoop gets the metadata information from the relational database, such as what the schema of the database table is, what fields the table has, what data type these fields are, and so on. After it obtains this information, it converts the input command into a MAP-based MapReduce job. There are many map tasks in the MapReduce job, and each map task reads a piece of data from the database so that multiple map tasks implement concurrent copies of the entire data to be quickly copied to HDFS.
Let's take a look at how Sqoop uses the command line to import data with command-line syntax as shown below
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -target-dir/junior/sqoop/\/ /optional, do not specify directory, data is imported by default to/user--where "sex= ' female '" \/ /optional--as-sequencefile \ / /optional, do not specify the format, the data format defaults to text Format--num-mappers \ //optional, this value should not be too large--null-string ' \\n ' \/ /optional--null-non-string ' \\n ' \
--connect: Specifies the JDBC URL.
The user name of the--username/password:mysql database.
--table: The database table to read.
--target-dir: Imports the data into the specified HDFS directory, and the file name, if not specified, will be the default database table name.
--where: Filters the data to be imported from the database.
--as-sequencefile: Specifies the data import data format.
--num-mappers: Specifies the concurrency of the MAP task.
--null-string,--null-non-string: Using the null field in the database can be converted to ' \ n ' because the fields in the database are null and occupy a lot of space.
Here we introduce several special applications of SQOOP data import.
1, Sqoop each time you import data, do not need to re-import all the previous data into HDFs, just need to add the new data into HDFs, let's see how to import new data.
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -incremental append \/ /represents only import incremental data--check-column ID \// with primary key ID as criteria--last-value 999 //import new data with ID greater than 999
The above three combinations can be used to achieve incremental import of data.
2, Sqoop data import process, the direct input of the clear existence of security risks, we can avoid this risk by the following two ways.
1)-p:sqoop command line at the end of the use of-p, the user is prompted to enter a password, and the user entered the password is invisible, play a role in security protection. The Sqoop command is not executed until the password is entered correctly.
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--table user-p
2)--password-file: Specify a password to save the file and read the password. We can set this file to only our own readable files to prevent password leaks.
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--table user--password-file My-sqoop-password
Sqoop Export
Its function is to import data from HDFS into a relational database table, with the flowchart shown below.
Let's analyze the Sqoop data export process, first the user enters a Sqoop export command, it gets the schema of the relational database, and establishes the mapping between the Hadoop field and the database table fields. The input commands are then converted into map-based mapreduce jobs, so that there are many map tasks in the MapReduce job that read the data in parallel from HDFS and copy the entire data into the database.
Let's take a look at how Sqoop uses the command line to export data, with command-line syntax as shown below.
Sqoop export--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -export-dir User
--connect: Specifies the JDBC URL.
--username/password:mysql the user name and password for the database.
--table: The database table to import.
--export-dir: The data is stored in the HDFS directory.
Here we introduce several special applications of SQOOP data export.
1, Sqoop Export data into the database, in general, is a single import, so the efficiency of import is very low. At this point we can use the Sqoop Export bulk Import to improve efficiency, the specific syntax is as follows.
Sqoop Export--dsqoop.export.records.per.statement=10--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop-- Username Sqoop--password sqoop--table user--export-dir user--batch
--dsqoop.export.records.per.statement: Specifies that you import 10 data at a time,--batch: Specifies a bulk import.
2, in the actual application of such a problem, such as the import of data when the map task failed to execute, then the map job will be transferred to another node to perform re-run, this time before the import of data and re-import a copy, resulting in duplicate data import. Because the Map Task does not roll back the policy, once the run fails, the data that has been imported into the database cannot be recovered. Sqoop Export provides a mechanism to guarantee atomicity, using the--staging-table option to specify a temporarily imported table. Sqoop Export the data will be divided into two steps: The first step, the data into a temporary table in the database, if the MAP task fails during import, the temporary table data re-import is deleted, and the second step is to confirm that all Map task tasks succeed, the temporary table name will be the specified table name.
3, in the Sqoop export data process, if we want to update the existing data, you can take the following two ways.
1) Update the existing data with the--update-key ID.
2) with the--update-key ID and--update-mode allowinsert two options, if the data already exists, update the data and insert a new data record if the data does not exist.
Sqoop export--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -update-key ID--update-mode Allowinsert
4, if the amount of data in HDFS is large, many fields do not need, we can use--columns to specify the insertion of a few columns of data.
Sqoop export--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -column Username,sex
5, when the imported field data does not exist or is null, we use--input-null-string and--input-null-non-string to handle.
Sqoop export--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -input-null-string ' \\n '--input-null-non-string ' \\n '
Sqoop combined with other systems
Sqoop can also be combined with HIVE, hbase and other systems to achieve data import and export, users need to add Hbase_home, hive_home and other environment variables in sqoop-env.sh.
1, Sqoop and hive combination is relatively simple, using the--hive-import option can be achieved.
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -hive-import
2, Sqoop and HBase with a little trouble, you need to use--hbase-table to specify the table name, use--column-family to specify the column name.
Sqoop import--connect jdbc:mysql://db.dajiangtai.net:3306/djtdb_hadoop--username sqoop--password sqoop--table User- -hbase-table User--column-family City
Sqoop combined with a variety of systems for specific applications