Apache Sqoop is a tool for data migration between structured data, such as relational databases, and Hadoop. It takes advantage of the parallel features of MapReduce to speed up data transmission in batch processing, and also realizes fault tolerance with MapReduce.
Project Address: http://sqoop.apache.org/
So far, 2 versions have evolved: SQOOP1 and SQOOP2.
The latest version of SQOOP1 is that the latest version of 1.4.5,SQOOP2 is 1.99.3;1.99.3 and 1.4.5 are incompatible, and functionality has not yet been developed and is not yet suitable for deployment in a production environment.
Sqoop Supported databases:
Database |
version | Support
--direct ? |
connect string matches |
Hsqldb |
1.8.0+ |
No |
jdbc:hsqldb:*// |
Mysql |
5.0+ |
Yes |
jdbc:mysql:// |
Oracle |
10.2.0+ |
No |
jdbc:oracle:*// |
PostgreSQL |
8.3+ |
Yes (Import only) |
jdbc:postgresql:/ |
[Email protected]:~/work$ sudo apt-get install Sqoop
[Email protected]:~/work$ sqoop Help
Usage:sqoop COMMAND [ARGS]
Available commands:
CodeGen Generate code to interact with database records
Create-hive-table Import A table definition into hive
Eval Evaluate A SQL statement and display the results
Export Export an HDFS directory to a database table
Help List available commands
Import Import a table from a database to HDFS
Import-all-tables Import tables from a database to HDFS
Job work with saved jobs
List-databases List available databases on a server
List-tables List available tables in a database
Merge Merge Results of incremental imports
Metastore Run a standalone Sqoop metastore
Version Display version information
See "Sqoop help command" For information on a specific command.
Import is migrating a relational database to HDFs
[Email protected]:~/work$ sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password passwd--table SDS
[Email protected]:~/work$ Hadoop FS-LS/USER/GUOJIAN/SDS
Found 5 Items
-rw-r--r-- 3 Guojian cug_test 0 2014-09-11 16:04/user/guojian/sds/_success
-rw-r--r-- 3 Guojian cug_test 483 2014-09-11 16:03/user/guojian/sds/part-m-00000.snappy
-rw-r--r-- 3 Guojian cug_test 504 2014-09-11 16:04/user/guojian/sds/part-m-00001.snappy
-rw-r--r-- 3 Guojian cug_test 1001 2014-09-11 16:03/user/guojian/sds/part-m-00002.snappy
-rw-r--r-- 3 Guojian cug_test 952 2014-09-11 16:03/user/guojian/sds/part-m-00003.snappy
You can set the parallel data by--m, that is, the map data, determine the number of files.
The default directory is/user/${user.name}/${tablename}, and you can set the target directory on HDFs by--target-dir.
If you want to import all the tables from the entire database into HDFs, you can use the Import-all-tables command.
Sqoop import-all-tables–connect jdbc:mysql://192.168.81.176/hivemeta2db --username root-password passwd
If you want to specify the required columns, use the following:
Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--columns "SD _id,cd_id,location "
Specifies that the export file is Sequencefiles, and the generated class file is named Com.ctrip.sds:
Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--class-name Com.ctrip.sds--as-sequencefile
You can specify a delimiter when importing text:
Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS-- Fields-terminated-by ' \ t '--lines-terminated-by ' \ n '--optionally-enclosed-by ' '
You can specify the filter criteria:
Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--where "Sd_i d > 100 "
Export is the reverse process of import, which imports data from HDFs into a relational database
Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds
In the above example, the SDS table in the Sqoop data needs to be created first, otherwise the export operation will fail directly.
Since Sqoop is a map to complete the import of data, the map process is independent, there is no concept of things, there may be some map data import failure situation. In order to solve this problem, there is a tradeoff in sqoop, that is, to specify the intermediate staging
table, and then import the intermediate table to the result table.
This is done by --staging-table <staging-table-name>
specifying that the staging table structure also needs to be created in advance:
Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds--staging-table sds_tmp
It is necessary to note that the --direct
--update-key
staging intermediate table is not available when using, or--call, the options for the stored procedure.
Validation results:
(1) The data will be written to the Sds_tmp table first, after the import operation succeeds, the Sds_tmp table is imported into the SDS result table, and the Sds_tmp table is cleared.
(2) If a map fails, the successful map writes the data to the TMP table, the export task fails, and the TMP table's data is preserved.
(3) If there is already data in TMP, the export operation will fail directly, and you can use the --clear-staging-table
specify to clear the intermediate table before execution.
Export options:
--direct |
mysqlimport import MySQL directly using tools |
--export-dir <dir> |
HDFs data path that requires export |
-m,--num-mappers <n> |
Number of maps in parallel export n |
--table <table-name> |
Exported to the target table |
--call <stored-proc-name> |
Call a stored procedure |
--update-key <col-name> |
Specify the name of the column that needs to be updated to update the data that already exists in the database |
--update-mode <mode> |
Update mode, including updateonly (default) andallowinsert |
|
The former allows only updates, which allow new column data to be written to |
--input-null-string <null-string> |
The string to is interpreted as null for string columns |
--input-null-non-string <null-string> |
The string to is interpreted as null for non-string columns |
--staging-table <staging-table-name> |
Specify Intermediate Staging table |
--clear-staging-table |
Purge intermediate staging table data before performing export |
--batch |
Use batch mode for underlying statement execution. |
Argument |
Description |
create-hive-table Importing a relational database table into a hive table
parameters |
description |
–hive-home <dir> |
hive installation directory, this parameter can be overridden by the default hive directory |
–hive-overwrite |
|
–create-hive-table |
|
|
|
–table |
Specify relational database table name |
jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS--hive-table Sds_bak
The default Sds_bak is the one in the default database.
This step needs to rely on hcatalog, you need to install the Hcatalog, or the following error is reported:
Hive History File=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txt
Failed:parseexception line 1:44 mismatched input ') ' Expecting Identifier near ' (' in column specification
list-databases lists the databases available on a single server
jdbc:mysql://192.168.81.176/ --username root-password passwd
list-tables List A table in a database
jdbc:mysql://192.168.81.176/sqoop --username root-password passwd
codegen: Mapping relational database tables to a Java file, Java class, and related jar packages
Sqoop codegen--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--table SDS
Note:/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/sds.java uses or overrides a deprecated API.
Note:recompile with-xlint:deprecation for details.
[Email protected]:~/work$ ll/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/
Total 48
Drwxrwxr-x 2 Guojian guojian 4096 September 12 14:15./
Drwxrwxr-x Guojian Guojian 4096 September 12 14:15.. /
-rw-rw-r-- 1 Guojian guojian 11978 September 14:15 sds.class
-rw-rw-r-- 1 Guojian guojian 4811 September 14:15 Sds.jar
-rw-rw-r-- 1 Guojian guojian 17525 September 14:15 Sds.java
Merge is a tool that merges two datasets, overwriting the old values for the same key.
specify target output directory
--class-name <class> |
|
--jar-file <file& gt; |
The jar package introduced when merging, which is a jar package generated by the CodeGen tool |
--merge-key <col> |
Specify the column name as the merge key |
--new-data <path> |
|
--onto <path> |
Specify older Data directory |
--target-dir <path> |
Parameters |
Description |
Sqoop merge--new-data/user/guojian/sds--onto/user/guojian/sqoop--target-dir/user/guojian/sds_new--jar-file Sds.jar--class-name SDS--merge-key sd_id
It is important to note that if the--target-dir is set to a directory that already exists, Sqoop will exit with an error.
Eval Users can quickly use SQL statements to manipulate a database. This allows the user to check that the SQL statement is correct before performing an import operation.
Sqoop eval--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--query "Select sd_id,cd_id, Location from SDS LIMIT 10 "
The job is used to generate sqoop tasks.
--create <job-id> |
start a new sqoop job. |
--delete <job-id> |
delete a sqoop job |
--exec <job-id> |
execute a --create saved jobs |
--show <job-id> |
Display a job's parameters |
--list |
Show all created sqoop jobs |
parameter |
description |
Example:
Sqoop Job--create myimportjob--Import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password Pass WD--table TBLs
[Email protected]:~/work$ sqoop Job--listavailable jobs: myimportjob
[email protected]:~/work$ sqoop Job--show myimportjobJob:myimportjobTool:importOptions:--------------------- -------verbose = falsedb.connect.string = Jdbc:mysql://192.168.81.176/hivemeta2dbcodegen.output.delimiters.escape = 0codegen.output.delimiters.enclose.required = Falsecodegen.input.delimiters.field = 0hbase.create.table = Falsedb.require.password = Truehdfs.append.dir = falsedb.table = TBLSimport.fetch.size = Nullaccumulo.create.table = Falsecodegen.input.delimiters.escape = 0codegen.input.delimiters.enclose.required = Falsedb.username = Rootcodegen.output.delimiters.record = 10import.max.inline.lob.size = 16777216hbase.bulk.load.enabled = falsehcatalog.create.table = falsedb.clear.staging.table = Falsecodegen.input.delimiters.record = 0enable.compression = falsehive.overwrite.table = Falsehive.import = Falsecodegen.input.delimiters.enclose = 0accumulo.batch.size = 10240000hive.drop.delims = Falsecodegen.output.delimiters.enclose = 0hdfs.delete-target.dir = falsecodegen.oUtput.dir =. Codegen.auto.compile.dir = Truemapreduce.num.mappers = 4accumulo.max.latency = 5000import.direct.split.size = 0codegen.output.delimiters.field = 44export.new.update = UpdateOnlyincremental.mode = Nonehdfs.file.format = TextFilecodegen.compile.dir =/tmp/sqoop-guojian/compile/ Bd9c7f7b651276067b3f7b341b7fa4cbdirect.import = Falsehive.fail.table.exists = Falsedb.batch = False execution:
Sqoop job-exec Myimportjob
metastore
配置sqoop job的共享元数据信息,这样多个用户定义和执行sqoop job在这一
metastore中。默认存储在~/.sqoop
Start: Sqoop Metastore
Close: Sqoop metastore--shutdown
metastore文件的存储位置是在
conf/sqoop-site.xml中sqoop.metastore.server.location配置,指向本地文件。
Metastore can be accessed via TCP/IP and the port number can be sqoop.metastore.server.port
configured by default of 16000.
The client can be 指定
sqoop.metastore.client.autoconnect.url
--meta-connect
configured for jdbc:hsqldb:hsql://<server-name>:<port>/sqoop
, for example, through or using jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop
.
See Http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html for more information
Sqoop Introduction and use