Sqoop Introduction and use

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache Sqoop is a tool for data migration between structured data, such as relational databases, and Hadoop. It takes advantage of the parallel features of MapReduce to speed up data transmission in batch processing, and also realizes fault tolerance with MapReduce.

Project Address: http://sqoop.apache.org/

So far, 2 versions have evolved: SQOOP1 and SQOOP2.

The latest version of SQOOP1 is that the latest version of 1.4.5,SQOOP2 is 1.99.3;1.99.3 and 1.4.5 are incompatible, and functionality has not yet been developed and is not yet suitable for deployment in a production environment.

Sqoop Supported databases:

Support

Database	version	`--direct` ?	connect string matches
Hsqldb	1.8.0+	No	`jdbc:hsqldb:*//`
Mysql	5.0+	Yes	`jdbc:mysql://`
Oracle	10.2.0+	No	`jdbc:oracle:*//`
PostgreSQL	8.3+	Yes (Import only)	`jdbc:postgresql:/`

[Email protected]:~/work$ sudo apt-get install Sqoop

[Email protected]:~/work$ sqoop Help

Usage:sqoop COMMAND [ARGS]

Available commands:

  CodeGen            Generate code to interact with database records

  Create-hive-table  Import A table definition into hive

  Eval               Evaluate A SQL statement and display the results

  Export             Export an HDFS directory to a database table

  Help               List available commands

  Import             Import a table from a database to HDFS

  Import-all-tables  Import tables from a database to HDFS

  Job work with                saved jobs

  List-databases     List available databases on a server

  List-tables        List available tables in a database

  Merge              Merge Results of incremental imports

  Metastore          Run a standalone Sqoop metastore

  Version            Display version information

See "Sqoop help command" For information on a specific command.

Import is migrating a relational database to HDFs

[Email protected]:~/work$ sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password passwd--table SDS

[Email protected]:~/work$ Hadoop FS-LS/USER/GUOJIAN/SDS

Found 5 Items

-rw-r--r--   3 Guojian cug_test          0 2014-09-11 16:04/user/guojian/sds/_success

-rw-r--r--   3 Guojian cug_test        483 2014-09-11 16:03/user/guojian/sds/part-m-00000.snappy

-rw-r--r--   3 Guojian cug_test        504 2014-09-11 16:04/user/guojian/sds/part-m-00001.snappy

-rw-r--r--   3 Guojian cug_test       1001 2014-09-11 16:03/user/guojian/sds/part-m-00002.snappy

-rw-r--r--   3 Guojian cug_test        952 2014-09-11 16:03/user/guojian/sds/part-m-00003.snappy

You can set the parallel data by--m, that is, the map data, determine the number of files.

The default directory is/user/${user.name}/${tablename}, and you can set the target directory on HDFs by--target-dir.

If you want to import all the tables from the entire database into HDFs, you can use the Import-all-tables command.

Sqoop import-all-tables–connect jdbc:mysql://192.168.81.176/hivemeta2db --username root-password passwd

If you want to specify the required columns, use the following:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--columns "SD _id,cd_id,location "

Specifies that the export file is Sequencefiles, and the generated class file is named Com.ctrip.sds:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--class-name Com.ctrip.sds--as-sequencefile

You can specify a delimiter when importing text:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS-- Fields-terminated-by ' \ t '--lines-terminated-by ' \ n '--optionally-enclosed-by ' '

You can specify the filter criteria:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--where "Sd_i d > 100 "

Export is the reverse process of import, which imports data from HDFs into a relational database

Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds

In the above example, the SDS table in the Sqoop data needs to be created first, otherwise the export operation will fail directly.

Since Sqoop is a map to complete the import of data, the map process is independent, there is no concept of things, there may be some map data import failure situation. In order to solve this problem, there is a tradeoff in sqoop, that is, to specify the intermediate staging table, and then import the intermediate table to the result table.

This is done by --staging-table <staging-table-name> specifying that the staging table structure also needs to be created in advance:

Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds--staging-table sds_tmp

It is necessary to note that the --direct --update-key staging intermediate table is not available when using, or--call, the options for the stored procedure.

Validation results:

(1) The data will be written to the Sds_tmp table first, after the import operation succeeds, the Sds_tmp table is imported into the SDS result table, and the Sds_tmp table is cleared.

(2) If a map fails, the successful map writes the data to the TMP table, the export task fails, and the TMP table's data is preserved.

(3) If there is already data in TMP, the export operation will fail directly, and you can use the --clear-staging-table specify to clear the intermediate table before execution.

Export options:

Argument	Description
`--direct`	`mysqlimport`import MySQL directly using tools
`--export-dir <dir>`	HDFs data path that requires export
`-m,--num-mappers <n>`	Number of maps in parallel export n
`--table <table-name>`	Exported to the target table
`--call <stored-proc-name>`	Call a stored procedure
`--update-key <col-name>`	Specify the name of the column that needs to be updated to update the data that already exists in the database
`--update-mode <mode>`	Update mode, including `updateonly` (default) and`allowinsert`
	The former allows only updates, which allow new column data to be written to
`--input-null-string <null-string>`	The string to is interpreted as null for string columns
`--input-null-non-string <null-string>`	The string to is interpreted as null for non-string columns
`--staging-table <staging-table-name>`	Specify Intermediate Staging table
`--clear-staging-table`	Purge intermediate staging table data before performing export
`--batch`	Use batch mode for underlying statement execution.

create-hive-table Importing a relational database table into a hive table

parameters	description
–hive-home <dir>	hive installation directory, this parameter can be overridden by the default hive directory
–hive-overwrite
–create-hive-table

–table	Specify relational database table name

jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS--hive-table Sds_bak

The default Sds_bak is the one in the default database.

This step needs to rely on hcatalog, you need to install the Hcatalog, or the following error is reported:

Hive History File=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txt

Failed:parseexception line 1:44 mismatched input ') ' Expecting Identifier near ' (' in column specification

list-databases lists the databases available on a single server

jdbc:mysql://192.168.81.176/ --username root-password passwd

list-tables List A table in a database

jdbc:mysql://192.168.81.176/sqoop --username root-password passwd

codegen: Mapping relational database tables to a Java file, Java class, and related jar packages

Sqoop codegen--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--table SDS

Note:/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/sds.java uses or overrides a deprecated API.

Note:recompile with-xlint:deprecation for details.

[Email protected]:~/work$ ll/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/

Total 48

Drwxrwxr-x  2 Guojian guojian  4096  September 12 14:15./

Drwxrwxr-x Guojian Guojian  4096  September 12 14:15.. /

-rw-rw-r--  1 Guojian guojian 11978  September 14:15 sds.class

-rw-rw-r--  1 Guojian guojian  4811  September 14:15 Sds.jar

-rw-rw-r--  1 Guojian guojian 17525  September 14:15 Sds.java

Merge is a tool that merges two datasets, overwriting the old values for the same key.

specify target output directory

Parameters	Description
`--class-name <class>`
`--jar-file <file& gt;`	The jar package introduced when merging, which is a jar package generated by the CodeGen tool
`--merge-key <col>`	Specify the column name as the merge key
`--new-data <path>`
`--onto <path>`	Specify older Data directory
`--target-dir <path>`

Sqoop merge--new-data/user/guojian/sds--onto/user/guojian/sqoop--target-dir/user/guojian/sds_new--jar-file Sds.jar--class-name SDS--merge-key sd_id

It is important to note that if the--target-dir is set to a directory that already exists, Sqoop will exit with an error.

Eval Users can quickly use SQL statements to manipulate a database. This allows the user to check that the SQL statement is correct before performing an import operation.

Sqoop eval--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--query "Select sd_id,cd_id, Location from SDS LIMIT 10 "

The job is used to generate sqoop tasks.

parameter	description
`--create <job-id>`	start a new sqoop job.
`--delete <job-id>`	delete a sqoop job
`--exec <job-id>`	execute a `--create` saved jobs
`--show <job-id>`	Display a job's parameters
`--list`	Show all created sqoop jobs

Example:

Sqoop Job--create myimportjob--Import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password Pass WD--table TBLs

[Email protected]:~/work$ sqoop Job--listavailable jobs:  myimportjob

[email protected]:~/work$ sqoop Job--show myimportjobJob:myimportjobTool:importOptions:--------------------- -------verbose = falsedb.connect.string = Jdbc:mysql://192.168.81.176/hivemeta2dbcodegen.output.delimiters.escape = 0codegen.output.delimiters.enclose.required = Falsecodegen.input.delimiters.field = 0hbase.create.table = Falsedb.require.password = Truehdfs.append.dir = falsedb.table = TBLSimport.fetch.size = Nullaccumulo.create.table = Falsecodegen.input.delimiters.escape = 0codegen.input.delimiters.enclose.required = Falsedb.username = Rootcodegen.output.delimiters.record = 10import.max.inline.lob.size = 16777216hbase.bulk.load.enabled = falsehcatalog.create.table = falsedb.clear.staging.table = Falsecodegen.input.delimiters.record = 0enable.compression = falsehive.overwrite.table = Falsehive.import = Falsecodegen.input.delimiters.enclose = 0accumulo.batch.size = 10240000hive.drop.delims = Falsecodegen.output.delimiters.enclose = 0hdfs.delete-target.dir = falsecodegen.oUtput.dir =. Codegen.auto.compile.dir = Truemapreduce.num.mappers = 4accumulo.max.latency = 5000import.direct.split.size = 0codegen.output.delimiters.field = 44export.new.update = UpdateOnlyincremental.mode = Nonehdfs.file.format = TextFilecodegen.compile.dir =/tmp/sqoop-guojian/compile/ Bd9c7f7b651276067b3f7b341b7fa4cbdirect.import = Falsehive.fail.table.exists = Falsedb.batch = False execution:

Sqoop job-exec Myimportjob

metastore配置sqoop job的共享元数据信息，这样多个用户定义和执行sqoop job在这一metastore中。默认存储在~/.sqoop

Start: Sqoop Metastore

Close: Sqoop metastore--shutdown

metastore文件的存储位置是在conf/sqoop-site.xml中sqoop.metastore.server.location配置，指向本地文件。

Metastore can be accessed via TCP/IP and the port number can be sqoop.metastore.server.port configured by default of 16000.

The client can be 指定 sqoop.metastore.client.autoconnect.url --meta-connect configured for jdbc:hsqldb:hsql://<server-name>:<port>/sqoop , for example, through or using jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop .

See Http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html for more information

Sqoop Introduction and use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More