Sqoop Introduction and use

Source: Internet
Author: User
Tags relational database table table definition sqoop hadoop fs

Apache Sqoop is a tool for data migration between structured data, such as relational databases, and Hadoop. It takes advantage of the parallel features of MapReduce to speed up data transmission in batch processing, and also realizes fault tolerance with MapReduce.

Project Address: http://sqoop.apache.org/

So far, 2 versions have evolved: SQOOP1 and SQOOP2.

The latest version of SQOOP1 is that the latest version of 1.4.5,SQOOP2 is 1.99.3;1.99.3 and 1.4.5 are incompatible, and functionality has not yet been developed and is not yet suitable for deployment in a production environment.

Sqoop Supported databases:

Support
Database version --direct ? connect string matches
Hsqldb 1.8.0+ No jdbc:hsqldb:*//
Mysql 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes (Import only) jdbc:postgresql:/

[Email protected]:~/work$ sudo apt-get install Sqoop

[Email protected]:~/work$ sqoop Help

Usage:sqoop COMMAND [ARGS]
Available commands:
  CodeGen            Generate code to interact with database records
  Create-hive-table  Import A table definition into hive
  Eval               Evaluate A SQL statement and display the results
  Export             Export an HDFS directory to a database table
  Help               List available commands
  Import             Import a table from a database to HDFS
  Import-all-tables  Import tables from a database to HDFS
  Job work with                saved jobs
  List-databases     List available databases on a server
  List-tables        List available tables in a database
  Merge              Merge Results of incremental imports
  Metastore          Run a standalone Sqoop metastore
  Version            Display version information
See "Sqoop help command" For information on a specific command.

Import is migrating a relational database to HDFs

[Email protected]:~/work$ sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password passwd--table SDS

[Email protected]:~/work$ Hadoop FS-LS/USER/GUOJIAN/SDS
Found 5 Items
-rw-r--r--   3 Guojian cug_test          0 2014-09-11 16:04/user/guojian/sds/_success
-rw-r--r--   3 Guojian cug_test        483 2014-09-11 16:03/user/guojian/sds/part-m-00000.snappy
-rw-r--r--   3 Guojian cug_test        504 2014-09-11 16:04/user/guojian/sds/part-m-00001.snappy
-rw-r--r--   3 Guojian cug_test       1001 2014-09-11 16:03/user/guojian/sds/part-m-00002.snappy
-rw-r--r--   3 Guojian cug_test        952 2014-09-11 16:03/user/guojian/sds/part-m-00003.snappy

You can set the parallel data by--m, that is, the map data, determine the number of files.

The default directory is/user/${user.name}/${tablename}, and you can set the target directory on HDFs by--target-dir.

If you want to import all the tables from the entire database into HDFs, you can use the Import-all-tables command.

Sqoop import-all-tables–connect jdbc:mysql://192.168.81.176/hivemeta2db --username root-password passwd

If you want to specify the required columns, use the following:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--columns "SD _id,cd_id,location "

Specifies that the export file is Sequencefiles, and the generated class file is named Com.ctrip.sds:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--class-name Com.ctrip.sds--as-sequencefile

You can specify a delimiter when importing text:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS-- Fields-terminated-by ' \ t '--lines-terminated-by ' \ n '--optionally-enclosed-by ' '

You can specify the filter criteria:

Sqoop import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username root-password passwd--table SDS--where "Sd_i d > 100 "


Export is the reverse process of import, which imports data from HDFs into a relational database

Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds

In the above example, the SDS table in the Sqoop data needs to be created first, otherwise the export operation will fail directly.

Since Sqoop is a map to complete the import of data, the map process is independent, there is no concept of things, there may be some map data import failure situation. In order to solve this problem, there is a tradeoff in sqoop, that is, to specify the intermediate staging table, and then import the intermediate table to the result table.

This is done by --staging-table <staging-table-name> specifying that the staging table structure also needs to be created in advance:

Sqoop export--connect jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS-- Export-dir/user/guojian/sds--staging-table sds_tmp

It is necessary to note that the --direct --update-key staging intermediate table is not available when using, or--call, the options for the stored procedure.

Validation results:

(1) The data will be written to the Sds_tmp table first, after the import operation succeeds, the Sds_tmp table is imported into the SDS result table, and the Sds_tmp table is cleared.

(2) If a map fails, the successful map writes the data to the TMP table, the export task fails, and the TMP table's data is preserved.

(3) If there is already data in TMP, the export operation will fail directly, and you can use the --clear-staging-table specify to clear the intermediate table before execution.

Export options:

--direct mysqlimportimport MySQL directly using tools
--export-dir <dir> HDFs data path that requires export
-m,--num-mappers <n> Number of maps in parallel export n
--table <table-name> Exported to the target table
--call <stored-proc-name> Call a stored procedure
--update-key <col-name> Specify the name of the column that needs to be updated to update the data that already exists in the database
--update-mode <mode> Update mode, including updateonly (default) andallowinsert
The former allows only updates, which allow new column data to be written to
--input-null-string <null-string> The string to is interpreted as null for string columns
--input-null-non-string <null-string> The string to is interpreted as null for non-string columns
--staging-table <staging-table-name> Specify Intermediate Staging table
--clear-staging-table Purge intermediate staging table data before performing export
--batch Use batch mode for underlying statement execution.
Argument Description


create-hive-table Importing a relational database table into a hive table

parameters description
–hive-home <dir> hive installation directory, this parameter can be overridden by the default hive directory
–hive-overwrite
–create-hive-table
–table Specify relational database table name
jdbc:mysql://192.168.81.176/sqoop --username root-password passwd--table SDS--hive-table Sds_bak

The default Sds_bak is the one in the default database.

This step needs to rely on hcatalog, you need to install the Hcatalog, or the following error is reported:

Hive History File=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txt
Failed:parseexception line 1:44 mismatched input ') ' Expecting Identifier near ' (' in column specification

list-databases lists the databases available on a single server

jdbc:mysql://192.168.81.176/ --username root-password passwd

list-tables List A table in a database

jdbc:mysql://192.168.81.176/sqoop --username root-password passwd

codegen: Mapping relational database tables to a Java file, Java class, and related jar packages

Sqoop codegen--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--table SDS
Note:/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/sds.java uses or overrides a deprecated API.
Note:recompile with-xlint:deprecation for details.
[Email protected]:~/work$ ll/tmp/sqoop-guojian/compile/d58f607b046a061593ba708ec5f3d608/
Total 48
Drwxrwxr-x  2 Guojian guojian  4096  September 12 14:15./
Drwxrwxr-x Guojian Guojian  4096  September 12 14:15.. /
-rw-rw-r--  1 Guojian guojian 11978  September 14:15 sds.class
-rw-rw-r--  1 Guojian guojian  4811  September 14:15 Sds.jar
-rw-rw-r--  1 Guojian guojian 17525  September 14:15 Sds.java

Merge is a tool that merges two datasets, overwriting the old values for the same key.

specify target output directory
--class-name <class>
--jar-file <file& gt; The jar package introduced when merging, which is a jar package generated by the CodeGen tool
--merge-key <col> Specify the column name as the merge key
--new-data <path>
--onto <path> Specify older Data directory
--target-dir <path>
Parameters Description
Sqoop merge--new-data/user/guojian/sds--onto/user/guojian/sqoop--target-dir/user/guojian/sds_new--jar-file Sds.jar--class-name SDS--merge-key sd_id

It is important to note that if the--target-dir is set to a directory that already exists, Sqoop will exit with an error.

Eval Users can quickly use SQL statements to manipulate a database. This allows the user to check that the SQL statement is correct before performing an import operation.

Sqoop eval--connect jdbc:mysql://192.168.81.176/sqoop--username root-password passwd--query "Select sd_id,cd_id, Location from SDS LIMIT 10 "

The job is used to generate sqoop tasks.

--create <job-id> start a new sqoop job.
--delete <job-id> delete a sqoop job
--exec <job-id> execute a --create saved jobs
--show <job-id> Display a job's parameters
--list Show all created sqoop jobs
parameter description

Example:

Sqoop Job--create myimportjob--Import--connect jdbc:mysql://192.168.81.176/hivemeta2db--username Root-password Pass WD--table TBLs

[Email protected]:~/work$ sqoop Job--listavailable jobs:  myimportjob
[email protected]:~/work$ sqoop Job--show myimportjobJob:myimportjobTool:importOptions:--------------------- -------verbose = falsedb.connect.string = Jdbc:mysql://192.168.81.176/hivemeta2dbcodegen.output.delimiters.escape = 0codegen.output.delimiters.enclose.required = Falsecodegen.input.delimiters.field = 0hbase.create.table = Falsedb.require.password = Truehdfs.append.dir = falsedb.table = TBLSimport.fetch.size = Nullaccumulo.create.table = Falsecodegen.input.delimiters.escape = 0codegen.input.delimiters.enclose.required = Falsedb.username = Rootcodegen.output.delimiters.record = 10import.max.inline.lob.size = 16777216hbase.bulk.load.enabled = falsehcatalog.create.table = falsedb.clear.staging.table = Falsecodegen.input.delimiters.record = 0enable.compression = falsehive.overwrite.table = Falsehive.import = Falsecodegen.input.delimiters.enclose = 0accumulo.batch.size = 10240000hive.drop.delims = Falsecodegen.output.delimiters.enclose = 0hdfs.delete-target.dir = falsecodegen.oUtput.dir =. Codegen.auto.compile.dir = Truemapreduce.num.mappers = 4accumulo.max.latency = 5000import.direct.split.size = 0codegen.output.delimiters.field = 44export.new.update = UpdateOnlyincremental.mode = Nonehdfs.file.format = TextFilecodegen.compile.dir =/tmp/sqoop-guojian/compile/ Bd9c7f7b651276067b3f7b341b7fa4cbdirect.import = Falsehive.fail.table.exists = Falsedb.batch = False execution:

Sqoop job-exec Myimportjob

metastore配置sqoop job的共享元数据信息,这样多个用户定义和执行sqoop job在这一metastore中。默认存储在~/.sqoop

Start: Sqoop Metastore

Close: Sqoop metastore--shutdown

metastore文件的存储位置是在conf/sqoop-site.xml中sqoop.metastore.server.location配置,指向本地文件。

Metastore can be accessed via TCP/IP and the port number can be sqoop.metastore.server.port configured by default of 16000.

The client can be 指定 sqoop.metastore.client.autoconnect.url --meta-connect configured for jdbc:hsqldb:hsql://<server-name>:<port>/sqoop , for example, through or using jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop .

See Http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html for more information

Sqoop Introduction and use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.