Hadoop data transmission tool sqoop

Source: Internet
Author: User
Tags table definition sqoop hadoop fs
Overview

Sqoop is a top-level Apache project used to transmit data in hadoop and relational databases. Through sqoop, we can easily import data from a relational database to HDFS, or export data from HDFS to a relational database.
Sqoop architecture: the sqoop architecture is very simple. It integrates hive, hbase, and oozie to transmit data through map-reduce tasks, so as to provide concurrency features and fault tolerance.
For more information about sqoop, see a new generation of data transfer tools for hadoop: sqoop 2.

Sqoop interacts with relational databases through JDBC. Theoretically, all databases supporting JDBC can use sqoop and HDFS for data interaction.
However, only a small part has been officially tested by sqoop, as shown below: database version -- direct support connect string matches
HSQLDB 1.8.0 + no jdbc: HSQLDB :*//
MySQL 5.0 + Yes JDBC: mysql ://
Oracle 10.2.0 + no jdbc: oracle :*//
PostgreSQL 8.3 + Yes (import only) JDBC: PostgreSQL ://

Older versions may also be supported, but have not been tested.
For performance considerations, sqoop provides a fast data access mechanism different from JDBC and can be used through -- direct.

Sqoop-1.4.3-based
Install

For how to install and use sqoop, refer to http://www.54chen.com/java-ee/sqoop-mysql-to-hive.html to test work.

Tools

Sqoop contains a series of tools. You can run sqoop help to view related help,
$./Sqoop help
Usage: sqoop Command [ARGs]
Available commands:
Codegen generate code to interact with database records
Create-hive-Table import a table definition into hive
Eval evaluate a SQL statement and display the results
Export export an HDFS directory to a database table
Help list available commands
Import a table from a database to HDFS
Import-all-tables import tables from a database to HDFS
Job work with saved jobs
List-databases list available databases on a server
List-Tables List available tables in a database
Merge merge results of incremental imports
MetaStore run a standalone sqoop MetaStore
Version display version information
See 'sqoop help command' for information on a specific command.

Use the list-tables tool to View tables:
$./Sqoop list-tables -- connect JDBC: mysql: // 127.0.0.1/test -- username root -- password 123456
A
T1

You can use codegen to generate code without executing map-reduce, as shown below:
$./Sqoop codegen -- connect JDBC: mysql: // 127.0.0.1/test -- username root -- password 123456 -- table --
Class-name zxm_sqoop
......
13/03/21 21:02:01 info Orm. compilationmanager: Writing JAR file:/tmp/sqoop-work/compile/29864e3980ab5630b699e8e1e2145369/zxm_sqoop.jar
The related code and Java package can be found at/tmp/sqoop-work/compile/29864e3980ab5630b699e8e1e2145369 /.
Import

Sqoop data import has the following features:
1. Supports text files (-- As-textfile), Avro (-- As-avrodatafile), and sequencefiles (-- As-sequencefile ). Rcfile is not supported currently. The default value is text.
2. Data appending is supported and specified through -- apend
3. Supports table column selection (-- column), data selection (-- where), and -- table
4. supports data selection, for example, reading data after multi-table join 'select. *, B. * From a join B on (. id = B. ID) ', cannot be used together with -- table
5. Support map number customization (-m)
6. Supports compression (-- compress)
7. You can import data from relational databases to hive (-- hive-import) and hbase (-- hbase-table)
Data Import hive consists of three steps: 1) import data to HDFS 2) create a hive table 3) Use "load data inpaht" to load data to the table
Data Import hbase is divided into two parts: 1) import data to HDFS 2) Call hbase put to write data to tables row by row
* Example:
MySQL file content:
Mysql> select * From;
+ ------ + -------- +
| Key1 | value1 |
+ ------ + -------- +
| 1 | A1 |
| 2 | A2 |
| 3 | A3 |
| 4 | A4 |
| 5 | A5 |
| 6 | A6 |
| 7 | A7 |
| 8 | A8 |
| 9 | A9 |
+ ------ + -------- +

Compile the file a. conf with the following content:
import--append-m3--connect jdbc:mysql://127.0.0.1/test --username root --password 123456 --table a--target-dir/tmp/a--columnskey1--where'key1>3'

Run:
$./Sqoop -- options-File A. conf

View output:
$ Hadoop FS-ls/tmp//
Found 3 items
-RW-r -- 1 work supergroup 4 2013-03-21/tmp/A/part-m-00000
-RW-r -- 1 work supergroup 4 2013-03-21/tmp/A/part-m-00001
-RW-r -- 1 work supergroup 4 2013-03-21/tmp/A/part-m-00002
= "3 files correspond to 3 mapper
$ Hadoop FS-CAT/tmp//*
4
5
6
7
8
9
Export

Sqoop export can export files on HDFS to relational databases. The principle is to read and parse data based on the user-specified delimiter (field separator: -- fields-terminated-by), and then convert the data into an insert/update statement to import data to the relational database.

It has the following features:
1. You can export data to a table or call a stored procedure)

2. The insert and update modes are supported.
3. Support concurrency control (-m)
Instance:
$ Hadoop FS-CAT/tmp/B /*
1,
2, B
3, C

$./Sqoop export -- connect JDBC: mysql: // 127.0.0.1/test -- table B-username root-Password 123456 -- ex
Port-DIR/tmp/B

Mysql> select * from B;
+ ------ + -------- +
| Key1 | value1 |
+ ------ + -------- +
| 1 | A |
| 2 | B |
| 3 | c |
+ ------ + -------- +

Out of the tools mentioned above, sqoop also provides some interesting tools, such as sqoop job. If you are interested, you can study

Others:

1. By using map-Reduce, sqoop provides good concurrency and fault tolerance, and can be used as a synchronization tool for heterogeneous databases.
2. Although sqoop supports hive and hbase, It is incomplete. In some scenarios, processing after data transmission is inevitable.
3. Big Data Transmission, maybe -- direct

Reference:

Apache sqoop
Sqoop User Guide
Apache sqoop: a data transfer tool for hadoop
A new generation of data transfer tools for hadoop: sqoop 2
Use sqoop
Comparison between datax and sqoop in heterogeneous Big Data Environments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.