Official sqoop Website:
Http://sqoop.apache.org/
*) Sqoop Introduction
Sqoop is used to transmit data in hadoop and relational databases. Through sqoop, we can easily import data from a relational database to HDFS, or export data from HDFS to a relational database.
Reference link:
Http://blog.csdn.net/yfkiss/article/details/8700480
*) Simple Sample Cases
Objective: To import all table data in Oracle to the HDFS File System
1) Add the following table to Oracle:
Create Table tb_message
(
Id int primary key,
Msgid int,
Message varchar (1, 256)
);
2) add the following data:
Insert into tb_message (ID, msgid, message) values (1, 1001, 'message 1 ');
Insert into tb_message (ID, msgid, message) values (2, 1002, 'message 2 ');
Insert into tb_message (ID, msgid, message) values (3, 1003, null );
3). Data Verification
Select * From tb_message;
IDMSGIDMESSAGE---------- -------------------------------------1 1001message 12 1002message 23 1003
4). Create the target directory
Sudo-u hdfs dfs-mkdir-P/sqoop-test/Data
5). Use sqoop import command to import
Sqoop import -- connect JDBC: oracle: thin: @ <oralce_server_ip >:< oracle_server_port>: <Sid> -- username <username> -- password <password> -- table <table_name> -- columns <Col, Col, Col…> -- Target-Dir
Sudo-u HDFS sqoop import -- connect JDBC: oracle: thin: @ 172.16.1.108: 1521: XE -- username System -- password tiger -- table tb_message -- columns ID, msgid, message -- target-DIR/sqoop-test/data/message-M 1
6). HDFS file Verification
Sudo-u hdfs dfs-CAT/sqoop-test/data/message/part-m-00000
1,1001,message 12,1002,message 23,1003,null
As a simple sample, the entire process is relatively simple and easy to understand.
*) Sqoop command line parameters
Import
Used to export data from RDMS to HDFS
Export
Used to import data from HDFS from RDMS
-- Append
Used for data appending
-- Connect JDBC connection string
Common MySQL/Oracle connection strings are as follows:
MySQL: JDBC: mysql: // <server_ip >:< server_port>/<dbname>
ORACLE: JDBC: oracle: thin: @ <server_ip >:< server_port >:< dbname>
-- Username
Database logon Username
-- Password
Database logon user password. You can use-P to enter the password from the console.
-- Table
Name of the database table to be imported/exported
-- Columns
Specifies the name of the exported column. Each column is separated by commas (,), such as ID and msgid.
-- Target-Dir
Export HDFS path
-- Num-mappers
Specifies the number of map tasks, abbreviated as-M.
The above are common command line parameters. For detailed parameters, see the official User Manual:
Http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html
*) Performance Analysis
For two solutions:
RDBMS> Local File> HDFS
RDBMS> sqoop> HDFS
Performance Comparison:
#) Use JDBC to improve the import/export performance, which is not necessarily better than the built-in RDBMS import/export tool. In fact, the performance is poor.
#) Sqoop improves data import/export Reliability
#) Sqoop also supports importing/exporting data to hive/hbase.
Optimization solution:
Parallel Import
1). Increase the value of -- num-mappers and start multiple map tasks to accelerate.
However, if a table without a primary key fails to be executed, the following error is returned:
14/06/25 17:47:33 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/a1525a92a1a74988ed98d367452e7fa8/tb_nopk_table.jar14/06/25 17:47:33 ERROR tool.ImportTool: Error during import: No primary key could be found for table tb_nopk_table. Please specify one with --split-by or perform a sequential import with ‘-m 1‘.
To avoid this, sqoop does not allow you to specify multiple map tasks and only allow '-M 1'. That is, the import/export operations must be executed in a serial mode.
2) specify -- split-by and select fields suitable for splitting.
The -- split-by field is applicable to the import/export of table data without a primary key. Its parameters are used with -- num-mapper.
3) Split multiple sqoop commands manually
Select fields suitable for splitting, specify -- where as range isolation, and start multiple sqoop commands to execute at the same time
For example:
sqoop import ... --table tb_message --where "ID > 0 AND ID <= 1000"sqoop import ... --table tb_message --where "ID > 1000 AND ID <= 2000"sqoop import ... --table tb_message --where "ID > 2000 AND ID <= 3000"
See: http://www.cnblogs.com/gpcuster/archive/2011/03/01/1968027.html
*) Sqoop other questions
1. Data Consistency
If the export is stopped halfway, data inconsistency may occur, that is, partially-complete export.
2. What should I do if the value is null?
By default, if the sqoop command does not specify relevant parameters, if the field value is null, it is written to the file as "null". If you want special processing, you can specify -- null-string, -- null-non-string parameters.
-- Null-string <null-string> the string to be written for a null value for string Columns
-- Null-non-string <null-string> the string to be written for a null value for non-string Columns
During import, specify
-- Null-string '\ N'
-- Null-non-string '\ N'
Specify
-- Input-Null-string '\ N'
-- Input-Null-non-string '\ N'