Import data from a database into HDFs using sqoop (parallel import, incremental import)

Source: Internet
Author: User
Tags sqoop

Basic use

As in the shell script below:

#Oracle的连接字符串, which contains the Oracle's address, SID, and port number
Connecturl=jdbc:oracle:thin:@20.135.60.21:1521:dwrac2
#使用的用户名
Oraclename=kkaa
#使用的密码
Oraclepassword=kkaa123
#需要从Oracle中导入的表名
Oralcetablename=tt
#需要从Oracle中导入的表中的字段名
Columns=area_id,team_name
#将Oracle中的数据导入到HDFS后的存放路径
hdfspath=apps/as/hive/$oralceTableName

#执行导入逻辑. Importing data from Oracle into HDFS
Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--nu M-mappers 1--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '

After executing this script, the importer is finished.

Next, users can create their own external tables, matching the path of the external table with the path to the Oracle data in HDFs.

Note: The data imported into HDFs by this program is in text format, so when creating a hive external table, you do not need to specify the format of the file as Rcfile and use the default textfile. The delimiter between the data is ' \001 '. If you import data from the same table more than once, the data is inserted into the HDFs directory as append.

Parallel import

Assuming this sqoop command, you need to import data from Oracle into HDFS:

Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--m 1--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '--where "data_desc= ' 2011-02-26 '"

Note that in this command, there is a parameter "-M", which means how many parallels are used, and the value of this parameter is 1, indicating that parallel functionality is not turned on.

Now, we can increase the value of the "-m" parameter by using the parallel import feature, as in the following command:

Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--m 4--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '--where "data_desc= ' 2011-02-26 '"

In general, the Sqoop will open 4 processes, while the data import operation.

However, if a table imported from Oracle does not have a primary key, the following error message appears:

ERROR tool. Importtool:error during Import:no primary key could is found for table creater_user.popt_cas_redirect_his. Specify one with--split-by or perform a sequential import with '-M 1 '.

In this case, in order to better use the Sqoop parallel import function, we need to understand the principle of Sqoop parallel import implementation mechanism.

If the primary key for an Oracle table that needs to be imported in parallel is an ID, and the number of parallel is 4, then Sqoop first executes one of the following queries:

Select Max (ID) as Max, select min (id) as min from table [where if a WHERE clause is specified];

With this query, the maximum and minimum values that need to be split fields (IDS) are obtained, assuming 1 and 1000, respectively.

Then, Sqoop will split the query according to the number of concurrent imports, such as the above example, parallel import will be split into the following 4 SQL simultaneous execution:

SELECT * FROM table where 0 <= ID < 250;

SELECT * FROM table where <= ID < 500;

SELECT * FROM table where <= ID < 750;

SELECT * FROM table where <= ID < 1000;

Note that this split field needs to be an integer.

As you can see from the example above, if you need to import a table without a primary key, how do we manually pick a suitable split field and choose the appropriate number of parallel.

Give a practical example to illustrate:

We want to import creater_user.popt_cas_redirect_his from Oracle.

This table does not have a primary key, so we need to manually select a suitable split field.

First look at what fields this table has:

I then assume that the Ds_name field is a split field that can be selected, and then execute the following SQL to verify my idea:

Select min (ds_name), Max (ds_name) from creater_user.popt_cas_redirect_his where data_desc= ' 2011-02-26 '

The results are not ideal, and the values of min and Max are equal. So this field is not appropriate as a split field.

Test another field: ClientIP
Select min (clientip), Max (ClientIP) from creater_user.popt_cas_redirect_his where data_desc= ' 2011-02-26 '

The result is still good. So we use the ClientIP field as the Split field.

Therefore, we use the following command to import in parallel:

Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--m --split-by clientip--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '--where "data_desc= ' 201 1-02-26 ' "

This execution of this command, you can see, the time consumed is: 20mins, 35sec, imported 33,222,896 data.

Also, if you feel that this split is not good enough for our needs, you can execute multiple Sqoop commands at the same time, and then specify the split rule after the where parameter. Such as:

Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--m 1--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '--where "data_desc= ' 2011-02-26 ' logtime< 10:00:00 "

Sqoop import--append--connect $CONNECTURL--username $ORACLENAME--password $ORACLEPASSWORD--target-dir $hdfsPath--m 1--table $oralceTableName--columns $columns--fields-terminated-by ' \001 '--where "data_desc= ' 2011-02-26 ' logtime> =10:00:00 "

So as to achieve the purpose of parallel import.

Incremental Import

The sqoop supports two types of incremental import modes,
One is append, that is, by specifying an incrementing column, such as:
--incremental Append--check-column num_iid--last-value 0

A check field of the varchar type can also be incrementally imported in this manner (an incrementing number with an ID of varchar type):

--incremental Append--check-column ID--last-value 8
Another type can be based on time stamps, such as:
--incremental lastmodified--check-column created--last-value ' 2012-02-01 11:0:00 '
is to import only created larger data than ' 2012-02-01 11:0:00 '.

Import data from a database into HDFs using sqoop (parallel import, incremental import)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.