Some thoughts on extracting data from JDBC to HDFs for sqoop1.99.6

Source: Internet
Author: User
Tags sqoop

Recently in the use of sqoop1.99.6 to do data extraction, during the encounter a lot of problems, hereby recorded here, convenient for later review and collation

1. First configuration, you need to configure the Lib directory of HDFs to Catalina.properties

Common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar, ${catalina.home}/. /lib/*.jar,/usr/lib/hadoop/*.jar,/usr/lib/hadoop/lib/*.jar,/usr/lib/hadoop-hdfs/*.jar,/usr/lib/hadoop-hdfs/lib /*.jar,/usr/lib/hadoop-mapreduce/*.jar,/usr/lib/hadoop-mapreduce/lib/*.jar,/usr/lib/hadoop-yarn/*.jar,/usr/lib /hadoop-yarn/lib/*.jar,/usr/lib/hive/lib/*.jar

Configuring the Logs directory with the DB directory in Sqoop.properties

2. Start Sqoop after configuration (Hadoop should be started first)

3. Creating a Sqoop task requires you to create a link, and in 1.99.6 you need to create a

Fromlink, and Tolink.

Create Link-cid 1create link-cid 3

These two lines of command create a jdbclink and an HDFs link, respectively.

Jdbclink need to fill in some parameters: must fill in the parameters for Connectstring,database user database password

HDFs link needs to fill in the following parameters: HDFs url:hdfs://namenode:9000/

After you have set these parameters, you are ready to connect.

I have tested MySQL connection and Oracle connection.

Connection parameters for Oracle connections can be connected with SIDs: Jdbc:oracle:thin: @IP: PORT:ORCL

After the link is built, you can build the job.

Create Job-f 1-t 2

The parameters to be filled in are

Name: Optional

From Database Configuration

Schema Name:

Table Name:

Table SQL Statement:

Table Columns Name:

...

To HDFS Configuration

...

Output format:

Compression format:

...

Output directory:

...

Here to focus on the meaning of a few parameters. Especially at the end of the database setting.

(Schema name,table name) and table SQL statement cannot be set at the same time, which is to prevent problems caused by different table names.

Table Columns name to set the name of the column to extract. The basic format is: Col1,col2,col3. Because the length of table Columns name has a perverted setting, that is, the string cannot be longer than 50. So basically not a few columns. Here, I found the use of table SQL statement by looking up source code, in fact, it can be said that his existence is to solve the above problems. First we can see the source code in the size of SQL 2000. So there is enough length. But before I tried a lot of SQL, I found it was always unsuccessful. So to sqoop all kinds of spit groove ...

Finally, it was found that the basic notation is this: SELECT id,col1,col2,col3 from SCHEMA. TABLE WHERE ${conditions}

That ${conditions} is just a placeholder, no use, it will be replaced in the source code 1=1

This solves the problem with the columns setup.

Explain it again.

Partition column name can generally be set to ID, which is intended for parallel processing. Sqoop cuts the database tables by getting the maximum and minimum values of partition column, and blocks the data (which is likely to cause imbalance between tasks).

In order to solve this problem, we can use boundary query to solve the problem of chunking, which I have not studied.

The last question, Sqoop now the delimiter between columns with the "," instead of "\ T", and the outside can not be manually changed, it is in the code to write dead. If you want to change it, you can modify it by recompiling the source code.

That class is org.apache.sqoop.connector.hdfs.HdfsConstants.

For the sqoop of learning for the moment first to remember here.

Some thoughts on extracting data from JDBC to HDFs for sqoop1.99.6

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.