Using Sqoop, the data that is eventually imported into hive and the data inconsistency in the original database are resolved

Source: Internet
Author: User
Tags sqoop hadoop fs

Sqoop is an open source tool that is used primarily in Hadoop (Hive) and traditional databases (MySQL, PostgreSQL ...) Data can be transferred from one relational database (such as MySQL, Oracle, Postgres, etc.) to the HDFs in Hadoop, or the data in HDFs can be directed into a relational database.1. Issue background

Use Sqoop to put a table in the Oracle database, which is assumed to be student, in which the data is imported into HDFs, and then the external table of hive is created, located to the location of the data just saved to HDFs. Finally, the result is inconsistent with the results in Oracle when the table-specific condition in hive is count.

1.1 Import data into HDFs in/user/hadoop/student path

Sqoop import--connect "jdbc:oracle:thin:@//localhost:1521/student"--password "* * *"--username "* * *"--query "SELECT * F Rom student where name= ' Zhangsan ' and class_id= ' 003 ' and \ $CONDITIONS "--target-dir"/user/hadoop/student "--verbose-m 1

This time the/user/hadoop/student on HDFs saved the table data imported from Oracle.

How is the table data stored on HDFS? Note this, resulting in an inconsistent error in the final result.

Let's take a look at how the data is stored in HDFs. We run Hadoop fs-cat/user/hadoop/student/part-m-00000, we can see that the original field and the field are separated by ', ', this is sqoop default, this time, if a field worth contains ', ', A separate error occurs when inserting data into hive. Because Hive is also separated by ', '.

2. Analyzing Problems

Comparing the results of the table select COUNT (*) from student in hive with the result of select COUNT (*) from Studeng in Oracle, the number of bars is the same, indicating that there is no less load data. That's why the results for certain conditions are inconsistent, and there are fewer bars in hive than in Oracle. That is, running select COUNT (*) from student where class_id= ' 003 '

Finally, when hive separates the data with commas, there are several data field values that contain commas, so the field is confused with the value, so the correct result is not available.

We recommend that you use ' \001 ' to split sqoop when importing data. i.e.--fields-terminated-by <char> parameters. Reference: Http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_large_objects

The last optimized Sqoop statement is:

Sqoop import--connect "jdbc:oracle:thin:@//localhost:1521/student"--password "* * *"--username "* * *"--query "SELECT * F Rom student where name= ' Zhangsan ' and class_id= ' 003 ' and \ $CONDITIONS "--target-dir"/user/hadoop/student "-- Fields-terminated-by "\001"--verbose-m 1

Using Sqoop, the data that is eventually imported into hive and the data inconsistency in the original database are resolved

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.