Sqoop is an open source tool that is used primarily in Hadoop (Hive) and traditional databases (MySQL, PostgreSQL ...) Data can be transferred from one relational database (such as MySQL, Oracle, Postgres, etc.) to the HDFs in Hadoop, or the data in HDFs can be directed into a relational database.1. Issue background
Use Sqoop to put a table in the Oracle database, which is assumed to be student, in which the data is imported into HDFs, and then the external table of hive is created, located to the location of the data just saved to HDFs. Finally, the result is inconsistent with the results in Oracle when the table-specific condition in hive is count.
1.1 Import data into HDFs in/user/hadoop/student path
Sqoop import--connect "jdbc:oracle:thin:@//localhost:1521/student"--password "* * *"--username "* * *"--query "SELECT * F Rom student where name= ' Zhangsan ' and class_id= ' 003 ' and \ $CONDITIONS "--target-dir"/user/hadoop/student "--verbose-m 1
This time the/user/hadoop/student on HDFs saved the table data imported from Oracle.
How is the table data stored on HDFS? Note this, resulting in an inconsistent error in the final result.
Let's take a look at how the data is stored in HDFs. We run Hadoop fs-cat/user/hadoop/student/part-m-00000, we can see that the original field and the field are separated by ', ', this is sqoop default, this time, if a field worth contains ', ', A separate error occurs when inserting data into hive. Because Hive is also separated by ', '.
2. Analyzing Problems
Comparing the results of the table select COUNT (*) from student in hive with the result of select COUNT (*) from Studeng in Oracle, the number of bars is the same, indicating that there is no less load data. That's why the results for certain conditions are inconsistent, and there are fewer bars in hive than in Oracle. That is, running select COUNT (*) from student where class_id= ' 003 '
Finally, when hive separates the data with commas, there are several data field values that contain commas, so the field is confused with the value, so the correct result is not available.
We recommend that you use ' \001 ' to split sqoop when importing data. i.e.--fields-terminated-by <char> parameters. Reference: Http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_large_objects
The last optimized Sqoop statement is:
Sqoop import--connect "jdbc:oracle:thin:@//localhost:1521/student"--password "* * *"--username "* * *"--query "SELECT * F Rom student where name= ' Zhangsan ' and class_id= ' 003 ' and \ $CONDITIONS "--target-dir"/user/hadoop/student "-- Fields-terminated-by "\001"--verbose-m 1
Using Sqoop, the data that is eventually imported into hive and the data inconsistency in the original database are resolved