Import data from MySQL to hive using Sqoop

Source: Internet
Author: User
Tags sqoop


Objective


This article is primarily a summary of the pits that were encountered when importing data from MySQL to hive with Sqoop.



Environment:


    • System: Centos 6.5
    • hadoop:apache,2.7.3
    • mysql:5.1.73
    • jdk:1.8
    • sqoop:1.4.7


Hadoop runs in pseudo-distributed mode.





One, the import command used


I mainly refer to an article to test, Sqoop:import Data from MySQL to Hive.



Referring to the method inside, the table was built in MySQL, populated with data, and then entered the command according to its own situation:


Sqoop import--connect jdbc:mysql://localhost:3306/test--username root-p--split-by ID--columns id,name--table custome R "," --hive-import--create-hive-table--hive-table sqoop_workspace.customers

 


Then began a trip to the gopher.





Second, the problems encountered and solutions 1. Issues with text fields for partitioning


Error message: "Generating splits for a textual index column allowed only in case of"-dorg.apache.sqoop.splitter.allow_text_splitte R=true "Property passed as a parameter".



The main problem is that the "--split-by ID" parameter specifies that the ID is a text format, so you need to add the option "-dorg.apache.sqoop.splitter.allow_text_splitter=true" to the command:




" -dorg.apache.sqoop.splitter.allow_text_splitter=true " --connect jdbc:mysql://localhost:3306/test--username root-p--split-by ID--columns id,name--table customer "," --hive-import--create-hive-table--hive-table sqoop_workspace.customers




2. Hadoop history server Hadoop jobhistory not open issues


Error message: "Error tool. Importtool:import Failed:java.io.IOException:java.net.ConnectException:Call from harry.com/192.168.0.210 to 0.0.0.0:10020 failed on connection exception: ... ".



The main reason is that Sqoop will need to log these job information with Hadoop Jobhistory after running the MapReduce task and store it in the specified HDFs directory, which is not started by default and will need to start the service manually after the configuration is complete.



Workaround: Mapred-site.xml Add the following configuration:






<property>
    <name>mapreduce.jobhistory.address</name>
    <value>0.0.0.0:10020</value>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>0.0.0.0:19888</value>
</property>

<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/history/done</value>
</property>

<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/history/done_intermediate</value>
</property>


Restart HDFs and yarn:






stop-dfs.sh
stop-yarn.sh
start-dfs.sh
start-yarn.sh


To start the history server:




$HADOOP _home/sbin/mr-jobhistory-daemon.sh start Historyserver


If you need to stop later, you can use the command:




$HADOOP _home/sbin/mr-jobhistory-daemon.sh Stop Historyserver


Then execute the command again.





3. Connection Metadata Storage Database error


Error message: "caused By:javax.jdo.JDOFatalDataStoreException:Unable to open a test connection to the given database. JDBC URL ... ".



Mainly because I used the hive metadata database as the default Derby, I opened a hive CLI in another session to access hive, and one drawback of derby is that when multiple users access hive at the same time, an error occurs.



Workaround: Exit the Hive CLI and rerun





4. Database not created in hive


Error message: "Error QL. Driver:FAILED:SemanticException [Error 10072]: Database does not exist:sqoop_workspace ... ", this is obvious, just create one directly in hive.





5. Other warnings and errors


Other errors do not actually hinder the import process, such as the following warn:



"WARN HDFs. Dfsclient:caught exception java.lang.InterruptedException ... "is actually Hadoop's own bug, specifically the bug in HDFs 9794:



When Dfsstripedoutputstream is turned off, the streamer thread is not closed if it fails to brush back the data to the data/parity block. This problem also exists in the Dfsoutputstream#closeimpl. Dfsoutputstream#closeimpl will always be forced to shut down threads, which can cause interruptedexception.



These errors are generally ignored.





Third, reference


1. Sqoop Import: "-dorg.apache.sqoop.splitter.allow_text_splitter=true"



2. Hadoop jobhistory



3. Already tried 0 time (s); Retry policy is Retryuptomaximumcountwithfixedsleep workaround



4. WARN HDFs. Dfsclient:caught Exception Java.lang.InterruptedException



Finish



Import data from MySQL to hive using Sqoop


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.