Sqoop: Fault Tolerance

Source: Internet
Author: User
Tags hadoop ecosystem sqoop

The error tolerance of Sqoop itself depends on Hadoop. Here we focus on the processing of Sqoop transmission task failure. Specifically, how does the focus solve the data consistency problem caused by the failure of the transmission task in Sqoop?

For A transfer task, data is transmitted from A to B. If the transfer task fails, the statuses of A and B should be consistent with those before the transfer starts.

Sqoop generates a mapreduce job for a transmission job. A job has multiple mapreduce tasks that execute transmission jobs in parallel for data transmission with external databases. Then, there are many causes for the failure of some tasks, such:
1. Database Violation
2. Lost database connection
3. Due to separators and other reasons, the number of transmitted columns is inconsistent with the number of columns in the table.
4. Hadoop machine hardware problems

The fail of any task may cause the entire transmission job fail, which may cause data consistency problems!

A transfer task is executed in parallel by multiple tasks. Each task itself is a transaction. When this task is fail, this transaction will roll back, but other transactions will not roll back, this will lead to very serious dirty data problems. Some data is imported and some are missing. What should I do ???

For Sqoop Import tasks, this problem does not exist due to the existence of Hadoop CleanUp tasks.

The Sqoop Export task provides a "Intermediate table" solution.
First, write the data to the intermediate table. The data is successfully written to the intermediate table. In a transaction, the data in the intermediate table is written to the target table.
-- Staging-table <staging-table-name> intermediate table
-- Clear-staging-table before the task starts, clear the intermediate table

Eg:
. /Sqoop export -- connect jdbc: mysql: // 127.0.0.1/test -- table employee -- staging-table employee_tmp -- clear-staging-table-username root-password 123456 -- export-dir hdfs: // localhost: 9000/user/hive/warehouse/employee
During the transmission, the data is saved in employee_tmp, and finally the data of employee_tmp will be moved to the employee

The intermediate table has a good idea, but it brings about a problem. If you want to import a copy of data to the database, you need to create a "companion table"
If the transmission tool needs to be generalized, this "companion table" operation needs to be integrated into the entire transmission tool, while "tabulation" is put out of the work, DBA will be a lot of resistance

Summary:
For a transmission tool/platform, the failure of a transmission task is not terrible. The terrible thing is how to handle "dirty data:
1. Temporary table: Use a temporary table to cache data, and then move the data of the temporary table to the target table in a transaction.
2. Custom rollback: After a task fails, the User-Defined statements/methods are used to clear data.
3. idempotence of the transfer task: If a task fails and dirty data is generated, run the task again after the problem is solved. For example, insert overwrite is used for hive writing.

Implement data import between Mysql, Oracle, and HDFS/Hbase through Sqoop

[Hadoop] Detailed description of Sqoop Installation Process

Use Sqoop to export data between MySQL and HDFS Systems

Hadoop Oozie learning notes Oozie does not support Sqoop Problem Solving

Hadoop ecosystem construction (hadoop hive hbase zookeeper oozie Sqoop)

Full history of Hadoop learning-use Sqoop to import MySQL Data to Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.