Troubleshooting process of flume anomaly based on TBDs

Source: Internet
Author: User
Tags throw exception

Copyright notice: This article by Wang Liang original article, reprint please indicate source:
Article original link: https://www.qcloud.com/community/article/214

Source: Tengyun https://www.qcloud.com/community

Phenomenon

The long-running operation found that the disk full of the flume cluster was deployed and was found to be caused by the Flume log directory.

Specific questions

Specifically, Flume's large file log found that a MySQL-related sink continues to throw an exception, printing a large number of logs

Analysis process

According to this exception information (exception) is:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after statement closed
It literally means that the state of the MySQL service (connection) has been closed, there are still commit transaction operations, throws an exception, but this exception continues to throw, still need in-depth analysis.

Configuration analysis

Since it is flume thrown, and is related to MySQL, that narrows the scope of the problem, looking for flume who is writing MySQL. (The flume configuration is typically located in/etc/flume/conf/agent/flume.conf)


Depending on the configuration, the only MySQL-related configuration logic: Read the log of the hiveserver, filter the SQL statements (in metadata collec* filter), the results are stored in the sink configuration of the MySQL data table hive_run_sqlinfo.

Flumeagent Logic Analysis

The above sink calls a Com.tencent.tbds.flume.sink.MysqlSinkForMetadata class, which is a custom class, where we find the jar of the class in the reference path and decompile it (Decompiler), the basic logic and comments are as follows:

Sink initialization phase

Sink Loop Execution Phase



Sink shutdown phase

The close phase simply checks to see if the connection exists.

Possible causes

From the logic of sink, only in the case of an empty connection, the sink state will be backoff, in other cases the state is ready, and before and after committing the transaction to MySQL, the connection state is not checked, even if the SQL throws an exception without modifying the sink state. Causes the commit to throw an exception after the sink loop executes and the loop throws the exception. Here is the root of the constant throw exception. So when did the connection actually shut down? There are 2 reasons for this: (1) The sink has no interaction with MySQL for a long time, over the connection auto-shutdown time, and (2) MySQL's abnormal shutdown.

Issue Confirmation

Whether sink is not interacting with MySQL for a long time
The timeout configuration for querying MySQL is as follows:

Configured as the default configuration for 28,800 seconds, or 8 hours.
To view the logs for Hiveserver, count the number of SQL executions per hour as follows:

As can be seen, the disconnection between sink and MySQL is not a long-term no interaction.

Whether the service is artificially disconnected
The time for the query to start MySQL is as follows:

The exception time of the flume is as follows: (from the time of the transaction itself content of the exception submission):

Time fits.

Conclusion
The MySQL service exception caused Flume to commit the transaction when the connection was interrupted, and Flume did not handle the exception, causing the dead loop to commit the transaction, and in this exceptional case, Flume was not working properly.

Problem recurrence

Based on the above inference, this exception can be verified as follows:

Hiveserver Generating logs

Perform multiple hivesql in hue

Manually force shutdown of MySQL


Manually restart the MySQL instance written by Flume.

View Flume Performance


Flume enters an infinite loop that throws an exception state, verifying success.

Summarize

The main reason here is the chain reaction caused by the MySQL service exception. Expediency can commit a transaction exception in sink code, modify the state of the next sink to Back.off, prevent the continuous printing of the log causes the machine disk full impact other services (to be verified).

Troubleshooting process of flume anomaly based on TBDs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.