Flume-kafka-storm Log Processing Experience

Source: Internet
Author: User
Tags ack

Transferred from: http://www.aboutyun.com/thread-9216-1-1.html

Several difficulties in using storm to process transactional real-time computing requirements: http://blog.sina.com.cn/s/blog_6ff05a2c0101ficp.html

Recent log processing, note is log processing, if the flow calculation of some financial data such as exchange market data, is not so "rude", the latter must also consider the integrity and accuracy of data. The following is a little summary in the process of practice, provided to log analysis of the pot friend reference, but also welcome you to share some of the situation you encountered:

A
Flume to Kafka real-time data than a single too fast, resulting in storm spout consumption Kafka rate is not up, this delay is mainly caused by the calculation of HBase after the data is emitted into stream (this part has been optimized with memory calculation). Analysis of the characteristics of the tuple, the tuple each log is very small, large number, if using the current spout, will be as a tuple in the stream of a lot of accumulation, resulting in a timeout automatically callback fail () function (but in fact this does not affect the result).

Some features of storm reference http://www.aboutyun.com/thread-8527-1-1.html
(1) Storm single pipeline processing capacity is approximately 20000 tupe/s, (each tuple size is 1000 bytes)
(2) storm system the processing delay in the province is millisecond, the JVM GC generally has limited impact on system performance, but when memory is tight, the GC becomes a bottleneck for the systems.
In practice, we find that there are too many tuples, and because Kafka's message needs a new String (), it will report a GC exception.
Some of the above situations and phenomena, I think it can be multi-tuple structure optimization, multiple logs packaged into a tuple to launch processing.
However, as a general rule, a single launch is already efficient enough

Two
Kafkaspout get the data, as far as my business is concerned, do not need to pay much attention to the integrity of the data, so in the entire stream, avoid the use of ACK and fail, that is, after spout obtained the data, the launch will no longer care whether the data is properly processed or timed out, etc.

Three
There is a misunderstanding, once again control the rate of spout acquisition, found that the number of fail is very little, but in a time to complement the data, spout obtained thousands basic data, and Bolt has a business is frequent interaction hbase, resulting in the stream of data accumulation and delay, The UI shows a large number of fail, beginning to think that the processing failed, and later compared to the data found that the calculation results are not many errors, speculation may be because the timeout callback the fail function.

Four
The landing for HBase, although hbase efficiency has been good, but found that for some business, just use hbase, or there is a greater delay, so you can use some of the data tables used to synchronize to memory, can be designed to map and other structures to calculate, the key point is to synchronize HBase, Otherwise, when storm or work is hung up, there will be a miscalculation.

Five
Some of the possible bugs
(1) ZK cluster outage, this error is very not, but, I appeared, causing storm outage, and my data backend is hbase, so all the calculations have failed, so it is better to have a monitoring system can detect ZK, hbase, Storm and other basic platform tools, To avoid a waste of time to find fault;

(2) A thread in the Kafkaspout can report an exception if it continuously fetches data from Kafka and then fires after parsing new String (): java.lang.StringIndexOutOfBoundsException:String Index out of Range:2, this bug is not necessarily, but I happened, plan to direct the byte[] as a tuple to launch into the bolt processing.

(3) Hateful info log
Because of the log configuration on the info level, storm emit and ACK of the info log too much, my side 1 hours about 1g of the log, plus Kafka consumer side of the request log, several times the disk has been maxed out, resulting in server downtime, this to serious attention, My current approach is to change the info to warn level. Don't know if there's a better way ~

(4) Open source Kafkaspout
Open source Kafkaspout There are several, git on, but some of the environmental requirements have constraints, need to note, if it is simple, such as I do not require a high application, can be fully self-Kafka consumption instances to develop.

Flume-kafka-storm Log Processing Experience

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.