Kafka repeat consumption and lost data research

Source: Internet
Author: User


Kafka repeated consumption reasons

Underlying root cause: data has been consumed, but offset has not been submitted.

Cause 1: Forcibly kill the thread, causing the data after consumption, offset is not committed.

Cause 2: set offset to auto commit, close Kafka, if Call Consumer.unsubscribe () before close, It is possible that partial offset is not committed and the next restart will repeat consumption. For example:

try {

Consumer.unsubscribe ();

} catch (Exception e) {

}

try {

Consumer.close ();

} catch (Exception e) {

}

The above code causes partial offset to not be submitted and will be repeated at the next startup.


Kafka consumer lost data cause

Guess: set offset to auto-timed commit, when offset is automatically timed commit, the data is still in memory is not processed, this time the thread is killed, then offset has been committed, but the data is not processed, resulting in the loss of data in this part of memory.



scheme for recording offset and recovering offset

In theory, offset is recorded, and the next group consumer can continue to consume the offset position recorded.

Offset recording scheme:

Updates each topic+partition position offset in memory each time it is consumed.

Map<key, value>,key=topic+ '-' +partition,value=offset

When the call closes the consumer thread, the offset data from the above map is recorded in the file * (the distributed cluster may have to be recorded in Redis).

The next time you start consumer, you need to read the last offset information by using the current topic+partition as key and looking for offset from the previous map.

Then use the Consumer.seek () method to specify the offset position to the last.

Description

1, the program for a single server is relatively simple, direct offset to the local file can be, but for multiple server clusters, offset must be recorded in the same place, and need to do the processing.

If the online program is a cluster of multiple servers, can it be supported by a single server? should be able, but consumption is a little slower, not much impact.


2, how to ensure that the data accuracy of offset consumption

In order to ensure that consumer consumption data must be the last time consumer consumption data,

Consumer consumption, records the first time the data is taken out, and its offset and the last consumer consumption of offset to compare, if the same will continue to spend. If it is different, stop spending and check the cause.






Kafka repeat consumption and lost data research

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.