Kafka repeated consumption reasons
Underlying root cause: data has been consumed, but offset has not been submitted.
Cause 1: Forcibly kill the thread, causing the data after consumption, offset is not committed.
Cause 2: set offset to auto commit, close Kafka, if Call Consumer.unsubscribe () before close, It is possible that partial offset is not committed and the next restart will repeat consumption. For example:
try {
Consumer.unsubscribe ();
} catch (Exception e) {
}
try {
Consumer.close ();
} catch (Exception e) {
}
The above code causes partial offset to not be submitted and will be repeated at the next startup.
Kafka consumer lost data cause
Guess: set offset to auto-timed commit, when offset is automatically timed commit, the data is still in memory is not processed, this time the thread is killed, then offset has been committed, but the data is not processed, resulting in the loss of data in this part of memory.
scheme for recording offset and recovering offset
In theory, offset is recorded, and the next group consumer can continue to consume the offset position recorded.
Offset recording scheme:
Updates each topic+partition position offset in memory each time it is consumed.
Map<key, value>,key=topic+ '-' +partition,value=offset
When the call closes the consumer thread, the offset data from the above map is recorded in the file * (the distributed cluster may have to be recorded in Redis).
The next time you start consumer, you need to read the last offset information by using the current topic+partition as key and looking for offset from the previous map.
Then use the Consumer.seek () method to specify the offset position to the last.
Description
1, the program for a single server is relatively simple, direct offset to the local file can be, but for multiple server clusters, offset must be recorded in the same place, and need to do the processing.
If the online program is a cluster of multiple servers, can it be supported by a single server? should be able, but consumption is a little slower, not much impact.
2, how to ensure that the data accuracy of offset consumption
In order to ensure that consumer consumption data must be the last time consumer consumption data,
Consumer consumption, records the first time the data is taken out, and its offset and the last consumer consumption of offset to compare, if the same will continue to spend. If it is different, stop spending and check the cause.
Kafka repeat consumption and lost data research