First of all, this is my original article, but also refer to the network of the Great God's articles plus their own summary, welcome to the Great God pointed out the mistake! We make progress together.
Where the 1.kafka data exchange is done. Kafka is designed to make every effort to complete data exchange in memory, whether it is an external system, or an internal operating system interaction. If the production and consumption between producer and consumer are properly coordinated, data exchange 0 I/O can be achieved, but this is almost impossible Where is the 2.kafka cached data. The broker is used in Kafka to accept requests for producer and consumer, and to persist the message to the local disk
How to lose 3.kafka data. 3.1 For broker, drop the data, unless the disk is broken. Will lose 3. 2 for in-memory flush (clean), the broker restart will throw flush (clean) is the internal mechanism of Kafka, Kafka first in memory to complete the exchange of data, Kafka always with O (1) time complexity to persist data to disk. Kafka will first cache the data (Cached in memory) up and then batch flush. Flush interval can be configured via Log.flush.interval.messages and log.flush.interval.ms but in version 0.8.0, the data is guaranteed to be not lost through the replica mechanism. The price is to need more resources, especially disk resources, Kafka currently supports gzip and snappy compression to mitigate whether the problem using replica (replicas) depends on the balance (balance) replica between reliability and resource cost (Dungeon) is a mechanism of Kafka.
3.3producer to broker producer is push data (push) to put request.required (required). ACKs (confirm) set to 1, lost will be re-sent, the probability of loss is very small 3.4broker to Consumer The Kafka consumer offers two interfaces. One is a high-level version this is the time when Kafka and other combinations such as having storm-kafka-0.8-plus (or other spark plug-ins) have encapsulated the management of partition and offset, The default is to automatically commit offset on a regular basis, which may cause data loss, a low-level version
This is the time when Kafka with other storm-kafka-0.8-plus (or other spark plugins)
Manage the correspondence between spout threads and partition and the consumed offset on each partition (periodically written to ZK)
And only when this offset by storm Ack, that is, after the successful processing, will be updated to ZK, so basically can guarantee that the data is not lost
Even if the spout thread crash (crashes), the restart can be read from ZK to the corresponding offset 4.kafka data Repeat Kafka design is designed (At-least once) at least once the logic so that the data may be duplicated Kafka uses a time-based SLA (Service level guarantee), the message is saved for a certain time (usually 7 days) after it is deleted Kafka data repeats should generally be on the consumer side, when log.cleanup.policy = delete uses the periodic delete mechanism 5. If the above can not prohibit repeated consumption. We can do this with other tools, such as Redis. Store Kafka data in Redis and go heavy in redis. Or you can use the encoded form to go to the weight (Kafka has the API to encode) 6. But using Redis or encoding can create new problems. What to do if there is a problem with the connection Kafka, or data Redis has no data. This is a problem with no solution.