Dear friends, I have recently studied kafka and read a lot that kafka may lose messages. I really don't know what scenarios A log system can tolerate the loss of messages. For example, if a real-time log analysis system is used, the log information I see may be incomplete... dear friends, I have recently studied kafka and read a lot that kafka may lose messages.
I really don't know what scenarios A log system can tolerate the loss of messages.
For example, if a real-time log analysis system is used, the log information I see may be incomplete. If abnormal logs are not displayed, the problem may be located?
We can also see that the crash of a node in the distributed cluster kafka may also lead to the loss of messages on this node (as mentioned in the comparison between kafka and rabbitMQ, rabbitMQ does not have this problem ).
If kafka is so unreliable, why are so many companies using kafka?
Reply content:
Dear friends, I have recently studied kafka and read a lot that kafka may lose messages.
I really don't know what scenarios A log system can tolerate the loss of messages.
For example, if a real-time log analysis system is used, the log information I see may be incomplete. If abnormal logs are not displayed, the problem may be located?
We can also see that the crash of a node in the distributed cluster kafka may also lead to the loss of messages on this node (as mentioned in the comparison between kafka and rabbitMQ, rabbitMQ does not have this problem ).
If kafka is so unreliable, why are so many companies using kafka?
I don't know how rabbitMQ works.
Messages are lost in kafka mainly in two links.
Message disk storage time
Messages are asynchronously refreshed and synchronously refreshed on disks, which significantly increases the reliability of asynchronous refresh. However, in some scenarios, performance is pursued, and reliability is ignored. This can be enabled.
Message storage maintenance
This is not a reference to persistent storage. Oracle/MySQL has been storing data for so long, and the disaster recovery tools among them are all very complete and form a system (if something goes wrong, you can find people and solve the problem) who knows about kafka storage ~ Very few tools!
In addition, it is the storage medium for disks. If raid is not performed, a single disk may be damaged. If raid is performed, the cost will increase. If you copy multiple sets, there will be instantaneous data inconsistency caused by network synchronization latency.
Conclusion: kafka requires that you do not lose data at all (in the case of non-major disaster tolerance, for example, the data center is bombed by an atomic bomb, or the raid is mistakenly mistaken for the synchronization time or low level ), yes. The cost is to lose some performance.
Therefore, kafka is generally used in scenarios where a small amount of data is allowed to be lost but the overall throughput is very large (such as log collection ), statistical analysis of data (but a few hundred pieces of data does not affect the sample space of hundreds of millions ).
Kafka can also be used for data synchronization between two reliable storages, such as MySQL (write)-> MySQL (degree), because MySQL (write) ensures that data can be replayed, therefore, the recovery speed and reliability can be ensured when kafka fails.