Kafka is designed for the distributed environment, so if the log file can actually be understood as a message database, put in the same place, then will inevitably bring a decline in availability, a hang all, if the full amount of copies to all machines, then there is too much redundancy of data, and because each machine's disk size is limited , so even if there are more machines, the messages that can be processed are limited by the disk and cannot exceed the current disk size. Therefore, the concept of partition.
Kafka a certain calculation of the message, the partition through the hash. In this way, a log file is divided into multiple copies. such as the above partition read-write log graph, divided into multiple parts later, on a single broker, such as fast on the hands, if the new topic, We have chosen--replication-factor 1--partitions 2, then in the log directory we will see
test-0 directory and test-1 directory. It's two partitions.
You may think, this special is not very different. Note that when there are multiple brokers, the meaning exists. Here is a picture, the original in the reference link has
This is a topic contains 4 partition,2 Replication (copy), that is, all messages are placed in 4 partition storage, in order to be highly available, 4 partitions are 2 redundant, and then based on the allocation algorithm. A total of 8 data is allocated to the broker cluster.
The result is that the data stored on each broker is less than the full amount of data, but every piece of data is redundant, so that once a machine goes down, it does not affect the use. Broker1, down. Then the remaining three brokers still retain the full amount of partition data. So you can use it, if you go down another one. , then the data is incomplete. Of course you can set more redundancy, such as setting the redundancy is 4, then each machine will have 0123 complete data, down a few units. Need to be measured between storage footprint and high availability.
As for downtime, zookeeper will elect a new partition leader. To provide services. This next article
Each consumer process belongs to a consumer group (consumer group).
To be precise, each message is sent only to one process in each user group.
Therefore, the user group makes many processes or multiple machines logically appear as a single user. The user group is a very powerful concept that can be used to support the semantics of queue or topic (topic) in JMS.
To support queue semantics, we can make all the users a single user group, in which case each message is sent to a single consumer.
To support the topic semantics, each consumer can be divided into its own consumer group, and then all the users will receive each message.
In our use, a more common scenario is that we logically divide multiple user groups, each of which has a cluster of multiple user computers as a logical whole. In the case of big data, Kafka has an added advantage, and for a topic, no matter how many users subscribe to it, the messages are stored only once.
Kafka Message Topic Partition