Kafka topic offset requirements
Brief: during development, we often consider it necessary to modify the offset of a consumer instance for a certain topic of kafka. How to modify it? Why is it feasible? In fact, it is very easy. Sometimes we only need to think about it in another way. If I implement kafka consumers myself, how can I let our consumer code control the consumption of a certain topic, and how can we achieve that different consumer groups can consume the same message of the same topic, and different consumers in a consumer group consume different messages of the same topic. How can I implement this framework?
Here I will demonstrate the kafkaspout of the experiment storm for consumption, the low-level api used in kafkaspout, therefore, the structure of data storage in zookeeper is different from that in zookeeper by using java client advanced APIs of kafka. For more information about the structure of the storage structure of java client advanced APIs of kafka in zookeeper, see this article: storage structure of apache kafka series in zookeeper.
Discuss with author: http://www.cnblogs.com/intsmaze/p/6212913.html
Supports website development and java development.
Sina Weibo: intsmaze Liu Yang Ge
: Intsmaze
Create a kafka topic named intsmazX and specify the number of partitions as 3.
Use kafkaspout to create a consumer instance for this topic (specify the path where metadata is stored in zookeeper as/kafka-offset, and specify the instance id as onetest). Start storm and observe the following information:
INFO storm. kafka. zkCoordinator-Task [1/1] Refreshing partition manager connectionsINFO storm. kafka. dynamicBrokersReader-Read partition info from zookeeper: GlobalPartitionInformation {partitionMap = {0 = partition: 6667, 1 = partition: 6667, 2 = hadoop001.icccuat.com: 6667} INFO storm. kafka. kafkaUtils-Task [1/1] assigned [Partition {host = hadoop002.icccuat.com: 6667, partition = 0}, Partition {host = partition: 6667, Partition = 1}, Partition {host = partition: 6667, partition = 2}] INFO storm. kafka. zkCoordinator-Task [1/1] Deleted partition managers: [] INFO storm. kafka. zkCoordinator-Task [1/1] New partition managers: [Partition {host = partition: 6667, Partition = 0}, partition {host = hadoop003.icccuat.com: 6667, partition = 1 }, partition {host = hadoop001.icccuat.com: 6667, partition = 2}] INFO storm. kafka. partitionManager-Read partition information from:/kafka-offset/onetest/partition_0 --> null // This location will be Read in this directory of zookeeper, check whether the consumption information for this partition is stored in INFO storm. kafka. partitionManager-No partition information found, using configuration to determine offset // No partition information. At this time, the maximum offset INFO storm of the partition is obtained directly from the kafka broker. kafka. partitionManager-Last commit offset from zookeeper: 0 INFO storm. kafka. partitionManager-Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime =-2 INFO storm. kafka. partitionManager-Starting Kafka hadoop002.icccuat.com: 0 from offset 0 INFO storm. kafka. partitionManager-Read partition information from:/kafka-offset/onetest/partition_1 --> nullINFO storm. kafka. partitionManager-No partition information found, using configuration to determine offsetINFO storm. kafka. partitionManager-Last commit offset from zookeeper: 0 INFO storm. kafka. partitionManager-Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime =-2 INFO storm. kafka. partitionManager-Starting Kafka hadoop003.icccuat.com: 1 from offset 0 INFO storm. kafka. partitionManager-Read partition information from:/kafka-offset/onetest/partition_2 --> nullINFO storm. kafka. partitionManager-No partition information found, using configuration to determine offsetINFO storm. kafka. partitionManager-Last commit offset from zookeeper: 0 INFO storm. kafka. partitionManager-Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime =-2 INFO storm. kafka. partitionManager-Starting Kafka hadoop001.icccuat.com: 2 from offset 0
At this time, the onetest directory is not generated under/kafka-offset of zookeeper, because the corresponding intsmazeX has no data.
We use the kafka consumer to produce three pieces of data and then view the information in the corresponding directory of zookeeper:
{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":1,"broker":{"host":"hadoop003.icccuat.com","port":6667},"topic":"intsmazeX"}{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":2,"broker":{"host":"hadoop001.icccuat.com","port":6667},"topic":"intsmazeX"}{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}
After 30 seconds (the zookeeper consumption offset time set in kafkaspout is 30 seconds), we can see that the offset consumed by this instance for each partition is 1.
Kill the topology. At this time, we will produce 6 data records to the intsmazeX topic. At this time, the maximum offset of each partition of the topic in the broker is 3.
Then we modify the offset of each partition in/kafka-offset/onttest/to 3.
At this time, we can deploy the topology again and find that the topology does not consume the six messages that have just been generated. Send three more messages, and the topology will immediately consume these three messages.
Kill the topology. At this time, the consumption offset of the consumer instance for each partition is 4. Then, we modify the offset to 6 and start the topology, in this case, the maximum offset of each partition of the topic in the broker is 4 rather than 6. Let's see what happens when the offset of the consumption partition is greater than the current offset of the topic partition.
WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4
At this time, we can see that the shard offset record of the consumer will be automatically synchronized to the current maximum offset of each partition. kafkaspout will first use offset 6 to pull it, and it is found that it cannot be pulled, obtain the maximum offset of the partition corresponding to the topic in the broker ..
{"topology":{"id":"818ab9cc-d56f-454f-88b2-06dd830d54c1","name":"intsmaze-20161222-150006"},"offset":4,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}....
Set the offset to 7000. After the topology is started, the offset is updated to the maximum offset of each partition.
Redeploy a topology to consume the topic and set the topology id to twotest. At this time, start the topology. We found that the message data before the topology is not started. This is because after the topology is started, to get the offset, the offset can only be the maximum offset of each partition in the current topic (because the offset of the partition is increasing, and the data of the partition is deleted regularly, so you cannot know the current start offset of the current partition .)
Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Read partition information from: /kafka-offset/twotest/partition_0 --> null No partition information found, using configuration to determine offset Starting Kafka hadoop002.icccuat.com:0 from offset 7 Read partition information from: /kafka-offset/twotest/partition_1 --> null No partition information found, using configuration to determine offset Starting Kafka hadoop003.icccuat.com:1 from offset 7 Read partition information from: /kafka-offset/twotest/partition_2 --> null No partition information found, using configuration to determine offset Starting Kafka hadoop001.icccuat.com:2 from offset 7 Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: []
Send the following three messages to view the instance directory.
{"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"offset":8,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}
Start another topology and the instance is twotest:
[INFO] Task [1/2] Refreshing partition manager connections[INFO] Task [2/2] Refreshing partition manager connections[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Task [1/2] assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop001.icccuat.com:6667, partition=2}][INFO] Task [2/2] assigned [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Task [1/2] Deleted partition managers: [][INFO] Task [2/2] Deleted partition managers: [][INFO] Task [1/2] New partition managers: [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop001.icccuat.com:6667, partition=2}][INFO] Task [2/2] New partition managers: [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Read partition information from: /kafka-offset/twotest/partition_0 --> {"topic":"intsmazeX","partition":0,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop002.icccuat.com"},"offset":8}[INFO] Read partition information from: /kafka-offset/twotest/partition_1 --> {"topic":"intsmazeX","partition":1,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop003.icccuat.com"},"offset":8}[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Starting Kafka hadoop002.icccuat.com:0 from offset 8[INFO] Starting Kafka hadoop003.icccuat.com:1 from offset 8[INFO] Task [2/2] Finished refreshing[INFO] Read partition information from: /kafka-offset/twotest/partition_2 --> {"topic":"intsmazeX","partition":2,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop001.icccuat.com"},"offset":8}[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Starting Kafka hadoop001.icccuat.com:2 from offset 8[INFO] Task [1/2] Finished refreshing[INFO] Task [2/2] Refreshing partition manager connections[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Task [2/2] assigned [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Task [1/2] Refreshing partition manager connections[INFO] Task [2/2] Deleted partition managers: [][INFO] Task [2/2] New partition managers: []
{"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"offset":8,"partition":1,"broker":{"host":"hadoop003.icccuat.com","port":6667},"topic":"intsmazeX"}
Then send the message. We can see that both topologies run, because the two topologies share one metadata.
Note some pitfalls in this process: 1: When Using kafka-spout, We need to specify the address where the kafka consumer stores the offset in zookeeper. Here is/kafka-offset. At the same time, specify the instance id corresponding to this kafka. Here it is onetest. kafkapout is different from the kafka client code. It does not have a consumer group concept. It cannot be said that data is stored differently. Different instances represent different consumer groups. 2: When modifying a kafkaspout instance, we must disable the topology of this id. We encountered a big pitfall in the project, that is, kafkaspout with the same id is the same, that is, shared with the same directory. If we do not deprecate these topology tasks, but set these topology tasks to inactive, after we modify the offset in zookeeper and set the topology to active, we will find that the modification is invalid and the offset still changes to the previous offset because the topology has not been killed, its running program also saves the offset of the current consumption, which is updated regularly. 3: we need to set the time when killing the topology, because the topology submits the offset information to zookeeper for 30 seconds by default. There are two types of offset changes. One is to modify the offset in zookeeper before deploying the topology, or directly Delete the directory of the corresponding instance in zookeeper. In this way, the new deployment starts from the latest offset.
The following describes how to determine the data consumption between a kafka consumer and a consumer group when I learned kafka. Although the framework greatly simplifies our productivity
For ideological programmers, we should think about a framework from another perspective, instead of the functions of this framework. We use this function of this framework, we will always think that this framework is very good, but we do not understand its internal implementation method.
If you want to implement kafka:
First, after a consumer group is created, the consumer group is created by the client, which stores the consumer group name in zookeeper. Second, after a consumer is created, the consumer name will be saved to the folder of the consumer group name in zookeeper. Third, the consumer has been created. Of course, we need to specify the message that the consumer can consume. This should be controlled by the kafka broker, it should constantly listen to the number of consumers in all consumer groups in zookeeper. When it is found that a consumer is added or deleted, it knows that it needs to be re-allocated. At this time, it should calculate the allocated shard number and the offset of the shard in each consumer file. Fourth, how does the broker know the partition status of each topic? In fact, when the broker creates a topic, it specifies the number of partitions and replicas. In this case, a topic folder is generated in zookeeper, each file in the folder represents a partition, and the content of each file is the location and Copy location of the partition. The consumption offset of the partition should not be recorded in it, because the offset of the shard consumed by the consumer in each consumer group is different. Fifth, I can guess that the offset recorded in the consumer file has been consumed. When the consumer re-allocates the consumption partition, the offset will also be transferred, otherwise, you need to consume the data that has been consumed before the redistribution. But there is also a problem: because a consumer is deleted from its consumption offset, the previously consumed Shard is allocated to others, and others do not know where to start consumption. Looking at the kafkazookeeper storage structure, we can find that the consumer (Group) folder contains various consumer group folders. Each folder represents a consumer group. There are three folders under the consumer group folder. One is to store every consumer in the consumer group, each consumer is a file, and the other folder stores the folders of topics that can be consumed by this consumer group, each folder represents the topics that can be consumed. Under each topic folder is the partition of the topic, and each partition file records the offset consumed by the consumer group. This ensures that, when a consumer increases or deletes the shard, the offset of the consumed Shard is still there. When we re-allocate the shard, we can ensure that the allocated Shard is not re-consumed by the consumer, and until the Shard is consumed. But how do we know which consumer consumes Which partition and store the partition in the consumer file? It seems that this is also acceptable because after the consumer deletes the partition, it does not matter if the partitions it consumes are lost. The number of consumers monitored by the broker changes, and the consumers are re-allocated as soon as they change. (The benefit I can think of now is that if there is a consumer in the existing system that does not consume data, we delete the consumer, but we only listen to consumer changes, I don't know if there are shards that will be stopped when the consumer deletes them. In fact, there is no need to re-consume them.) Let's change the method. The above conjecture is wrong. A consumer in a consumer group can only consume one message of one topic. In fact, a topic partition can only correspond to one consumer in one consumer group, A consumer group can consume multiple topics, which should be acceptable. Consumers in a consumer group can consume one partition of multiple topics. A consumer group can consume multiple topics, or a consumer can only consume one partition of one topic. After my tests, I found that a consumer can consume multiple topics. How does one consumer consume one partition of multiple topics? There is also the last file, which also contains folders with multiple themes. Each folder contains a partition of the file. I should have it record the name of the consumer who consumed the file.