kafka主題offset各種需求修改方法,kafkaoffset

來源:互聯網
上載者:User

kafka主題offset各種需求修改方法,kafkaoffset

  簡要:開發中,常常因為需要我們要認為修改消費者執行個體對kafka某個主題消費的位移量。具體如何修改?為什麼可行?其實很容易,有時候只要我們換一種方式思考,如果我自己實現kafka消費者,我該如何讓我們的消費者代碼如何控制對某一個主題消費,以及我們該如何?不同消費者組可以消費同一個主題的同一條訊息,一個消費組下不同消費者消費同一個主題的不同訊息。如果讓你實現該架構該如何??

  這裡我示範實驗storm的kafkaspout來進行消費,kafkaspout裡面使用的低級api,所以他在zookeeper中儲存資料的結構和我們使用kafka的java用戶端的進階api在zookeeper中的儲存結構是有所不同的。關於kafka的java用戶端的進階api在zookeeper中的儲存結構的構造可以看這篇文章:apache kafka系列之在zookeeper中儲存結構 。

原文和作者一起討論:http://www.cnblogs.com/intsmaze/p/6212913.html

可接網站開發,java開發。

新浪微博:intsmaze劉洋洋哥

:intsmaze

 

  建立一個kafka主題名為intsmazX,指定分區數為3.

  使用kafkaspout建立該主題的消費者執行個體(指定中繼資料存放zookeeper中的路徑為/kafka-offset,指定執行個體id為onetest),啟動storm可以觀察到如下資訊:

INFO storm.kafka.ZkCoordinator - Task [1/1] Refreshing partition manager connectionsINFO storm.kafka.DynamicBrokersReader - Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}INFO storm.kafka.KafkaUtils - Task [1/1] assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}]INFO storm.kafka.ZkCoordinator - Task [1/1] Deleted partition managers: []INFO storm.kafka.ZkCoordinator - Task [1/1] New partition managers: [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}]INFO storm.kafka.PartitionManager - Read partition information from: /kafka-offset/onetest/partition_0  --> null //這個地方會到zookeeper中該目錄下讀取,看是否儲存有對該分區的消費資訊INFO storm.kafka.PartitionManager - No partition information found, using configuration to determine offset//沒有分區資訊,這個時候就會直接到kafka的broker中得到該分區的最大位移量INFO storm.kafka.PartitionManager - Last commit offset from zookeeper: 0INFO storm.kafka.PartitionManager - Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2INFO storm.kafka.PartitionManager - Starting Kafka hadoop002.icccuat.com:0 from offset 0INFO storm.kafka.PartitionManager - Read partition information from: /kafka-offset/onetest/partition_1  --> nullINFO storm.kafka.PartitionManager - No partition information found, using configuration to determine offsetINFO storm.kafka.PartitionManager - Last commit offset from zookeeper: 0INFO storm.kafka.PartitionManager - Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2INFO storm.kafka.PartitionManager - Starting Kafka hadoop003.icccuat.com:1 from offset 0INFO storm.kafka.PartitionManager - Read partition information from: /kafka-offset/onetest/partition_2  --> nullINFO storm.kafka.PartitionManager - No partition information found, using configuration to determine offsetINFO storm.kafka.PartitionManager - Last commit offset from zookeeper: 0INFO storm.kafka.PartitionManager - Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2INFO storm.kafka.PartitionManager - Starting Kafka hadoop001.icccuat.com:2 from offset 0
這個時候在zookeeper的/kafka-offset下沒有產生名為onetest的目錄,這是因為對應的intsmazeX還沒有資料產生。
我們使用kafka消費者生產3條資料,然後去查看zookeeper中對應目錄下的資訊:

{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":1,"broker":{"host":"hadoop003.icccuat.com","port":6667},"topic":"intsmazeX"}{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":2,"broker":{"host":"hadoop001.icccuat.com","port":6667},"topic":"intsmazeX"}{"topology":{"id":"34e94ae4-a0a0-41e9-a360-d0ab648fe196","name":"intsmaze-20161222-143121"},"offset":1,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}

30秒(kafkaspout中設定提交zookeeper消費位移量時間為30秒)之後,可以看到,會記錄該執行個體對每一個分區消費的位移量為1.

殺掉該拓撲,這個時候我們再向intsmazeX主題生產6條資料,這個時候,broker中該主題每個分區的最大位移量為3了。

然後我們修改/kafka-offset/onttest/下每一個分區的offset為3.

這個時候,我們再次部署該拓撲,可以發現拓撲沒有消費剛剛產生的6條訊息。再發送3條訊息,拓撲就會立馬消費這三條訊息。

殺掉該拓撲,這個時候該拓撲消費者執行個體對每個分區的消費位移量就是4了,然後我們把offset修改為6,然後啟動拓撲,這個時候broker中該主題每個分區的最大位移量為4並不是6,讓我們看看,消費分區的位移量大於主題分區當前位移量會有什麼樣的情況出現。

WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4WARN storm.kafka.KafkaUtils - Got fetch request with offset out of range: [6]; retrying with default start offset time from configuration. configured start offset time: [-2]WARN storm.kafka.PartitionManager - Using new offset: 4
這個時候我們看到,消費者的分區位移量的記錄將會自動同步為每一個分區當前最大的位移量了,kafkaspout會先用位移量6去拉去,發現拉去不到,就到broker中擷取該主題對應分區的最大位移量。。
{"topology":{"id":"818ab9cc-d56f-454f-88b2-06dd830d54c1","name":"intsmaze-20161222-150006"},"offset":4,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}....
把offset的位移量設定為7000,一樣在拓撲啟動後,會更新為每個分區的最大位移量。
    重新部署一個拓撲消費該主題,設定該拓撲的id為twotest,這個時候啟動拓撲,我們發現,並沒有啟動拓撲前的訊息資料,這是因為,拓撲啟動後,要獲得位移量,而這個位移量只能是當前主題每個分區的最大位移量(因為分區的位移量是遞增,且分區的資料會定時刪除的,所以無法知道當前分區當前最開始的位移量。)
Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Read partition information from: /kafka-offset/twotest/partition_0  --> null No partition information found, using configuration to determine offset Starting Kafka hadoop002.icccuat.com:0 from offset 7 Read partition information from: /kafka-offset/twotest/partition_1  --> null No partition information found, using configuration to determine offset Starting Kafka hadoop003.icccuat.com:1 from offset 7 Read partition information from: /kafka-offset/twotest/partition_2  --> null No partition information found, using configuration to determine offset Starting Kafka hadoop001.icccuat.com:2 from offset 7 Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: [] Finished refreshing Refreshing partition manager connections Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}} assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop003.icccuat.com:6667, partition=1}, Partition{host=hadoop001.icccuat.com:6667, partition=2}] Deleted partition managers: [] New partition managers: []
發送三條資訊,查看該執行個體目錄如下。
{"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"offset":8,"partition":0,"broker":{"host":"hadoop002.icccuat.com","port":6667},"topic":"intsmazeX"}
再啟動一個拓撲,執行個體為twotest不變:
[INFO] Task [1/2] Refreshing partition manager connections[INFO] Task [2/2] Refreshing partition manager connections[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Task [1/2] assigned [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop001.icccuat.com:6667, partition=2}][INFO] Task [2/2] assigned [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Task [1/2] Deleted partition managers: [][INFO] Task [2/2] Deleted partition managers: [][INFO] Task [1/2] New partition managers: [Partition{host=hadoop002.icccuat.com:6667, partition=0}, Partition{host=hadoop001.icccuat.com:6667, partition=2}][INFO] Task [2/2] New partition managers: [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Read partition information from: /kafka-offset/twotest/partition_0  --> {"topic":"intsmazeX","partition":0,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop002.icccuat.com"},"offset":8}[INFO] Read partition information from: /kafka-offset/twotest/partition_1  --> {"topic":"intsmazeX","partition":1,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop003.icccuat.com"},"offset":8}[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Starting Kafka hadoop002.icccuat.com:0 from offset 8[INFO] Starting Kafka hadoop003.icccuat.com:1 from offset 8[INFO] Task [2/2] Finished refreshing[INFO] Read partition information from: /kafka-offset/twotest/partition_2  --> {"topic":"intsmazeX","partition":2,"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"broker":{"port":6667,"host":"hadoop001.icccuat.com"},"offset":8}[INFO] Read last commit offset from zookeeper: 8; old topology_id: 3d6a5f80-357f-4591-8e5c-b3d4d2403dfe - new topology_id: 348af8da-994a-4cdb-a629-e4bf107348af[INFO] Starting Kafka hadoop001.icccuat.com:2 from offset 8[INFO] Task [1/2] Finished refreshing[INFO] Task [2/2] Refreshing partition manager connections[INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=hadoop002.icccuat.com:6667, 1=hadoop003.icccuat.com:6667, 2=hadoop001.icccuat.com:6667}}[INFO] Task [2/2] assigned [Partition{host=hadoop003.icccuat.com:6667, partition=1}][INFO] Task [1/2] Refreshing partition manager connections[INFO] Task [2/2] Deleted partition managers: [][INFO] Task [2/2] New partition managers: []
{"topology":{"id":"3d6a5f80-357f-4591-8e5c-b3d4d2403dfe","name":"demo-20161222-152236"},"offset":8,"partition":1,"broker":{"host":"hadoop003.icccuat.com","port":6667},"topic":"intsmazeX"}
然後發送訊息,我們可以看到兩個拓撲都會啟動並執行,因為兩個拓撲共用一個中繼資料資訊。
這個過程有些坑要注意:1:在使用kafka-spout的時候,我們要指定該kafka消費者在zookeeper中儲存位移量的地址,這裡是/kafka-offset。同時指定該kafka對應的執行個體id這裡是onetest.kafkapout和kafka用戶端代碼不一樣,它沒有消費組的概念,也不能這樣說吧,只能說資料的存放不一樣,不同的執行個體代表不同的消費組。2:修改某一個kafkaspout執行個體的時候,我們一定要把該id的拓撲關閉掉,我們在項目中遇到一個大坑,就是不熟一樣的kafkaspout它的id是相同的,也就是共用同一個目錄,那麼如果我們沒有下線這些拓撲任務,而只是把這些拓撲任務設定為不活躍狀態,那麼我們修改zookeeper中位移量後,再把拓撲設定為活躍狀態後,會發現修改無效,offset還是變為以前的offset了,這是因為拓撲沒有殺掉,它的運行程式中也會儲存當前消費的位移量,會定時更新的。3:我們在殺拓撲時,要設定時間,因為拓撲預設30秒向zookeeper提交一下位移量資訊。修改位移量有兩種,一種就是在部署拓撲前,先修改zookeeper中的位移量,或者直接刪除zookeeper中的對應執行個體的目錄。這樣從新部署都會從最新的位移量開始運行。
 
  下面的是我當初自己學習kafka時,思考自己寫kafka時,該如何解決kafka的消費者和消費組之間對資料消費時的判斷。雖然架構極大簡化了我們的生產力,但是作為一個有
思想的程式員,我們應該換一個角度去思考一個架構,而不應該再是這個架構有什麼功能,我們用這個架構的這個功能,這樣下去,我們就會一直認為這個架構好厲害,卻不明白其內部實現方式。

如果自己要實現kafka功能:
第一,一個消費組建立後,這個消費組的建立是用戶端完成的,它把消費組名會存到zookeeper中。第二,消費者被建立以後,會把自己的名字存到zookeeper中所屬消費組名的檔案夾下面。第三,消費者被建立了,我們當然要指定他可以消費主體的那一條訊息,這個時候應該是kafka的broker進行控制了,它應該會不斷監聽zookeeper中所有消費組下的消費者的變得,當發現有消費者增加或刪除就知道要進行重新分配,這個時候,它應該計算分配好一會在每一個消費者檔案中寫上他可以消費的分區號和該分區的位移量。第四,broker怎麼知道每一個主題的分區情況,其實broker建立一條主題的時候指定了分區和副本數量,這個時候會在zookeeper中產生一個主題檔案夾,檔案夾下每一個檔案代表一個分區,且每一個檔案的內容就是這個分區的位置和副本位置等資訊,關於該分區消費位移量應該不會記錄在裡面,因為每一個消費組中消費者消費該分區位移量是不同的。第五,這個時候我可以猜想到,應該是消費者檔案中記錄著它已經消費的位移量,當消費者對消費分區進行重新分配時,位移量也要進行轉移,不然重新分配後,又要消費之前已經消費過的資料。但是這也有問題:因為消費者被刪除它消費的位移量就刪掉了,它之前消費的分區分給其他人,其他人也不知道從哪裡開始消費。 看kafkazookeeper儲存結構我們可以發現:消費者(群)檔案夾,這個檔案夾夾下面是各個消費組檔案夾,每一個檔案夾代表一個消費組資訊。消費組檔案夾下面有三個檔案夾,一個是儲存該消費組中的每一個消費者,每一消費者就是一個檔案,另一個檔案夾儲存的這個消費組可以消費的主題的檔案夾,每一個檔案夾代表他可以消費哪些主題。每一個主題檔案夾下面就是該主題的分區,每一個分區檔案就記錄被該消費組消費的位移量。這樣就可以保證,當消費者增加或刪除後,它所消費分區的位移量還在,我們進行重新分配時,可以保證分配好的分區,消費者不會重新消費,而直到該分區被消費的位置。但是我們怎麼知道哪一個消費者消費哪一個分區,把分區好儲存到消費者檔案中,這樣貌似也可以,因為消費者刪除後,它消費的分區會丟失也沒有關係,broker監聽消費者數量變化,一變化就對他們進行重新分配。(我現在能想到的好處就是,如果現有系統中存在消費者沒有消費資料,那麼我們刪掉該消費者,但是我們只是監聽到了消費者變化,並不知道是否有分區隨著消費者的刪掉而被停止消費,仍然會進行重新消費,其實這種情況是沒有必要的),那麼我們換一個方法吧。上面的猜想錯了,一個消費組中的消費者只能消費一個主題的一條訊息,其實就是一個主題的分區只能對應一個消費組中的一個消費者,換過來想,一個消費組可以消費多條主題,應該是可以的,那麼一個消費組中的消費者就可以消費多條主題的中的一個分區。 或者是一個消費組可以消費多個主題,還是是一個消費者只能消費一個主題的一個分區。經過我測試發現,一個消費者消費多個主題是可以實現的。一個消費者消費多條主題的一個分區如何??還有最後一個檔案,該檔案下面也是多個主題的檔案夾,每個檔案夾下面就是該檔案的一個一個分區,分區我應該讓他記錄消費它的消費者的名稱。
 

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.