hadoop營運之jobtracker無故停止服務

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

今天下午同事在使用hive提交查詢時，拋出執行錯誤：

於是開啟jobtracker的管理頁面，發現正在啟動並執行job數目為零，tasktracker心跳正常，這一異常現象讓我覺得jobtracker可能是停止服務了（一般很少出現叢集的運行job數為零的情況），於是手動提交了一個mapred任務進行測試，運行錯誤資訊如下：

12/07/03 18:07:22 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException12/07/03 18:07:22 INFO hdfs.DFSClient: Abandoning block blk_-1772232086636991458_567162812/07/03 18:07:28 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException12/07/03 18:07:28 INFO hdfs.DFSClient: Abandoning block blk_-2108024038073283869_567162912/07/03 18:07:34 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 192.168.1.25:5001012/07/03 18:07:34 INFO hdfs.DFSClient: Abandoning block blk_-6674020380591432013_567162912/07/03 18:07:40 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 192.168.1.26:5001012/07/03 18:07:40 INFO hdfs.DFSClient: Abandoning block blk_-3788726859662311832_567162912/07/03 18:07:46 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)12/07/03 18:07:46 WARN hdfs.DFSClient: Error Recovery for block blk_-3788726859662311832_5671629 bad datanode[2] nodes == null12/07/03 18:07:46 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201206270914_17301/job.jar" - Aborting...

從namenode日誌中發現檔案塊blk_-2108024038073283869_5671629是用於jobtracker的任務jar包：

2012-07-03 18:07:27,316 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201206270914_17301/job.jar. blk_-2108024038073283869_5671629

再到相應的datanode 上去查看日誌，發現沒有該檔案塊的資訊，這下問題出來了：jobtracker向namenode申請了mapred作業配置的儲存資源，且namenode正確的分配了資源（datanode 列表），然後jobtracker再聯絡datanode時報錯了，但是當時datanode還處於正常工作中（有運行中的資料載入業務），那麼，是什麼原因導致jobtracker在向datanode寫入資料時失敗了呢？

於是再仔細查看問題發生時datanode上的日誌，發現這麼一條日誌資訊：

2012-07-03 18:07:10,274 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.25:50010, storageID=DS-841642307-50010-1324273874581, infoPort=50075, ipcPort=50020):DataXceiverjava.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256

百度了下錯誤資訊：xceiverCount 257 exceeds the limit of concurrent xcievers 256的含義，發現報錯的原因主要是因為配置項：

<property>        <name>dfs.datanode.max.xcievers</name>        <value>256</value></property>

dfs.datanode.max.xcievers 對於datanode來說，就如同linux上的檔案控制代碼的限制，當datanode 上面的串連數操作配置中的設定時，datanode就會拒絕串連。

好了，問題找到了，只要找機會修改叢集所有datanode節點的配置，將dfs.datanode.max.xcievers參數修改大一些即可。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

hadoop營運之jobtracker無故停止服務

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support