今天下午同事在使用hive提交查詢時,拋出執行錯誤:
於是開啟jobtracker的管理頁面,發現正在啟動並執行job數目為零,tasktracker心跳正常,這一異常現象讓我覺得jobtracker可能是停止服務了(一般很少出現叢集的運行job數為零的情況),於是手動提交了一個mapred任務進行測試,運行錯誤資訊如下:
12/07/03 18:07:22 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException12/07/03 18:07:22 INFO hdfs.DFSClient: Abandoning block blk_-1772232086636991458_567162812/07/03 18:07:28 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException12/07/03 18:07:28 INFO hdfs.DFSClient: Abandoning block blk_-2108024038073283869_567162912/07/03 18:07:34 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 192.168.1.25:5001012/07/03 18:07:34 INFO hdfs.DFSClient: Abandoning block blk_-6674020380591432013_567162912/07/03 18:07:40 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 192.168.1.26:5001012/07/03 18:07:40 INFO hdfs.DFSClient: Abandoning block blk_-3788726859662311832_567162912/07/03 18:07:46 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)12/07/03 18:07:46 WARN hdfs.DFSClient: Error Recovery for block blk_-3788726859662311832_5671629 bad datanode[2] nodes == null12/07/03 18:07:46 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201206270914_17301/job.jar" - Aborting...
從namenode日誌中發現檔案塊blk_-2108024038073283869_5671629是用於jobtracker的任務jar包:
2012-07-03 18:07:27,316 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201206270914_17301/job.jar. blk_-2108024038073283869_5671629
再到相應的datanode 上去查看日誌,發現沒有該檔案塊的資訊,這下問題出來了:jobtracker向namenode申請了mapred作業配置的儲存資源,且namenode正確的分配了資源(datanode 列表),然後jobtracker再聯絡datanode時報錯了,但是當時datanode還處於正常工作中(有運行中的資料載入業務),那麼,是什麼原因導致jobtracker在向datanode寫入資料時失敗了呢?
於是再仔細查看問題發生時datanode上的日誌,發現這麼一條日誌資訊:
2012-07-03 18:07:10,274 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.25:50010, storageID=DS-841642307-50010-1324273874581, infoPort=50075, ipcPort=50020):DataXceiverjava.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256
百度了下錯誤資訊:xceiverCount 257 exceeds the limit of concurrent xcievers 256的含義,發現報錯的原因主要是因為配置項:
<property> <name>dfs.datanode.max.xcievers</name> <value>256</value></property>
dfs.datanode.max.xcievers 對於datanode來說,就如同linux上的檔案控制代碼的限制,當datanode 上面的串連數操作配置中的設定時,datanode就會拒絕串連。
好了,問題找到了,只要找機會修改叢集所有datanode節點的配置,將dfs.datanode.max.xcievers參數修改大一些即可。