情景:運行Spark程式出現報錯
1、報錯資訊:
17/05/09 14:30:58 WARN scheduler.TaskSetManager: Lost task 28162.1 in stage 0.0 (TID 30490, 127.0.0.1): java.io.IOException: Cannot obtain block length for LocatedBlock{BP-203532773-dfsfdf-1476004795661:blk_1080431162_6762963; getBlockSize()=411; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[127.0.0.1:1004,DS-e9905a06-4607-4113-b717-709a087b8b96,DISK], DatanodeInfoWithStorage[127.0.0.1:1004,DS-a5046b43-4416-45d9-8ff6-44891bcdf3b8,DISK], DatanodeInfoWithStorage[127.0.0.1:1004,DS-f6b04bbe-9555-4ac8-b06a-3317eb229511,DISK]]}
2、解決參考:
https://community.hortonworks.com/questions/37412/cannot-obtain-block-length-for-locatedblock.html
3、開始檢查檔案
執行命令檢查的結果:注意紅色字型
hdfs fsck /user/admin/data/cdn/20170509 -locations -blocks -files Status: HEALTHY Total size: 2115443944 B (Total open files size: 7684855 B) Total dirs: 1 Total files: 67353 Total symlinks: 0 (Files currently being written: 367) Total blocks (validated): 67339 (avg. block size 31414 B) (Total open file blocks (not validated): 357) Minimally replicated blocks: 67339 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 1
發現:有357個檔案處於開啟狀態
4、再列出有問題的檔案
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
Total size: 2123128799 B Total dirs: 1 Total files: 67720 Total symlinks: 0 Total blocks (validated): 67696 (avg. block size 31362 B) ************************ CORRUPT FILES: 253 MISSING BLOCKS: 253 MISSING SIZE: 7473074 B ************************ Minimally replicated blocks: 67443 (99.626274 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.9887881 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 1FSCK ended at Wed May 10 10:01:56 CST 2017 in 1357 milliseconds
The filesystem under path '/user/admin/data/cdn/20170509' is CORRUPT
(1)找到有問題的檔案
cat tmp.txt |tr '/' '\n' |grep ngaahcs-acc |tr ':' ' '|awk '{print $1}' |sort |uniq |grep -v "2017112318"
(2)最好的解決方案:刪除tmp檔案
hdfs dfs -rmr /user/admin/data/cdn/20170509/*.tmp
然而沒有解決!!
(3)刪除tmp檔案後,再執行
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
或者用這種方式尋找那些檔案
[root@eeeee spark]# hdfs fsck /user/admin/data/cdn/20170509 -openforwrite |grep "/user/admin/data/cdn//20170509"
Connecting to namenode via http://rrrrrr:50070
/user/admin/data/cdn//20170509/ngaahcs-access.log..201705090002.1494259322790.gz 250 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log..201705090002.1494259322790.gz: MISSING 1 blocks of total size 250 B.......
/user/admin/data/cdn//20170509/ngaahcs-access.log.705090000.1494259200039.gz 1222 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.l4.201705090000.1494259200039.gz: MISSING 1 blocks of total size 1222
/user/admin/data/cdn//20170509/ngaahcs-access.log.C2-3l4.201705090245.1494269103909.gz 211 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CTSX2-3l4.201705090750.1494287404133.gz 1504 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-3l4.201705090820.1494289204450.gz 308 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.C2-3l4.201705091545.1494315903839.gz 437 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.SX3-3l3.201705090002.1494259321230.gz 1075 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CX3-3l4.201705090001.1494259260581.gz 521 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-X3-3l4.201705090001.1494259260581.gz: MISSING 1 blocks of total size
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-SX3-3l4.201705090002.1494259320807.gz 729 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-GX-GD-SX4-3l4.201705090001.1494259260236.gz 1138 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-3l4.201705090001.1494259260236.gz: MISSING 1 blocks of total size 1138 B.........................
/user/admin/data/cdn//20170509/ngaahcs-access.log.CTX9-3n3.201705090001.1494259260495.gz 2379 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CXq-3k1.201705090002.1494259320204.gz: MISSING 1 blocks of total size 10153 /user/admin/data/cdn//20170509/ngaahcs-access.log.CTXq-3k2.201705090001.1494259260772.gz 539 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-GXq-3n1.201705090002.1494259320328.gz 1278 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-G-3n2.201705090001.1494259260696.gz 2183 bytes, 1 block(s), OPENFORWRITE:
如果檔案不重要則刪除他們
hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT.201705090002.1494259322790.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.C.201705090002.1494259322790.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT-.201705090000.1494259200039.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT-.201705090000.1494259200039.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT-.201705090245.1494269103909.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT-Gl4.201705090750.1494287404133.gz hdfs dfs -rmr /user/admin/data/cdn/meitu/20170509/ngaahcs-access.log.CT-G3l4.201705090820.1494289204450.gz
再檢查
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
Total size: 2115004402 B
Total dirs: 1
Total files: 67337
Total symlinks: 0
Total blocks (validated): 67337 (avg. block size 31409 B)
Minimally replicated blocks: 67337 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 6
Number of racks: 1
FSCK ended at Wed May 10 10:16:52 CST 2017 in 1329 milliseconds
The filesystem under path '/user/admin/data/cdn//20170509' is HEALTHY
然後再運行spark程式
註:這不是最終解決方案,所以需要查明原因
如果檔案重要,則需要修複。
一個一個地查看檔案狀態並且恢複
以這個檔案為例:/user/admin/data/cdn//20170508/ngaahcs-access.log.3k3.201705081700.1494234003128.gz
執行修複命令:
hdfs debug recoverLease -path <path-of-the-file> -retries <retry times>
hdfs debug recoverLease -path /user/admin/data/cdn//20170508/ngaahcs-access.log.C00.1494234003128.gz -retries 10
hadoop 命令匯總:
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck