I. Summary
The company's recent storm cleaning procedures over the other side of the reaction HDFs will be sporadic anomalies cause data to write into HDFs, and some spark jobs in the large-scale to HDFS data when the client will appear various "all datanode bad." and a variety of timeout on the service side, it is worth noting that the problem is that the load of each datanode node is not high!
Second, fault analysis
First, when we see various timeout in the HDFs client ... What waiting for information such as reading, our first response is why the datanode that make up the pipeline when writing data to HDFs are not receiving datanode from upstream packet? At this time, it is often said that adding something to the client's timeout period, this is certainly not possible (because the load on each node is very low). Additionally, if the "All Datanode bad." appears. This kind of mistake, we tend to jump out of the first 2 ideas, first: All Datanode are unable to provide services, Second: Dfsclient with the HDFS server Dataxserver thread connection but long time no packet transmission caused the HDFS server start protection mechanism automatically disconnect, resulting.
For now "all datanode bad ..." This kind of problem, I basically can rule out the second kind of situation. Then look down, in the platform monitoring system to observe the Datanode thread dump information and heartbeat information, found the problem:
Reproduce the anomaly and observe the thread dump and heartbeat of all Datanode:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/71/5F/wKiom1XMZNGDioEzAAmWDEtDJz4179.jpg "title=" Cccccc.png "alt=" Wkiom1xmzngdioezaamwdetdjz4179.jpg "/>
This is horrible, the heartbeat reaches 30s!!.
Further analysis, directly reproduced in the valley, with the jstack-l command to see Datanode specific thread dump information found in the system high density calls Fsdatasetimp createtemporary and Checkdirs methods:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/71/5F/wKiom1XMZoziQeBQAAMLKL1KXPk061.jpg "title=" a1.png "alt=" Wkiom1xmzoziqebqaamlkl1kxpk061.jpg "/>
Because the above frequency calls in the coarse-grained object lock Fsdatasetimpl the method, resulting in sending heartbeat and Dataxceiver threads are blocked (because they are also in coarse-grained object lock Fsdatasetimpl), see Thread dump information , the dataxceiver thread that Datanode handles the request is blocked:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/71/5F/wKiom1XMZ5LCjqWIAAPPipnp2_s292.jpg "title=" a2.png "alt=" Wkiom1xmz5lcjqwiaappipnp2_s292.jpg "/>
Send heartbeat thread is blocked off:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/71/5F/wKiom1XMZ9OCnJQjAAI35Qm7RS0582.jpg "title=" a3.png "alt=" Wkiom1xmz9ocnjqjaai35qm7rs0582.jpg "/>
For sending the heartbeat of the thread is blocked off, from the source of the view is mainly due to Datanode to namenode send heartbeat needs to get the resources of the node, the heartbeat through the getdfused,getcapacity,getavailable, Getblockpoolused and other methods (see Fsdatasetimpl Code):
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/71/5F/wKiom1XMaA6iJckuAAHs34MsM7c330.jpg "title=" a4.png "alt=" Wkiom1xmaa6ijckuaahs34msm7c330.jpg "/>
And these methods are in the scope of Fsdatasetimpl object lock, so the heartbeat thread is blocked off, specifically to see the getdfsused Source:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/71/5F/wKiom1XMaETAypZyAAEovFmXess737.jpg "title=" a5.png "alt=" Wkiom1xmaetaypzyaaeovfmxess737.jpg "/>
Through the analysis above, the basic can analyze the cause of the failure: large-scale to HDFS simultaneously write multiple batches of files, Datanode thread dump a large number of dataxceiver and send heartbeat thread is blocked off, the abnormal heartbeat sometimes reached a few 10 seconds or so, A large number of dataxceiver threads are blocked out of the datastreamer of each dfsclient (sending Datanode to packet) and Responseprocessor ( Receive Datanode ack in pipeline) thread service and Datanode Blockreceiver thread does not work, resulting in a timeout on the client, or dfsclient to HDFs write packet when the entire pipeline Datanode unable to respond to the client request, and then the system to start pipeline fault tolerance, But each dataNode due to dataxceiver a large number of blocked can not provide services, and finally led to the client reported "All DataNode is bad ..." and the server side of the Timoeut.
in other words, this is a big bug in HDFs.
This is a bug in hadoop2.6, where the code uses a very large-grained object lock (Fsdatasetimpl), which causes lock exceptions in large-scale write operations. This bug appears in the 2 versions of 2.5 and 2.6 (our new cluster uses 2.6), which is now fixed in the 2 versions of 2.6.1 and 2.7.0. Official patch information specific to:
https://issues.apache.org/jira/browse/HDFS-7489
https://issues.apache.org/jira/browse/HDFS-7999
In fact, the specific fix is to decompose this large-grained object lock into multiple small-grained locks, and Datande sends the heartbeat thread to Namenode to be stripped from the associated lock.
to further confirm, this is a bug in hadoop2.6, I upgraded the test cluster to 2.7.1 (bug fix version), comparing the Datanode heart jumper thread to the dataxceiver thread and the blocked and heartbeat interval of the 2.7.1bug repair version when writing multiple batches of files on a large scale. Here are the hadoop2.7.1 performance conditions:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/71/5F/wKiom1XMbLzg47GhAASCy-xlOBM716.jpg "title=" a8.png "alt=" wkiom1xmblzg47ghaascy-xlobm716.jpg "/> writes multiple batches of files to HDFs, and after the test cluster is upgraded to hadoop2.7.1, the client does not report timeout and" all Datanode Bad ... "exception, service side also did not report timeout exception. In addition, this bug was found to be resolved in 2.7.1 by comparing it with the chart shown in hadoop2.6.1 above.
Third, fault handling
The impact of this failure on our existing business is probably:
A, affecting data written to HDFs by storm at a certain point in time
B, the job submission time point just met this HDFs exception trigger point, it will cause the job attached file can not be uploaded to HDFs eventually cause job submission failure.
C, if the time of the anomaly in HDFs is elongated, it may lead to more than 3 fault-tolerant triggering of Mr Jobs, resulting in job failure.
Specific treatment: In the case of non-stop the use of smooth upgrade to the hadoop2.7.1 version
The specific upgrade steps follow http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html to OK.
This article is from the "Ant" Blog, please be sure to keep this source http://zengzhaozheng.blog.51cto.com/8219051/1684432
Hadoop (2.5,2.6) HDFs sporadic heartbeat anomalies and a large number of dataxceiver threads are shared by blocked troubleshooting