Preface
Recently solved a slow disk problem in the work, personally feel the whole discovery-analysis-solution process is very interesting and meaningful. and disk monitoring in the current Hadoop is still not doing a very full, most of the datanode, it can be said that this is 1 blind zone. Actually think about it, Hadoop itself does not do this kind of monitoring is also reasonable, because like this problem is basically a hardware problem, it should not be monitored at the software level, there is no such a big need. But then we thought, if the software level of monitoring means to find the machine hardware problems are good, at least to find problems, Let's go.
Slow Disk
Here I would use this term to explain this phenomenon, using the English terminology of the professional point is slow-writed disk, the write operation is very slow disk, write operations mainly include the creation of files, directories, Write files for these operations. The slow disk is understood to be a disk that takes much longer than the average time to write. We have recently encountered such a scenario, the other normal disk basically create 1 test directories, just 1/10 or fast 1/100 seconds or so of time, And I was surprised to find that there was a disk that took 5 minutes or so, and, more strangely, sometimes there will be times when there is no such phenomenon. Once a slow disk is present, it can seriously slow down the overall operational efficiency of the node, making the Node a slow node in the cluster and ultimately affecting the entire cluster. Then the problem comes, Now that the slow disk is so important, how exactly do we pinpoint which disk of the machine is problematic, then multiple nodes, and so many more disks on each node.
Discovery of Slow disks
Here are a few ways to teach you:
1. The number of heartbeats is not reached. Generally, if there is a slow disk phenomenon, it will affect the heartbeat of Datanode and Namenode, and this value will become very large.
2. This is a traditional approach through ganglia monitoring of datanode write operations.
There is no particularly long observation time compared to several special nodes.
Of course, the above is to determine the suspicious slow disk node, assuming that the exception node has been found, the following is how to find the above slow disk, this method is not as complicated as it is, here is the simplest way to write 1 scripts to execute on all disks
Time mkdir test
rm-r-F test
It is possible to observe which disk takes the longest time. Of course you want to use Linux tools specifically to check disk read and write performance commands, of course, the best.
Slow Disk monitoring
There are many deviations in usability and accuracy of the methods provided above, especially in the search for slow disks, so the most authoritative approach is to monitor each disk's write operation at the Hadoop level, which is certainly the most accurate, so we're going to add custom metrics code, The following is a brief introduction to how we have modified this. First of all to understand a certain principle, Datanode write disk corresponding relationship is
datanode-->fsdatasetimpl-->volumeslist
Write to the disk directory in volumeslist for each file that contains the configuration file. The class that corresponds to each disk is Fsvolumeimpl. The Fsvolumeimpl class contains many ways to create files.
These created files are eventually written to the disk represented by the class, so the object we want to monitor is this object. OK, how do we get started, the most recent article has said that the Hadoop community does not have additional monitoring of fsvolume, so it needs to define a new 1, it is called fsvolumsmetrics, the indicator is as follows:
@Metrics (about = "Fsvolume Metrics", context = "DFS") public class Fsvolumemetrics {static final log log = LOGFACTORY.G
Etlog (Fsvolumemetrics.class);
Private static final map<string, fsvolumemetrics> REGISTRY = Maps.newhashmap ();
int gettmpinputstreamscounter;
int createtmpfilecounter;
int createrbwfilecounter;
int gettmpinputstreamstimeoutcounter;
int createtmpfiletimeoutcounter;
int createrbwfiletimeoutcounter;
Metricsregistry registry = NULL;
@Metric mutablerate Gettmpinputstreamsop;
@Metric mutablerate Createtmpfileop;
@Metric mutablerate Createrbwfileop;
@Metric mutablerate gettmpinputstreamstimeout;
@Metric mutablerate createtmpfiletimeout;
@Metric mutablerate createrbwfiletimeout;
Private Fsvolumemetrics (Fsvolumeimpl volume) {this.createrbwfilecounter = 0;
This.createtmpfilecounter = 0;
This.gettmpinputstreamscounter = 0;
This.createrbwfiletimeoutcounter = 0;
This.createtmpfiletimeoutcounter = 0; This.gettmpinputstreamstimeoutcounter = 0;
String name = "Fsvolume:" + Volume.getbasepath ();
Log.info ("Register fsvolumn metric for Path:" + name);
Registry = new Metricsregistry (name);
} static Fsvolumemetrics Create (Fsvolumeimpl volume) {String n = "Fsvolume:" + Volume.getbasepath ();
Log.info ("Create fsvolumn metric for Path:" + N);
Synchronized (REGISTRY) {fsvolumemetrics m = registry.get (n);
if (M = = null) {m = new fsvolumemetrics (volume);
Defaultmetricssystem.instance (). Register (n, null, m);
Registry.put (n, m);
} return m;
}} public void Addgettmpinputstreamsop (long time) {gettmpinputstreamscounter++;
Gettmpinputstreamsop.add (time);
} public void Addgettmpinputstreamstimeout (long time) {gettmpinputstreamstimeoutcounter++;
Gettmpinputstreamstimeout.add (time);
} public void Addcreatetmpfileop (long time) {createtmpfilecounter++; Createtmpfileop.add (Time);
} public void Addcreatetmpfiletimeout (long time) {createtmpfiletimeoutcounter++;
Createtmpfiletimeout.add (time);
} public void Addcreaterbwfileop (long time) {createrbwfilecounter++;
Createrbwfileop.add (time);
} public void Addcreaterbwfiletimeout (long time) {createrbwfiletimeoutcounter++;
Createrbwfiletimeout.add (time); }
}
Because each volume disk needs to have its own monitoring, it is necessary to take a path to differentiate when registering the name. The advantage of using mutablerate is that you can monitor the times and times, and then count the time-outs of these write operations again, So you're going to define a new definition of 1 write disk timeouts, such as this:
public static final String Dfs_write_volume_threshold_time_ms =
"dfs.write.volume.threshold.time.ms";
public static final Long dfs_write_volume_threshold_time_ms_default = 300;
Next is to register this metrics class code, note that this is to be registered in Fsvolumeimpl:
Fsvolumeimpl (Fsdatasetimpl DataSet, String Storageid, File currentdir,
Configuration conf, Storagetype Storagetype ) throws IOException {
this.dataset = DataSet;
This.storageid = Storageid;
this.reserved = Conf.getlong (
dfsconfigkeys.dfs_datanode_du_reserved_key,
dfsconfigkeys.dfs_datanode_du_ Reserved_default);
THIS.RESERVEDFORRBW = new Atomiclong (0L);
....
Metric = Fsvolumemetrics.create (this);
}
In this way, each disk pair will have its own monitoring class. Then the writing method of monitoring, here, except for 1 of the methods of monitoring, the rest of the same is not listed, in the end of the article will give the code.
@Override//FSDATASETSPI public synchronized Replicahandler CREATERBW (Storagetype storagetype, Extendedblock B,
Boolean allowlazypersist) throws IOException {Replicainfo replicainfo = Volumemap.get (B.getblockpoolid (),
B.getblockid ());
....
}
Fsvolumeimpl v = (fsvolumeimpl) ref.getvolume ();
Create an RBW file to hold block in the designated volume file F;
try {Long startTime = Time.monotonicnow ();
f = v.createrbwfile (B.getblockpoolid (), B.getlocalblock ());
Long Duration = Time.monotonicnow ()-startTime;
if (Duration > Volumethresholdtime) {log.warn ("Slow Create Rbwfile to Volume=" + v.getbasepath () + "took"
+ Duration + "MS (threshold=" + volumethresholdtime + "ms)");
V.metric.addcreaterbwfiletimeout (duration);
} v.metric.addcreaterbwfileop (duration);
} catch (IOException e) {ioutils.cleanup (null, ref);
Throw e; }
.....
}
The code is added to the fsdatasetimple, because the method is initiated here. Monitoring code logic that is, it is not complicated. Since it is metrics class, it is necessary to display in the ganglia interface diagram, so the effect diagram is the following:
Because I configured the Data.dir is/home/data/data/hadoop/dfs/data, so will appear above so long title, this is we want to achieve the final effect, hope can bring to everyone harvest. This feature I've made patches, submitted to the open source community, numbered HDFS-9510, the classmate you want to use can git apply by yourself.
Slow Disk resolution
If the slow disk has been found, how to solve, the simplest way is immediately offline, do not write data on this disk, and contact operations department to deal with or you own internal solution. But still that sentence, like slow disk such as the hardware of the problem or to the professional people to solve the more secure.
RELATED Links:
Issue Link:https://issues.apache.org/jira/browse/HDFS-9510
Github Patch Link:https://github.com/linyiqun/open-source-patch/blob/master/hdfs/HDFS-9510/HDFS-9510.002.patch