Hadoop cluster Measurement

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Is the cluster set up correctly? The best way to answer this question is empirically: Run some jobs and confirm that you get the expected results. benchmarks make good tests, as you also get numbers that you can compare with other clusters as a sanity check on whether your new cluster is refreshing roughly as expected. and you can tune a cluster using benchmark results to squeeze the best performance out of it. this is often done with monitoring systems in place, so you can see how resources are being used using SS the cluster.

To get the best results, you shocould run benchmarks on a cluster that is not being used by others. in practice, this is just before it is put into service, and users start relying on it. once users have periodically scheduled jobs on a cluster it is generally impossible to find a time when the cluster is not being used (unless you arrange downtime with users ), so you shoshould run benchmarks to your satisfaction before this happens.

Experience has shown that most hardware failures for new systems are hard drive failures. by running I/O intensive benchmarks-such as the ones described next-you can "burn in" The cluster before it goes live.

Hadoop benchmarks

Hadoop comes with several benchmarks that you can run very easily with minimal setup cost. benchmarks are packaged in the test JAR file, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments:

 % Hadoop jar $ hadoop_install/hadoop-*-test. Jar

Most of the benchmarks show usage instructions when invoked with no arguments. For example:

 % Hadoop jar $ hadoop_install/hadoop-*-test. jar testdfsiotestfdsio.0.0.4usage: testfdsio2-read |-write |-clean [-nrfiles N] [-filesize MB] [-resfile resultfilename] [-buffersize bytes]

Benchmarking HDFS with testdfsio

TestdfsioTests the I/O performance of HDFS. it does this by using a mapreduce job as a convenient way to read or write files in parallel. each file is read or written in a separate map task, and the output of the map is used for collecting statistics relating to the file just processed. the statistics are accumulated in the reduce, to produce a summary.

The following command writes 10 files of 1,000 MB each:

 % Hadoop jar $ hadoop_install/hadoop-*-test. Jar testdfsio-Write-nrfiles 10-filesize 1000

At the end of the run, the results are written to the console and also recorded in a local file (which is appended to, so you can rerun the benchmark and not lose old results ):

% Cat testdfsio_results.log ----- testdfsio -----: Write Date & time: Sun Apr 12 07:14:09 EDT 2009 number of files: 10 Total Mbytes processed: 10000 throughput MB/sec: 7.796340865378244 average Io rate MB/sec: 7.8862199783325195 Io rate STD deviation: 0.9101254683525547 test exec time sec: 163.387

The files are written under/Benchmarks/testdfsioDirectory by default (this can be changed by settingTest. Build. DataSystem property), in a directory calledIo_data.

To run a read benchmark, use-ReadArgument. Note that these files must already exist (having been writtenTestdfsio-write):

% Hadoop jar $ hadoop_install/hadoop-*-test. Jar testdfsio-read-nrfiles 10-filesize 1000

Here are the results for a real run:

 ----- Testdfsio -----: Read Date & time: Sun Apr 12 07:24:28 EDT 2009 number of files: 10 Total Mbytes processed: 10000 throughput MB/sec: 80.25553361904304 average Io rate MB/sec: 98.6801528930664 Io rate STD deviation: 36.63507598174921 test exec time sec: 47.624

When you 've finished benchmarking, you can delete all the generated files from HDFS using-CleanArgument:

 % Hadoop jar $ hadoop_install/hadoop-*-test. Jar testdfsio-clean

Benchmarking mapreduce With sort

Hadoop comes with a mapreduce program that does a partial sort of its input. it is very useful for benchmarking the whole mapreduce system, as the full input dataset is transferred through the shuffle. the three steps are: generate some random data, perform the sort, then validate the results.

First we generate some random data usingRandomwriter. It runs a mapreduce job with 10 maps per node, and each map generates (approximately) 10 Gb of random binary data, with key and values of various sizes. you can change these values if you like by setting the propertiesTest. randomwriter. maps_per_hostAndTest. randomwrite. bytes_per_map. There are also settings for the size ranges of the keys and values; seeRandomwriterFor details.

Here's how to invokeRandomwriter(Found in the example JAR file, not the test one) to write its output to a directory calledRandom-Data:

 % Hadoop jar $ hadoop_install/hadoop-*-examples. Jar randomwriter random-Data

Next we can runSortProgram:

 % Hadoop jar $ hadoop_install/hadoop-*-examples. JarSortRandom-data sorted-Data

The overall execution time of the sort is the metric we are interested in, but it's instructive to watch the job's progress via the Web UI (Http ://Jobtracker-host: 50030/), Where you can get a feel for how long each phase of the job takes.

As a final sanity check, we validate the data inSorted-DataIs, in fact, correctly sorted:

 % Hadoop jar $ hadoop_install/hadoop-*-test. Jar testmapredsort-sortinput random-Data \-Sortoutput sorted-Data

This command runsSortvalidatorProgram, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. It reports the outcome to the console at the end of its run:

 Success! Validated the mapreduce Framework's 'sort 'successfully.

Other benchmarks

There are getting more hadoop benchmarks, but the following are widely used:

Mrbench(InvokedMrbench) Runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive.
Nnbench(InvokedNnbench) Is useful for Load Testing namenode hardware.
GridmixIs a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. SeeSrc/benchmarks/gridmix2In the distribution for further details. [63]

User jobs

For tuning, it is best to include a few jobs that are representative of the jobs that your users run, so your cluster is tuned for these and not just for the standard benchmarks. if this is your first hadoop cluster and you don't have any user jobs yet, then gridmix is a good substitute.

When running your own jobs as benchmarks you shoshould select a dataset for your user jobs that you use each time you run the benchmarks to allow comparisons between runs. when you set up a new cluster, or upgrade a cluster, you will be able to use the same dataset to compare the performance with previous runs.

From http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop cluster Measurement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop cluster Measurement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support