Hadoop Benchmark Test

Source: Internet
Author: User
Tags hadoop fs
Hadoop cluster Benchmark Test

I. Test conditions

The benchmark test should be started immediately after the cluster is fully installed and configured. During the benchmark test, no other tasks should be run in the cluster.

Ii. test objectives

1. Hard disk failure: the most common fault of the new system. You can perform centralized testing by running a high-intensity I/O benchmark test program. For example, testdfsio

2. mapreduce Performance

Iii. Test Methods

1. testdfsio benchmark HDFS

The test sequence should be write test first and then read Test

Write test:

Use 10 map tasks to write 10 files, each 500 mb.

Hadoop jar $ hadoop_home/hadoop-test-*. Jar testdfsio-Write-nrfiles 10-filesize 1000

At the end of the operation, the result is written to the console and recorded to the current path testdfsio_results.log.

Data is written to the/benchmarks/testdfsio directory by default.

Read test:

Hadoop jar $ hadoop_home/hadoop-test-*. Jar testdfsio-read-nrfiles 10-filesize 1000

Clear test data:

Hadoop jar $ hadoop_home/hadoop-test-*. Jar testdfsio-clean

2. Use sort to sort and test mapreduce

Hadoop comes with a partial sorting program. This is useful for testing the entire mapreduce system, because the entire input dataset is shuffled and transmitted to Cer. There are three steps in total: generate some random data, execute the sorting, and then verify the result.
First, we use randomwriter to generate random data. It runs a mapreduce job in the form of 10 maps for each node, and each map generates random binary data of approximately 10 Gb, with keys and values of different lengths.

Hadoop jar hadoop-examples-0.20.2-cdh3u1.jar randomwriter random-Data

3. terasort Benchmark Test

1 TB sorting is usually used to measure the data processing capability of the Distributed Data Processing framework. Terasort is a sorting job in hadoop. In 2008, hadoop won the first place in the 1 TB sorting benchmark evaluation, which took 209 seconds.

First, execute teragen to generate data.

Write 1000000 rows, 100 bytes in each row. Format:

(10 bytes key) (10 bytes rowid) (78 bytes filler) \ r \ n

For example

. T ^ # \ | V $2 \ 0 success

Hadoop jar hadoop-examples-0.20.2-cdh3u1.jar teragen 1000000 terasort/1000000-Input

View data

Hadoop FS-ls/usr/hadoop/terasort/1000000-Input

Hadoop jar hadoop-*-examples. Jar terasort in-Dir out-Dir

Sort

Hadoop jar hadoop-examples-0.20.2-cdh3u1.jar terasort/1000000-input terasort/1000000-Output

View sorting

Hadoop FS-ls terasort/1000000-Output

4. gridmix Benchmark Test

Hadoop gridmix is a benchmark testing program for hadoop systems. It has various functional modules required to evaluate large-scale data processing systems, including generating data, generating and submitting jobs, and counting job completion times. This article describes the design principle and usage of the gridmix2 Benchmark Test Program (in the \ SRC \ benchmarks directory) in hadoop 0.20.2.

Job type

Gridmix evaluates hadoop performance by simulating the actual load in the hadoop cluster. It generates a large amount of data and a batch of jobs based on user-defined parameters, submits these jobs (batch processing) at the same time, and finally counts the running time of these jobs. To simulate as many jobs as possible, gridmix comes with a variety of representative jobs, including streamsort, sort ORT, webdatascan, and combiner (this job only compresses the results), monsterquery, webdatasort, which can be divided into the following types:

(1) three-stage MAP/reduce job

Input: GB compressed (equivalent to 2 TB uncompressed) sequencefile

(Key, value) = (5 words, 100 words)

Calculation 1: Map retains 10% of the data, and reduce retains 40% of the data,

Computing 2: Map retains 10% of the data, reduce retains 40% of the data, and the data comes from the output of [computing 1 ].

Computing 3: Map retains 10% of the data, reduce retains 40% of the data, and the data comes from the output of [computing 2 ].

Motivation: many job loads are sequential MAP/reduce jobs, including pig

Corresponding job: monsterquery

(2) large-scale data sorting, where the length of key and value changes

Input: GB compressed (equivalent to 2 TB uncompressed) sequencefile

(Key, value) = (5-10 words, 100-10000 words)

Computing: Map retains 100% of the data, and reduce retains 100% of the data.

Motivation: It is very common to process large-scale compressed data.

Corresponding job: webdatasort

(3) Filtering

Input: GB compressed (equivalent to 2 TB uncompressed) sequencefile

(Key, value) = (5-10 words, 100-10000 words)

Computing: Map retains 0.2% of the data, and reduce retains 5% of the data.

Motivation: it is common to filter large datasets.

Corresponding job: webdatascan

(4) API text sorting (directly calling some APIs for sorting)

Input: GB uncompressed text

(Key, value) = (1-10 words, 0-200 words)

Computing: Map retains 100% of the data, and reduce retains 100% of the data.

Motivation: map/reduce directly calls the library function for sorting

Corresponding job: streamsort and javasort. streamsort uses the shell command cat as Mapper and reducer (this is not a sort, but simply performs a row-by-row scan ), sort ORT calls APIs in Java for sorting.

A benchmark load generated by gridmix contains different types of jobs, and each job processes different data volumes. You can configure the job quantity and data volume in an XML file. gridmix constructs jobs based on the configuration file, submits jobs to the cluster, and monitors their execution status until all jobs are completed.

Usage

(1) Compile

Enter "ant" in/home/hadoop/hadoop_install/src/benchmarks/gridmix2 to generate the gridmix. jar file in the build directory and copy it to the gridmix directory.

(2) Configure Environment Variables

Modify the following variable values in a script gridmix-env-2:

Hadoop_home: hadoop installation path

Hadoop_version: hadoop version, such as hadoop-0.20.2

Hadoop_conf_dir: conf path, for example, $ {hadoop_home}/Conf

Use_real_data: whether to use a large dataset (2 TB). If it is set to false, the default data volume is 2 GB. You can configure it in generategridmix2data. Sh as needed.

(3) Configure job information

Gridmix provides a default gridmix_conf.xml, which can be modified as needed. The modified content can be the job type and quantity, the data volume processed by the job, the number of reduce tasks, and whether to compress the data results. You can configure multiple jobs of the same type with different reduce quantities. For example:

<Property>

<Name> sort ORT. smalljobs. numofjobs </Name>

<Value> 8, 2 </value>

<Description> </description>

</Property>

<Property>

<Name> sort ORT. smalljobs. numofreduces </Name>

<Value> 15, 70 </value>

<Description> </description>

</Property>

In the preceding example, 10 small Java sort jobs are set. Eight of them have 15 reduce tasks, and the other two have 70 reduce tasks.

In gridmix, each job has three types: large, small, and medium. A small job has only three map tasks (only processing data of {part-00000, part-00001, part-00002 ); the number of tasks in a job is related to the total data volume. It processes data blocks that match the regular expression {part-000*0, part-000*1, part-000*2, for example, there are 10 data blocks: Part-00000, part-00001, part-00002... In part-0009, only the first three parts of the job are processed. In large jobs, all data is processed.

(4) generate data

Use the generategridmix2data. Sh script to generate data. You can configure the data volume as needed. In gridmix, the data compression rate is 4x.

(5) Run

First, make sure that the hadoop cluster has been started and run./rungridmix_2. The script creates start. Out to record the start time of the job. When the job ends, the end. Out record is created.

Summary

Hadoop gridmix consists of two versions. This article discusses the second version, namely, gridmix2 .. Gridmix2 has good scalability. You can easily add other jobs and simulate batch processing. However, it cannot simulate application scenarios where jobs are submitted randomly (such as submitting jobs based on Poisson distribution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.