Exploring the existing hadoop testing framework

Source: Internet
Author: User
Document directory
  • Minidfscluster
  • Debugging in IDE
  • Regression

View Original

Background of hadoop's existing testing framework

From the first day of using hadoop, we have never left the development of hadoop's own functions or the development of hadoop's own bug fixes. This development model has lasted for several years, but one of the phenomena that can be found is that the bugs we fix or the features we develop have never been very standardized and unified, efficient, well-managed, and clear-cut testing methods. A common phenomenon is that after a function is developed or a bug is fixed, some manual environment simulation and tests are performed on the modification. After the tests confirm that there is no problem, package and launch the baseline version. The disadvantage of this mode is:

1. without a unified testing framework and specifications, it is easy to make a change. You need to think of building or simulating some testing results for this change. After this test is complete, the environment has been removed, and no relevant documents have been formed. As a result, regression testing cannot be performed during subsequent releases.

2. The simulation of the test environment depends on the consideration of the testing personnel. Therefore, for tests of the same modification, the method and result may be different after a period of time.

3. For some performance-related tests, there is no corresponding performance testing tool, which can easily lead to no one can accurately express the performance indicators of clusters that have been running for a long time. For future improvements and optimizations, it is not possible to quantify the results of the optimization and improvement, and it is difficult to assess the results.

4. Developers and testers are usually the same person.

In fact, each version of hadoop contains the test code corresponding to many features and improvements. When each version is released or each patch is integrated, all test cases are completely regression once, so that modifications do not affect other modules or other functions. Therefore, the knowledge about hadoop testing is just as important as the understanding of hadoop code and subsequent development.

The following sections briefly introduce the testing frameworks and structures in hadoop, and provide some examples after the introduction, to show how to compile the testing code for hadoop kernel development.

Directory structure of versions earlier than 0.21

In versions earlier than hadoop 0.21 (for example, 0.20 may be different in other versions), all test-related code is stored under $ {hadoop_home}/src/test, in this directory, the test code for different modules is differentiated according to different directories. Here, we need to know that the corresponding hadoop code and class package structure are also managed in the same package structure in test. For example, for Org. apache. hadoop. HDFS. server. the source code of the Code in namenode is in src/HDFS/org/Apache/hadoop/HDFS/Server/namenode. The code of the test case is located in: /src/test/org/Apache/hadoop/HDFS/Server/namenode. Other modules and so on.

Test Case Structure

Taking HDFS as an example, for tests that do not require a cluster environment, the test code is the same as the common unit test code. It is nothing more than Program-level verification and assert, it is no different from general test case code.

Minidfscluster

If you need to simulate the HDFS cluster environment, but is there a real cluster, the hadoop test code providesMinidfsclusterThis class provides a local single-process HDFS cluster environment to simulate the HDFS cluster environment. During initialization of this class, the Program sets the corresponding key configurations and parameter settings in the Cluster Environment Based on the constructor parameters, such as DFS. name. dir, DFS. data. dir, FS. checkpoint. dir, FS. default. name (this will be set to HDFS: // localhost: Port, equivalent to the HDFS distributed environment of a machine), and some datanode will be initialized based on the number of datanode obtained from the parameter, such a real distributed environment can be built. If you need to set the corresponding namenode or datanode configuration parameters for a function test, you only need to set the conf object in the constructor parameters.

Example

Here we use the testdfsrename case for example:

  1. BuildMinidfsclusterEnvironment, so the following HDFS cluster is initialized in setup () of testcase:
  2. Obtain the distributefilesystem instance:
  3. Compile your own testrename () method.
Run testcase in IDE for debugging

Generally, the testcase of this unit test can be directly run in the IDE of the development environment, as shown in the following figure:

Regression

When you want to release a version or compile a new hadoop version, you can perform a regression test while compiling and run all the related testcase, check whether some changes affect the logic of other modules. In this case, ant can be used to run the corresponding target to run all the test cases. As shown in:

In this way, all the use cases whose target is test-core will undergo a regression. When all the use cases pass, at least the expected conditions can be ensured, there is no problem with the current code version. All the logs of each case are recorded. The Log Path is $ {hadoop_home}/build/test, as shown in:

In this way, you can find the corresponding error log for the error testcase, view why the case fails, and then find other problems caused by code modification.

With the increase of testcase, it may take a long time to run a complete regression, which may take several hours. So if you want to run a separate testcase in ant mode, you can also

Ant-dtestcase =$ {casename} test-core

Here, casename is the name of the test case, for example, testdfsrename.

Some Test Tools

We often need to perform performance tests on HDFS or mapreduce, such as testing RPC performance, testing dfs I/O read/write performance, testing DFS throughput performance, and testing namenode benchmark performance, mapreduce sort performance and so on. In the release version of hadoop, many similar tools have been provided and packaged into jar for our use. The following is a list of self-contained tools in 0.20.2:

Dfsciotest Distributed I/O benchmark of libhdfs.
Distributedfscheck Distributed checkup of the file system consistency.
Mrreliabilitytest A program that tests the reliability of the MR framework by injecting faults/failures
Testdfsio Distributed I/O benchmark.
Dfsthroughput Measure HDFS Throughput
Filepath: Benchmark sequencefile (input | output) format (Block, record compressed and uncompressed), text (input | output) format (compressed and uncompressed)
Loadgen Generic MAP/reduce load generator
Mapredtest A map/reduce test check.
Minicluster Single Process HDFS and Mr cluster.
Nnbench A benchmark that stresses the namenode.
Testbigmapoutput A map/reduce program that works on a very big non-splittable file and does identity MAP/reduce
Testfilesystem A test for filesystem read/write.
Testrpc A test for RPC
Testsequencefile A test for flat files of binary key value pairs.
Threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

When running each tool separately, detailed help information is output to the command line. According to the command line prompt, you can perform stress and performance tests on many modules that want to perform performance tests. Each tool outputs a statistical result. To customize a stress testing tool, you can write a stress testing program and register it with org. Apache. hadoop. Test. alltestdriver. For details about each benchmark tool, you can find the code of the corresponding testing tool from alltestdriver.

Summary

It can be seen that the testing environment in the hadoop release version is already very rich, and there are already a lot of classes and tools used to simulate the cluster environment. These codes and tools are very useful for program developers. Writing new test code and adding new test cases are also very convenient. It is very effective to avoid other problems caused by program modifications.

However, we can see that many test cases run in the simulated environment except some test tools, and there is no relevant test framework for the real cluster environment. The reason for this defect is that many test cases require personalized settings for the daemon process of the hadoop cluster. In this way, you need to start and stop the cluster and restart the cluster, before 0.20, Java APIs have not been used to implement the reconfiguration and restart functions of real clusters in test cases. Therefore, external manual and script intervention is required. However, many tests cannot be automated once external manual and script intervention is required. Therefore, a newLarge-scale
Automated test framework (HADOOP-6332 ).

Versions earlier than 0.21

From 0.21, a new test framework was introduced in the hadoop release,Large-scale Automatic Test Framework, which is different from the previous test framework in that the development of tests based on it is based on the system layer of the real cluster environment and is called Herriot.

The main feature of the Herriot testing framework is that you can directly start, stop, and restart a real hadoop cluster through APIS provided by Herriot for HDFS or Mr systems, this ensures that the case runs in a completely new cluster execution environment. In this way, the automated testing of all real cluster environments can be completed through the testcase of Java code without the intervention of additional manual and peripheral scripts.

Directory structure

Herriot uses the junit4 framework. Some Critical fixtures of JUnit will be used in the Herriot framework. Such as @ before and @ after. Therefore, for testing developers, the Herriot test framework is actually the JUnit test case programming. Therefore, anyone familiar with JUnit test case development does not have any problems when using the Herriot framework.

In the new test framework, the test code is placed in:

 
src/
  test/
    system/
      test/
        [org.apache.hadoop.hdfs|org.apache.hadoop.mapred]

And the Code related to the Framework is located inOrg. Apache. hadoop. Test. SystemWhile HDFS and Mr-related Herriot test code are located inOrg. Apache. hadoop. HDFS. Test. System,Org. Apache. hadoop. mapreduce. Test. System.

Example

Here we use the real case in the Herriot SystemSrc/test/system/test/org/Apache/hadoop/mapred/testcluster. JavaFor example.

In this case, starting from @ beforeclass, this before will create a cluster proxy instance (here it is a mapreduce cluster ), this proxy allows the program to directly access the mapreduce daemons process (JT and TTS ). The second line of the program creates all the daemon proxies of mapreduce and makes these daemons processes available to the test program through the Herriot library API. Herriot ensures that the test environment is completely clean and all internal daemons statuses have been reset. In addition, logs of all daemons processes are saved. These logs are very useful for developers and testers to easily locate problems. @ Beforeclass ensures that all testcase has only one cluster
The proxy instance is in the service. To avoid conflicts.

  • In the test, submitting a job to the cluster is also very simple, as shown below:

The new jt api calls submitandverifyjob (configuration conf) to check whether the submitted job has been completed successfully. It also tracks the details (such as how many map reduce tasks are run) of the job, monitors the job's progress and success, and executes the corresponding cleanup. If an exception occurs in any process, the test framework throws an exception.

The following code modifies the configuration of a cluster and restarts it. Then use the previous configuration to restart again.

1.1 test case execution environment

Before executing a test case, the client must meet the following requirements:

L access an existing hadoop cluster supporting Herriot

L corresponding hadoop configuration file directory (usually under $ hadoop_conf_dir)

The client running testcase does not need the hadoop Binary Package. Herriot test uses source code to directly run the following command.

ant test-system -Dhadoop.conf.dir.deployed=${HADOOP_CONF_DIR}
 

In this way, test-system will run all the testcase statements. If you only want to run a testcase, you only need to add the option-dtestcase = testname during the running.

After test is executed, the execution results and logs can be found in the build-fi/system/test directory.

Generally, the test client is deployed on the gateway of the cluster. However, the client executing test can also be a Server Load balancer, notebook, and other machines with the right to access the cluster.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.