Introduction to Hadoop jobhistory history Server

Source: Internet
Author: User
Tags hadoop fs

Introduction to Hadoop jobhistory history Server

Hadoop comes with a history server. You can view the records of running Mapreduce jobs on the history server, for example, how many maps are used, how many Reduce tasks are used, the job submission time, the job start time, and the job completion time. By default, the Hadoop history server is not started. We can use the following command to start the Hadoop history server.

$ Sbin/mr-jobhistory-daemon.sh start historyserver

In this way, you can open the web ui of the historical server on port 19888 of the corresponding machine. You can view the running job status. The historical server can be started on a single machine, mainly through the following parameter configuration:

<Property>
<Name> mapreduce. jobhistory. address </name>
<Value> 0.0.0.0: 10020 </value>
</Property>

<Property>
<Name> mapreduce. jobhistory. webapp. address </name>
<Value> 0.0.0.0: 19888 </value>
</Property>

The above parameters are configured in the mapred-site.xml file, mapreduce. jobhistory. address and mapreduce. jobhistory. webapp. the default address values are 0.0.0.0: 10020 and 0.0.0.0: 19888 respectively. You can configure the parameters as needed. The parameter format is host: port. After configuring the preceding parameters, restart Hadoop jobhistory so that you can view the history of Hadoop jobs on the host where the mapreduce. jobhistory. webapp. address parameter is configured.

Many people will ask where the historical data is stored? Is stored in HDFS. We can use the following configuration to set the directory in which HDFS stores historical job records:

<Property>
<Name> mapreduce. jobhistory. done-dir </name>
<Value >$ {yarn. app. mapreduce. am. staging-dir}/history/done </value>
</Property>

<Property>
<Name> mapreduce. jobhistory. intermediate-done-dir </name>
<Value >$ {yarn. app. mapreduce. am. staging-dir}
/History/done_intermediate </value>
</Property>

<Property>
<Name> yarn. app. mapreduce. am. staging-dir </name>
<Value>/tmp/hadoop-yarn/staging </value>
</Property>

The above configurations are all default values and we can modify them in the mapred-site.xml file. Among them, mapreduce. jobhistory. the done-dir parameter indicates the directory under which Hadoop job records have been run; mapreduce. jobhistory. intermediate-done-dir indicates the Hadoop job records that are running. We can go to the directory of mapreduce. jobhistory. done-dir parameter configuration to see what is stored in it:

[Wyp @ master/home/wyp/hadoop] # bin/hadoop fs-ls/jobs/done/
Found 2 items
Drwxrwx ----wyp supergroup 0 2013-12-03 23:36/jobs/done/2013
Drwxrwx ----wyp supergroup 0 2014-02-01 00:02/jobs/done/2014

[Wyp @ master/home/wyp/hadoop] # bin/hadoop fs-ls/jobs/done/2014/02/16
Found 27 items
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001216
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001217
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001218
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001219
Drwxrwx ----wyp supergroup 0/jobs/done/2014/02/16/001220

[Wyp @ master hadoop] # bin/hadoop fs-ls/jobs/done/2014/02/16/001216
Found 1318 items
-Rwxrwx --- 3 wyp supergroup 45541335 2014-02-16 00:11/jobs/done/2014
/02/16/001216/job_1388830974669_1216161-1392478837250-wyp-insert + overwrite
+ Table + qt_city_query_ana... e % 28Stage-1392480689141-5894-33-SUCCEEDED-wyp.jhist
-Rwxrwx --- 3 wyp supergroup 193572 2014-02-16 00:11/jobs/done
/2014/02/16/001216/job_1388830974669_12161_conf.xml
-Rwxrwx --- 3 wyp supergroup 45594759 2014-02-16 00:11/jobs/done/2014
/02/16/001216/job_1388830974669_1216162-1392478837250-wyp-insert + overwrite
+ Table + qt_city_query_ana... e % 28Stage-1392480694818-5894-33-SUCCEEDED-wyp.jhist
-Rwxrwx --- 3 wyp supergroup 193572 2014-02-16 00:11/jobs/done
/2014/02/16/001216/job_1388830974669_1216162_conf.xml

Through the above results, we can get the following points:

(1) historical job records are stored in the HDFS directory;

(2) because there may be many historical job records, the historical job records are stored in the corresponding directories in the form of year, month, and day, which facilitates management and searching;

(3) Each Hadoop history job record is stored in two files with the suffix *. jhist and *. xml. * The jhist File Stores detailed information about a specific Hadoop job, as follows:

{
"Type": "JOB_INITED ",
"Event ":{
"Org. apache. hadoop. mapreduce. jobhistory. JobInited ":{
"Jobid": "job_1388830974669_1215999 ",
"LaunchTime": 1392477383583,
"TotalMaps": 1,
"TotalReduces": 1,
"JobStatus": "INITED ",
"Uberized": false
}
}
}

This is a piece of information about Hadoop JOB initialization. We can see that all data in the *. jhist file is in Json format. Differentiate the meaning of this Json according to type. In Hadoop, there are several types in total:

"JOB_SUBMITTED ",
"JOB_INITED ",
"JOB_FINISHED ",
"JOB_PRIORITY_CHANGED ",
"JOB_STATUS_CHANGED ",
"JOB_FAILED ",
"JOB_KILLED ",
"JOB_ERROR ",
"JOB_INFO_CHANGED ",
"TASK_STARTED ",
"TASK_FINISHED ",
"TASK_FAILED ",
"TASK_UPDATED ",
"NORMALIZED_RESOURCE ",
"MAP_ATTEMPT_STARTED ",
"MAP_ATTEMPT_FINISHED ",
"MAP_ATTEMPT_FAILED ",
"MAP_ATTEMPT_KILLED ",
"REDUCE_ATTEMPT_STARTED ",
"Performance_attempt_finished ",
"REDUCE_ATTEMPT_FAILED ",
"REDUCE_ATTEMPT_KILLED ",
"SETUP_ATTEMPT_STARTED ",
"SETUP_ATTEMPT_FINISHED ",
"SETUP_ATTEMPT_FAILED ",
"SETUP_ATTEMPT_KILLED ",
"CLEANUP_ATTEMPT_STARTED ",
"CLEANUP_ATTEMPT_FINISHED ",
"CLEANUP_ATTEMPT_FAILED ",
"CLEANUP_ATTEMPT_KILLED ",
"AM_STARTED"

The *. xml file records the complete parameter configuration during the running of the corresponding job. You can check it in.

(4) historical records of each job are stored in a separate file.

Mapreduce. jobhistory. the directory configured in intermediate-done-dir stores information about the currently running Hadoop task records. If you are interested, you can go in and check the information.

If you are not satisfied with the data provided on the Hadoop history Server web ui, you can use mapreduce. jobhistory. the done-dir directory is analyzed to get the information we are interested in, for example, count how many maps are run in a certain day, how long the longest job is used, the number of Mapreduce tasks run by each user, and the total number of Mapreduce tasks run by each user, this is a good way to monitor Hadoop clusters. We can determine how to allocate resources to a user based on the information.

Careful personnel may find that up to 20000 historical job records can be displayed on the web ui of the Hadoop history server. In fact, we can configure them using the following parameters, restart Hadoop jobhistory.

<Property>
<Name> mapreduce. jobhistory. joblist. cache. size </name>
<Value> 20000 </value>
</Property>

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.