Introduction to Hadoop jobhistory history Server
Hadoop comes with a history server. You can view the records of running Mapreduce jobs on the history server, for example, how many maps are used, how many Reduce tasks are used, the job submission time, the job start time, and the job completion time. By default, the Hadoop history server is not started. We can use the following command to start the Hadoop history server.
$ Sbin/mr-jobhistory-daemon.sh start historyserver
In this way, you can open the web ui of the historical server on port 19888 of the corresponding machine. You can view the running job status. The historical server can be started on a single machine, mainly through the following parameter configuration:
<Property>
<Name> mapreduce. jobhistory. address </name>
<Value> 0.0.0.0: 10020 </value>
</Property>
<Property>
<Name> mapreduce. jobhistory. webapp. address </name>
<Value> 0.0.0.0: 19888 </value>
</Property>
The above parameters are configured in the mapred-site.xml file, mapreduce. jobhistory. address and mapreduce. jobhistory. webapp. the default address values are 0.0.0.0: 10020 and 0.0.0.0: 19888 respectively. You can configure the parameters as needed. The parameter format is host: port. After configuring the preceding parameters, restart Hadoop jobhistory so that you can view the history of Hadoop jobs on the host where the mapreduce. jobhistory. webapp. address parameter is configured.
Many people will ask where the historical data is stored? Is stored in HDFS. We can use the following configuration to set the directory in which HDFS stores historical job records:
<Property>
<Name> mapreduce. jobhistory. done-dir </name>
<Value >$ {yarn. app. mapreduce. am. staging-dir}/history/done </value>
</Property>
<Property>
<Name> mapreduce. jobhistory. intermediate-done-dir </name>
<Value >$ {yarn. app. mapreduce. am. staging-dir}
/History/done_intermediate </value>
</Property>
<Property>
<Name> yarn. app. mapreduce. am. staging-dir </name>
<Value>/tmp/hadoop-yarn/staging </value>
</Property>
The above configurations are all default values and we can modify them in the mapred-site.xml file. Among them, mapreduce. jobhistory. the done-dir parameter indicates the directory under which Hadoop job records have been run; mapreduce. jobhistory. intermediate-done-dir indicates the Hadoop job records that are running. We can go to the directory of mapreduce. jobhistory. done-dir parameter configuration to see what is stored in it:
[Wyp @ master/home/wyp/hadoop] # bin/hadoop fs-ls/jobs/done/
Found 2 items
Drwxrwx ----wyp supergroup 0 2013-12-03 23:36/jobs/done/2013
Drwxrwx ----wyp supergroup 0 2014-02-01 00:02/jobs/done/2014
[Wyp @ master/home/wyp/hadoop] # bin/hadoop fs-ls/jobs/done/2014/02/16
Found 27 items
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001216
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001217
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001218
Drwxrwx ----wyp supergroup 0 2014-02-16/jobs/done/2014/02/16/001219
Drwxrwx ----wyp supergroup 0/jobs/done/2014/02/16/001220
[Wyp @ master hadoop] # bin/hadoop fs-ls/jobs/done/2014/02/16/001216
Found 1318 items
-Rwxrwx --- 3 wyp supergroup 45541335 2014-02-16 00:11/jobs/done/2014
/02/16/001216/job_1388830974669_1216161-1392478837250-wyp-insert + overwrite
+ Table + qt_city_query_ana... e % 28Stage-1392480689141-5894-33-SUCCEEDED-wyp.jhist
-Rwxrwx --- 3 wyp supergroup 193572 2014-02-16 00:11/jobs/done
/2014/02/16/001216/job_1388830974669_12161_conf.xml
-Rwxrwx --- 3 wyp supergroup 45594759 2014-02-16 00:11/jobs/done/2014
/02/16/001216/job_1388830974669_1216162-1392478837250-wyp-insert + overwrite
+ Table + qt_city_query_ana... e % 28Stage-1392480694818-5894-33-SUCCEEDED-wyp.jhist
-Rwxrwx --- 3 wyp supergroup 193572 2014-02-16 00:11/jobs/done
/2014/02/16/001216/job_1388830974669_1216162_conf.xml
Through the above results, we can get the following points:
(1) historical job records are stored in the HDFS directory;
(2) because there may be many historical job records, the historical job records are stored in the corresponding directories in the form of year, month, and day, which facilitates management and searching;
(3) Each Hadoop history job record is stored in two files with the suffix *. jhist and *. xml. * The jhist File Stores detailed information about a specific Hadoop job, as follows:
{
"Type": "JOB_INITED ",
"Event ":{
"Org. apache. hadoop. mapreduce. jobhistory. JobInited ":{
"Jobid": "job_1388830974669_1215999 ",
"LaunchTime": 1392477383583,
"TotalMaps": 1,
"TotalReduces": 1,
"JobStatus": "INITED ",
"Uberized": false
}
}
}
This is a piece of information about Hadoop JOB initialization. We can see that all data in the *. jhist file is in Json format. Differentiate the meaning of this Json according to type. In Hadoop, there are several types in total:
"JOB_SUBMITTED ",
"JOB_INITED ",
"JOB_FINISHED ",
"JOB_PRIORITY_CHANGED ",
"JOB_STATUS_CHANGED ",
"JOB_FAILED ",
"JOB_KILLED ",
"JOB_ERROR ",
"JOB_INFO_CHANGED ",
"TASK_STARTED ",
"TASK_FINISHED ",
"TASK_FAILED ",
"TASK_UPDATED ",
"NORMALIZED_RESOURCE ",
"MAP_ATTEMPT_STARTED ",
"MAP_ATTEMPT_FINISHED ",
"MAP_ATTEMPT_FAILED ",
"MAP_ATTEMPT_KILLED ",
"REDUCE_ATTEMPT_STARTED ",
"Performance_attempt_finished ",
"REDUCE_ATTEMPT_FAILED ",
"REDUCE_ATTEMPT_KILLED ",
"SETUP_ATTEMPT_STARTED ",
"SETUP_ATTEMPT_FINISHED ",
"SETUP_ATTEMPT_FAILED ",
"SETUP_ATTEMPT_KILLED ",
"CLEANUP_ATTEMPT_STARTED ",
"CLEANUP_ATTEMPT_FINISHED ",
"CLEANUP_ATTEMPT_FAILED ",
"CLEANUP_ATTEMPT_KILLED ",
"AM_STARTED"
The *. xml file records the complete parameter configuration during the running of the corresponding job. You can check it in.
(4) historical records of each job are stored in a separate file.
Mapreduce. jobhistory. the directory configured in intermediate-done-dir stores information about the currently running Hadoop task records. If you are interested, you can go in and check the information.
If you are not satisfied with the data provided on the Hadoop history Server web ui, you can use mapreduce. jobhistory. the done-dir directory is analyzed to get the information we are interested in, for example, count how many maps are run in a certain day, how long the longest job is used, the number of Mapreduce tasks run by each user, and the total number of Mapreduce tasks run by each user, this is a good way to monitor Hadoop clusters. We can determine how to allocate resources to a user based on the information.
Careful personnel may find that up to 20000 historical job records can be displayed on the web ui of the Hadoop history server. In fact, we can configure them using the following parameters, restart Hadoop jobhistory.
<Property>
<Name> mapreduce. jobhistory. joblist. cache. size </name>
<Value> 20000 </value>
</Property>
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)