Distributed System Hadoop configuration file loading sequence detailed tutorial

Source: Internet
Author: User
Tags deprecated rollback

We need to configure some files before using Hadoop, Hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml. So when are these files being used by Hadoop?

typically uses up to start-all.sh when you start Hadoop, so what does this script do?

  code is as follows copy code

# Start All Hadoop daemons. Run this on master node.
#特别的地方时要在master节点上启动hadoop所有进程
 
bin= ' dirname ' $ '
bin= ' CD $bin '; pwd ' #bin = $HADOOP _home/bin
 
If [E] $bin/.. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi
 
# start Dfs daemons
"$bin"/start-dfs.sh--config $HADOOP _conf_dir
&NBSP
# start mapred daemons
"$bin"/start-mapred.sh--config $HADOOP _conf_dir




Load hadoop-env.sh

The script first finds the bin directory in Hadoop, where the Hadoop environment variable is configured to use $hadoop_home/bin directly instead. The next step is to execute hadoop-config.sh, which may be in the $hadoop_home/libexec directory or $hadoop_home/bin directory, in the version of HADOOP I use in $hadoop_home/ In the Libexec directory, there are several lines of script in the hadoop-config.sh file

hadoop-config.sh

The code is as follows Copy Code

If [F "${hadoop_conf_dir}/hadoop-env.sh"]; Then
. "${hadoop_conf_dir}/hadoop-env.sh"
Fi



Test $hadoop_home/conf/hadoop-env.sh as plain file after passing. "${hadoop_conf_dir}/hadoop-env.sh" executes hadoop-env.sh this script, OK, the environment variable we configured in hadoop-env.sh is in effect java_home, In fact, I feel this place completely can not configure, why? Since we install Hadoop on Linux, we will definitely install Java, so when installing java_home, the environment variables configured in/etc/profile take effect in any shell process.

Load Core-*.xml,hdfs.*.xml File

Executes the $hadoop_home/start-dfs.sh after the hadoop-config.sh command has been executed. The purpose of this script is to start namenode,datename,secondarynamenode these three hdfs-related processes

start-dfs.sh

The code is as follows Copy Code

# Run this on master node.

Usage= "usage:start-dfs.sh [-upgrade|-rollback]"

bin= ' dirname ' "$"
bin= ' CD ' $bin '; PWD '

If [E] $bin/.. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi

# Get arguments
If [$#-ge 1]; Then
Namestartopt=$1
Shift
Case $nameStartOpt in
(-upgrade)
;;
(-rollback)
datastartopt= $nameStartOpt
;;
(*)
Echo $usage
Exit 1
;;
Esac
Fi

# Start Dfs daemons
# start Namenode after datanodes, to minimize time namenode be up w/o data
# Note:datanodes'll log connection errors until Namenode starts
"$bin"/hadoop-daemon.sh--config $HADOOP _conf_dir start Namenode $nameStartOpt
"$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir start Datanode $dataStartOpt
"$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir--hosts Masters Start Secondarynamenode


Take a closer look at what you can't find, Hadoop-config.sh is also executed in start-dfs.sh, because we don't always use start-all.sh to start all of Hadoop's processes, and sometimes we just need to use HDFS instead of MapReduce, where we just have to perform the St art-dfs.sh, the same variables defined in hadoop-config.sh are also used by file system-related processes, so it is necessary to perform hadoop-config.sh before starting Namenode,datanode,secondarynamenode and Hado The op-env.sh file is executed. Take a look at the last three lines of code, which is the script that starts Namenode,datanode,secondarynamenode. When you start Hadoop, there are 5 processes, three of which are namenode,datanode,secondarynamenode, and since you can start the process description the corresponding class must have a main method, see the source code can verify this, this is not the point, The point is to see how the corresponding class loads the configuration file. Whether it's Namenode, Dataname, or secondarynamenode, they load core-*.xml and hdfs-*.xml files at startup, Take the class of Org.apache.hadoop.hdfs.server.namenode.NameNode as an example, The other two classes org.apache.hadoop.hdfs.server.datanode.datanode,org.apache.hadoop.hdfs.server.namenode.secondarynamenode similar.
Org.apache.hadoop.hdfs.server.namenode.NameNode

  code is as follows copy code

Public Class Namenode implements ClientProtocol, Datanodeprotocol,
Namenodeprotocol, fsconstants,
Refreshauthorizationpolicyprotocol,
Refreshusermappingsprotocol {
static{
Configuration.adddefaultresource ("Hdfs-default.xml");
Configuration.adddefaultresource ("Hdfs-site.xml");
}
...
}



Look at the contents of the static code block, it will be exciting to see Hdfs-default.xml and Hdfs-site.xml. Right here, the static code block executes (not object initialization) when the class is loaded into the JVM to perform the initialization of the class. Configuration.adddefaultresource ("Hdfs-default.xml"); This code will first load the Configuration class into the JVM before it executes. So look at the org.apache.hadoop.conf.Configuration of the static code block in this class.
Org.apache.hadoop.conf.Configuration

  code is as follows copy code

static{
//print deprecation warning if hadoop-site.xml is found in classpath
ClassLoader CL = Thread.CurrentThread (). Getcontextclassloader ();
if (CL = null) {
CL = Configuration.class.getClassLoader ();
}
if (Cl.getresource ("Hadoop-site.xml")!=null) {
Log.warn ("Deprecated:hadoop-site.xml found in the classpath." +
Usage of Hadoop-site.xml is deprecated. Instead use Core-site.xml, "
+" Mapred-site.xml and Hdfs-site.xml to override properties of "+
" Core-default.xml, Mapred-default.xml and Hdfs-default.xml "+
" respectively ");
}
Adddefaultresource ("Core-default.xml");
Adddefaultresource ("Core-site.xml");
}

The


Configuration class loads the two files Core-default.xml and Core-site.xml when the class is initialized. This namenode loads the Core-*.xml and Hdfs-*.xml files at startup, where Core-*.xml is loaded by the configuration class.
 
Loads Core-*.xml and mapred-*.xml files
 
After Start-dfs.xml executes start-mapred.sh, which is similar to start-hdf.sh.

  code is as follows copy code

Start-mapred.s H
# Start Hadoop map reduce daemons. Run this on master node.
 
bin= ' dirname ' $ '
bin= ' CD ' $bin '; pwd '
 
If [e ' $bin/. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi
 
# start mapred daemons
# start Jobtracker i to minimize connection E Rrors at startup
$bin/hadoop-daemon.sh--config $HADOOP _conf_dir start jobtracker
"$bin"/hadoop-daemons.sh-- Config $HADOOP _conf_dir start tasktracker



The script also executes hadoop-config.sh, and also executes hadoop-env.sh. Here and start-dfs.sh are unified. The last two lines of code are to start the jobtracker and tasktracker processes. The same corresponds to two classes Org.apache.hadoop.mapred.JobTracker and Org.apache.hadoop.mapred.TaskTracker
 
Take Org.apache.hadoop.mapred.JobTracker For example, Org.apache.hadoop.mapred.TaskTracker is similar to
Org.apache.hadoop.mapred.JobTracker

  code is as follows copy code

Public Class Jobtracker implements Mrconstants, Intertrackerprotocol,
Jobsubmissionprotocol, Tasktrackermanager, Refreshusermappingsprotocol,
Refreshauthorizationpolicyprotocol, Adminoperationsprotocol,
Jobtrackermxbean {
 
static{
Configuration.adddefaultresource ("Mapred-default.xml");
Configuration.adddefaultresource ("Mapred-site.xml");
}
...
}

The



OK, with the above explanation, is now clear. The Core-*.xml and Mapred-*.xml files were loaded at jobtracker startup, where Core-*.xml was completed by configuration.
 
Summarize:

When you start all of the Hadoop processes using start-all.sh, the various configuration files are loaded in order:
hdfs:hadoop-env.sh--> Core-default.xml--> core-site.xml--> hdfs-default.xml--> hdfs-site.xml
mapred:hadoop-env.sh--> Core-default.xml--> core-site.xml--> mapred.default.xml-->
Note a little, mapred.site.xml. XML system files are always loaded first, and 5 processes in Hadoop are loaded, which means that core-*.xml is a common base and is shared by big guys. The

configuration file is loaded at process startup, which also proves that if you modify the configuration file for Hadoop, whether it is a system profile or an administrator profile, you need to restart the process for it to take effect.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.