Distributed System Hadoop configuration file loading sequence detailed tutorial

Last Update:2017-01-13 Source: Internet

Author: User

Tags deprecated rollback

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We need to configure some files before using Hadoop, Hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml. So when are these files being used by Hadoop?

typically uses up to start-all.sh when you start Hadoop, so what does this script do?

code is as follows

copy code

# Start All Hadoop daemons. Run this on master node.
#特别的地方时要在master节点上启动hadoop所有进程

bin= ' dirname ' $ '
bin= ' CD $bin '; pwd ' #bin = $HADOOP _home/bin

If [E] $bin/.. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi

# start Dfs daemons
"$bin"/start-dfs.sh--config $HADOOP _conf_dir
&NBSP
# start mapred daemons
"$bin"/start-mapred.sh--config $HADOOP _conf_dir

Load hadoop-env.sh

The script first finds the bin directory in Hadoop, where the Hadoop environment variable is configured to use $hadoop_home/bin directly instead. The next step is to execute hadoop-config.sh, which may be in the $hadoop_home/libexec directory or $hadoop_home/bin directory, in the version of HADOOP I use in $hadoop_home/ In the Libexec directory, there are several lines of script in the hadoop-config.sh file

hadoop-config.sh

The code is as follows	Copy Code
If [F "${hadoop_conf_dir}/hadoop-env.sh"]; Then . "${hadoop_conf_dir}/hadoop-env.sh" Fi

Test $hadoop_home/conf/hadoop-env.sh as plain file after passing. "${hadoop_conf_dir}/hadoop-env.sh" executes hadoop-env.sh this script, OK, the environment variable we configured in hadoop-env.sh is in effect java_home, In fact, I feel this place completely can not configure, why? Since we install Hadoop on Linux, we will definitely install Java, so when installing java_home, the environment variables configured in/etc/profile take effect in any shell process.

Load Core-*.xml,hdfs.*.xml File

Executes the $hadoop_home/start-dfs.sh after the hadoop-config.sh command has been executed. The purpose of this script is to start namenode,datename,secondarynamenode these three hdfs-related processes

start-dfs.sh

The code is as follows

Copy Code

# Run this on master node.

Usage= "usage:start-dfs.sh [-upgrade|-rollback]"

bin= ' dirname ' "$"
bin= ' CD ' $bin '; PWD '

If [E] $bin/.. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi

# Get arguments
If [$#-ge 1]; Then
Namestartopt=$1
Shift
Case $nameStartOpt in
(-upgrade)
;;
(-rollback)
datastartopt= $nameStartOpt
;;
(*)
Echo $usage
Exit 1
;;
Esac
Fi

# Start Dfs daemons
# start Namenode after datanodes, to minimize time namenode be up w/o data
# Note:datanodes'll log connection errors until Namenode starts
"$bin"/hadoop-daemon.sh--config $HADOOP _conf_dir start Namenode $nameStartOpt
"$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir start Datanode $dataStartOpt
"$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir--hosts Masters Start Secondarynamenode

Take a closer look at what you can't find, Hadoop-config.sh is also executed in start-dfs.sh, because we don't always use start-all.sh to start all of Hadoop's processes, and sometimes we just need to use HDFS instead of MapReduce, where we just have to perform the St art-dfs.sh, the same variables defined in hadoop-config.sh are also used by file system-related processes, so it is necessary to perform hadoop-config.sh before starting Namenode,datanode,secondarynamenode and Hado The op-env.sh file is executed. Take a look at the last three lines of code, which is the script that starts Namenode,datanode,secondarynamenode. When you start Hadoop, there are 5 processes, three of which are namenode,datanode,secondarynamenode, and since you can start the process description the corresponding class must have a main method, see the source code can verify this, this is not the point, The point is to see how the corresponding class loads the configuration file. Whether it's Namenode, Dataname, or secondarynamenode, they load core-*.xml and hdfs-*.xml files at startup, Take the class of Org.apache.hadoop.hdfs.server.namenode.NameNode as an example, The other two classes org.apache.hadoop.hdfs.server.datanode.datanode,org.apache.hadoop.hdfs.server.namenode.secondarynamenode similar.
Org.apache.hadoop.hdfs.server.namenode.NameNode

code is as follows	copy code
Public Class Namenode implements ClientProtocol, Datanodeprotocol, Namenodeprotocol, fsconstants, Refreshauthorizationpolicyprotocol, Refreshusermappingsprotocol { static{ Configuration.adddefaultresource ("Hdfs-default.xml"); Configuration.adddefaultresource ("Hdfs-site.xml"); } ... }

Look at the contents of the static code block, it will be exciting to see Hdfs-default.xml and Hdfs-site.xml. Right here, the static code block executes (not object initialization) when the class is loaded into the JVM to perform the initialization of the class. Configuration.adddefaultresource ("Hdfs-default.xml"); This code will first load the Configuration class into the JVM before it executes. So look at the org.apache.hadoop.conf.Configuration of the static code block in this class.
Org.apache.hadoop.conf.Configuration

code is as follows

copy code

static{
//print deprecation warning if hadoop-site.xml is found in classpath
ClassLoader CL = Thread.CurrentThread (). Getcontextclassloader ();
if (CL = null) {
CL = Configuration.class.getClassLoader ();
}
if (Cl.getresource ("Hadoop-site.xml")!=null) {
Log.warn ("Deprecated:hadoop-site.xml found in the classpath." +
Usage of Hadoop-site.xml is deprecated. Instead use Core-site.xml, "
+" Mapred-site.xml and Hdfs-site.xml to override properties of "+
" Core-default.xml, Mapred-default.xml and Hdfs-default.xml "+
" respectively ");
}
Adddefaultresource ("Core-default.xml");
Adddefaultresource ("Core-site.xml");
}

The

Configuration class loads the two files Core-default.xml and Core-site.xml when the class is initialized. This namenode loads the Core-*.xml and Hdfs-*.xml files at startup, where Core-*.xml is loaded by the configuration class.

Loads Core-*.xml and mapred-*.xml files

After Start-dfs.xml executes start-mapred.sh, which is similar to start-hdf.sh.

code is as follows

copy code

Start-mapred.s H
# Start Hadoop map reduce daemons. Run this on master node.

bin= ' dirname ' $ '
bin= ' CD ' $bin '; pwd '

If [e ' $bin/. /libexec/hadoop-config.sh "]; Then
. "$bin"/.. /libexec/hadoop-config.sh
Else
. "$bin/hadoop-config.sh"
Fi

# start mapred daemons
# start Jobtracker i to minimize connection E Rrors at startup
$bin/hadoop-daemon.sh--config $HADOOP _conf_dir start jobtracker
"$bin"/hadoop-daemons.sh-- Config $HADOOP _conf_dir start tasktracker

The script also executes hadoop-config.sh, and also executes hadoop-env.sh. Here and start-dfs.sh are unified. The last two lines of code are to start the jobtracker and tasktracker processes. The same corresponds to two classes Org.apache.hadoop.mapred.JobTracker and Org.apache.hadoop.mapred.TaskTracker

Take Org.apache.hadoop.mapred.JobTracker For example, Org.apache.hadoop.mapred.TaskTracker is similar to
Org.apache.hadoop.mapred.JobTracker

code is as follows	copy code
Public Class Jobtracker implements Mrconstants, Intertrackerprotocol, Jobsubmissionprotocol, Tasktrackermanager, Refreshusermappingsprotocol, Refreshauthorizationpolicyprotocol, Adminoperationsprotocol, Jobtrackermxbean { static{ Configuration.adddefaultresource ("Mapred-default.xml"); Configuration.adddefaultresource ("Mapred-site.xml"); } ... }

The

OK, with the above explanation, is now clear. The Core-*.xml and Mapred-*.xml files were loaded at jobtracker startup, where Core-*.xml was completed by configuration.

Summarize:

When you start all of the Hadoop processes using start-all.sh, the various configuration files are loaded in order:
hdfs:hadoop-env.sh--> Core-default.xml--> core-site.xml--> hdfs-default.xml--> hdfs-site.xml
mapred:hadoop-env.sh--> Core-default.xml--> core-site.xml--> mapred.default.xml-->
Note a little, mapred.site.xml. XML system files are always loaded first, and 5 processes in Hadoop are loaded, which means that core-*.xml is a common base and is shared by big guys. The

configuration file is loaded at process startup, which also proves that if you modify the configuration file for Hadoop, whether it is a system profile or an administrator profile, you need to restart the process for it to take effect.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More