Hadoop configuration file loading sequence,

Source: Internet
Author: User

Hadoop configuration file loading sequence,

After using hadoop for a period of time, I now come back and look at the source code to find that the source code has a different taste, so I know it is really like this.

Before using hadoop, We need to configure some files, hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml. When will these files be used by hadoop?

Generally, when hadoop is started, the most commonly used is the start-all.sh. So what does this script do?

Start-all.sh
# Start all hadoop daemons. run this on master node. # In special cases, start all hadoop processes bin = 'dirname "$0" 'bin = 'CD "$ bin" on the master node "; pwd' # bin = $ HADOOP_HOME/bin if [-e "$ bin /.. /libexec/hadoop-config.sh "]; then. "$ bin "/.. /libexec/hadoop-config.shelse. "$ bin/hadoop-config.sh" fi # start dfs daemons "$ bin"/start-dfs.sh -- config $ HADOOP_CONF_DIR # start mapred daemons "$ bin"/start-mapred.sh -- config $ HADOOP_CONF_DIR

 

Load hadoop-env.shThe script first finds the bin directory in hadoop. When the hadoop environment variable is configured, $ HADOOP_HOME/bin can be used directly instead. The next step is to execute the hadoop-config.sh, which may be under the $ HADOOP_HOME/libexec directory or the $ HADOOP_HOME/bin directory, in my hadoop version, under the $ HADOOP_HOME/libexec directory, there are the following lines of script hadoop-config.sh in the hadoop-config.sh File
if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then. "${HADOOP_CONF_DIR}/hadoop-env.sh"fi

After testing $ HADOOP_HOME/conf/hadoop-env.sh as a normal file, pass. "$ {HADOOP_CONF_DIR}/hadoop-env.sh" execute the hadoop-env.sh script, OK, the environment variable JAVA_HOME we configured in the hadoop-env.sh takes effect, in fact, I feel this place can be completely without configuration, why? Because we must install java when installing hadoop on linux, JAVA_HOME will be configured during installation, and the Environment Variables configured in/etc/profile will take effect in any shell process.

  Load the core-*. xml and hdfs. *. xml files.  After the hadoop-config.sh command is executed, run $ HADOOP_HOME/start-dfs.sh. The role of this script is to start the three hdfs-related process start-dfs.sh namenode, datename, secondarynamenode
# Run this on master node. usage="Usage: start-dfs.sh [-upgrade|-rollback]" bin=`dirname "$0"`bin=`cd "$bin"; pwd` if [ -e "$bin/../libexec/hadoop-config.sh" ]; then. "$bin"/../libexec/hadoop-config.shelse. "$bin/hadoop-config.sh"fi # get argumentsif [ $# -ge 1 ]; thennameStartOpt=$1shiftcase $nameStartOpt in(-upgrade);;(-rollback)dataStartOpt=$nameStartOpt;;(*)echo $usageexit 1;;esacfi # start dfs daemons# start namenode after datanodes, to minimize time namenode is up w/o data# note: datanodes will log connection errors until namenode starts"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode

Taking a closer look, you cannot find that start-dfs.sh is also executed in the hadoop-config.sh, because we don't always use start-all.sh to start all hadoop processes, sometimes we only need to use hdfs instead of MapReduce, at this point we only need to execute the start-dfs.sh separately, the variables defined in the hadoop-config.sh will also be used by the file system-related process, so here we start namenode, before datanode, secondarynamenode, You need to execute the hadoop-config.sh while the hadoop-env.sh file is being executed. Let's take a look at the last three lines of code, namely the scripts for starting namenode, datanode, and secondarynamenode. There are five processes in total after hadoop is started, three of which are namenode, datanode, and secondarynamenode. Since the process can be started, the main method must exist in the corresponding class. You can check the source code to verify this, this is not the point. The point is to see how the corresponding class loads the configuration file. Whether it is namenode, dataname, or secondarynamenode, they will load core-* at startup -*. xml and hdfs -*. xml file, with org. apache. hadoop. hdfs. server. namenode. the NameNode class is used as an example. The other two classes are org. apache. hadoop. hdfs. server. datanode. dataNode, org. apache. hadoop. hdfs. server. namenode. secondaryNameNode is similar.

Org. apache. hadoop. hdfs. server. namenode. NameNode
public class NameNode implements ClientProtocol, DatanodeProtocol,NamenodeProtocol, FSConstants,RefreshAuthorizationPolicyProtocol,RefreshUserMappingsProtocol {static{Configuration.addDefaultResource("hdfs-default.xml");Configuration.addDefaultResource("hdfs-site.xml");}...}

Look at the contents of the static code block, it will be very excited to see the hdfs-default.xml and hdfs-site.xml. The focus is here. static code block will be executed (not object initialization) when the class is loaded to the jvm execution class for initialization ). Configuration. addDefaultResource ("hdfs-default.xml"); before executing this code, the Configuration class is loaded into jvm, so let's take a look at org. apache. hadoop. conf. what does the static code block in the Configuration class do?

Org. apache. hadoop. conf. Configuration
static{//print deprecation warning if hadoop-site.xml is found in classpathClassLoader cL = Thread.currentThread().getContextClassLoader();if (cL == null) {cL = Configuration.class.getClassLoader();}if(cL.getResource("hadoop-site.xml")!=null) {LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +"Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "+ "mapred-site.xml and hdfs-site.xml to override properties of " +"core-default.xml, mapred-default.xml and hdfs-default.xml " +"respectively");}addDefaultResource("core-default.xml");addDefaultResource("core-site.xml");}

The Configuration class loads the core-default.xml and core-site.xml files at class initialization. In this way, the core-*. xml and hdfs-*. xml files are loaded when namenode is started. The core-*. xml file is loaded by the Configuration class.

  Load the core-*. xml and mapred-*. xml files  After the start-dfs.xml is executed, execute the start-mapred.sh, which is similar to the start-hdf.sh.
 
start-mapred.sh# Start hadoop map reduce daemons. Run this on master node. bin=`dirname "$0"`bin=`cd "$bin"; pwd` if [ -e "$bin/../libexec/hadoop-config.sh" ]; then. "$bin"/../libexec/hadoop-config.shelse. "$bin/hadoop-config.sh"fi # start mapred daemons# start jobtracker first to minimize connection errors at startup"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker 

The script also executes the hadoop-config.sh and the hadoop-env.sh. Here and start-dfs.sh are unified. The last two lines of code start the jobtracker and tasktracker processes. It also corresponds to two classes: org. apache. hadoop. mapred. JobTracker and org. apache. hadoop. mapred. TaskTracker.

Take org. apache. hadoop. mapred. JobTracker as an example. org. apache. hadoop. mapred. TaskTracker is similar to org. apache. hadoop. mapred. JobTracker.
public class JobTracker implements MRConstants, InterTrackerProtocol,JobSubmissionProtocol, TaskTrackerManager, RefreshUserMappingsProtocol,RefreshAuthorizationPolicyProtocol, AdminOperationsProtocol,JobTrackerMXBean { static{Configuration.addDefaultResource("mapred-default.xml");Configuration.addDefaultResource("mapred-site.xml");}...}

 

OK. With the above explanation, I understand it now. JobTracker loads the core-*. xml and mapred-*. xml files at startup, where core-*. xml is completed by Configuration. Summarize: When you start all hadoop processes with a start-all.sh, various configuration files have to load order: HDFS: hadoop-env.sh --> core-default.xml --> core-site.xml --> hdfs-default.xml --> hdfs-site.xml
Mapred: hadoop-env.sh --> core-default.xml --> core-site.xml --> mapred. default. xml --> mapred. site. xml notes that core -*. files in the xml system are always preferentially loaded, and all five processes in hadoop are loaded. This also shows core -*. xml is a public base library shared by big guys. The configuration file is loaded when the process starts. This also proves that if you modify the hadoop configuration file, both the system configuration file and the Administrator configuration file must be restarted to take effect.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.