Hadoop configuration file load order

Source: Internet
Author: User
Tags deprecated

With a period of time of Hadoop, and now come back to see the source found not a taste, warm so know the new, it is true

  We need to configure some files before using Hadoop, Hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml. So when are these files used by Hadoop?

  Most of the time when starting Hadoop is start-all.sh, so what does this script do?

start-all.sh
# Start all Hadoop daemons. Run this on master node. #特别的地方时要在master节点上启动hadoop所有进程 bin=`dirname " $"' Bin= ' CD"$bin";pwd' #bin = $HADOOP _home/binif[-E"$bin/.. /libexec/hadoop-config.sh"]; Then. "$bin"/.. /libexec/hadoop-config.SHElse. "$bin/hadoop-config.sh"fi# start Dfs daemons"$bin"/start-dfs.SH--config $HADOOP _conf_dir # start mapred daemons"$bin"/start-mapred.SH--config $HADOOP _conf_dir

Load hadoop-env.sh   The script first finds the bin directory in Hadoop, which can be directly replaced with $hadoop_home/bin if the Hadoop environment variable is configured. The next step is to execute hadoop-config.sh, which may be in the $hadoop_home/libexec directory or the $hadoop_home/bin directory, in the HADOOP version I'm using $hadoop_home/ In the Libexec directory, there are a few lines of script in the hadoop-config.sh file hadoop-config.sh
if " ${hadoop_conf_dir}/hadoop-env.sh "  Then  "${hadoop_conf_dir}/hadoop-env.sh"fi

Test $hadoop_home/conf/hadoop-env.sh as normal file after passing . "${hadoop_conf_dir}/hadoop-env.sh" executes hadoop-env.sh this script, OK, we configure the environment variable Java_home in hadoop-env.sh to take effect, In fact, I feel this place can not be configured completely, why? Since installing Hadoop on Linux is sure to install Java, the installation will certainly be configured with Java_home, and the environment variables configured in/etc/profile will take effect in any shell process.

  load the Core-*.xml,hdfs.*.xml file  after executing the hadoop-config.sh command, execute the $hadoop_home/start-dfs.sh. The purpose of this script is to start the Namenode,datename,secondarynamenode three HDFs-related processes start-dfs.sh
# Run this on master node. Usage="usage:start-dfs.sh [-upgrade|-rollback]"bin=`dirname " $"' Bin= ' CD"$bin";pwd` if[-E"$bin/.. /libexec/hadoop-config.sh"]; Then. "$bin"/.. /libexec/hadoop-config.SHElse. "$bin/hadoop-config.sh"fi# Get Argumentsif[$#-ge1]; Thennamestartopt=$1Shift Case$nameStartOptinch(-upgrade);;(-rollback) datastartopt=$nameStartOpt;;(*)Echo$usageexit1;;Esacfi# Start Dfs daemons# start Namenode after datanodes, to minimize TimeNamenode is upW/o data# note:datanodes would log connection errorsuntilNamenode starts"$bin"/hadoop-daemon.SH--config $HADOOP _conf_dir start Namenode $nameStartOpt"$bin"/hadoop-daemons.SH--config $HADOOP _conf_dir start Datanode $dataStartOpt"$bin"/hadoop-daemons.SH--config $HADOOP _conf_dir--hosts MastersStart Secondarynamenode

Take a closer look and you can't find Hadoop-config.sh is also executed in start-dfs.sh, because we don't always use start-all.sh to start all of Hadoop's processes, and sometimes we just need to use HDFS instead of MapReduce, where we just have to execute the ST art-dfs.sh, the variables defined in the same hadoop-config.sh are also used by the file system-related processes, so the hadoop-config.sh must be executed before starting Namenode,datanode,secondarynamenode, and Hado The op-env.sh file is executed. Take a look at the last three lines of code, which are the scripts that start Namenode,datanode,secondarynamenode. After starting Hadoop a total of 5 processes, of which three is namenode,datanode,secondarynamenode, since can start process description corresponding class must have the main method, see the source code can verify this, this is not the point, The point is to see how the corresponding class is loading the configuration file. Whether it's Namenode, Dataname, or secondarynamenode, they'll load core-*.xml and Hdfs-*.xml files on startup to Org.apache.hadoop.hdfs.server.namenode.NameNode This class for example, the other two classes Org.apache.hadoop.hdfs.server.datanode.DataNode,org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode similar.

Org.apache.hadoop.hdfs.server.namenode.NameNode
 public  class  NameNode "implements   ClientProtocol, Datanodeprotocol,namenodeprotocol, Fsconstants, Refreshauthorizationpolicyprotocol,refreshusermappingsprotocol { static  {configuration.adddefaultresource (    } ...}

Look at the contents of the static code block, will be very excited, see Hdfs-default.xml and Hdfs-site.xml. The point here is that the static code block executes (not object initialization) when the class is loaded into the JVM to perform the initialization of the class. Configuration. Adddefaultresource("Hdfs-default.xml"); Before this code executes, it loads the configuration class into the JVM, so look at what the static code block in the Org.apache.hadoop.conf.Configuration class is doing.

org.apache.hadoop.conf.Configuration
Static{//Print deprecation warning if hadoop-site.xml is found in ClasspathClassLoader CL =Thread.CurrentThread (). Getcontextclassloader ();if(CL = =NULL) {CL= Configuration.class. getClassLoader ();}if(Cl.getresource ("Hadoop-site.xml")! =NULL) {Log.warn ("Deprecated:hadoop-site.xml found in the classpath." + "Usage of Hadoop-site.xml is DEPRECATED. Instead use Core-site.xml, "+" Mapred-site.xml and Hdfs-site.xml to override properties of "+" Core-default.xml, mapred-de Fault.xml and Hdfs-default.xml "+" respectively ");} Adddefaultresource ( " core-default.xml"); Adddefaultresource ("Core-site.xml"));}

The configuration class loads the two files of Core-default.xml and Core-site.xml when the class is initialized. This namenode loads the Core-*.xml and hdfs-*.xml files at boot time, where Core-*.xml is loaded by the configuration class.

  loading core-*.xml and Mapred-*.xml files   after executing Start-dfs.xml, execute start-mapred.sh, which is the same script as start-hdf.sh.
 
start-mapred.sh# Start Hadoop map reduce daemons. Run ThisOn master node. Bin= ' DirName '' Bin= ' CD ' $bin '; pwd 'if[-E "$bin/: /libexec/hadoop-config.sh " ]; Then. "$bin"/.. /libexec/hadoop-config.shElse. "$bin/hadoop-config.sh"fi # start mapred daemons# start Jobtracker first to minimize connection errors at startup"$bin"/hadoop-daemon.sh--config $HADOOP _conf_dir start Jobtracker "$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir start Tasktracker 

the script will also execute the hadoop-config.sh, and the hadoop-env.sh will be executed as well. Here and the start-dfs.sh are unified. The last two lines of code are to start the jobtracker and tasktracker processes. Also corresponds to two classes of Org.apache.hadoop.mapred.JobTracker and org.apache.hadoop.mapred.TaskTracker .

inOrg.apache.hadoop.mapred.JobTracker For example, org.apache.hadoop.mapred.TaskTracker similar to Org.apache.hadoop.mapred.JobTracker
 Public class Implements  static{configuration.adddefaultresource ("Mapred-default.xml"); Configuration.adddefaultresource ("Mapred-site.xml");} ...}

OK, with the above explanation, it is clear now. Jobtracker loads the Core-*.xml and Mapred-*.xml files at startup, where Core-*.xml is done by the configuration. Summarize:when using start-all.sh to start all Hadoop processes, the various configuration files are loaded in the order:    hdfs:hadoop-env.sh--Core-default.xml--and Core-site.xml--hdfs-default.xml--Hdfs-site.xml
    mapred:hadoop-env.sh--Core-default.xml--and Core-site.xml--mapred.default.xml--Mapred.site.xmlNote that the files of the Core-*.xml system are always loaded first, and 5 processes in Hadoop are loaded, which means that core-*.xml is a public base library that is shared by the big guys. configuration files are loaded at process startup, which also proves that if you modify a Hadoop configuration file, whether it is a system configuration file or an administrator profile, the restart process takes effect.

Hadoop configuration file load order

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.