We need to configure some files before using Hadoop, Hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml. So when are these files being used by Hadoop?
typically uses up to start-all.sh when you start Hadoop, so what does this script do?
code is as follows |
copy code |
# Start All Hadoop daemons. Run this on master node. #特别的地方时要在master节点上启动hadoop所有进程 bin= ' dirname ' $ ' bin= ' CD $bin '; pwd ' #bin = $HADOOP _home/bin If [E] $bin/.. /libexec/hadoop-config.sh "]; Then . "$bin"/.. /libexec/hadoop-config.sh Else . "$bin/hadoop-config.sh" Fi # start Dfs daemons "$bin"/start-dfs.sh--config $HADOOP _conf_dir &NBSP # start mapred daemons "$bin"/start-mapred.sh--config $HADOOP _conf_dir |
Load hadoop-env.sh
The script first finds the bin directory in Hadoop, where the Hadoop environment variable is configured to use $hadoop_home/bin directly instead. The next step is to execute hadoop-config.sh, which may be in the $hadoop_home/libexec directory or $hadoop_home/bin directory, in the version of HADOOP I use in $hadoop_home/ In the Libexec directory, there are several lines of script in the hadoop-config.sh file
hadoop-config.sh
The code is as follows |
Copy Code |
If [F "${hadoop_conf_dir}/hadoop-env.sh"]; Then . "${hadoop_conf_dir}/hadoop-env.sh" Fi |
Test $hadoop_home/conf/hadoop-env.sh as plain file after passing. "${hadoop_conf_dir}/hadoop-env.sh" executes hadoop-env.sh this script, OK, the environment variable we configured in hadoop-env.sh is in effect java_home, In fact, I feel this place completely can not configure, why? Since we install Hadoop on Linux, we will definitely install Java, so when installing java_home, the environment variables configured in/etc/profile take effect in any shell process.
Load Core-*.xml,hdfs.*.xml File
Executes the $hadoop_home/start-dfs.sh after the hadoop-config.sh command has been executed. The purpose of this script is to start namenode,datename,secondarynamenode these three hdfs-related processes
start-dfs.sh
The code is as follows |
Copy Code |
# Run this on master node.
Usage= "usage:start-dfs.sh [-upgrade|-rollback]"
bin= ' dirname ' "$" bin= ' CD ' $bin '; PWD '
If [E] $bin/.. /libexec/hadoop-config.sh "]; Then . "$bin"/.. /libexec/hadoop-config.sh Else . "$bin/hadoop-config.sh" Fi
# Get arguments If [$#-ge 1]; Then Namestartopt=$1 Shift Case $nameStartOpt in (-upgrade) ;; (-rollback) datastartopt= $nameStartOpt ;; (*) Echo $usage Exit 1 ;; Esac Fi
# Start Dfs daemons # start Namenode after datanodes, to minimize time namenode be up w/o data # Note:datanodes'll log connection errors until Namenode starts "$bin"/hadoop-daemon.sh--config $HADOOP _conf_dir start Namenode $nameStartOpt "$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir start Datanode $dataStartOpt "$bin"/hadoop-daemons.sh--config $HADOOP _conf_dir--hosts Masters Start Secondarynamenode |
Take a closer look at what you can't find, Hadoop-config.sh is also executed in start-dfs.sh, because we don't always use start-all.sh to start all of Hadoop's processes, and sometimes we just need to use HDFS instead of MapReduce, where we just have to perform the St art-dfs.sh, the same variables defined in hadoop-config.sh are also used by file system-related processes, so it is necessary to perform hadoop-config.sh before starting Namenode,datanode,secondarynamenode and Hado The op-env.sh file is executed. Take a look at the last three lines of code, which is the script that starts Namenode,datanode,secondarynamenode. When you start Hadoop, there are 5 processes, three of which are namenode,datanode,secondarynamenode, and since you can start the process description the corresponding class must have a main method, see the source code can verify this, this is not the point, The point is to see how the corresponding class loads the configuration file. Whether it's Namenode, Dataname, or secondarynamenode, they load core-*.xml and hdfs-*.xml files at startup, Take the class of Org.apache.hadoop.hdfs.server.namenode.NameNode as an example, The other two classes org.apache.hadoop.hdfs.server.datanode.datanode,org.apache.hadoop.hdfs.server.namenode.secondarynamenode similar.
Org.apache.hadoop.hdfs.server.namenode.NameNode
code is as follows |
copy code |
Public Class Namenode implements ClientProtocol, Datanodeprotocol, Namenodeprotocol, fsconstants, Refreshauthorizationpolicyprotocol, Refreshusermappingsprotocol { static{ Configuration.adddefaultresource ("Hdfs-default.xml"); Configuration.adddefaultresource ("Hdfs-site.xml"); } ... } |
Look at the contents of the static code block, it will be exciting to see Hdfs-default.xml and Hdfs-site.xml. Right here, the static code block executes (not object initialization) when the class is loaded into the JVM to perform the initialization of the class. Configuration.adddefaultresource ("Hdfs-default.xml"); This code will first load the Configuration class into the JVM before it executes. So look at the org.apache.hadoop.conf.Configuration of the static code block in this class.
Org.apache.hadoop.conf.Configuration
code is as follows |
copy code |
static{ //print deprecation warning if hadoop-site.xml is found in classpath ClassLoader CL = Thread.CurrentThread (). Getcontextclassloader (); if (CL = null) { CL = Configuration.class.getClassLoader (); } if (Cl.getresource ("Hadoop-site.xml")!=null) { Log.warn ("Deprecated:hadoop-site.xml found in the classpath." + Usage of Hadoop-site.xml is deprecated. Instead use Core-site.xml, " +" Mapred-site.xml and Hdfs-site.xml to override properties of "+ " Core-default.xml, Mapred-default.xml and Hdfs-default.xml "+ " respectively "); } Adddefaultresource ("Core-default.xml"); Adddefaultresource ("Core-site.xml"); } |
The
Configuration class loads the two files Core-default.xml and Core-site.xml when the class is initialized. This namenode loads the Core-*.xml and Hdfs-*.xml files at startup, where Core-*.xml is loaded by the configuration class.
Loads Core-*.xml and mapred-*.xml files
After Start-dfs.xml executes start-mapred.sh, which is similar to start-hdf.sh.
code is as follows |
copy code |
Start-mapred.s H # Start Hadoop map reduce daemons. Run this on master node. bin= ' dirname ' $ ' bin= ' CD ' $bin '; pwd ' If [e ' $bin/. /libexec/hadoop-config.sh "]; Then . "$bin"/.. /libexec/hadoop-config.sh Else . "$bin/hadoop-config.sh" Fi # start mapred daemons # start Jobtracker i to minimize connection E Rrors at startup $bin/hadoop-daemon.sh--config $HADOOP _conf_dir start jobtracker "$bin"/hadoop-daemons.sh-- Config $HADOOP _conf_dir start tasktracker |
The script also executes hadoop-config.sh, and also executes hadoop-env.sh. Here and start-dfs.sh are unified. The last two lines of code are to start the jobtracker and tasktracker processes. The same corresponds to two classes Org.apache.hadoop.mapred.JobTracker and Org.apache.hadoop.mapred.TaskTracker
Take Org.apache.hadoop.mapred.JobTracker For example, Org.apache.hadoop.mapred.TaskTracker is similar to
Org.apache.hadoop.mapred.JobTracker
code is as follows |
copy code |
Public Class Jobtracker implements Mrconstants, Intertrackerprotocol, Jobsubmissionprotocol, Tasktrackermanager, Refreshusermappingsprotocol, Refreshauthorizationpolicyprotocol, Adminoperationsprotocol, Jobtrackermxbean { static{ Configuration.adddefaultresource ("Mapred-default.xml"); Configuration.adddefaultresource ("Mapred-site.xml"); } ... } |
The
OK, with the above explanation, is now clear. The Core-*.xml and Mapred-*.xml files were loaded at jobtracker startup, where Core-*.xml was completed by configuration.
Summarize:
When you start all of the Hadoop processes using start-all.sh, the various configuration files are loaded in order:
hdfs:hadoop-env.sh--> Core-default.xml--> core-site.xml--> hdfs-default.xml--> hdfs-site.xml
mapred:hadoop-env.sh--> Core-default.xml--> core-site.xml--> mapred.default.xml-->
Note a little, mapred.site.xml. XML system files are always loaded first, and 5 processes in Hadoop are loaded, which means that core-*.xml is a common base and is shared by big guys. The
configuration file is loaded at process startup, which also proves that if you modify the configuration file for Hadoop, whether it is a system profile or an administrator profile, you need to restart the process for it to take effect.