Hadoop Configuration Analysis

Source: Internet
Author: User

Hadoop Configuration Analysis

To learn about the Hadoop Common module, of course, it is best to learn from the simplest and most basic modules. So I chose the conf configuration module for learning. The overall class structure is very simple.


As long as it inherits the retriable interface, it generally indicates that it is Configurable and can perform corresponding Configuration operations, but the centralized Configuration operations are embodied in the Configuration class. This class defines many set variables:

/*** List of configuration resources. */private ArrayListResources = new ArrayList();/*** List of configuration parameters markedFinal. * The finalParameters Set retains the immutable parameters modified by final */private Set
   
    
FinalParameters = new HashSet
    
     
();/*** Whether to load the default resource Configuration */private boolean loadDefaults = true;/*** Configuration objects * Configuration object */private static final WeakHashMap
     
      
REGISTRY = new WeakHashMap
      
        ();/*** List of default Resources. Resources are loaded in the order of the list * entries */private static final CopyOnWriteArrayList
       
         DefaultResources = new CopyOnWriteArrayList
        
          ();The above is just a part of the list, the basic purpose is to save some resource data. Another variable is critical:
         

// Attributes in the resource configuration file are loaded into the Properties attribute to private Properties properties;
All attribute variables are stored in Properties in java to facilitate subsequent direct access. Property is actually a HashTable. We learned the entire process in the order of Configuration loading. First of all, execute the initialization code block:

Static {// print deprecation warning if hadoop-site.xml is found in classpath ClassLoader cL = Thread. currentThread (). getContextClassLoader (); if (cL = null) {cL = Configuration. class. getClassLoader ();} if (cL. getResource ("hadoop-site.xml ")! = Null) {LOG. warn ("DEPRECATED: hadoop-site.xml found in the classpath. "+" Usage of hadoop-site.xml is deprecated. instead use core-site.xml, "+" mapred-site.xml and hdfs-site.xml to override properties of "+" core-default.xml, mapred-default.xml and hdfs-default.xml "+" respectively ");} // loads the default profile during initialization, core-site is the user's property definition // if there is the same, the latter's property will overwrite the former's property addDefaultResource ("core-default.xml"); addDefaultResource ("core-site.xml ");}
If you have learned the execution sequence of java constructor, you should know that the execution sequence of the Code in the initialization code block is prior to that of the constructor. Therefore, after performing the above operations, you will come to addDefaultResource ():

/*** Add a default resource. resources are loaded in the order of the resources * added. * @ param name file name. file shoshould be present in the classpath. */public static synchronized void addDefaultResource (String name) {if (! DefaultResources. contains (name) {defaultResources. add (name); // traverses the registered resource Configuration and reloads the resource for (Configuration conf: REGISTRY. keySet () {if (conf. loadDefaults) {conf. reloadConfiguration ();}}}}
Add the resource name to the corresponding collection, traverse each configuration class, and re-load the configuration operation. Because the default resource list has been changed, it needs to be re-loaded. This article briefly introduces that every Configuration class will be added to the REGISTRY collection after initialization. This is a static variable, so it will be globally unified. Then move the focus to reloadConfiguration ():

/*** Reload configuration from previusly added resources. ** This method will clear all the configuration read from the added * resources, and final parameters. this will make the resources to * be read again before accessing the values. values that are added * via set methods will overlay values read from the resources. */public synchronized void reloadConfiguration () {// re-load Configuration is to re-Clear the attribute records in properties = null; // trigger reload finalParameters. clear (); // clear site-limits}
The operation is very simple, that is, clear some operations. Maybe at this time, you will wonder if you don't need to load new resources immediately? In fact, this is also a major design of the author. The answer is later. Okay, the program is executed here, and the initialization code block operation is complete. The next step is the execution of the constructor:

/** A new configuration. */public Configuration () {// Initialization is the this (true) that needs to load default resources );}
Then continue to call the overload function:

/** A new configuration where the behavior of reading from the default * resources can be turned off. ** If the parameter {@ code loadDefaults} is false, the new instance * will not load resources from the default files. * @ param loadDefaults specifies whether to load from the default files */public Configuration (boolean loadDefaults) {this. loadDefaults = loadDefaults; if (LOG. isDebugEnabled () {LOG. debug (StringUtils. stringifyException (new IOException ("config ()");} synchronized (Configuration. class) {// The loaded Configuration object will be added to the REGISTRY set. put (this, null);} this. storeResource = false ;}
Focus on adding the Configuration class initialized to the Global REGISTRY.

The code for the above analysis is a preliminary operation, so how to implement the key set/get methods directly related to attributes, you must first understand how the configuration file format in Hadoop exists in the file. For example, HDFS profile hdfs-site.xml;

          
          
          
             
               
            
             dfs.name.dir
                
            
             /var/local/hadoop/hdfs/name
                
            
             Determines where on the local filesystem the DFS name node      should store the name table.  If this is a comma-delimited list      of directories then the name table is replicated in all of the      directories, for redundancy. 
                
            
             true
              
             
               
            
             dfs.data.dir
                
            
             /var/local/hadoop/hdfs/data
                
            
             Determines where on the local filesystem an DFS data node       should store its blocks.  If this is a comma-delimited       list of directories, then data will be stored in all named       directories, typically on different devices.       Directories that do not exist are ignored.    
                
            
             true
              
           .......
          
The relationship at the node level is not very complex. The key is that each Property node retains the name, value, and des descriptions of this attribute. The final tag is used to determine whether this attribute can be changed, if it is true, it cannot be changed. It is similar to the final keyword in java. After understanding the structure of the configuration file, you can continue to read it. For example, if I want to set one attribute, a small method of set is as follows:

/*** SetvalueOfnameProperty. ** @ param name property name. * @ param value property value. * set the property value based on name. The property key-value pair is saved in property */public void set (String name, String value) {getOverlay (). setProperty (name, value); getProps (). setProperty (name, value );}
The following setProperty is the Property setting method and jdk API, so the key is to get the getProps method and how to load the attributes in the file to the Property variable.
/*** The delayed loading policy is used during loading * @ return */private synchronized Properties getProps () {if (properties = null) {properties = new Properties (); // obtain the property-related data loadResources (properties, resources, quietmode) from the resource again; if (overlay! = Null) {properties. putAll (overlay); if (storeResource) {for (Map. Entry
          
           
Item: overlay. entrySet () {updatingResource. put (String) item. getKey (), "Unknown") ;}}} return properties ;}
          
After reading the NULL judgment above, you may know why the reload operation was just as simple as the clear operation is complete, it is the so-called delayed loading policy that will be loaded when the real Property is to be obtained later. It is similar to the lazy model in the singleton mode. Therefore, loadResources is the key to this implementation:

Private void loadResource (Properties properties, Object name, boolean quiet) {try {// retrieve the xml file Object for parsing in the factory mode. Here, the doc resolution method DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory is used. newInstance (); // ignore all comments inside the xml file docBuilderFactory. setIgnoringComments (true); // allow nodes in the xml file docBuilderFactory. setNamespaceAware (true); try {docBuilderFactory. setXIncludeAware ( True);} catch (UnsupportedOperationException e) {LOG. error ("Failed to set setXIncludeAware (true) for parser" + docBuilderFactory + ":" + e, e);} DocumentBuilder builder = docBuilderFactory. newDocumentBuilder ();..... if (root = null) {// get the node in xml for retrieval. The root node root = doc is obtained first. getDocumentElement ();} if (! "Configuration". equals (root. getTagName () LOG. fatal ("bad conf file: top-level element not
          
           
"); NodeList props = root. getChildNodes (); for (int I = 0; I <props. getLength (); I ++) {Node propNode = props. item (I); if (! (PropNode instanceof Element) continue; Element prop = (Element) propNode; if ("configuration ". equals (prop. getTagName () {// if the sub-node is configuration, call the loadResource () method loadResource (properties, prop, quiet); continue;} if (! "Property". equals (prop. getTagName () LOG. warn ("bad conf file: element not
           
            
"); NodeList fields = prop. getChildNodes (); String attr = null; String value = null; boolean finalParameter = false; for (int j = 0; j <fields. getLength (); j ++) {Node fieldNode = fields. item (j); if (! (FieldNode instanceof Element) continue; // attribute nodes are divided into three types: name, value, final Element field = (Element) fieldNode; if ("name ". equals (field. getTagName () & field. hasChildNodes () attr = (Text) field. getFirstChild ()). getData (). trim (); if ("value ". equals (field. getTagName () & field. hasChildNodes () value = (Text) field. getFirstChild ()). getData (); if ("final ". equals (field. getTagName () & field. hasChildNodes () // The final parameter needs to be added to the set of finalParameters parameter finalParameter = "true ". equals (Text) field. getFirstChild ()). getData ();} // Ignore this parameter if it has already been marked as 'final' if (attr! = Null) {if (value! = Null) {if (! FinalParameters. contains (attr) {// put the value above into properties in this step. setProperty (attr, value); if (storeResource) {updatingResource. put (attr, name. toString () ;}} else if (! Value. equals (properties. getProperty (attr) {LOG. warn (name + ": a attempt to override final parameter:" + attr + "; Ignoring. ") ;}} if (finalParameter) {finalParameters. add (attr );}}}
           
          
Compared with the actual configuration file we see above, it is not difficult to understand, that is, a simple doc parses xml files, but there is more processing here, for example, the final parameter requires additional operations. After loading, the Property information is put into the Property and the goal is achieved.

Let's talk about the get attribute acquisition operation. It also has a different design, not just getProps (). get (name) operations, because sometimes, through such operations, you cannot really get the desired value. For example, the following structure:

              
           
            dfs.secondary.namenode.kerberos.principal
               
           
            hdfs/_HOST@${local.realm}
               
           
                    Kerberos principal name for the secondary NameNode.    
             
          
Maybe you will directly use dfs. secondary. namenode. kerberos. the name principal gets the value, and the obtained value is hdfs/_ HOST $ {local. realm}, but obviously this is not the value we need, because there is $ {local. realm}, which actually represents another set value, sometimes more of the value of the system variable, so this tells us, in the value search operation, we need to replace these variables.

/**   * Get the value of the name property, null if   * no such property exists.   *    * Values are processed for variable expansion    * before being returned.    *    * @param name the property name.   * @return the value of the name property,    *         or null if no such property exists.   */  public String get(String name) {    return substituteVars(getProps().getProperty(name));  }
Therefore, Hadoop performs a step-by-step Value Replacement Operation after obtaining the value, and uses a regular expression.

// The pattern to be matched is \ $ \ {[^ \} \ $] + \}, and many \ s are escaped in java. // $ ,{,} is a reserved word in a regular expression, so \, this match can be divided into // '\ $ \' {matched parts of $ {// The final '\}' matches the terminator }, in this way, the initial $ {....} the target type structure of // The intermediate [^ \} \ $] matches the keyword except}, $, and space. // + is one modifier, ensure that the matching in the middle is at least one time, that is, there must be at least private static Pattern varPat = Pattern in the middle. compile ("\\\ \\{ [^ \}\\$ \ u0020] + \\}"); private static int MAX_SUBST = 20; private String substituteVars (String expr) {// INPUT attribute match value. if it is null, if (expr = null) {return null;} Matcher match = varPat is returned. matcher (""); String eval = expr; // avoid endless loop of loop iterations. Here, MAX_SUBST20 replacement times are mandatory at most for (int s = 0; s
          
           
The key difficulty is for $ {...} the construction of this pattern, such as my regular expression, is hard to think of for people looking for the Internet. Another special processing is to avoid the endless loop of $ {...} after replacement, so there is a limit on the number of times. This is the implementation of the get operation. Finally, let's look at the two process analysis diagrams I have made for Configuration under different circumstances:
           


The implementation of configuration code should be short and refined. In the future, such principles can be used for reference when developing large systems.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.