Hadoop Configuration class analysis, hadooptext class
Configuration is a public class of five components in hadoop. Therefore, it is placed under the core, org. apache. hadoop. conf. Configruration. This class is the Configuration information class of the job. Configuration information for any function must be passed through Configuration, because Configuration can be used to share information between multiple mapper tasks and multiple CER tasks.
Class Diagram
Note: Configuration implements the Iterable and Writable interfaces. Iterable is implemented for iteration, and all name-value Key-value pairs loaded into the memory by the Configuration object are iterated. Writable is implemented to achieve the serialization required by the hadoop framework. The name-value in the memory can be serialized to the hard disk. I don't need to talk about the specific implementation of these two interfaces, think clearly. Next, we will analyze in detail how Configuration works, including the loading of Configuration files, how Configuration information is obtained and how Configuration information is loaded, and precautions during use. To study any class, we first start from the Constructor. Even if we use a single instance, the static factory cannot get objects without Constructor. Configuration has three constructors.
public Configuration() { this(true); }/** A new configuration where the behavior of reading from the default * resources can be turned off. * * If the parameter {@code loadDefaults} is false, the new instance * will not load resources from the default files. * @param loadDefaults specifies whether to load from the default files */ public Configuration(boolean loadDefaults) { this.loadDefaults = loadDefaults; updatingResource = new HashMap<String, String>(); synchronized(Configuration.class) { REGISTRY.put(this, null); } }/** * A new configuration with the same settings cloned from another. * * @param other the configuration from which to clone settings. */ @SuppressWarnings("unchecked") public Configuration(Configuration other) { this.resources = (ArrayList) other.resources.clone(); synchronized (other) { if (other.properties != null) { this.properties = (Properties) other.properties.clone(); } if (other.overlay != null) { this.overlay = (Properties) other.overlay.clone(); } this.updatingResource = new HashMap<String, String>( other.updatingResource); } this.finalParameters = new HashSet<String>(other.finalParameters); synchronized (Configuration.class) { REGISTRY.put(this, null); } }
1, Configuration () 2, Configuration (boolean loadDefaults) 3, the first two constructors of Configuration (Configuraiont other) Use a typical overlapping Constructor mode, that is, the default Constructor without parameters will generate a Configuration object loaded with the default Configuration file, where Configuration (boolean loadDefaults) the parameter in is used to control whether the constructed object is loaded with the default configuration file or not. However, if I want to design it, I won't be so troublesome. I will use two static factory methods to identify objects of different properties -- getConfigruationWithDefault () and getConfiguration, in this way, developers can use it in a simple way. Isn't it a good way? Don't talk about this. When loadDefaults is false, the Configuration object will not load the Configuration file loaded through addDefaultResource (String resource) into the memory. However, the configuration file loaded by addResource (...) is loaded into the memory. How is it implemented? In the Constructor of Configuration, this. loadDefaults = loadDefaults is used to set whether to load the flag of the default Configuration file. After the Configuration object is constructed, getType (String name, Type default) is called) method to obtain the value corresponding to a name. Take getInt as an example. Let's take a look at the getInt () code getInt (String name, int defalutVale)
public int getInt(String name, int defaultValue) { String valueString = get(name); if (valueString == null) return defaultValue; try { String hexString = getHexDigits(valueString); if (hexString != null) { return Integer.parseInt(hexString, 16); } return Integer.parseInt(valueString); } catch (NumberFormatException e) { return defaultValue; } }
The first line of the method code String valueString = get (name); is the key, so let's take a look at the get (String name) method get (String name)
private synchronized Properties getProps() { if (properties == null) { properties = new Properties(); loadResources(properties, resources, quietmode); if (overlay!= null) { properties.putAll(overlay); for (Map.Entry<Object,Object> item: overlay.entrySet()) { updatingResource.put((String) item.getKey(), UNKNOWN_RESOURCE); } } } return properties; }
Here is something to talk about. The path of getInt --> get --> getProps is the path to be taken for any getType method call, but it is necessary to part with the getProps, the first time the getType method is used, the loadResources (properties, resources, quietmode) method is executed after the properties = null is determined. However, if properties is not null, subsequent code will not be executed. The following describes the loadResources (properties, resources, quietmode) method)
private void loadResources(Properties properties, ArrayList resources, boolean quiet) { if(loadDefaults) { for (String resource : defaultResources) { loadResource(properties, resource, quiet); } //support the hadoop-site.xml as a deprecated case if(getResource("hadoop-site.xml")!=null) { loadResource(properties, "hadoop-site.xml", quiet); } } for (Object resource : resources) { loadResource(properties, resource, quiet); } }
Have you seen loadDefaults? Are you happy? The loadDefaults involved in controlling the default configuration file loading in the Constructor has finally appeared. DefaultResource is loaded only when loadDefaults is true. However, the configuration files stored in resources will be loaded anyway. Here there are two containers that store the configuration files: defaultResources and resource
/** * List of configuration resources. */ private ArrayList<Object> resources = new ArrayList<Object>();/** * List of default Resources. Resources are loaded in the order of the list * entries */ private static final CopyOnWriteArrayList<String> defaultResources = new CopyOnWriteArrayList<String>();
One is the list of cofiguration resources and the other is the list of default Resources. How can we tell whether it is a default resources? Don't worry. Let's take a look at the analysis below. In the Configuration class, there are multiple methods for loading the Configuration file, such as addDefaultResource (String name), addResource (String resoruce), and reload method, addResourceObject (Object resource ). Because addResource (...) the system class method is implemented by calling addResourceObject. Therefore, the difference between addDefaultResource (String name) and addResourceObject (Object resource) lies in addDefaultResource (String resource)
public static synchronized void addDefaultResource(String name) { if(!defaultResources.contains(name)) { defaultResources.add(name); for(Configuration conf : REGISTRY.keySet()) { if(conf.loadDefaults) { conf.reloadConfiguration(); } } } }
AddResourceObject (Object object)
private synchronized void addResourceObject(Object resource) { resources.add(resource); // add to resources reloadConfiguration(); }
Are you sure you want to know? I can't see it clearly. AddDefaultResource (String name) internally uses defaultResources. add (name) adds the name of the configuration file to the defaultResources container of the container. addResourceObject (Object resource) uses resources. add (resource) adds the configuration file to the resources container. So this shows that the default configuration file is loaded by addDefaultResource (String name) and stored in the ultresources container, the configuration file stored in resources cannot be used as the default configuration file. Observe the implementation of the two methods and find that reloadConfiguration () can be done in this article. Let's talk about the source code reloadConfiguration ()
/** * Reload configuration from previously added resources. * * This method will clear all the configuration read from the added * resources, and final parameters. This will make the resources to * be read again before accessing the values. Values that are added * via set methods will overlay values read from the resources. */ public synchronized void reloadConfiguration() { properties = null; // trigger reload finalParameters.clear(); // clear site-limits }
Well, properties = null, fianlParmeters. clear (), which clears the name-value in the memory. Therefore, you must re-load the configuration file to the memory after using the getType method. Therefore, we recommend that you do not use addDefaultResource (String resource) and addResourceObject (Object object Object) during job running ), this will cause the configuration file to be reloaded to the memory. It is necessary to explain the finalParameters filed. The feilds is also a Set container, mainly used to store the name-value modified by final, the name-value modified by fianl cannot be overwritten by the subsequent configuration file, but set (String name, String value) can be used in the program, it is strange that the administrator cannot modify the name-value in the configuration file but can be modified by the user. For the third constructor, it is easy to know that a configuration object that is the same as the input configuration object is generated based on parameters and specific implementations. The principle of how to control whether to load the Configuration file when constructing the Configuration object is clear, and the principle of getType is also clarified. The following is the Sequence Graph constructor that calls the getType method. The getType principle should be clear. Now let's look at the setType method, setType (String name, Type value) the set (String name, String value) method is called internally, which has the same relationship with getType (String name, Type defaultValue) and get (String. Now let's think about a problem: As mentioned above, we are using addDefaultResources (...) and addResourceObject (...) the method clears the name-value Key-value pairs in the memory, and the name-value pairs in the configuration file can be reloaded into the memory. That is to say, these name-value Key-value pairs will not be lost. However, the values set through setType () are not written to the configuration file. They exist in the memory.
public void set(String name, String value) { getOverlay().setProperty(name, value); getProps().setProperty(name, value); this.updatingResource.put(name, UNKNOWN_RESOURCE); }
GetProps returns the Properties object that stores all the name-value pairs. set (String name, String value) the name-value set in the method is only placed in the memory space of the properties object and is not written to the file, so addDefaultResources (...) and addResourceObject (...) when properties is set to null, the name-value loaded through set (String name, String value) is not discarded? Note that the key is getOverlay () in set (String name, String value (). setProperty (name, value), where the overlay returned by the getOverlay () method, the reference type of this object is Properties. The conflict arises. The set (String name, String value) method adds name-value to two Properties objects. What is this? Yes, it is certain that the name-value Key-value pairs set through the set (String name, String value) method are available in both the Field overlay object and field properties, let's look back at the getProps () method.
private synchronized Properties getProps() { if (properties == null) { properties = new Properties(); loadResources(properties, resources, quietmode); if (overlay!= null) { properties.putAll(overlay); for (Map.Entry<Object,Object> item: overlay.entrySet()) { updatingResource.put((String) item.getKey(), UNKNOWN_RESOURCE); } } } return properties; }
When properties is null, in addition to loading the value-name in the configuration file, the overlay object is also checked to see if it is null, if it is not null, the name-value in the overlay object is loaded into properties. Well, this is not in conflict with reloadConfiguration (), because reloadConfiguration () sets the properties object to null, overlay is not set to null. The purpose of overlay is to save the name-value set by the user as a backup of properties in the memory, in this way, the name-value configured by the system and administrator in properties is backed up by the configuration file, while the name-value loaded by the user in the later stage is backed up to the memory By overlay, properties does not lose any information during the lifecycle of the configuration object. Both the setType and getType methods can trigger the loadResources () method to add name-value to the memory of the properties object. However, once properties has stored the name-value Key-value pair in the configuration file, if you call the setType or getType method again, the loadResources () loading action will not be triggered unless addDefaultResources (...) is called (...) and addResourceObject (...). Summarize: 1 do not use addDefaultResources (...) during job running (...) and addResourceObject (...) load resources because this will cause the properties object to be refactored. We recommend that you use setType (...) at this time (...)
2 Configuration is frequently used in the entire MapReduce. The JobTraker and TaskTraker processes will use the Configuration object at startup, and the Configuration object will also be used in HDFS, therefore, I think it is important to understand the basic working principles of Configuration.
3 Configuration can be used to share information between MapReduce tasks. Of course, the shared information is configured in the job. Once the map or reduce task in the job is started, the configuration object is completely independent. Therefore, shared information is set in the job.