Configuration of hadoop source code analysis

Source: Internet
Author: User

Recently, I think we should take a closer look at the hadoop source code. I used to understand the basic architecture and use it. Recently, I am working on a system. I think many things can be used to learn from the scalability of mapreduce. However, when our system version 0.1 appeared, we found that our configuration was messy. So I took a look at the hadoop configuration class. I really think hadoop configuration is worth learning from and learned a lot! The following is a list of properties of the configuration class:

Log is the log object. Quietmode indicates whether the configuration information is in the silent mode during loading. If the configuration information is in the silent mode, some information in the configuration information loading process will not be recorded in the log, the default value is true. Resources is an array of objects used to store objects that contain configuration information. Finalparameter is a set of variables whose configuration values are declared as final. Loaddefault indicates whether to load the default configuration from the surface field. Registry is a weakhashmap that is used to manage configurations of multiple objects. Weak hashing can automatically clear entries corresponding to keys that are not in normal use. Defaultresources is a copyonwrite string array used to store the default configuration resource name or path. {...} Is a static initialization block used to load default configuration resources. Properties stores all the configuration information in the configuration object. Its type is properties. This type is an attribute set of Kv configuration provided by Java, improves the storage and operation of Kv configuration parameters. Overlay is the overwriting attribute. Classloader is mainly used to configure the class loader that provides the context environment when the corresponding object instance is constructed based on the configured parameters. Varpat is a regular expression object that converts values containing environment variables. For example, we set the value of a path variable to $ home/data, then, the variable value will be divided into strings $ home and/data according to certain rules, and $ home will be parsed into the home directory on the system. Max_subst is the number of levels that can be deeply parsed for values with environment variables. Values that exceed this maximum number of layers cannot be parsed.

The following describes the situation of each method. Methods of the configuration class are divided into two categories according to access control: private and public. Private is mainly a tool of the public method. Where:

Three constructor methods:

The second is whether to load the default settings. The default value is true. The third is to use a configuration object to construct a new configuration object.

The following describes how to add and configure resources:

Add default or specified configuration resources from various sources, while reloadconfiguration () is a method to clear all original configuration information, so that you can reload the configuration information, this can be used to overwrite the value or overwrite the previous configuration resource with the new configuration resource.

The following describes the set or get methods and other methods for obtaining configuration information:

The substitutevars (string) method is used together with the above regular expression object to parse the parameter values containing environment variables. The following set and get methods are mainly used to obtain various parameter values. The main mechanism is to call the loadresources (properties, arraylist, Boolean) method through getpros () and then call loadresource (properties, object, Boolean) loads the configuration information in the configuration resource, and the getpros () method is called to obtain the properties object of the current configuration object in the Set (string, string) and get (string) methods, if this object is empty, call the loadresources (properties, arraylist, Boolean) method to load the configuration information. Other get and set methods after the call are both get (string) and set (string, string) method. The last several methods are the size () method, which obtains the configuration information. The clear () method is used to clear the configuration information. integerranges is an internal class about the Integer Range, iterator () is an iterator for configuring objects. The final readfields (datainput) method and write (dataoutput) method are because the configuration class implements the writable interface implementation method, in this way, the configuration class can be distributed in the cluster so that the configuration information on all nodes of the same job is identical.

This is the core class of hadoop configuration information. From this, we can see how to provide a valuable configuration in a large distributed system to realize system availability and meet business flexibility. I have summarized the following points:

1. You need to provide various set and get methods to obtain various configuration parameter values, such as a large number of set and get methods and their implementation logic.

2. The appropriate collection garbage collection mechanism and thread synchronization problems adopt a reasonable data storage structure, such as the weak hash and copyonwritearraylist used here, as well as the locking of configuration. Class.

3. The configuration in the distributed system must be serialized so that the configuration information consistency can be maintained in the cluster, so that the configuration information can be left in the stream.

4. the configuration method of a distributed system should be divided into at least three layers. The first layer is the default and Global static configuration. The second layer is a parameter that can be customized for each job, this can be done by using the set method in configuration. The third layer is to customize the configuration parameters for a job process group through the command line. The three levels of scopes are different. Their scopes are global systems, a job program, and a job running process group. Each time, they can overwrite the parameter values of the previous layer.

Here I think there are too many configuration parameters for hadoop. it is inconvenient to set the parameter name through the set method, but it also satisfies the flexibility. Users can customize the configuration parameters themselves. You can also provide an enumeration class to describe the corresponding parameters, which is more convenient to use.

From http://blog.csdn.net/dahaifeiyu/article/details/6655652

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.