The filesystem of Hadoop source code analysis

Last Update:2018-07-23 Source: Internet

Author: User

Tags abstract constant

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After you create a new configuration object, when you call Configuration.get () to get a configuration key-value pair, if the properities of the Config object is null, The default profile on Classpath is loaded by default (see Configuration for Hadoop source analysis), so you can use this object to create a new filesystem object after you have a configuration object. Hadoop Abstract File system

In order to provide a unified interface for different file systems, Hadoop provides an abstract file system, and the Hadoop Distributed File System (Hadoop distributed filesystem, HDFS) is just a concrete implementation of this abstract file system. The Hadoop abstract file system interface is primarily provided by the abstract class Org.apache.hadoop.fs.FileSystem, and its inherited hierarchy is shown below:

As can be seen from the figure above, the Hadoop release package contains different filesystem subclasses to meet different data access requirements. It is more typical to access data on HDFs, but sometimes it can also be accessed on other file systems, such as Amazon's S3 system. In addition, users can also implement specific file systems for a particular network storage service based on their specific needs.

The Hadoop abstract file system provides users with a unified interface for accessing different file systems, most of which are in the abstract class filesystem. Its main approach can be divided into two parts: one for processing file and directory related transactions, and the other for reading and writing file data. The processing of files and directories mainly refers to the creation of files, create directories, delete files, delete directories and other operations, read and write data files mainly refers to read file data, write file data and other operations. Most of these column operations are similar to Java's file system interfaces, such as the Filesystem.mkdirs (Path F, fspermission permission) method, which creates a directory in the file system represented by the FileSystem object. Java.io.File.mkdirs () is also a way to create a directory. The Filesystem.delete (Path f) method is used to delete a file or directory, and the Java.io.File.delete () method is also used to delete a file or directory. Most of the abstract methods that require subclasses to be implemented in filesystem are known as methods, and the abstract methods in filesystem are organized as follows:

	/** Get file system uri**/public abstract URI GetURI ();
	/** opens a file and returns an input stream **/public abstract Fsdatainputstream open (Path F, int buffersize) throws IOException; /** creates a file and returns an input stream **/public abstract Fsdataoutputstream Create (Path F, fspermission permission, Boolean Overwrite, int buffersize, short replication, long blockSize, progressable progress) throw
	s IOException;  /** append data to an already existing file **/public abstract Fsdataoutputstream append (Path f, int buffersize, progressable progress)
	Throws IOException;
	/** Modify the file name or directory name **/public abstract boolean rename (path src, path DST) throws IOException;
	/** Delete File **/public abstract Boolean Delete (Path F, Boolean recursive) throws IOException; /** If path is a directory that reads all project and project properties under a directory, if path is a file that gets the properties of the file **/public abstract filestatus[] Liststatus (Path f) throws
	IOException;
	/** set the current working directory **/public abstract void Setworkingdirectory (Path new_dir); /** gets the current working directory **/public Abstract Path GetworkingdirectOry ();
	/** Create folder **/public abstract Boolean mkdirs (Path F, fspermission permission) throws IOException;
 /** gets the file or directory property **/public abstract Filestatus Getfilestatus (Path f) throws IOException;

The method above essentially covers the implementation of the Hadoop abstract file system implementation. In addition, the Hadoop abstract file system, based on the above methods, provides a number of tool methods for user-friendly invocation. such as the Liststatus () method.

get FileSystem Object

The following is a Distributedfilesystem object about the client getting filesystem

The client can construct a filesystem object using the Filesystem.get () method, which has 3 overloaded methods, namely

  /** Returns the configured filesystem implementation. * * * */public static FileSystem get (Configuration conf) throws IOException {return get (Getdefaulturi (CO
  NF), Conf);  }/** Returns The FileSystem for this URI ' s scheme and authority. The scheme * of the URI determines a configuration property name, * <tt>fs.<i>scheme</i>.class<
   /tt> whose value names the FileSystem class.
   * The entire URI is passed to the FileSystem instance ' s Initialize method. */public static FileSystem get (URI Uri, Configuration conf) throws IOException {String scheme = Uri.getscheme ();// Gets the URI pattern String authority = uri.getauthority ();//authentication information//uri mode is empty, and authentication information is empty, return default file system if (scheme = = NULL &&amp ;
    authority = = null) {//Use default FS return get (conf); }//authentication information is NULL if (scheme! = NULL && authority = = null) {//no authority URI Defaulturi = Getdefaul
      TUri (conf); if (Scheme.equals (Defaulturi.Getscheme ())//if scheme matches default && defaulturi.getauthority ()! = NULL) {//& default H              As authority return get (Defaulturi, conf);
    Return default}} String Disablecachename = String.Format ("Fs.%s.impl.disable.cache", Scheme);
    if (Conf.getboolean (Disablecachename, False)) {//whether to use the file system of the cache return Createfilesystem (URI, conf);
  } return Cache.get (URI, conf); public static FileSystem get (final URI Uri, final Configuration conf, final String user) throws IOException, I
    nterruptedexception {usergroupinformation ugi;
    if (user = = null) {UGI = Usergroupinformation.getcurrentuser ();
    } else {Ugi = usergroupinformation.createremoteuser (user);  } return Ugi.doas (new privilegedexceptionaction<filesystem> () {public FileSystem run () throws IOException
      {return get (URI, conf);
  }
    });
 }

But all are called filesystem.get () methods that have two parameters to get the FileSystem object.

The line of code that gets the FileSystem object from the beginning of the Hadoop source analysis is

FileSystem in = Filesystem.get (conf);

Where Conf is a configuration object. After executing this line of code into the Filesystem.get (Configuration conf) method, you can see that in this method, the Getdefaulturi () method to get the file system corresponding to the URI, The URI holds the protocol and authorization information corresponding to the file system, such as: hdfs://localhost:9000. How is this URI obtained? is obtained in the configuration file in Classpath, see the Getdefaulturi () method has Conf.get (Fs_default_name_key, "file:///") such an argument, There is such a configuration in the Core-site.xml file in the classpath of the author project:

	<property>
		<name>fs.default.name</name>
		<value>hdfs://localhost:9000</value >
	</property>
	<property>

The value of the constant Fs_default_name_key corresponds to Fs.default.name, so Conf.get (Fs_default_name_key, "file:///") gets the value Hdfs://localhost : 9000.

When the URI is created, it enters into the Filesystem.get (final URI Uri, final Configuration conf) method. In this method, some checks are performed to check if the protocol and authorization information for the URI is empty, and then the method is called directly or by the introduction, and the most important thing is to execute

    String disablecachename = String.Format ("Fs.%s.impl.disable.cache", scheme);
    if (Conf.getboolean (Disablecachename, False)) {//whether to use the file system of the cache
      return Createfilesystem (URI, conf);
    }

    Return Cache.get (URI, conf);

The constant cache caches the open, shareable file system, which is an object of the static inner class filesystem.cache of the FileSystem class and uses a map to store the file system inside it

Private final Map<key, filesystem> Map = new Hashmap<key, filesystem> ();

The key for this key-to-value mapping is the FileSystem.Cache.Key type, which has three member variables:

/**uri mode **/
      final String scheme;
      /**uri's authorized portion **/
      final String authority;
      /** Save the local user information to open the specific file system, the specific file system opened by different local users is also not shared **/
      final usergroupinformation Ugi;

Since Filesystem.cache represents a shareable file system, this key is used to differentiate between different file system objects, such as a file system object that can be shared, and the three member variables of the FileSystem.Cache.Key are equal, overriding the hashcode in this class. The () method and the Equals () method are used to determine whether the three variables are equal. According to the introduction of the book "Hadoop Technology Insider: In-depth analysis of the principles of design and implementation of Hadoop common and HDFs architectures," in HADOOP1. 0 version of the FileSystem.Cache.Key class also has a unique field, this field indicates that if the other 3 fields equal, if the user does not want to share the file system, set this value (the default is 0), but do not know why now to remove, still do not understand, there is a student who knows the trouble to tell Thank you.

Return to the last line of the Filesystem.get (final URI Uri, final Configuration conf) method return Cache.get (URI, conf), called FileSystem.Cahce.get () method to get the specific file system object, the method code is as follows:

 FileSystem get (URI Uri, Configuration conf) throws ioexception{key key = new Key (URI, conf);
      FileSystem fs = null;
      Synchronized (this) {fs = Map.get (key);
      } if (fs! = NULL) {return FS;
      } fs = Createfilesystem (URI, conf);
        Synchronized (this) {//Refetch the lock again FileSystem Oldfs = Map.get (key); if (Oldfs! = null) {//A file system is created while lock is releasing fs.close ();//Close the new file Syste  M return Oldfs; Return the old file system}//Now insert the new file system into the map if (Map.isempty ()
        &&!clientfinalizer.isalive ()) {Runtime.getruntime (). Addshutdownhook (Clientfinalizer);
        } Fs.key = key;
        Map.put (key, FS);
      return FS; }
    }

In this method, first look at whether the map has already cached the file system object to get, if it is already, directly from the collection to be removed, if not to create, Because Filesystem.cache is a static type, multiple threads may be accessed at the same time, so you need to use synchronous operations in the method of the cache class to take values and set values. This method is relatively simple, the core is

      fs = Createfilesystem (URI, conf);

This line of statement, which performs the creation of specific file system objects, functions. The Createfilesystem () method is a private method of filesystem with the following code:

private static FileSystem Createfilesystem (URI Uri, Configuration conf
      ) throws IOException {
    class<?> Clazz = Conf.getclass ("FS." + uri.getscheme () + ". Impl", null);
    Log.debug ("Creating filesystem for" + URI);
    if (clazz = = null) {
      throw new IOException ("No FileSystem for scheme:" + uri.getscheme ());
    }
    FileSystem fs = (FileSystem) reflectionutils.newinstance (clazz, conf);
    Fs.initialize (URI, conf);
    return FS;
  }

The implementation is to first obtain the URI corresponding class from the configuration file, such as in the Core-default.xml file property (key) Fs.hdfs.impl corresponding value is Org.apache.hadoop.hdfs.DistributedFileSystem, the corresponding XML code is as follows:

<property>
  <name>fs.hdfs.impl</name>
  <value> Org.apache.hadoop.hdfs.distributedfilesystem</value>
  <description>the FileSystem for Hdfs:uris. </description>
</property>

So if the URI corresponds to Fs.hdfs.impl, then the clazz in Createfilesystem is the class object of Org.apache.hadoop.hdfs.DistributedFileSystem. Then use reflection to create the Org.apache.hadoop.hdfs.DistributedFileSystem object fs. The Fs.initialize (URI, conf) is then executed, and the FS object is initialized. Distributedfilesystem is a Hadoop Distributed File system implementation class, the implementation of the Hadoop file system interface, provides processing HDFS files and directories related transactions.

After the Distributedfilesystem.initialize () method is executed, the FileSystem object is created successfully.

Reference:

Hadoop Insider: An in-depth analysis of the design and implementation principles of Hadoop common and HDFS architectures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More