Hadoop file system,

Source: Internet
Author: User
Tags hadoop fs

Hadoop file system,

HDFS is the most commonly used Distributed File System when processing big data using the Hadoop framework. However, Hadoop file systems are not only distributed file systems, such as hfs and HSFTP, HAR and so on are integrated in Hadoop to process data stored in different systems. In fact, Hadoop is actually a comprehensive file system.

Next let's take a look at the structure and system of the file system.

Of course, the above UML diagram is actually redundant, but in order to clearly express the members of the fs system, I try to list all the members. In the file system of Hadoop, FileSystem is the parent class of other Members. It is an abstract class and aims to define a standard for accessing the file system, other components in the Hadoop framework only know the standards provided by the FileSystem class when using the file system. Therefore, if you define a file system, you must follow this standard (of course, let's just say, I have never tried). In my opinion, abstract programming is based on certain standards. As for the sub-class of FileSystem, whether it is only an abstract method or a method of the parent class is rewritten, I don't think it is important. The most important thing is to understand this design idea.

When using a file system, FileSystem is generally used.Get(Configuration conf) or FileSystem.Get(URI uri, Configuration conf) to obtain the file system instance. These two methods are highly consistent. Let's take a look at the source code.

public static FileSystem get(URI uri, Configuration conf) throws IOException {  String scheme = uri.getScheme();  String authority = uri.getAuthority();  if (scheme == null) { // no scheme: use default FS    return get(conf);  }  if (authority == null) { // no authority  URI defaultUri = getDefaultUri(conf);  if (scheme.equals(defaultUri.getScheme()) // if scheme matches default    && defaultUri.getAuthority() != null) { // & default has authority  return get(defaultUri, conf); // return default  }}  String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);    if (conf.getBoolean(disableCacheName, false)) {      return createFileSystem(uri, conf);  }return CACHE.get(uri, conf);}  /** Returns the configured filesystem implementation.*/  public static FileSystem get(Configuration conf) throws IOException {    return get(getDefaultUri(conf), conf);}  public static URI getDefaultUri(Configuration conf) {    return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, "file:///")));  }



When you call FileSystem.Get(Configuration conf), FileSystem is actually called.Get(URI uri, Configuration conf), using FileSystem.Get(Configuration conf) to get the file system instance will read the Configuration file core-site.xml middle-key fs. default. the value corresponding to name. If this configuration does not exist, the default file: // is the local file system. Use FileSystem.Get (conf)Method, the configuration file will be read, so it is very convenient to replace the file system later. Of course, use FileSystem.Get(Uri, conf) can also be used to modify the file system in the later stage, simply adding parameters when running the program. Now the remaining problem lies in FileSystem.Get(Uri, conf) Head, the power is in its hands. Let's take a look at how this method gets a specific file system instance. After verifying that the uri structure is complete, createFileSystem (uri, conf) is used to return the FileSystem instance in the CACHE. the get (uri, conf) method also calls createFileSystem (uri, conf) internally, so the problem falls on this method. Let's take a look at createFileSystem (uri, conf) source code, you will understand what is going on

private static FileSystem createFileSystem(URI uri, Configuration conf      ) throws IOException {    Class<?> clazz = conf.getClass("fs." + uri.getScheme() + ".impl", null);    LOG.debug("Creating filesystem for " + uri);    if (clazz == null) {      throw new IOException("No FileSystem for scheme: " + uri.getScheme());    }    FileSystem fs = (FileSystem)ReflectionUtils.newInstance(clazz, conf);    fs.initialize(uri, conf);    return fs;  }

This makes it clear that Class <?> Clazz = conf. getClass ("fs. "+ uri. getScheme () + ". impl ", null); by reading fs. [scheme]. the value of impl is used to obtain the bytecode file of the FileSystem implementation Class. With other things of the Class file, it is easy to say. At this time, it is equivalent to exposing all the details of the entire Class to programmer, the reflection technology is used to operate the Class to obtain an instance of the file system. Below is the configuration snippet for this part in the core-default.xml

<property>  <name>fs.file.impl</name>  <value>org.apache.hadoop.fs.LocalFileSystem</value>  <description>The FileSystem for file: uris.</description></property><property>  <name>fs.hdfs.impl</name>  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>  <description>The FileSystem for hdfs: uris.</description></property><property>  <name>fs.s3.impl</name>  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>  <description>The FileSystem for s3: uris.</description></property><property>  <name>fs.s3n.impl</name>  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>  <description>The FileSystem for s3n: (Native S3) uris.</description></property><property>  <name>fs.kfs.impl</name>  <value>org.apache.hadoop.fs.kfs.KosmosFileSystem</value>  <description>The FileSystem for kfs: uris.</description></property><property>  <name>fs.hftp.impl</name>  <value>org.apache.hadoop.hdfs.HftpFileSystem</value></property><property>  <name>fs.hsftp.impl</name>  <value>org.apache.hadoop.hdfs.HsftpFileSystem</value></property><property>  <name>fs.webhdfs.impl</name>  <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value></property><property>  <name>fs.ftp.impl</name>  <value>org.apache.hadoop.fs.ftp.FTPFileSystem</value>  <description>The FileSystem for ftp: uris.</description></property><property>  <name>fs.ramfs.impl</name>  <value>org.apache.hadoop.fs.InMemoryFileSystem</value>  <description>The FileSystem for ramfs: uris.</description></property><property>  <name>fs.har.impl</name>  <value>org.apache.hadoop.fs.HarFileSystem</value>  <description>The filesystem for Hadoop archives. </description></property>

In the preceding file system diagram, there are three marked classes. The following describes whether the file system supports verification and setup. What is a test? In the file writing or reading phase, the test is used to verify whether the data size exists in the data statistics of the difference pool (to put it bluntly). The purpose of the test is to ensure data integrity. In hadoop, data integrity is ensured through the verification during read/write. In hadoop FS, not all underlying file systems support verification and verification. LocalFileSystem and RawLocalFileSystem are used to describe what underlying file systems support verification and verification.

LocalFileSystem is a file system that supports verification and verification.

public LocalFileSystem() {    this(new RawLocalFileSystem());  }    public FileSystem getRaw() {    return rfs;  }      public LocalFileSystem(FileSystem rawLocalFileSystem) {    super(rawLocalFileSystem);    rfs = rawLocalFileSystem;  }

The above is the constructor of LocalFileSystem. From the constructor, we can see that the underlying use of LocalFileSystem is RawLocalFileSystem (verification and verification are not supported). Here we can see the decoration mode in the design mode used here.

It can be seen that LocalFileSystem does not directly implement FileSystem, and in the hadoop authoritative guide, the FS inherited by ChecksumFileSystem in hadoop FS has the ability to verify and test. It is hard to imagine that, the verification and function are completed in FilterFileSystem. Looking at the source code, we can see that this design mode is a policy mode. FilterFileSystem is only interested in classes that inherit FileSystem. The following describes the constructors of ChecksumFileSystem and FilterFileSystem.


public ChecksumFileSystem(FileSystem fs) {    super(fs);  }


 protected FileSystem fs;    /*   * so that extending classes can define it   */  public FilterFileSystem() {  }    public FilterFileSystem(FileSystem fs) {    this.fs = fs;    this.statistics = fs.statistics;  }

From the preceding three constructors, we can see that the RawLocalFileSystem instance is stored in the protected field FileSystem fs of FilterFileSystem, then call FilterFileSystem to access the file system. In fact, the method corresponding to the field fs cannot be called at the end. So, FS in hadoop inherits the ChecksumFileSystem and has the verification and capability. For example, InMemoryFileSystem also supports verification and verification. How can I support verification and verification for other file systems in hadoop that do not inherit ChecksumFileSystem? From the constructor of checkSumFileSystem and FilterFileSystem, it cannot be seen that verification and verification can be implemented by encapsulating any class that inherits FileSystem. To add verification and verification capabilities to the DistributedFileSystem instance, use new ChecksumFileSystem (dfs). I call the ChecksumFileSystem class a file system coat, with this coat on, the file system becomes stronger.

The above content is a summary of the hadoop learning phase. I am here to introduce some of the mistakes and hope you can help us to point out them.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.