Hadoop-0.20.0 source code analysis (03)

Source: Internet
Author: User
Tags abstract definition

In the hadoop framework source code Org. apache. hadoop. the FS package is related to the implementation of the hadoop file system, including the establishment of the file system model, the definition of the file system, and the implementation of basic file operations. For example, the file system abstraction is provided, and basic operations on the files stored in the file system are abstracted.

In this package, the class inheritance relationships are as follows:

Java. lang. object <br/> Org. apache. hadoop. FS. blocklocation (implements Org. apache. hadoop. io. writable) <br/> using Org. apache. hadoop. conf. configured (implements Org. apache. hadoop. conf. retriable) <br/> Org. apache. hadoop. FS. filesystem (implements Java. io. closeable) <br/> Org. apache. hadoop. FS. filterfilesystem <br/> invalid Org. apache. hadoop. FS. checksumfilesystem <br/> invalid Org. apache. hadoop. FS. inmemoryfilesystem <br/> invalid Org. apache. hadoop. FS. localfilesystem <br/> using Org. apache. hadoop. FS. harfilesystem <br/> Org. apache. hadoop. FS. rawlocalfilesystem <br/> Org. apache. hadoop. FS. fsshell (implements Org. apache. hadoop. util. tool) <br/> Org. apache. hadoop. FS. trash <br/> Org. apache. hadoop. FS. contentsummary (implements Org. apache. hadoop. io. writable) <br/> using Org. apache. hadoop. FS. filechecksum (implements Org. apache. hadoop. io. writable) <br/> using Org. apache. hadoop. FS. md5md5crc32filechecksum <br/> invalid Org. apache. hadoop. FS. filestatus (implements Java. lang. comparable <t>, org. apache. hadoop. io. writable) <br/> using Org. apache. hadoop. FS. filesystem. statistics <br/> invalid Org. apache. hadoop. FS. fileutil <br/> invalid Org. apache. hadoop. FS. fileutil. hardlink <br/> Org. apache. hadoop. FS. fsurlstreamhandlerfactory (implements java.net. urlstreamhandlerfactory) <br/> implements Java. io. inputstream (implements Java. io. closeable) <br/> using Java. io. filterinputstream <br/> implements Java. io. bufferedinputstream <br/> invalid Org. apache. hadoop. FS. bufferedfsinputstream (implements Org. apache. hadoop. FS. positionedreadable, org. apache. hadoop. FS. seekable) <br/> using Java. io. datainputstream (implements Java. io. datainput) <br/> invalid Org. apache. hadoop. FS. fsdatainputstream (implements Org. apache. hadoop. FS. positionedreadable, org. apache. hadoop. FS. seekable) <br/> extends Org. apache. hadoop. FS. fsinputstream (implements Org. apache. hadoop. FS. positionedreadable, org. apache. hadoop. FS. seekable) <br/> extends Org. apache. hadoop. FS. fsinputchecker <br/> extends Org. apache. hadoop. FS. localdirallocator <br/> implements Java. io. outputstream (implements Java. io. closeable, Java. io. flushable) <br/> using Java. io. filteroutputstream <br/> implements Java. io. dataoutputstream (implements Java. io. dataoutput) <br/> Org. apache. hadoop. FS. fsdataoutputstream (implements Org. apache. hadoop. FS. syncable) <br/> invalid Org. apache. hadoop. FS. fsoutputsummer <br/> extends Org. apache. hadoop. FS. PATH (implements Java. lang. comparable <t>) <br/> extends Org. apache. hadoop. util. shell <br/> Org. apache. hadoop. FS. DF <br/> memory Org. apache. hadoop. FS. du <br/> using Java. lang. throwable (implements Java. io. serializable) <br/> using Java. lang. error <br/> invalid Org. apache. hadoop. FS. fserror <br/> implements Java. lang. exception <br/> Except Java. io. ioexception <br/> invalid Org. apache. hadoop. FS. checksumexception

First, read and analyze the source code of the top-level abstract class filesystem of the file system.

The filesystem abstract class is inherited from Org. apache. hadoop. conf. configured configuration base class, implements Java. io. the closeable interface can be used to understand that the filesystem abstract class is an abstract definition of a file system and can be configured, in other words, you can describe a file system by specifying some configuration items in the configuration file. In fact, the most important configuration class is Org. apache. hadoop. conf. configuration, org. apache. hadoop. conf. the method defined in configured is for Org. apache. hadoop. conf. configuration configuration class to set or obtain, meet the requirements of an org. apache. hadoop. conf. other configurations.

The filesystem abstract class defines the basic features and basic operations of the file system. First, from the attribute definition of this abstract class, these attributes describe the static characteristics of the file system. This class defines the following attributes:

Private Static final string fs_default_name_key = "FS. default. name "; <br/>/** file system cache */<br/> Private Static final cache = new cache (); <br/>/** key instance in the cache of the file system (this) */<br/> private cache. key key; <br/>/** map that records statistics of the file system class */<br/> Private Static final map <class <? Extends filesystem>, Statistics> statisticstable = new identityhashmap <class <? Extends filesystem>, Statistics> (); <br/>/** <br/> * This File System (this) <br/> */<br/> protected statistics; <br/>/** <br/> * When the file system is disabled or the JVM exits, you need to clear the files in the cache. The content in the set <path> is the path of the cached files, which is sorted in order. <Br/> */<br/> private set <path> deleteonexit = new treeset <path> (); <br/>

The file system implemented by the hadoop framework can be seen from the definition of the filesystem cache. A file system can manage instances of multiple file systems related to it and cached, this file system coordinates the storage work and provides a convenient storage basis for the mechanism of mapreduce parallel computing framework implemented by hadoop.

File System Cache

The filesystem abstract class defines a file system cache used to cache file system objects. That is, there may be multiple file system objects. Therefore, in addition to managing the content on the file system, each file system may manage a set of File System instances cached, this depends on how the specific file system is implemented.

Of course, it may also be that in a distributed environment, a file system manages remote and local file system instances.

To quickly obtain a file system object that exists in the cache, hadoop uses the hash algorithm to store the file system object in the hashmap as a key-value pair, that is, org. apache. hadoop. FS. filesystem. the map attribute defined by the cache class is as follows:

Private final map <key, filesystem> map = new hashmap <key, filesystem> ();

Here, org. apache. hadoop. FS. filesystem. cache. the key is Org. apache. hadoop. FS. filesystem. an internal static class of the cache, used as the map key in the cache. The content of a key is the information of a URI and its user name. The attributes of the key class are as follows:

Final string scheme; <br/> final string authority; <br/> final string username;

The MAP value cached by org. Apache. hadoop. fs. filesystem. cache is a subclass inherited from the filesystem abstract class. It can be seen that a valid URI information and user name can be used to quickly obtain the object of a file system in the cache, so as to obtain the file information in the specified file system. The cache class provides three basic operations:

/** Extracts a filesystem instance from the cache based on Uri and configuration, and requires synchronous cache operations. */<Br/> synchronized filesystem get (URI Uri, configuration conf) throws ioexception; <br/>/** Based on the specified cache key instance, to delete the filesystem instance corresponding to the key from the cache, you must synchronize the cache operation. */<Br/> synchronized void remove (Key key, filesystem FS); <br/>/** iteratively caches map to delete all cached File System instances, synchronous cache operations are required. */<Br/> synchronized void closeall () throws ioexception;

File System statistics

The above statisticstable is an identityhashmap <class <? Extends filesystem>, Statistics>. The key is the class inherited from filesystem, and the value is the statistics class of statistical information. To perform Secure Computing in a parallel computing environment, the statistics class uses Java. util. concurrent. the atomic variable attribute in the atomic package ensures thread-safe atomic read/write operations while improving parallel performance. As follows:

Private atomiclong bytesread = new atomiclong (); <br/> private atomiclong byteswritten = new atomiclong ();

Here, bytesread reads a specified number of bytes from statistics and adds them to the number of currently read bytes. Similarly, bytesread is based on atomic write operations.

Another statistical data attribute, protected statistics, is an instance of statistics for the current (this) filesystem. This attribute is initialized after the instance of the file system (this) is constructed. The initialize method is called to initialize statistics:

Public void initialize (URI name, configuration conf) throws ioexception {<br/> Statistics = getstatistics (name. getscheme (), getclass (); <br/>}

Then, the getstatistics method is called inside the initialize method to obtain an initialized statistics instance. In this method, after a statistics instance is instantiated, you need to add it to the cache statisticstable of the statistical information instance, in this way, the statistics of the corresponding file system can be quickly obtained through the given Uri.

To facilitate statistics on the file system, the filesystem class implements several very convenient methods. The following only lists the method declarations:

Public static synchronized Map <string, Statistics> getstatistics (); <br/> Public static synchronized list <Statistics> getallstatistics (); <br/> Public static synchronized statistics getstatistics (string scheme, class <? Extends filesystem> CLs); <br/> Public static synchronized void clearstatistics (); <br/> Public static synchronized void printstatistics () throws ioexception;

These methods obtain the statistics of the file system from statisticstable.

File Cache

Attribute Set <path> deleteonexit is a File Cache used to collect the file path in the current cache. When the file system is closed or the JVM exits, all files in the cache need to be deleted. The method for deleting cached files is in the processdeleteonexit method, as shown below:

/** <Br/> * Delete all files in the cache deleteonexit. You must synchronize deleteonexit. <Br/> */<br/> protected void processdeleteonexit () {<br/> synchronized (deleteonexit) {<br/> for (iterator <path> iter = deleteonexit. iterator (); ITER. hasnext ();) {<br/> Path = ITER. next (); <br/> try {<br/> Delete (path, true); // call to delete a directory, and its subdirectories and files <br/>}< br/> catch (ioexception e) {<br/> log.info ("ignoring failure to deleteonexit for path" + path ); <br/>}< br/> ITER. remove (); <br/>}< br/>}

After a filesystem is closed, you need to add the path corresponding to the file system to the deleteonexit File Cache to call the processdeleteonexit method to delete these files when the file system is closed or the JVM exits. Add a file to the file cache that may be deleted when the file system is closed or the JVM exits, which is implemented in the deleteonexit method.

File System Abstraction

Next, we will learn from the abstract section of the filesystem abstract class "abstract" which file system-based operations are defined by a filesystem, so that we can know what basic operations need to be implemented if a file system is implemented. As shown below, the filesystem abstract class defines 12 Abstract METHODS:

/** Obtain the URI that uniquely identifies a filesystem */<br/> public abstract URI geturi (); </P> <p>/** <br/> * Open the fsdatainputstream input stream of a file based on the given path F. <Br/> * @ Param F file to be opened <br/> * @ Param buffersize buffer size <br/> */<br/> public abstract fsdatainputstream open (path F, int buffersize) throws ioexception; </P> <p>/** <br/> * open a fsdataoutputstream for the write process. <Br/> * @ Param F file to be written <br/> * @ Param permission <br/> * @ Param overwrite whether to overwrite <br/> * @ Param buffersize Buffer size <br/> * @ Param replication: Number of block copies of the file <br/> * @ Param blocksize <br/> * @ Param progress: process used to report the working status of the hadoop framework <br/> * @ throws ioexception <br/> */<br/> public abstract fsdataoutputstream create (path F, <br/> fspermission permission, <br/> Boolean overwrite, <br/> int buffersize, <br/> short replication, <br/> long blocksize, <br/> progressable progress) throws ioexception; </P> <p>/** <br/> * append an existing file <br/> * @ Param F <br/> * @ Param buffersize buffer size <br/> * @ Param Progress Report Process <br/> * @ throws ioexception <br/> */<br/> public abstract fsdataoutputstream append (Path f, int buffersize, progressable progress) throws ioexception; </P> <p>/** <br/> * rename the SRC file to DST <br/> */<br/> Public Abstract Boolean Rename (path SRC, path DST) throws ioexception; </P> <p>/** <br/> * delete a file <br/> */<br/> Public Abstract Boolean Delete (path f) throws ioexception; </P> <p>/** <br/> * delete a file <br/> */<br/> Public Abstract Boolean Delete (path F, Boolean recursive) throws ioexception; </P> <p>/** <br/> * If F is a directory, list Files in this directory <br/> */<br/> public abstract filestatus [] liststatus (path f) throws ioexception; </P> <p>/** <br/> * set the current working directory for the given file system <br/> */<br/> public abstract void setworkingdirectory (Path new_dir ); </P> <p>/** <br/> * obtain the current working directory of the file system <br/> */<br/> public abstract path getworkingdirectory (); </P> <p>/** <br/> * create a directory F <br/> */<br/> Public Abstract Boolean mkdirs (path F, fspermission permission) throws ioexception; </P> <p>/** <br/> * obtain the statistical information instance corresponding to F <br/> */<br/> public abstract filestatus getfilestatus (path F) throws ioexception; </P> <p>

The above abstract methods should be the basic operations that a file system should possess. A subclass implementation Class Based on the filesystem abstract class may be designed according to different needs. In the implementation of this file system, the implementation details of some operations may vary according to the features of the file system. Therefore, you can flexibly design the file system you need.

 

File Operations

In the filesystem file system, there are many file-related operations, including file creation, read/write, renaming, copying, and deletion.

File Creation, including Directory Creation and non-directory file creation. The method for creating a directory is as follows:

Public Boolean mkdirs (path f) throws ioexception {<br/> return mkdirs (F, fspermission. getdefault (); <br/>}</P> <p> Public Abstract Boolean mkdirs (path F, fspermission permission) throws ioexception;

The filesystem abstract class does not have details on how to create directories.

In addition, there is an implementation of creating directories across file systems:

Public static Boolean mkdirs (filesystem FS, path Dir, fspermission permission) throws ioexception {<br/> boolean result = FS. mkdirs (DIR); // create a directory based on the default permission and return the file output stream object <br/> FS. setpermission (Dir, permission); // set the permission for creating the Dir directory in FS <br/> return result; <br/>}

This method shows that in the current File System (this), a directory is created in the FS of another file system based on the specified permissions, obviously, this is a Distributed Remote Directory Creation operation.

The creation of Non-directory files mainly aims to open a file for read or write operations and return the stream object of the file, which can be read, written, and appended. For file creation operations, 10 overload methods are based on a create abstract method:

Public abstract fsdataoutputstream create (path F, <br/> fspermission permission, <br/> Boolean overwrite, <br/> int buffersize, <br/> short replication, <br/> long blocksize, <br/> progressable progress) throws ioexception;

There is also a special create method, as shown below:

Public static fsdataoutputstream create (filesystem FS, <br/> path file, fspermission permission) throws ioexception {<br/> fsdataoutputstream out = FS. create (File); // create a file based on the default permission and return the file output stream object <br/> FS. setpermission (file, permission); // set the permission for creating a file in FS <br/> return out; <br/>}

The parameters of this method show that in the current File System (this), a file is created based on the specified permissions in the FS of another file system, obviously, this is a Distributed Remote File Creation operation. As long as the permissions of the file system meet the requirements of the Remote File System FS creation and meet the necessary communication conditions, you can perform Distributed File Operations.

Two other open methods are used to open an existing file and return the file stream object. The internal implementation of a createnewfile method also calls the create method.

Object appending is implemented through three overloaded append methods. After the append write operation is completed successfully, the stream object org. Apache. hadoop. fs. fsdataoutputstream is returned.

The file RENAME operation is defined by using the abstract method Rename (path, PATH.

The delete operation of an object is defined by the delete method.

The copy operation of local files is mainly implemented through two sets of overload methods. One is the reload copyfromlocalfile method: copy the source file to the target file, retain the source file (copy operation); the other is the heavy load of the movefromlocalfile method: copy the source file to the target file, delete the source file, this is the file movement operation (that is, the cut operation ).

Files, blocks, and copies

For files and blocks, you can learn some related information, meanings of some parameters, and settings through the hadoop architecture design.

For block, the following two methods are defined in filesystem:

/** <Br/> * obtain the block size of file F <br/> */<br/> Public long getblocksize (path F) throws ioexception {<br/> return getfilestatus (f ). getblocksize (); <br/>}</P> <p>/** <br/> * obtain the default block size <br/> */<br/> Public long getdefablockblocksize () {<br/> // default to 32 MB <br/> return getconf (). getlong ("FS. local. block. size ", 32*1024*1024); <br/>}</P> <p>

To ensure the reliability and availability of the hadoop Distributed File System, the file copy redundancy storage and pipeline Replication technologies are used. Therefore, you must set the file copy. The following are some parameter operations on the copy:

/** <Br/> * set the replication factor of the SRC file to replication <br/> */<br/> Public Boolean setreplication (path SRC, short replication) hrows ioexception {<br/> return true; <br/>}</P> <p>/** <br/> * obtain the replication factor of the SRC file <br/> */<br/> @ deprecated <br /> Public short getreplication (path SRC) throws ioexception {<br/> return getfilestatus (SRC ). getreplication (); <br/>}</P> <p>/** <br/> * obtain the default number of copies of a file, that is, the replication factor <br/> */<br/> Public short getdefaultreplication () {return 1 ;}

You can obtain the File status information through a set of overloaded liststatus methods. The file status information is obtained through Org. apache. hadoop. FS. filestatus object class for statistics, which implements Org. apache. hadoop. io. the writable interface is serializable. It mainly contains the following information about the file:

Private Path; // file path <br/> private long length; // file length <br/> private Boolean isdir; // whether the file is a directory <br/> private short block_replication; // block copy factor <br/> private long blocksize; // block size <br/> private long modification_time; // modification time <br/> private long access_time; // access time <br/> private fspermission permission; // operation permission in the specified file system <br/> private string owner; // file owner <br/> private string group; // Group

For blocks, blocks are the basic unit of a file. Given a file, it should have a list of blocks, you can use the getfileblocklocations method to obtain the list of hosts where the block corresponding to a file is located, the offset location in the file, and other information, as follows:

/** <Br/> * returns a blocklocation [], it contains the Host Name List, offset location, and file size information <br/> */<br/> Public blocklocation [] getfileblocklocations (filestatus file, long start, long Len) throws ioexception {<br/> If (file = NULL) {<br/> return NULL; <br/>}</P> <p> If (start <0) | (LEN <0 )) {<br/> throw new illegalargumentexception ("invalid start or Len parameter"); <br/>}</P> <p> If (file. getlen () <start) {<br/> return New blocklocation [0]; <br/>}< br/> string [] Name = {"localhost: 50010 "}; <br/> string [] HOST = {"localhost"}; <br/> return New blocklocation [] {New blocklocation (name, host, 0, file. getlen () }; <br/>}

Here, org. apache. hadoop. FS. the blocklocation class has the block information of a specified file, which implements Org. apache. hadoop. io. the writable interface is serializable and has the following information:

Private string [] hosts; // hostnames of datanodes <br/> private string [] names; // hostname: portnumber of datanodes <br/> private string [] topologypaths; // full path name in network topology <br/> private long offset; // The offset position of the block in the file <br/> private long length;

In addition, the globstatus method is defined in the filesystem class to filter the file path in the file system based on the specified pathfilter, in this way, the array filestatus [] of the file status information that meets the filtering condition is returned.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.