In this section we will delve into the filesystem class of Hadoop-an important interface for interacting with the file system of Hadoop. While we are only focusing on the implementation of HDFS, we generally pay attention to the portability of code between filesystem different sub-class file systems when coding. This is very useful, for example, you can easily use the same code directly on your local file system for testing.
Read data using Hadoop URLs
One of the simplest ways to read files from a Hadoop file system is to use the Java.net.URL object to open a stream from which to read the data. The typical programming style is as follows:
1 InputStream in = null;2 try {3 in = new URL ("Hdfs://host/path"). OpenStream (); 4 // process In5} finally {6
ioutils.closestream (in); 7}
It will take a little more work to make Java recognize the URL that begins with HDFs: set a fsurlstreamhandlerfactory for Java through the Seturlstreamhandlerfactory () method of the URL. This method can be called only once per JVM, so it is usually executed in a static block (see below), but if one of your programs-such as a third-party component where you cannot modify the source code-has already called this method, Then you can't read the data through the URL (we'll describe another method in the next section).
1 public class Urlcat {2 static {3 url.seturlstreamhandlerfactory (New Fsurlstreamhandlerfactory ()); 4 } 5< C4/>6 public static void Main (string[] args) throws exception{7 inputstream in = null; 8 try {9
in = new URL (Args[0]). OpenStream (); ioutils.copybytes (in, System.out, 4096, false); } finally {12 //todo:handle exception13 ioutils.closestream (in); }15 }16}
In the example above we used two static methods of the Ioutils class in Hadoop:
1) ioutils.copybytes (), where in represents the copy source, System.out indicates the copy destination (that is, to copy to standard output), 4096 indicates the buffer size used to copy, False indicates that we do not close the copy source to copy the destination after the copy is complete (because System.out does not need to be closed, in can be closed in the finally statement).
2) Ioutils.closestream (), used to close a stream.
Here are some examples of our tests:
% Hadoop Urlcat hdfs://localhost/user/tom/quangle.txt On the top of the Crumpetty Tree The Quangle Wangle Sat, But he face you could isn't see, On account of his Beaver Hat. |
Reading data using filesystem
As we said in the previous section, sometimes we can't read data through URLs by setting the URLStreamHandlerFactory method, when the FileSystem API comes in handy.
The files in the Hadoop file system are represented by the path object of Hadoop (not the Java.io.File object in Java, because its semantics are too close to the local file system). You can think of a path object as a URL in the Hadoop file system, such as "Hdfs://localhost/user/tom/quangle.txt" in the previous example.
FileSystem is a common file system API, so the first step in using it is to extract an instance of it-in this case, HDFs. Several static methods for extracting filesystem instances are listed below: FileSystem
public static FileSystem Get (Configuration conf) throws IOException public static FileSystem get (URI Uri, Configuration conf) throws IOException public static FileSystem get (URI Uri, Configuration conf, String user) throws IOException |
A configuration object encapsulates the client or server-side provisioning information that is set by a name-value pair that is read from a configuration file such as Conf/core-size.xml. Let's explain the above three methods:
1) The first method returns a default file system (specified by Fs.default.name in Conf/core-site.xml, and returns the local file system if there is no setting in Conf/core-site.xml).
2) The second method uses a URI to specify the file system to return (for example, if the URI is a hdfs://localhost/user/tom/in the last Test example) Quangle.txt, which begins with the HDFs identity, returns an HDFs file system that returns the local file system if there is no corresponding identity in the URI.
3) The third method of returning the file system is the same as (2), but it also restricts the user of the file system, which is important in terms of security.
Sometimes you might want to use a local file system, and you can use another handy method:
public static LocalFileSystem getlocal (Configuration conf) throws IOException
After getting an instance of a filesystem, we can call the instance's open () method to open the input stream for a given file (the first method uses a default 4KB input buffer):
Public Fsdatainputstream Open (Path f) throws IOException Public abstract Fsdatainputstream Open (Path F, int buffersize) throws IOException |
By combining the above, we get the following code:
1 public class Filesystemcat {2 public static void Main (string[] args) throws Exception {3 String uri = args[0]; 4 Configuration Configuration = new configuration (), 5 FileSystem fs = Filesystem.get (Uri.create (URI), Configuration); 6 InputStream in = null, 7 try{8 in = Fs.open (new Path (URI)), 9 ioutils.copybytes (in, System.out, 4096, F ALSE); finally { ioutils.closestream (in); }13 }14}
Fsdatainputstream
Unlike the URL's OpenStream () method, which returns InputStream, the Open () method of FileSystem returns a Fsdatainputstream object (Inheritance relationship: Java.io.InputStream- Java.io.FilterInputStream-Java.io.DataInputStream-Org.apache.hadoop.fs.FSDataInputStream). Since Fsdatainputstream implements Closeable, Datainput, positionedreadable, seekable and other interfaces, you can read data from any location in the stream.
The Seek () and GetPos () methods of the Seekable interface allow us to jump to a position in the stream and get its position:
Public interface Seekable { void Seek (Long pos) throws IOException; Long GetPos () throws IOException; } |
If you specify a displacement value that exceeds the length of the file when you call Seek (), a IOException exception is thrown.
Unlike the Java.io.Inputstream's skip () method, which indicates a relative displacement value, the Seek () method uses an absolute displacement value. The code shown below reads the input file two times through the Seek () method:
1 public class Filesystemdoublecat {2 public static void Main (string[] args) throws Exception {3 String uri = ar Gs[0]; 4 Configuration conf = new configuration (), 5 FileSystem fs = Filesystem.get (Uri.create (URI), conf), 6 Fsdatainputstream in = null; 7 try {8 in = Fs.open (new Path (URI)), 9 ioutils.copybytes (in, System.out, 4096, false), In.seek (0); /Go back to the start of the File11 ioutils.copybytes (in, System.out, 4096, false); finally { ioutils . Closestream (in); }15 }16}
The results of the operation are as follows:
% Hadoop Filesystemdoublecat hdfs://localhost/user/tom/quangle.txt On the top of the Crumpetty Tree The Quangle Wangle Sat, But he face you could isn't see, On account of his Beaver Hat. On the top of the Crumpetty Tree The Quangle Wangle Sat, But he face you could isn't see, On account of his Beaver Hat.
|
Fsdatainputstream also implements the Positionedreadable interface, which allows you to read content of a given length from a given position in the stream:
Public interface Positionedreadable { public int read (long position, byte[] buffer, int offset, int length) Throws IOException; public void readfully (long position, byte[] buffer, int offset, int length) Throws IOException; public void readfully (long position, byte[] buffer) throws IOException; } |
Description: The Read () method reads the length byte from the given position of the file to offset at buffer. The return value is the actual number of bytes read, and the caller should check for this return value because it may be smaller than length (it may have been read to the end of the file, or an interrupt has occurred, etc.).
Calling all of these methods does not change the file's offset value, so these methods are thread-safe. It also provides a convenient way to access another part of the file's data-such as metadata-when accessing the contents of a file.
Finally, it is important to note that the cost of calling the Seek () method is relatively high and should be avoided as much as possible. Your program should be built based on streaming access, rather than executing a lot of seek.
Write Data
The FileSystem class has many methods for creating a file, the simplest of which is the create (path F) method, which is the parameter of the path object to create the file, which returns an output stream to write to the data:
Public Fsdataoutputstream Create (Path f) throws IOException
The method also has several overloaded methods through which you can specify whether to overwrite the file name of the file that already exists, the number of copies of this file, the buffer size used to write the data, the block sizes and file permissions of the file, and so on.
The Create () method creates any nonexistent parent directories contained in the specified file name, which is convenient, but not recommended (because if other data exists in a parent directory, it is overwritten and the file is lost). If you want the create operation to fail when the parent directory does not exist, you can call the exists () method before calling the Create () method to check for the existence of the indicated parent directory and, if present, an error to let create () fail |
The Create () method also has an overloaded method that allows you to pass a callback on the pretext of--progressable, so your program will know how much your data has been written, that is, the progress of the Write (progress):
Package org.apache.hadoop.util; Public interface Progressable { public void Progress (); } |
In addition to creating a new file to write data to, we can also use the Append () method to add data to an existing file:
Public Fsdataoutputstream append (Path f) throws IOException
With this function, the application can write data to files that cannot be limited in size (such as logfile, you do not know in advance how much the log will be logged). The append operation is optional in the filesystem of Hadoop, for example HDFs implements it, but S3 does not.
The following example shows how to copy a file from the local file system to HDFs, and we call the progress () function once every 64KB of data is written, each time the function is called to print a period:
1 public class Filecopywithprogress {2 public static void Main (string[] args) throws Exception {3 String LOCALSR c = args[0]; 4 String DST = args[1]; 5 InputStream in = new Bufferedinputstream (new FileInputStream (LOCALSRC)); 6 Configuration conf = new configuration (); 7 FileSystem fs = Filesystem.get (Uri.create (DST), conf), 8 OutputStream out = fs.create (new Path (DST), new Progre Ssable () {9 Public Void Progress () { System.out.print ("."); }12 }); ioutils.copybytes (in, out, 4096, true); }15}
The following is a demonstration of the use of this example:
% Hadoop filecopywithprogress input/docs/1400-8.txt hdfs://localhost/user/tom/ 1400-8.txt ............... |
Note: the progress () method is now not supported by Hadoop-supported file systems other than HDFs, but we should know that progress information (pregress) is very important in the MapReduce program.
Fsdataoutputstream
The Create () method in filesystem returns a Fsdataoutputstream, like Fsdatainputstream, which also has a method for querying displacements (but not similar to the Fsdatainputstream seek () method, because Hadoop does not allow data to be written anywhere in the stream, we can only add data at the end of a file:
Package org.apache.hadoop.fs; public class Fsdataoutputstream extends DataOutputStream implements syncable { Public long GetPos () throws IOException { Implementation elided } Implementation elided } |
Querying a file System file metadata: Filestatus
A typical function of any file system is the ability to traverse its directory structure to obtain information about directories and files. The Filestatus class in Hadoop wraps its metadata (including file length, block size, redundancy, modification time, file owner and permissions, etc.) for files and directories whose getfilestatus () The Filestatus method provides a way to get an object for a given file or directory, as follows:
1 public class Showfilestatustest {2 private minidfscluster cluster,//Use a in-process HDFS cluster for testing ( This class has been deprecated in the latest Hadoop1.0.4) 3 4 private FileSystem FS; 5 6 @Before 7 public void SetUp () throws IOException {8 Configuration conf = new configuration (); 9 if (System.getproperty ("test.build.data") = = null) {Ten system.setproperty ("Test.build.data", "/tmp"); 11 }12 cluster = new Minidfscluster (conf, 1, true, null); fs = Cluster.getfilesystem (); Putstream out = fs.create (new Path ("/dir/file")), Out.write ("content". GetBytes ("UTF-8"); Out.close (); }18 @After20 public void TearDown () throws IOException {if (fs! = NULL) {fs.cl OSE ();}24 if (cluster! = null) {Cluster.shutdown ();}27}28 @Test (E xpected = Filenotfoundexception.class) void Throwsfilenotfoundfornonexistentfile () throws IOException {fs.getfilestatus (New Path ("No-such-file")); 32 }33 @Test35 public void Filestatusforfile () throws IOException {$ path file = new Path ("/dir/file" ), PNS filestatus stat = fs.getfilestatus (file), Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir/fil E ")), Assertthat (Stat.isdir (), is (false)), Assertthat (Stat.getlen (), is (7L)), and the Assertthat (STA T.getmodificationtime (), Lessthanorequalto (System.currenttimemillis ())); Assertthat (STAT.G Etreplication (), is ((short) 1)), Assertthat (Stat.getblocksize (), is (1024L)), and Assertthat (St At.getowner (), is ("Tom")): Assertthat (Stat.getgroup (), is ("supergroup")); Assertthat (Stat.getpermissio N (). ToString (), is ("rw-r--r--")),}49 @Test51 public void Filestatusfordirectory () throws IOException {52 Path dir = newPath ("/dir"), filestatus stat = Fs.getfilestatus (dir), Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir")); Assertthat (Stat.isdir (), is (true)), and Assertthat (Stat.getlen (), is (0L)); Assertthat (Stat.getm Odificationtime (), Lessthanorequalto (System.currenttimemillis ())), Assertthat (STAT.GETREPL Ication (), is (0)), Assertthat (Stat.getblocksize (), was (0L)), Assertthat (Stat.getowner (), is ("to M ")), Assertthat (Stat.getgroup (), is (" supergroup ")), Assertthat (Stat.getpermission (). ToString (), is (" R Wxr-xr-x ")); 64}65}
Listing files
In addition to getting file information from a single file or directory, you might also want to list all the files in a directory, using the filesystem Liststatus () method:
Public filestatus[] Liststatus (Path f) throws IOException Public filestatus[] Liststatus (Path F, pathfilter filter) throws IOException Public filestatus[] Liststatus (path[] files) throws IOException Public filestatus[] Liststatus (path[] files, Pathfilter filter) throws IOException |
When an incoming parameter is a file, it gets the Filestatus object for this file, and when the incoming file is a directory, it returns 0 or more Filestatus objects representing the corresponding information for all the files in that directory, respectively.
The overloaded function allows you to specify a pathfilter to further qualify the file or directory to match.
Here we use the Liststatus () method to get the metadata information for the file specified in the parameter (can have multiple), store it in a filestatus array, and then use the Stat2paths () method to convert the Filestatus array to the path array. Finally print out the file name to:
1 public class Liststatus {2 public static void Main (string[] args) throws Exception {3 String uri = args[0]; 4< C2/>configuration conf = new Configuration (); 5 FileSystem fs = Filesystem.get (Uri.create (URI), conf), 6 path[] paths = new Path[args.length]; 7 for (int i = 0; i < paths.length; i++) {8 paths[i] = new Path (Args[i]), 9 }10 filestatus[] status = Fs.liststatus (paths); path[] Listedpaths = fileutil.stat2paths (status); \ n ( Path p:listedpaths) { System.out.println (p); }16}
The results of the operation are as follows:
% Hadoop liststatus Hdfs://localhost/hdfs://localhost/user/tom Hdfs://localhost/user Hdfs://localhost/user/tom/books Hdfs://localhost/user/tom/quangle.txt |
File mode
It is common to work with some column files in a single operation. For example, a log-processing MapReduce job may have to analyze the one-month log volume. If a file is a file or a directory declaration of a directory that is too much trouble, we can use a wildcard (wild card) to match multiple files (this operation is also called globbing). Hadoop provides two ways to work with filegroups:
Public filestatus[] Globstatus (Path pathpattern) throws IOException Public filestatus[] Globstatus (Path Pathpattern, Pathfilter filter) throws IOException |
The Globstatus () method returns a Filestatus array of multiple files that match the file pattern (in path order). An optional pathfilter can be used to further limit the matching pattern. The match in Hadoop is the same as in Unix, as shown below:
Suppose a log file has the following organizational structure:
The structure corresponds to the following representation:
Pathfilter
Using file mode sometimes does not effectively describe the series of files you want, such as if you want to exclude a particular file. So filesystem's Liststatus () and Globstatus () methods provide an optional parameter: pathfilter--It allows you to have some more granular control matches:
Package org.apache.hadoop.fs; Public interface Pathfilter { Boolean Accept (path path); } |
Pathfilter acts like Java.io.FileFilter, except for the path object, which is for the file object. Let's use Pathfilter to exclude a file that matches a given regular expression:
1 public class Regexexcludepathfilter implements Pathfilter {2 private final String regex; 3 4 public Regexex Cludepathfilter (String regex) {5 This.regex = regex; 6 } 7 8 public Boolean Accept (path Path) {9 ret Urn!path.tostring (). Matches (regex); }11}
Regexexcludepathfilter only makes a mismatch (see the implementation of the Accept method) a file of a given regular expression passes through, we get the desired set of files through file pattern, Then use Regexexcludepathfilter to filter out the files we don't need:
Fs.globstatus (New Path ("/2007/*/*"), New Regexexcludefilter ("^.*/2007/12/31$"))
So we get:/2007/12/30
Note: Filter files can only be filtered based on the file name, and it is not possible to filter files by their properties (such as modification time, file owner, etc.). However, it still provides functionality that the file mode and regular expressions cannot provide.
Delete data
Use the Delete () method of filesystem to permanently delete a file or directory:
public Boolean Delete (Path F, Boolean recursive) throws IOException
If the incoming path f is a file or an empty directory, the value of recursive is ignored. When the recursive value is true, the given non-empty directory, along with its contents, is deleted.
Source: http://www.cnblogs.com/beanmoon/archive/2012/12/11/2813235.html
Use the FileSystem class to read and write files and view file information