5) The Java Interface
A) Reading Data from a Hadoop URL.
Using the Hadoop URL to read data
b) Although we focus mainly on the HDFS implementation, Distributedfilesystem, in general you should strive to write your Code against the FileSystem abstract class, to retain portability across filesystems.
While we focus primarily on the implementation of HDFs, which is distributedfilesystem, you should usually write code for the abstract class filesystem to maintain its portability across file systems.
c) One of the simplest ways to read a file from a Hadoop filesystem are by using a Java.net.URL object to open a stream to Read the data from. The General idiom is:
The simplest way to read a file from a Hadoop file system is to use a Java.net.URL object to open a stream of data to read from it. The usual format is:
inputstream in = null ; try {in = new URL ( "HDFs ://host/path "). OpenStream (); //process in } finally {Ioutils.closestream (in );}
There's a little bit more work required to make Java recognize Hadoop ' s HDFs URL scheme. This was achieved by calling the Seturlstreamhandlerfactory () method on the URL with an instance of Fsurlstreamhandlerfactory. This method can be called only once per JVM, so it is typically executed in a static block.
The HDFs URL scheme that allows Java to identify Hadoop also requires a bit of extra work, which can be done here by invoking the Seturlstreamhandlerfactory () method in the URL of the Fsurlstreamhandlerfactory object. This method can only be executed once per JVM, so it is usually executed in a static block.
d) Example 3-1. Displaying files from a Hadoop filesystem in standard output using a urlstreamhandler.
Use URLStreamHandler to list files in a Hadoop file system in a standard output way.
Public classUrlcat {Static{Url.seturlstreamhandlerfactory (NewFsurlstreamhandlerfactory ()); } Public Static void Main(string[] args) throws Exception {InputStreaminch=NULL;Try{inch=NewURL (args[0]). OpenStream (); Ioutils.copybytes (inch, System. out,4096,false); }finally{Ioutils.closestream (inch); }}}run:% Hadoop Urlcat HDFs://localhost/user/tom/quangle.txt
E) We make use of the the handy Ioutils class that comes with Hadoop for closing the stream in the finally clause, and also For copying bytes between the input stream and the output stream (System.out, in this case). The last of the arguments to the Copybytes () method is the buffer size used for copying and whether to close the streams whe n the copy is complete. We close the input stream ourselves, and system.out doesn ' t need to be closed.
We used the nearest Ioutils class in Hadoop and closed the data flow in the finally clause, and copied the data between the input stream and the output stream (in this case the output stream is System.out). Copybytes () The last two parameters in the method represent the cache size of the replicated data and whether the data stream is closed when replication is complete. Here we close the input stream, and the output stream System.out does not need to be closed.
f) Reading Data Using the FileSystem API. The
uses the FileSystem API to read data.
G) FileSystem is a general FileSystem API, so the first step is to retrieve an instance for the FileSystem we want to Use-hdfs, in this case. There is several static factory methods for getting a FileSystem instance: the
FileSystem class is a common file system API, so the first step is to get a file system Strength, in this case, is HDFs. There are several static factory methods for obtaining a filesystem instance.
public Static FileSystem get (Configuration conf) throws ioexceptionpublic static FileSystem Span class= "Hljs-title" >get (Uri Uri, Configuration conf) throws IOException public static FileSystem get (Uri Uri, Configuration conf, String user) Throws IOException
h) A configuration object encapsulates a client or server ' s Configuration, which is set using configuration files read fro M the classpath, such as Etc/hadoop/core-site.xml. The first method returns the default filesystem (as specified in Core-site.xml, or the default local filesystem if not spe Cified there). The second uses the given URI ' s scheme and authority to determine the filesystem-use, falling-to-the-default files Ystem If no scheme is a specified in the given URI. The third retrieves the filesystem as the given user, which is important in the context of security.
The configuration object encapsulates the client or server-side configurations, which are set to read from the classpath using a configuration file, such as Etc/hadoop/core-site.xml. The first method returns the default file system (which is specified in Core-site.xml, which is the default local file system if not specified here). The second method determines the file system used, based on the given URL scheme and permissions, if no specific scheme is specified in the given URL. Then the default file system is returned. The third method is to retrieve the file system of a given user, which is important in the context of the emphasis on security.
L) Example 3-3. Displaying files from a Hadoop filesystem in standard output twice, by using Seek ():
Use the Seek () method to list the files of the Hadoop file System 2 times in standard output mode
public class Filesystemdoublecat {
public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); FSDataInputStream in = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); in.seek(0); // go back to the start of the file IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); }}
}
Run
% Hadoop Filesystemdoublecat hdfs://localhost/user/tom/quangle.txt
j) The Open () method on FileSystem actually returns an Fsdatainputstream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so can read from any part of The stream:
The open () method of the FileSystem class actually returns a Fsdatainputstream, not a standard Java IO class. This class is a special class that inherits the Java.io.DataInputStream class and supports random access, so any part of the data stream can be read.
package org.apache.hadoop.fs;publicclass FSDataInputStream extends DataInputStreamimplements Seekable, PositionedReadable {// implementation elided}
k) The Seekable interface permits seeking to a position in the file and provides a query method for the current offset fro M the start of the file (GetPos ()):
The Seekable interface allows you to locate in a file and provide a query method (GetPos ()) for the offset of the current position relative to the starting position of the file:
publicinterface Seekable {void seek(longthrows IOException;longthrows IOException;}
L) Example 3-3. Displaying files from a Hadoop filesystem in standard output twice, by using Seek ():
Use the Seek () method to list the files of the Hadoop file System 2 times in standard output mode
Public class Filesystemdoublecat {
public static void Main (string[] args) throws Exception {
String uri = arg S[0];
Configuration conf = new configuration ();
FileSystem fs = Filesystem.get (Uri.create (URI), conf);
Fsdatainputstream in = null;
try {
in = Fs.open (new Path (URI));
Ioutils.copybytes (in, System.out, 4096, false);
In.seek (0);//Go back to the start of the "File
Ioutils.copybytes (in, System.out, 4096, false);
} finally {
Ioutils.closestream (in);
}
}
}
Run:
% Hadoop filesystemdoublecat hdfs://localhost/user/tom/quangle.txt
Bear on Mind, calling Seek () is a relatively expensive operation and should are done sparingly. Should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather t Han performing a large number of seeks.
Finally, don't forget that calling the Seek () method is a relatively expensive operation and should be used with caution. Instead of executing a large number of Seek () methods, you should build application access patterns on top of streaming data (for example, MapReduce).
N) Writing Data
O) The FileSystem class has a number of methods for creating a file. The simplest is the method, takes a Path object for the file to being created and returns an output stream to write to: The Br>filesystem class has many ways to create files. The simplest way is to set a path object for the file to be created and return an output stream that can write data to the file.
Public Fsdataoutputstream Create (Path f) throws IOException
P) There ' s also an overloaded method for passing a CAL Lback interface, progressable, so your application can is notified of the progress of the data being written to the Datano Des:
There is also an overloaded method for passing a callback interface, progressable, so that the application can be informed of the progress of writing data to the node.
package org.apache.hadoop.util;publicinterface Progressable { publicvoidprogress();}
Q) As an alternative-creating a new file, you can append to an existing file using the Append () method (there is also Some other overloaded versions):
As an optional way to create a new file, you can use the Append () method to attach a file that already exists (there are also other overloaded versions).
Public Fsdataoutputstream append (Path f) throws IOException
R) Example 3-4. Copying a local file to a Hadoop filesystem
Copy a local file to the Hadoop file system.
Public classfilecopywithprogress { Public Static void Main(string[] args) throws Exception {String localsrc = args[0]; String DST = args[1]; InputStreaminch=NewBufferedinputstream (NewFileInputStream (LOCALSRC)); Configuration conf =NewConfiguration (); FileSystem fs = FileSystem.Get(Uri.create (DST), conf); OutputStream out= Fs.create (NewPath (DST),NewProgressable () { Public void Progress() {System. out. Print ("."); } }); Ioutils.copybytes (inch, out,4096,true); }}
s) the Create () method on FileSystem returns an Fsdataoutputstream, which, like Fsdatainputstream, have a method for que Rying the current position in the file: The
FileSystem class's Create () method returns a Fsdataoutputstream, just like Fsdatainputstream, There is also a way to query the current location in the file:
package Org.apache.hadoop.fs; public class fsdataoutputstream extends dataoutputstream implements syncable { Public long getpos () throws IOException {//implementation elided }// Implementation elided }
However, unlike Fsdatainputstream, Fsdataoutputstream does not permit seeking. This was because HDFS allows only sequential writes to an open file or appends to an already written file. In other words, there are no support for writing to anywhere and than the end of the file, so there are no value in being Able to seek while writing.
However, unlike Fsdatainputstream, Fsdataoutputstream does not allow retrieval. This is because HDFS allows only successive writes to a file that is already open, or to a writable document that already exists. In other words, in addition to supporting writing to the end of the file, other locations are not supported, so it is meaningless to locate when writing.
T) FileSystem provides a method to create a directory:
The FileSystem class provides a way to create a directory.
public boolean mkdirs (Path f) throws IOException
Often, you don ' t need to explicitly create a directory, because writing a file by calling create () would automatically crea Te any parent directories.
Typically, you do not need to display the creation of a directory, because any required parent directory is automatically created when you write to the file using the Create () method.
u) querying the Filesystem
V) An important feature of any filesystem are the ability to navigate its directory structure and retrieve information abou t the files and directories that it stores. The Filestatus class encapsulates filesystem metadata for files and directories, including file length, block size, replic ation, modification time, ownership, and permission information.
An important feature of any file system is the directory structure and information that browses and retrieves the stored files and directories. The Filestatus class encapsulates the metadata for files and directories in the file system, including file length, block size, backup factors, modification time, owner, and permission information.
W) the Method Getfilestatus () on FileSystem provides a to getting a Filestatus object for a single file or directory.
The Getfilestatus () method of the FileSystem class provides a way to get a Filestatus object for a file or directory.
x) finding information on a single file or directory was useful, but you also often need to being able to list the contents of A directory. That's what FileSystem ' s Liststatus () methods is for:
It is useful to search for information in a single file or directory, but you will often need to list the contents of a directory. This is the function of the FileSystem class Liststatus () method.
public filestatus[] liststatus (Path f) throws ioexceptionpublic filestatus[] liststatus (Path F, pathfilter filter) throws ioexceptionpublic filestatus[] liststatus (path[] files) throws ioexceptionpublic filestatus[] liststatus (path[] files, pathfilter filter ) throws ioexception
When the argument was a file, the simplest variant returns an array of Filestatus objects of length 1. When the argument was a directory, it returns zero or more Filestatus objects representing the files and directories Contai Ned in the directory.
When the parameter is a file, the simplest change is to return an array of Filestatus objects of length 1, and when the parameter is a directory, 0 or more Filestatus objects are returned, representing the files or directories contained in the directory.
Y) Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystem.
Displays file status for a set of paths in the Hadoop file system
Public classListstatus { Public Static void Main(string[] args) throws Exception {String uri = args[0]; Configuration conf =NewConfiguration (); FileSystem fs = FileSystem.Get(Uri.create (URI), conf); path[] paths =NewPath[args.length]; for(inti =0; i < paths.length; i++) {Paths[i] =NewPath (Args[i]); } filestatus[] status = Fs.liststatus (paths); path[] Listedpaths = fileutil.stat2paths (status); for(Path p:listedpaths) {System. out. println (P); } }}
Z) rather than have to enumerate each file and directory to specify the input, it's convenient to use wildcard characte RS to match multiple files with a single expression, an operation, which is known as globbing. Hadoop provides FileSystem methods for processing globs:
Unlike using enumerations to specify each file and directory as input, it is convenient to use wildcards to match multiple files with an expression, which is what is considered a globbing operation. Hadoop provides two methods of filesystem classes to handle Globs:
publicglobStatusthrows IOExceptionpublicglobStatusthrows IOException
Hadoop supports the same set of Glob characters as the Unix bash shell.
Hadoop supports a wildcard expression consistent with the UNIX System bash script.
The authoritative Guide to Hadoop (fourth edition) Essentials translation (5)--chapter 3. The HDFS (5)