Hadoop HDFs (3) Java Access HDFs

Source: Internet
Author: User

now let's take a closer look at the FileSystem class for Hadoop. This class is used to interact with Hadoop's file system. While we are mainly targeting HDFS here, we should let our code use only abstract class filesystem so that our code can interact with any Hadoop file system. When we write the test code, we can test it with the local file system, use HDFs when deploying, just configure it, no need to modify the code. A new filesystem interface was introduced in the later versions of Hadoop 1.x called Filecontext, where a Filecontext instance can handle a wide variety of file systems, and the interfaces are clearer and more unified.
use the Hadoop URL to read files in HDFsBefore discussing the Java API, let's look at a way to read file data with a Hadoop URL. This approach is not strictly the interface of Hadoop to Java, it should be said to use Java NET to send a network request to HDFS and then read the return stream. With a URL to read an HTTP service to get a stream similar to
InputStream in = null;try {in    = new URL ("Hdfs://host/path"). OpenStream ();    Operation input stream in, can read to the contents of the file} finally {    ioutils.closestream (in);}
This approach has a small problem, the Java virtual machine is not known by default HDFs this protocol, want to let it know, need to passurl.seturlstreamhandlerfactory (New Fsurlstreamhandlerfactory ());to set up a urlstreamhandlerfactory so that it knows the HDFS protocol. However, the invocation of this method takes effect on the entire virtual machine, so if there are other parts of the program, especially the third-party framework, the factory is set up, then there is a problem. So this becomes the limit for accessing HDFS using this approach. The complete code is as follows:
Import Java.io.inputstream;import Java.net.url;import Org.apache.hadoop.fs.fsurlstreamhandlerfactory;import Org.apache.hadoop.io.ioutils;public class Urlcat {    static {        url.seturlstreamhandlerfactory (new Fsurlstreamhandlerfactory ());    }    public static void Main (string[] args) throws Exception {        inputstream in = null;        try {in             = new URL (Args[0]). OpenStream ();             Ioutils.copybytes (in, System.out, 4096, false);         } finally {             ioutils.closestream (in);}}    }
in this program, first set the JVM's urlstreamhandlerfactory, and then open a stream through the URL, read the stream, you get the contents of the file, through Ioutils.copybytes () to read the content into the standard output stream, On the console, you realize the functionality of a cat command similar to Linux. Finally, the input stream is closed. Ioutils is a tool class provided by Hadoop. The last two parameters of the copybytes indicate the buffer size and whether to close the stream after the end, we manually close the flow behind, so pass false. put the program into a jar package and place it on Hadoop and execute:$hadoop jar Urlcat.jar Hdfs://localhost/user/norris/weatherdata.txt can display the contents of the previous section to the Weatherdata.txt on the HDFs to the console. The method of executing the Hadoop jar seems to have forgotten to write a blog before, in short, the program into a jar package, you can specify an executable class, and then execute: $hadoop jar Xxx.jar or if you do not hit the jar package, put the class file directly up, and then perform Hadoop xxxclass is also OK, in fact, the Hadoop command is to start a virtual machine, with the implementation of Java Xxxclass or Java-jar Xxx.jar, just start the virtual machine with the Hadoop command, the Hadoop command automatically joins the required class library into the classpath before booting, so it saves itself the hassle of setting the environment variable. It is important to note that if you put the class in the/home/norris/data/hadoop/bin/directory, you need to add the export to the last line of $hadoop_install/etc/hadoop/hadoop-env.sh Hadoop_classpath= $HADOOP _classpath:/home/norris/data/hadoop/bin/To turn your class storage location into an option for the HADOOP search class location.
Use the FileSystem (Org.apache.hadoop.fs.FileSystem) class to read the files in HDFsThere is a class called Path (Org.apache.hadoop.fs.Path), the instance of this class represents a file (or directory) in HDFs, it can be understood as java.io.File, and this file class is the same, except that the file name sounds like a local file system class, So for the difference, named path, you can also understand path as a URI on HDFs, such as Hdfs://localhost/user/norris/weatherdata.txt. The FileSystem class is an abstract class that represents all the file systems that can run on Hadoop, although we are talking about HDFs here, but in fact the FileSystem class can represent any file system that can run on Hadoop, such as a local file system. We write programs that should also be written for filesystem, not org.apache.hadoop.hdfs.DistributedFileSystem, to facilitate porting to other file systems.
The first step is to get an instance of filesystem with the following API to get to the instance:
public static FileSystem Get (Configuration conf) throws Ioexception;public static FileSystem get (URI Uri, configuration Co  NF) throws Ioexception;public static FileSystem get (final URI URI, final Configuration conf,  final String user) throws IOException, interruptedexception;
among them,Config (org.apache.hadoop.conf.Configuration) is the configuration of Hadoop, if you use the default constructor to construct a configuration, which means core-site.xml, the front "Configuring Hadoop" (http://blog.csdn.net/norriszhang/article/details/38659321section describes this configuration and configures FS to be HDFs, so Hadoop is aware of this configuration to fetch a Distributedfilesystem (Org.apache.hadoop.hdfs.DistributedFileSystem) instance. A URI is a path where a file is stored in HDFs. This parameter, String user, will be discussed in the section "Security". even if you are working with HDFS, you may want to access the local file system at the same time, which can be easily accessed through the following APIs:
public static LocalFileSystem getlocal (Configuration conf)  throws IOException;

an input stream can be obtained by invoking the Open method of the filesystem instance:
Public Fsdatainputstream open (path f) throws Ioexception;public abstract Fsdatainputstream open (path f, int buffersize) 
   throws IOException;
where the int buffersize is the buffer size, if this parameter is not passed, the default size is 4K. The following is the complete procedure for reading the contents of a file in HDFs using filesystem:
Import Java.net.uri;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.FSDataInputStream; Import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import org.apache.hadoop.io.IOUtils; public class Filesystemcat {public    static void Main (string[] args) throws Exception {        String uri = args[0];        Configuration conf = new configuration ();        FileSystem fs = Filesystem.get (Uri.create (URI), conf);        System.out.println (Fs.getclass (). GetName ()); Here you can see that the resulting instance is Distributedfilesystem, because Core-site.xml is equipped with HDFs        fsdatainputstream in = null;        try {in            = Fs.open (new Path (URI));            Ioutils.copybytes (in, System.out, 4096, false);        } finally {            ioutils.closestream (in);}}    }
put the jar package into Hadoop to run:$hadoop jar Filesystemcat.jar hdfs://localhost/user/norris/weatherdata.txtThe contents of the Weatherdata.txt file are output in the console.
Let's take a look at the return value of the Api,fs.open method is Fsdatainputstream, not the stream in the Java.io packet. This stream inherits from Java.io.DataInputStream and implements Seekable (org.apache.hadoop.fs.Seekable), which means that the seek () method can be called to read arbitrary location data of the file. For example, the above program adds two lines:
In.seek (0); Ioutils.copybytes (in, System.out, 4096, false);
It is possible to print the contents of the file two times. Other than that The Fsdatainputstream class also implements the Positionedreadable (org.apache.hadoop.fs.PositionedReadable) interface, and the three methods defined in the interface allow the contents of the file to be read in any location:
public int read (long position, byte[] buffer, int offset, int length) throws Ioexception;public void readfully (Long positi On, byte[] buffer, int offset, int length) throws Ioexception;public void readfully (long position, byte[] buffer) throws I Oexception;
Also, these methods are thread-safe, but Fsdatainputstream is not thread-safe, but it is safe to say that different fsdatainputstream instances read the contents of the file in different threads, But it is not safe for a Fsdatainputstream instance to be called by a different thread. Therefore, each thread should be created with its own Fsdatainputstream instance. Finally, the cost of calling the Seek () method is quite huge and should be called as few as possible, and your program should be designed to stream the contents of the file as much as possible.
Use the FileSystem class to write files to HDFs.The filesystem class provides a whole bunch of APIs for creating files, the simplest of which is:
Public Fsdataoutputstream Create (Path f) throws IOException;
creates a file represented by the path class and returns an output stream. This method has a lot of overloaded methods that can be used to set whether to overwrite an existing file, the number of copies of the file, the size of the buffer when it is written, the block size (block), permissions, and so on. If the path represents the parent directory (or even the grandfather directory) of the file that does not exist, these directories will be created automatically, which in many cases provides convenience, but sometimes does not match our needs, if you do want to do not create when the parent directory does not exist, you should determine whether its parent directory exists. There is a createnonrecursive method in the original API that fails if the parent directory does not exist. But now this group of methods has been deprecated, is not recommended to use. An important overloaded method of the Create method is:
Public Fsdataoutputstream Create (Path F, progressable progress) throws IOException;
the progress method of the Progressable interface can be used to callback when there is data written.
Public interface Progressable {public    void Progress ();}
If you do not want to create a new file, but want to append the content to an existing file, you can call:
Public Fsdataoutputstream append (Path f) throws IOException;
This method is an optional implementation, which means that not all Hadoop filesystem implement the method, for example, HDFs is implemented, and the S3 file system is not implemented. Also, the implementation of HDFS in the later version of 1.x is a stable implementation, the previous implementation is problematic. The following program uploads a local file to HDFs and outputs a point (.) Each time the progress method is invoked.
Import Java.io.bufferedinputstream;import Java.io.fileinputstream;import Java.io.inputstream;import Java.io.outputstream;import Java.net.uri;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.ioutils;import Org.apache.hadoop.util.progressable;public class Filecopywithprogress {public static void main (string[] args) throws E        xception {String localsrc = args[0];               String DST = args[1];               InputStream in = new Bufferedinputstream (new FileInputStream (LOCALSRC));        Configuration conf = new configuration ();        FileSystem fs = Filesystem.get (Uri.create (DST), conf);                OutputStream out = fs.create (new Path (DST), new progressable () {@Override public void progress () { System.out.print ("."); /try {//Thread.Sleep ($),//} catch (Exception e) {//E.prints Tacktrace ();//             }            }        });        Ioutils.copybytes (in, out, 4096, true);        System.out.println ();    System.out.println ("end."); }}
What conditions are called in the program progress? This API does not say, that is, you can not assume that it will be called at some point, the actual test results are probably written to 64K will be called once, but many tests are inconsistent, especially in the file hours, more unpredictable. If you open the Thread.Sleep () comment on the above program, you will find that the main program is waiting when progress is called. If you print out the ID of the main thread and the thread ID of the calling progress method, each thread that calls progress is the same thread, but not the main path. However, the main thread's "end." It does not output until all the dots (.) are output. The file I copied is 4743680 bytes, which is about 4.5M, hitting 73 points. Average 64K a point. Up to now, only HDFs has callback progress when writing to a file, and no other filesystem has implemented callback progress, which is useful when you do a MapReduce program later.
The Fsdataoutputstream class returned by the Create method in FileSystem has a method:
Public long GetPos () throws IOException;
You can query the current location of the file to write, a write offset. However, unlike Fsdatainputstream, Fsdataoutputstream does not have a seek method because HDFs only allows sequential writes, and for an open file, it is allowed to append to the tail, not to be written anywhere, so there is no need to provide a seek method.
Create a directoryMethod of creating a directory in FileSystem: public boolean mkdirs (Path f) throws IOException; As with the Java.io.File.mkdirs method, create the directory and create the missing parent directory at the same time. We generally do not need to create a directory, because when you create a file, the required directories are created by default.
query file meta information: Filestatus (org.apache.hadoop.fs.FileStatus)The Getfilestatus () method in the FileSystem class returns an Filestatus instance that contains meta-information about the path (file or directory): File size, block size, number of copies copied, last modified time , owner, permissions, and so on. simple procedure, first on the code, then explain
Import static org.junit.assert.*;import static Org.hamcrest.corematchers.*;import java.io.FileNotFoundException; Import Java.io.ioexception;import Java.io.outputstream;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filestatus;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path;import Org.apache.hadoop.hdfs.minidfscluster;import Org.junit.after;import Org.junit.before;import Org.junit.Test;public    Class Showfilestatustest {private static final String Sysprop_key = "Test.build.data";    /** Minidfscluster class in Hadoop-hdfs-2.4.1-tests.jar, is a in-process HDFS cluster dedicated to testing */private minidfscluster cluster;    Private FileSystem FS;        @Before public void SetUp () throws IOException {Configuration conf = new configuration ();        String Sysprop = System.getproperty (Sysprop_key);        if (Sysprop = = null) {System.setproperty (Sysprop_key, "/tmp");        } cluster = new Minidfscluster (conf, 1, true, NULL); FS =Cluster.getfilesystem ();        OutputStream out = fs.create (new Path ("/dir/file"));        Out.write ("Content". GetBytes ("UTF-8"));    Out.close ();        } @After public void TearDown () throws IOException {if (fs! = NULL) {fs.close ();        } if (cluster! = null) {Cluster.shutdown (); }} @Test (expected = filenotfoundexception.class) public void Throwsfilenotfoundfornonexistentfile () throws Ioexc    eption {fs.getfilestatus (New Path ("No-such-file"));        } @Test public void Filestatusforfile () throws IOException {Path file = new Path ("/dir/file");               Filestatus stat = fs.getfilestatus (file);        Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir/file"));        Assertthat (Stat.isdirectory (), is (false));        Assertthat (Stat.getlen (), is (7L));        Asserttrue (Stat.getmodificationtime () <= system.currenttimemillis ());        Assertthat (Stat.getreplication (), is ((short) 1)); AssertThat (Stat.getblocksize (), is (* 1024L));        Assertthat (Stat.getowner (), is ("Norris"));        Assertthat (Stat.getgroup (), is ("supergroup"));    Assertthat (Stat.getpermission (). ToString (), is ("rw-r--r--"));        } @Test public void Filestatusfordirectory () throws IOException {path dir = new Path ("/dir");        Filestatus stat = fs.getfilestatus (dir);        Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir"));        Assertthat (Stat.isdirectory (), is (true));        Assertthat (Stat.getlen (), is (0L));        Asserttrue (Stat.getmodificationtime () <= system.currenttimemillis ());        Assertthat (Stat.getreplication (), is ((short) 0));        Assertthat (Stat.getblocksize (), is (0L));        Assertthat (Stat.getowner (), is ("Norris"));        Assertthat (Stat.getgroup (), is ("supergroup"));    Assertthat (Stat.getpermission (). ToString (), is ("rwxr-xr-x")); }}
Program compilation requires the introduction of Hadoop-hdfs-2.4.1-tests.jar and Hadoop-hdfs-2.4.1.jar, which can be found in the Hadoop installation package. Minidfscluster class inHadoop-hdfs-2.4.1-tests.jar, is a memory HDFs cluster dedicated to testing. other code to see the name can be understood, but this junit Assertthat method Let me complete the long time, do not know is the method is the static method of the class, find the online are directly so write, as if write can use like, but I compiled but, later only know, is a static method in the Org.hamcrest.CoreMatchers class, which needs to be introduced statically. In addition there is a Lessthanorequalto method, dead and alive did not find in which class, had to use the Asserttrue method instead.
The program needs to be started with JUnit because this is a test case. Put Junit.jar and Org.hamcrest.core_1.3.0.v201303031735.jar on the server, modify hadoop-env.sh, and put the last line we just modified: Export Hadoop_classpath = $HADOOP _classpath:/home/norris/data/hadoop/bin/and append the two jars, namely: Export hadoop_classpath= $HADOOP _classpath:/home/ Norris/data/hadoop/bin/:/home/norris/data/hadoop/lib/junit.jar:/home/norris/data/hadoop/lib/org.hamcrest.core_ 1.3.0.v201303031735.jar
Instead of running the class we wrote, we ran JUnit and let JUnit go to test the class we wrote: $hadoop org.junit.runner.JUnitCore showfilestatustestrunning results found that three tests were conducted, two succeeded, one failed, and the failureAssertthat (Stat.getblocksize (), is (+ * 1024L)); this lineThe expected value is: 67108864, that is, 64M, the actual values are: 134217728, that is 128M. That is, the default value of the block size of HDFs is 128M, which I do not understand is because the new version of Hadoop modified the default value of block size or because I re-compiled Hadoop under the 64-bit system, if set in Hdfs-site.xml: <property><name>dfs.block.size</name><value>67108864</value></property>set the block size to 64M, and all three have succeeded.

Hadoop HDFs (3) Java Access HDFs

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.