Hadoop In-depth Study: (ii)--java access HDFs

Source: Internet
Author: User

Reprint please indicate the source, http://blog.csdn.net/lastsweetop/article/details/9001467

All source code on GitHub, Https://github.com/lastsweetop/styhadoop read data using Hadoop URL read A simpler way to read HDFS data is to open a stream via Java.net.URL, but before that, it's Seturlstreamhandlerfactory method is set to Fsurlstreamhandlerfactory (the factory takes the parse HDFs This method can only be invoked once, so write it in a static block. The copybytes of the Ioutils class is then called to copy the HDFs data stream to the standard output stream System.out, the first two parameters are well understood, one input, one output, the third is the cache size, and the fourth specifies whether the stream is closed after the copy is completed. We want to set this to false, the standard output stream does not close, we want to manually close the input stream.

Package com.sweetop.styhadoop;

Import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
Import Org.apache.hadoop.io.IOUtils;

Import Java.io.InputStream;
Import Java.net.URL;

/**
 * Created with IntelliJ idea.
 * User:lastsweetop
 * date:13-5-31 * time
 : Morning 10:16 * To change this
 template use File | Settings | File Templates.
 *
/public class Urlcat {

    static {
        url.seturlstreamhandlerfactory (new Fsurlstreamhandlerfactory ());
    } Public

    static void Main (string[] args) throws Exception {
        inputstream in = null;
        try {in
            = new URL (Args[0]). OpenStream ();
            Ioutils.copybytes (in, System.out, 4096, false);
        finally {
            ioutils.closestream (in);}}}

reading data using the FileSystem APIThe first is to instantiate the FileSystem object, passing the FileSystem class's Get method, where a Java.net.URL and a configuration configuration are passed in. Then filesystem can open a stream through a path object, followed by the example above
Package com.sweetop.styhadoop;

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.IOUtils;

Import Java.io.InputStream;
Import Java.net.URI;

/**
 * Created with IntelliJ idea.
 * User:lastsweetop
 * date:13-5-31 * time
 : Morning 11:24 * To change this
 template use File | Settings | File Templates.
 */Public
class Filesystemcat {public
    static void Main (string[] args) throws Exception {
        String uri=args[ 0];
        Configuration conf=new Configuration ();
        FileSystem Fs=filesystem.get (Uri.create (URI), conf);
        InputStream In=null;
        try {
            In=fs.open (new Path (URI));
            Ioutils.copybytes (in, System.out, 4096, false);   finally {
            ioutils.closestream (in);}}}

FsdatainputstreamThe object returned by the filesystem open stream is a Fsdatainputstream object that implements the Seekable interface.
Public interface Seekable {
    void Seek (long L) throws java.io.IOException;
    Long GetPos () throws java.io.IOException;
    Boolean Seektonewsource (Long L) throws java.io.IOException;
}
The Seek method can jump to any location in the file, where we jump to the original location of the file and reread it again.
public class Filesystemdoublecat {public
    static void Main (string[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration ();
        FileSystem fs = Filesystem.get (Uri.create (URI), conf);
        Fsdatainputstream In=null;
        try {in
            = Fs.open (new Path (URI));
            Ioutils.copybytes (in, System.out, 4096, false);
            In.seek (0);
            Ioutils.copybytes (in, System.out, 4096, false);   finally {
            ioutils.closestream (in);}}}
Fsdatainputstream also implements the Positionedreadable interface,
Public interface Positionedreadable {
    int read (long l, byte[] bytes, int i, int i1) throws java.io.IOException;
    void readfully (Long l, byte[] bytes, int i, int i1) throws java.io.IOException;
    void readfully (Long l, byte[] bytes) throws java.io.IOException;
}
Can be in any position (first argument), offset (third parameter), length (fourth argument), to array (second argument)
It's not going to happen here, you can try it. Write DataThere are many ways to create a file in the FileSystem class, the simplest of which is
Create (Path f) throws IOException
It also has a number of overloaded methods that can specify whether to force overwriting existing files, the file's recurrence factor, the size of the write cache, the file's block size, the file's permissions, and so on. You can also specify a callback interface:
Public interface Progressable {
    void progress ();
}
Like ordinary file system, also support apend operation, write log is most commonly used
Append (Path f) throws IOException
But not all Hadoop file systems support Append,hdfs support, S3 does not support it. Here is an example of copying local files to HDFs
Package com.sweetop.styhadoop;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.IOUtils;

Import org.apache.hadoop.util.Progressable;
Import Java.io.BufferedInputStream;
Import Java.io.FileInputStream;
Import Java.io.InputStream;
Import Java.io.OutputStream;

Import Java.net.URI;
 /** * Created with IntelliJ idea. * User:lastsweetop * date:13-6-2 * Time: PM 4:54 * To change this template use File | Settings |
 File Templates. */public class Filecopywithprogress {public static void main (string[] args) throws Exception {String LocalS
        rc = Args[0];

        String DST = args[1];

        InputStream in = new Bufferedinputstream (new FileInputStream (LOCALSRC));
        Configuration conf = new Configuration ();
        FileSystem fs = Filesystem.get (Uri.create (DST), conf);
         OutputStream out = fs.create (new Path (DST), new progressable () {@Override   public void Progress () {System.out.print (".");

        }
        });
    Ioutils.copybytes (in, out, 4096, true);
DirectoryHow to create a directory:
mkdirs (Path f) throws IOException
The Mkdirs method automatically creates all parent directories that do not exist RetrieveRetrieving a directory and viewing information about directories and files is an essential feature of any operating system, HDFS is no exception, but there are some special places:
FilestatusFilestatus encapsulates metadata for HDFS files and directories, including the length of the file, block size, number of repetitions, modification time, owner, permission, and so on, filesystem Getfilestatus can obtain this information,
Package com.sweetop.styhadoop;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;

Import Org.apache.hadoop.fs.Path;
Import java.io.IOException;

Import Java.net.URI;
 /** * Created with IntelliJ idea. * User:lastsweetop * date:13-6-2 * Time: PM 8:58 * To change this template use File | Settings |
 File Templates. */public class Showfilestatus {public static void main (string[] args) throws IOException {path Path = new
        Path (Args[0]);
        Configuration conf = new Configuration ();
        FileSystem fs = Filesystem.get (Uri.create (args[0)), conf);
        Filestatus status = Fs.getfilestatus (path);
        SYSTEM.OUT.PRINTLN ("path =" + Status.getpath ());
        System.out.println ("owner =" + Status.getowner ());
        System.out.println ("Block size =" + status.getblocksize ());
        System.out.println ("Permission =" + status.getpermission ()); SYSTEM.OUT.PRINTLN ("replication = "+ status.getreplication ()); }
}

Listing FilesSometimes you may need to find a set of files that meet your requirements, and the following example can help you by using the FileSystem Liststatus method to obtain a set of Filestatus objects that meet the criteria, liststatus several overloaded methods and can pass in multiple paths , you can also use Pathfilter to filter, we will talk about it below. There is also an important way to fileutils.stat2paths a set of Filestatus objects into a set of path objects, which is a very handy way to do this.

Package com.sweetop.styhadoop;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.FileUtil;

Import Org.apache.hadoop.fs.Path;
Import java.io.IOException;

Import Java.net.URI;
 /** * Created with IntelliJ idea. * User:lastsweetop * date:13-6-2 * Time: PM 10:09 * To change this template use File | Settings |
 File Templates. 
        */public class Liststatus {public static void main (string[] args) throws IOException {String uri = args[0];
        Configuration conf = new Configuration ();

        FileSystem fs = Filesystem.get (Uri.create (URI), conf);
        path[] paths = new Path[args.length];
        for (int i = 0; i < paths.length i++) {paths[i] = new Path (args[i));
        } filestatus[] status = Fs.liststatus (paths);
        path[] Listedpaths = fileutil.stat2paths (status); for (Path p:listedpaths) {System.out.println(p); }
    }
}
PathfilterAnd then we'll talk about the Pathfilter interface, the interface only needs to implement one of the methods, that is, the Accpet method, the method returns True when the expression is filtered, we implement a regular filter, and it works in the following example
Package com.sweetop.styhadoop;

Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.fs.PathFilter;

/**
 * Created with IntelliJ idea.
 * User:lastsweetop
 * date:13-6-3 * Time
 : PM 2:49 * To change this
 template use File | Settings | File Templates.
 * * Public
class Regexexludepathfilter implements Pathfilter {

    private final String regex;

    Public Regexexludepathfilter (String regex) {
        This.regex = regex;
    }

    @Override Public
    Boolean Accept (path Path) {return
        !path.tostring (). Matches (regex);
    }

File PatternsWhen you need a lot of files, a list of paths is very convenient, HDFS provides a wildcard list file method, through the FileSystem Globstatus method provides this convenient, Globstatus also have overloaded method, using Pathfilter filter, So let's combine two to achieve
Package com.sweetop.styhadoop;

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileStatus;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.FileUtil;
Import Org.apache.hadoop.fs.Path;

Import java.io.IOException;
Import Java.net.URI;

/**
 * Created with IntelliJ idea.
 * User:lastsweetop
 * date:13-6-3 * Time
 : PM 2:37 * To change this
 template use File | Settings | File Templates.
 */Public
class Globstatus {public
    static void Main (string[] args) throws IOException {
        String uri = args[ 0];
        Configuration conf = new Configuration ();
        FileSystem fs = Filesystem.get (Uri.create (URI), conf);

        filestatus[] Status = Fs.globstatus (new Path (URI), New Regexexludepathfilter ("^.*/1901"));
        path[] Listedpaths = fileutil.stat2paths (status);
        for (Path p:listedpaths) {
            System.out.println (p);}}}
Delete DataIt's easier to delete data
Delete (Path F,
                               Boolean recursive)
                        throws IOException
The first parameter is clear, and the second parameter indicates whether the file in the subdirectory or directory is recursively deleted, but can be ignored when path is directory, but the directory is empty or path is a file, but if the path is a directory and is not empty, if recursive is false, Then the deletion throws an IO exception.


Thank Tom White, this article mostly from the Great God's definitive guide, but the Chinese version of the translation is too bad, on the basis of the original English and some official documents to add some of their own understanding. It's all about reading notes, the superfluous.

If my article is helpful to you, please use Alipay to reward:





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.