In-depth hadoop Research: (2) Access HDFS through Java

Source: Internet
Author: User

Reprinted please indicate the source, http://blog.csdn.net/lastsweetop/article/details/9001467

All source code on GitHub, https://github.com/lastsweetop/styhadoop

Reading data using hadoop URL is a simple way to read HDFS data through java.net. the URL opens a stream, but before that, you must call its seturlstreamhandlerfactory method to set it to fsurlstreamhandlerfactory (the factory retrieves the parsing HDFS Protocol). This method can only be called once, therefore, you must write it in a static block. Call the copybytes of the ioutils class to copy the HDFS data stream to the standard output stream system. in the out statement, the first two parameters of copybytes are easy to understand. One input, one output, the third is the cache size, and the fourth parameter specifies whether to disable the stream after the copy is completed. Set this parameter to false, and the standard output stream is not closed. We need to manually close the input stream.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. FS. fsurlstreamhandlerfactory; import Org. apache. hadoop. io. ioutils; import Java. io. inputstream; import java.net. URL;/*** created with intellij idea. * User: lastsweetop * Date: 13-5-31 * Time: am * to change this template use file | Settings | file templates. */public class urlcat {static {URL. seturlstreamhandlerfactory (New fsurlstreamhandlerfactory ();} public static void main (string [] ARGs) throws exception {inputstream in = NULL; try {In = new URL (ARGs [0]). openstream (); ioutils. copybytes (in, system. out, 4096, false);} finally {ioutils. closestream (in );}}}

To read data using the filesystem API, you first instantiate the filesystem object and use the get method of the filesystem class. here you need to input a java.net. url and a configuration. Then, filesystem can open a stream through a path object. The subsequent operations are the same as those in the preceding example.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Java. io. inputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-5-31 * Time: 11 am * to change this template use file | Settings | file templates. */public class filesystemcat {public static void main (string [] ARGs) throws exception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); inputstream in = NULL; try {In = FS. open (New Path (URI); ioutils. copybytes (in, system. out, 4096, false);} finally {ioutils. closestream (in );}}}
Fsdatainputstream the object returned by opening the stream through filesystem is a fsdatainputstream object, which implements the seekable interface,
public interface Seekable {    void seek(long l) throws java.io.IOException;    long getPos() throws java.io.IOException;    boolean seekToNewSource(long l) throws java.io.IOException;}

The seek method can jump to any position in the file. Here we will jump to the initial position of the file and then read it again.

public class FileSystemDoubleCat {    public static void main(String[] args) throws Exception {        String uri = args[0];        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        FSDataInputStream in=null;        try {            in = fs.open(new Path(uri));            IOUtils.copyBytes(in, System.out, 4096, false);            in.seek(0);            IOUtils.copyBytes(in, System.out, 4096, false);        }   finally {            IOUtils.closeStream(in);        }    }}

Fsdatainputstream also implements the positionedreadable interface,

public interface PositionedReadable {    int read(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes) throws java.io.IOException;}

It can be anywhere (the first parameter), offset (the third parameter), length (the fourth parameter), to the array (the second parameter)
This is not implemented here. You can try it.


The filesystem class for Data Writing has many file creation methods. The simplest one is

public FSDataOutputStream create(Path f) throws IOException

There are also many overload methods to specify whether to forcibly overwrite existing files, File Duplication factors, write cache size, file block size, and file permissions.

You can also specify a callback interface:
public interface Progressable {    void progress();}

Like common file systems, apend operations are also supported.

public FSDataOutputStream append(Path f) throws IOException
Not all hadoop file systems support append and HDFS, but S3 does not. The following is an example of copying a local file to HDFS:
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. util. progressable; import Java. io. bufferedinputstream; import Java. io. fileinputstream; import Java. io. inputstream; import Java. io. outputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class filecopywithprogress {public static void main (string [] ARGs) throws exception {string localsrc = ARGs [0]; string DST = ARGs [1]; inputstream in = new bufferedinputstream (New fileinputstream (localsrc); configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (DST), conf); outputstream out = FS. create (New Path (DST), new progressable () {@ override public void progress () {system. out. print (". ") ;}}); ioutils. copybytes (In, out, 4096, true );
To create a directory:
public boolean mkdirs(Path f) throws IOException

The mkdirs method automatically creates all non-existing parent directories to retrieve a directory. Viewing directory and file information is an indispensable feature in any operating system, and HDFS is no exception, however, filestatusfilestatus encapsulates the metadata of HDFS files and directories, including the file length, block size, number of duplicates, modification time, owner, and permissions, the getfilestatus information of filesystem can be obtained,

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class showfilestatus {public static void main (string [] ARGs) throws ioexception {Path = New Path (ARGs [0]); configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (ARGs [0]), conf); filestatus status = FS. getfilestatus (PATH); system. out. println ("Path =" + status. getpath (); system. out. println ("owner =" + status. getowner (); system. out. println ("block size =" + status. getblocksize (); system. out. println ("permission =" + status. getpermission (); system. out. println ("replication =" + status. getreplication ());}}

Listing files sometimes you may need to find a set of files that meet the requirements. The following example can help you to obtain a set of filestatus objects that meet the conditions through the liststatus method of filesystem, there are several reload methods for liststatus. You can input multiple paths and use pathfilter for filtering. We will discuss it below. There is also an important method. fileutils. stat2paths can convert a group of filestatus objects into a group of path objects, which is a very convenient method.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. fileutil; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class liststatus {public static void main (string [] ARGs) throws ioexception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); path [] paths = new path [args. length]; for (INT I = 0; I <paths. length; I ++) {paths [I] = New Path (ARGs [I]);} filestatus [] status = FS. liststatus (paths); path [] listedpaths = fileutil. stat2paths (Status); For (path P: listedpaths) {system. out. println (p );}}}
Pathfilter: Next we will talk about the pathfilter interface above. This interface only needs to implement one of the methods, that is, the accpet method. If the method returns true, it means it is filtered out. We will implement a regular filter, in the following example
Package COM. sweetop. styhadoop; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. FS. pathfilter;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-3 * Time: * to change this template use file | Settings | file templates. */public class regexexludepathfilter implements pathfilter {private final string RegEx; Public regexexludepathfilter (string RegEx) {This. regEx = Rege X ;}@ override public Boolean accept (Path) {return! Path. tostring (). Matches (RegEx );}}

File patterns when many files are required, it is not convenient to list the paths one by one. HDFS provides a wildcard Method for listing files, which is provided through the globstatus method of filesystem, there are also overload methods for globstatus. If you use pathfilter to filter, we can use two methods to achieve this.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. fileutil; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-3 * Time: * to change this template use file | Settings | file templates. */public class globstatus {public static void main (string [] ARGs) throws ioexception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); filestatus [] status = FS. globstatus (New Path (URI), new regexexludepathfilter ("^. */1901 "); path [] listedpaths = fileutil. stat2paths (Status); For (path P: listedpaths) {system. out. println (p );}}}
Data deletion is relatively simple.
public abstract boolean delete(Path f,                               boolean recursive)                        throws IOException

The first parameter is very clear. The second parameter indicates whether to recursively delete files in subdirectories or directories. This parameter can be ignored when path is a directory but the directory is empty or path is a file, however, if the path is a directory and is not empty, if the recursive value is false, an IO exception will be thrown when the path is deleted. Thanks to Tom White, most of this article is from the definitive guide of the great god. However, the Chinese translation is too bad, so I will add some understanding to some official documents on the basis of the original English version. It's all about reading notes.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.