In-depth hadoop Research: (2) Access HDFS through Java

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted please indicate the source, http://blog.csdn.net/lastsweetop/article/details/9001467

All source code on GitHub, https://github.com/lastsweetop/styhadoop

Reading data using hadoop URL is a simple way to read HDFS data through java.net. the URL opens a stream, but before that, you must call its seturlstreamhandlerfactory method to set it to fsurlstreamhandlerfactory (the factory retrieves the parsing HDFS Protocol). This method can only be called once, therefore, you must write it in a static block. Call the copybytes of the ioutils class to copy the HDFS data stream to the standard output stream system. in the out statement, the first two parameters of copybytes are easy to understand. One input, one output, the third is the cache size, and the fourth parameter specifies whether to disable the stream after the copy is completed. Set this parameter to false, and the standard output stream is not closed. We need to manually close the input stream.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. FS. fsurlstreamhandlerfactory; import Org. apache. hadoop. io. ioutils; import Java. io. inputstream; import java.net. URL;/*** created with intellij idea. * User: lastsweetop * Date: 13-5-31 * Time: am * to change this template use file | Settings | file templates. */public class urlcat {static {URL. seturlstreamhandlerfactory (New fsurlstreamhandlerfactory ();} public static void main (string [] ARGs) throws exception {inputstream in = NULL; try {In = new URL (ARGs [0]). openstream (); ioutils. copybytes (in, system. out, 4096, false);} finally {ioutils. closestream (in );}}}

To read data using the filesystem API, you first instantiate the filesystem object and use the get method of the filesystem class. here you need to input a java.net. url and a configuration. Then, filesystem can open a stream through a path object. The subsequent operations are the same as those in the preceding example.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Java. io. inputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-5-31 * Time: 11 am * to change this template use file | Settings | file templates. */public class filesystemcat {public static void main (string [] ARGs) throws exception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); inputstream in = NULL; try {In = FS. open (New Path (URI); ioutils. copybytes (in, system. out, 4096, false);} finally {ioutils. closestream (in );}}}

Fsdatainputstream the object returned by opening the stream through filesystem is a fsdatainputstream object, which implements the seekable interface,

public interface Seekable {    void seek(long l) throws java.io.IOException;    long getPos() throws java.io.IOException;    boolean seekToNewSource(long l) throws java.io.IOException;}

The seek method can jump to any position in the file. Here we will jump to the initial position of the file and then read it again.

public class FileSystemDoubleCat {    public static void main(String[] args) throws Exception {        String uri = args[0];        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        FSDataInputStream in=null;        try {            in = fs.open(new Path(uri));            IOUtils.copyBytes(in, System.out, 4096, false);            in.seek(0);            IOUtils.copyBytes(in, System.out, 4096, false);        }   finally {            IOUtils.closeStream(in);        }    }}

Fsdatainputstream also implements the positionedreadable interface,

public interface PositionedReadable {    int read(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes) throws java.io.IOException;}

It can be anywhere (the first parameter), offset (the third parameter), length (the fourth parameter), to the array (the second parameter)
This is not implemented here. You can try it.

The filesystem class for Data Writing has many file creation methods. The simplest one is

public FSDataOutputStream create(Path f) throws IOException

There are also many overload methods to specify whether to forcibly overwrite existing files, File Duplication factors, write cache size, file block size, and file permissions.

You can also specify a callback interface:

public interface Progressable {    void progress();}

Like common file systems, apend operations are also supported.

public FSDataOutputStream append(Path f) throws IOException

Not all hadoop file systems support append and HDFS, but S3 does not. The following is an example of copying a local file to HDFS:

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. util. progressable; import Java. io. bufferedinputstream; import Java. io. fileinputstream; import Java. io. inputstream; import Java. io. outputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class filecopywithprogress {public static void main (string [] ARGs) throws exception {string localsrc = ARGs [0]; string DST = ARGs [1]; inputstream in = new bufferedinputstream (New fileinputstream (localsrc); configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (DST), conf); outputstream out = FS. create (New Path (DST), new progressable () {@ override public void progress () {system. out. print (". ") ;}}); ioutils. copybytes (In, out, 4096, true );

To create a directory:

public boolean mkdirs(Path f) throws IOException

The mkdirs method automatically creates all non-existing parent directories to retrieve a directory. Viewing directory and file information is an indispensable feature in any operating system, and HDFS is no exception, however, filestatusfilestatus encapsulates the metadata of HDFS files and directories, including the file length, block size, number of duplicates, modification time, owner, and permissions, the getfilestatus information of filesystem can be obtained,

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class showfilestatus {public static void main (string [] ARGs) throws ioexception {Path = New Path (ARGs [0]); configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (ARGs [0]), conf); filestatus status = FS. getfilestatus (PATH); system. out. println ("Path =" + status. getpath (); system. out. println ("owner =" + status. getowner (); system. out. println ("block size =" + status. getblocksize (); system. out. println ("permission =" + status. getpermission (); system. out. println ("replication =" + status. getreplication ());}}

Listing files sometimes you may need to find a set of files that meet the requirements. The following example can help you to obtain a set of filestatus objects that meet the conditions through the liststatus method of filesystem, there are several reload methods for liststatus. You can input multiple paths and use pathfilter for filtering. We will discuss it below. There is also an important method. fileutils. stat2paths can convert a group of filestatus objects into a group of path objects, which is a very convenient method.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. fileutil; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-2 * Time: * to change this template use file | Settings | file templates. */public class liststatus {public static void main (string [] ARGs) throws ioexception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); path [] paths = new path [args. length]; for (INT I = 0; I <paths. length; I ++) {paths [I] = New Path (ARGs [I]);} filestatus [] status = FS. liststatus (paths); path [] listedpaths = fileutil. stat2paths (Status); For (path P: listedpaths) {system. out. println (p );}}}

Pathfilter: Next we will talk about the pathfilter interface above. This interface only needs to implement one of the methods, that is, the accpet method. If the method returns true, it means it is filtered out. We will implement a regular filter, in the following example

Package COM. sweetop. styhadoop; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. FS. pathfilter;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-3 * Time: * to change this template use file | Settings | file templates. */public class regexexludepathfilter implements pathfilter {private final string RegEx; Public regexexludepathfilter (string RegEx) {This. regEx = Rege X ;}@ override public Boolean accept (Path) {return! Path. tostring (). Matches (RegEx );}}

File patterns when many files are required, it is not convenient to list the paths one by one. HDFS provides a wildcard Method for listing files, which is provided through the globstatus method of filesystem, there are also overload methods for globstatus. If you use pathfilter to filter, we can use two methods to achieve this.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filestatus; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. fileutil; import Org. apache. hadoop. FS. path; import Java. io. ioexception; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-3 * Time: * to change this template use file | Settings | file templates. */public class globstatus {public static void main (string [] ARGs) throws ioexception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); filestatus [] status = FS. globstatus (New Path (URI), new regexexludepathfilter ("^. */1901 "); path [] listedpaths = fileutil. stat2paths (Status); For (path P: listedpaths) {system. out. println (p );}}}

Data deletion is relatively simple.

public abstract boolean delete(Path f,                               boolean recursive)                        throws IOException

The first parameter is very clear. The second parameter indicates whether to recursively delete files in subdirectories or directories. This parameter can be ignored when path is a directory but the directory is empty or path is a file, however, if the path is a directory and is not empty, if the recursive value is false, an IO exception will be thrown when the path is deleted. Thanks to Tom White, most of this article is from the definitive guide of the great god. However, the Chinese translation is too bad, so I will add some understanding to some official documents on the basis of the original English version. It's all about reading notes.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

In-depth hadoop Research: (2) Access HDFS through Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

In-depth hadoop Research: (2) Access HDFS through Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support