hadoop深入研究:(二)—

hadoop深入研究:(二)——java訪問hdfs

最後更新：2018-07-26 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

轉載請註明出處，http://blog.csdn.net/lastsweetop/article/details/9001467

所有源碼在github上，https://github.com/lastsweetop/styhadoop 讀資料 使用hadoop url讀取 比較簡單的讀取hdfs資料的方法就是通過java.net.URL開啟一個流，不過在這之前先要預先調用它的setURLStreamHandlerFactory方法設定為FsUrlStreamHandlerFactory（由此工廠取解析hdfs協議），這個方法只能調用一次，所以要寫在靜態塊中。然後調用IOUtils類的copyBytes將hdfs資料流拷貝到標準輸出資料流System.out中，copyBytes前兩個參數好理解，一個輸入，一個輸出，第三個是緩衝大小，第四個指定拷貝完畢後是否關閉流。我們這裡要設定為false，標準輸出資料流不關閉，我們要手動關閉輸入資料流。

package com.sweetop.styhadoop;import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;import org.apache.hadoop.io.IOUtils;import java.io.InputStream;import java.net.URL;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-5-31 * Time: 上午10:16 * To change this template use File | Settings | File Templates. */public class URLCat {    static {        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());    }    public static void main(String[] args) throws Exception {        InputStream in = null;        try {            in = new URL(args[0]).openStream();            IOUtils.copyBytes(in, System.out, 4096, false);        } finally {            IOUtils.closeStream(in);        }    }}

使用FileSystem API讀取資料 首先是執行個體化FileSystem對象，通過FileSystem類的get方法，這裡要傳入一個java.net.URL和一個配置Configuration。然後FileSystem可以通過一個Path對象開啟一個流，之後的操作和上面的例子一樣

package com.sweetop.styhadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;import java.io.InputStream;import java.net.URI;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-5-31 * Time: 上午11:24 * To change this template use File | Settings | File Templates. */public class FileSystemCat {    public static void main(String[] args) throws Exception {        String uri=args[0];        Configuration conf=new Configuration();        FileSystem fs=FileSystem.get(URI.create(uri),conf);        InputStream in=null;        try {            in=fs.open(new Path(uri));            IOUtils.copyBytes(in, System.out, 4096, false);        }   finally {            IOUtils.closeStream(in);        }    }}

FSDataInputStream 通過FileSystem開啟流返回的對象是個FSDataInputStream對象，該類實現了Seekable介面，

public interface Seekable {    void seek(long l) throws java.io.IOException;    long getPos() throws java.io.IOException;    boolean seekToNewSource(long l) throws java.io.IOException;}

seek方法可跳到檔案中的任意位置，我們這裡跳到檔案的初始位置再重新讀一次

public class FileSystemDoubleCat {    public static void main(String[] args) throws Exception {        String uri = args[0];        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        FSDataInputStream in=null;        try {            in = fs.open(new Path(uri));            IOUtils.copyBytes(in, System.out, 4096, false);            in.seek(0);            IOUtils.copyBytes(in, System.out, 4096, false);        }   finally {            IOUtils.closeStream(in);        }    }}

FSDataInputStream還實現了PositionedReadable介面，

public interface PositionedReadable {    int read(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes, int i, int i1) throws java.io.IOException;    void readFully(long l, byte[] bytes) throws java.io.IOException;}

可以在任意位置（第一個參數），位移量（第三個參數），長度（第四個參數），到數組中（第二個參數）
這裡就不實現了，大家可以試下 寫資料 FileSystem類有很多種建立檔案的方法，最簡單的一種是

public FSDataOutputStream create(Path f) throws IOException

它還有很多重載方法，可以指定是否強制覆蓋已存在的檔案，檔案的重複因子，寫緩衝的大小，檔案的塊大小，檔案的許可權等。還可以指定一個回調介面：

public interface Progressable {    void progress();}

和普通檔案系統一樣，也支援apend操作，寫日誌時最常用

public FSDataOutputStream append(Path f) throws IOException

但並非所有hadoop檔案系統都支援append，hdfs支援，s3就不支援。以下是個拷貝本地檔案到hdfs的例子

package com.sweetop.styhadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.util.Progressable;import java.io.BufferedInputStream;import java.io.FileInputStream;import java.io.InputStream;import java.io.OutputStream;import java.net.URI;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-6-2 * Time: 下午4:54 * To change this template use File | Settings | File Templates. */public class FileCopyWithProgress {    public static void main(String[] args) throws Exception {        String localSrc = args[0];        String dst = args[1];        InputStream in = new BufferedInputStream(new FileInputStream(localSrc));        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(dst), conf);        OutputStream out = fs.create(new Path(dst), new Progressable() {            @Override            public void progress() {                System.out.print(".");            }        });        IOUtils.copyBytes(in, out, 4096, true);

public boolean mkdirs(Path f) throws IOException

mkdirs方法會自動建立所有不存在的父目錄檢索檢索一個目錄，查看目錄和檔案的資訊在任何作業系統這些都是不可或缺的功能，hdfs也不例外，但也有一些特別的地方：
FileStatus FileStatus 封裝了hdfs檔案和目錄的中繼資料，包括檔案的長度，塊大小，重複數，修改時間，所有者，許可權等資訊，FileSystem的getFileStatus可以獲得這些資訊，

package com.sweetop.styhadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import java.io.IOException;import java.net.URI;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-6-2 * Time: 下午8:58 * To change this template use File | Settings | File Templates. */public class ShowFileStatus {    public static void main(String[] args) throws IOException {        Path path = new Path(args[0]);        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(args[0]), conf);        FileStatus status = fs.getFileStatus(path);        System.out.println("path = " + status.getPath());        System.out.println("owner = " + status.getOwner());        System.out.println("block size = " + status.getBlockSize());        System.out.println("permission = " + status.getPermission());        System.out.println("replication = " + status.getReplication());    }}

Listing files 有時候你可能會需要找一組符合要求的檔案，那麼下面的樣本就可以幫到你，通過FileSystem的listStatus方法可以獲得合格一組FileStatus對象，listStatus有幾個重載的方法，可以傳入多個路徑，還可以使用PathFilter做過濾，我們下面就會講到它。這裡還有一個重要的方法，FileUtils.stat2Paths可以將一組FileStatus對象轉換成一組Path對象，這是個非常便捷的方法。

package com.sweetop.styhadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FileUtil;import org.apache.hadoop.fs.Path;import java.io.IOException;import java.net.URI;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-6-2 * Time: 下午10:09 * To change this template use File | Settings | File Templates. */public class ListStatus {    public static void main(String[] args) throws IOException {        String uri = args[0];        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        Path[] paths = new Path[args.length];        for (int i = 0; i < paths.length; i++) {            paths[i] = new Path(args[i]);        }        FileStatus[] status = fs.listStatus(paths);        Path[] listedPaths = FileUtil.stat2Paths(status);        for (Path p : listedPaths) {            System.out.println(p);        }    }}

PathFilter 接著上面我們來講PathFilter介面，該介面只需實現其中的一個方法即可，即accpet方法，方法返回true時表示被過濾掉，我們來實現一個正則過濾，並在下面的例子裡起作用

package com.sweetop.styhadoop;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.PathFilter;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-6-3 * Time: 下午2:49 * To change this template use File | Settings | File Templates. */public class RegexExludePathFilter implements PathFilter {    private final String regex;    public RegexExludePathFilter(String regex) {        this.regex = regex;    }    @Override    public boolean accept(Path path) {        return !path.toString().matches(regex);    }}

File patterns 當需要很多檔案時，一個個列出路徑是很不便捷的，hdfs提供了一個萬用字元列出檔案的方法，通過FileSystem的globStatus方法提供了這個便捷，globStatus也有重載的方法，使用PathFilter過濾，那麼我們結合兩個來實現一下

package com.sweetop.styhadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FileUtil;import org.apache.hadoop.fs.Path;import java.io.IOException;import java.net.URI;/** * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-6-3 * Time: 下午2:37 * To change this template use File | Settings | File Templates. */public class GlobStatus {    public static void main(String[] args) throws IOException {        String uri = args[0];        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        FileStatus[] status = fs.globStatus(new Path(uri),new RegexExludePathFilter("^.*/1901"));        Path[] listedPaths = FileUtil.stat2Paths(status);        for (Path p : listedPaths) {            System.out.println(p);        }    }}

刪除資料 刪除資料比較簡單

public abstract boolean delete(Path f,                               boolean recursive)                        throws IOException

第一個參數很明確，第二個參數表示是否遞迴刪除子目錄或目錄下的檔案，在Path為目錄但目錄是空的或者Path為檔案時可以忽略，但如果Path為目錄且不為空白的情況下，如果recursive為false,那麼刪除就會拋出io異常。

感謝Tom White,此文章大部分來自於大神的definitive guide，奈何中文版翻譯太爛，就在英文原版的基礎上和官方的一些文檔加入一些自己的理解。全當是讀書筆記吧，畫蛇添足之舉。

如果我的文章對您有協助，請用支付寶打賞：

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More