Hadoop源碼分析—-Client的open、seek和read操作

來源:互聯網
上載者:User

hadoop雖然沒有提供POSIX那樣的操作,但是提供的基本的檔案操作open,create,delete,write,seek,read還是令使用者可以方便的操作檔案。下面是一段尋常的hadoop開啟檔案並且讀取檔案內容的代碼:

hdfs=hdfsPath.getFileSystem(conf);inFsData=hdfs.open(p);inFsData.seek(place);inFsData.readLong();

hdfs是FileSystem的執行個體,FileSystem是一個抽象類別,根據conf中url的內容,返回的hdfs可能是本地檔案系統的執行個體,也可能是Distributed File System的執行個體。hadoop檔案操作的實際類是DistributedFileSystem

下面來看一下DistributedFileSystem的open操作:

  public FSDataInputStream open(Path f, int bufferSize) throws IOException {    statistics.incrementReadOps(1);    return new DFSClient.DFSDataInputStream(          dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics));  }

可以看出open操作是返回一個FSDataInputStream的輸入資料流,open裡面產生了DFSClient中內部類DFSDataInputStream的對象,對象的其中參數是DFSClent的open函數傳回值下面是DFSClient的open函數

  public DFSInputStream open(String src, int buffersize, boolean verifyChecksum,                      FileSystem.Statistics stats      ) throws IOException {    checkOpen();    //    Get block info from namenode    return new DFSInputStream(src, buffersize, verifyChecksum);  }

這個open函數返回的是DFSInputStream對象,下面是DFSInputStream的建構函式:

    DFSInputStream(String src, int buffersize, boolean verifyChecksum                   ) throws IOException {      this.verifyChecksum = verifyChecksum;      this.buffersize = buffersize;      this.src = src;      prefetchSize = conf.getLong("dfs.read.prefetch.size", prefetchSize);      openInfo();    }

下面是DFSInputStream的openInfo函數,這個函數式整個open系列的核心操作。

 synchronized void openInfo() throws IOException {      LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);      if (newInfo == null) {        throw new FileNotFoundException("File does not exist: " + src);      }      // I think this check is not correct. A file could have been appended to      // between two calls to openInfo().      if (locatedBlocks != null && !locatedBlocks.isUnderConstruction() &&          !newInfo.isUnderConstruction()) {        Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();        Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();        while (oldIter.hasNext() && newIter.hasNext()) {          if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {            throw new IOException("Blocklist for " + src + " has changed!");          }        }      }      updateBlockInfo(newInfo);      this.locatedBlocks = newInfo;      this.currentNode = null;    }

其中callGetBlockLocations是通過RPC和namenode通訊來訪問該檔案的前prefetchSize個塊(設定檔裡的,預設為10)。把這10個塊的位置存放在這個流中。後面有一個updateBlockInfo函數是選最後一塊的datanode的資訊與namenode上的資訊做比較,若不一致,則遵從datanode上的資訊(因為namenode和datanode上的資訊可能存在不一致)。

然後的seek和read函數都是針對於stream的。下面看下DFSInputStream的seek函數

 public synchronized void seek(long targetPos) throws IOException {      if (targetPos > getFileLength()) {        throw new IOException("Cannot seek after EOF");      }      boolean done = false;      if (pos <= targetPos && targetPos <= blockEnd) {        //        // If this seek is to a positive position in the current        // block, and this piece of data might already be lying in        // the TCP buffer, then just eat up the intervening data.        //        int diff = (int)(targetPos - pos);        if (diff <= TCP_WINDOW_SIZE) {          try {            pos += blockReader.skip(diff);            if (pos == targetPos) {              done = true;            }          } catch (IOException e) {//make following read to retry            LOG.debug("Exception while seek to " + targetPos + " from "                      + currentBlock +" of " + src + " from " + currentNode +                       ": " + StringUtils.stringifyException(e));          }        }      }      if (!done) {        pos = targetPos;        blockEnd = -1;      }    }

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.