"Original" HDFs introduction

Source: Internet
Author: User
Tags create directory readfile

I. Introduction of HDFS

1. HDFs Full Name

Hadoop distributed filesystem,hadoop Distributed File system.

Hadoop has an abstract file system concept, and Hadoop provides an abstract class Org.apache.hadoop.fs.filessystem,hdfs is an implementation of this abstract class. Others are:

File system

URI Programme

Java implementation (Org.apache.hadoop )

Local

File

Fs. LocalFileSystem

Hdfs

Hdfs

Hdfs. Distrbutedfilessystem

Hftp

Hftp

Hdfs. Hftpfilessystem

Hsftp

Hsftp

Hdfs. Hsftpfilessystem

HAR

Har

Fs. Harfilesystem

KFS

Kfs

Fs.kfs.KosmosFilesSystem

Ftp

Ftp

Fs.ftp.FtpFileSystem

2. HDFs Features:

(1) Super large file data cluster

(2) Streaming data access mode read file

(3) The hardware requirements are not particularly high, there is a good fault-tolerant mechanism.

(4) There is a certain delay in data access because HDFS optimizes data throughput and is at the cost of increasing latency.

(5) HDFs cannot efficiently store large numbers of small files. Because Namenode limits the number of files.

(6) HDFs does not support multiple writers, nor does it support random writes.

Ii. HDFs Architecture

3. Architecture diagram

4. Introduction to Architecture

(1) HDFs consists of client, NameNode, DataNode, Secondarynamenode.

(2) The client provides a calling interface for the file system.

(3) Namenode consists of fsimage (HDFs metadata image file) and Editlog (HDFs file change log), Namenode in memory each file and data block reference relationship. The reference relationship in Namenode does not exist on the hard disk, and is re-constructed each time HDFs starts.

(4) There are two tasks for Secondarynamenode:

L Merge Fsimage and Editlog regularly and transfer them to Namenode.

L Provide hot backup for Namenode.

(5) A datanode is usually installed on a machine, and a datanode is divided into many chunks (blocks). The data block is the smallest addressable unit in HDFs, typically a block of size 64M, unlike a standalone file system, files with less than one block size do not occupy a whole block of space.

(6) The main reason is to reduce the address overhead, but the block settings can not be too large, because a map task to process a block of data, if the block is set too large, the map task processing the amount of data will be too large, resulting in a low efficiency.

(7) Datanode will send the stored file block information to Namenode through the heartbeat timing.

(8) The copy storage rules for HDFs

The default replica factor is 3, one copy exists on the local rack of the machine, the second copy is stored on the other machine in the local rack, and the third copy exists on one node of the other rack.

This reduces the network data transmission of the write operation, improves the writing operation efficiency, on the other hand, the rack error rate is far lower than the node error rate, so does not affect the data reliability.

Iii. HDFs reading and writing process

1. Data reading flowchart

2. Read Process description

(1) The HDFS client invokes the open () method of the Distributedfilesystem class and, through the RPC protocol request Namenode, determines the location of the requested file and finds the address of the nearest Datanode node.

(2) Distributedfilesystem will return a Fsdatainputstream input stream object to the client.

(3) The client will raise the read () function on the Fsdatatinputstream, reading from near to far in accordance with the distance of each datanode.

(4) After each datanode is read, the close () function is raised on Fsdatainputstream.

(5) If the read fails, a copy of the data block is read, and the message is reported to Namenode.

3, the file of the write flowchart

4. Write Process description

(1) The client invokes the Create () method of the Distributedfilesystem object, calls Namenode through the RPC protocol, creates a new file in the namespace, and has no associated datanode associated with it at this time.

(2) The Create () method returns a Fsdataoutputstream object that is used by the client to write data.

(3) Before writing the data, the file is split into a packet and placed in a "data queue".

(4) Namenode allocates the appropriate Datenode storage copy for the package and returns a Datanode pipeline.

(5) According to the pipeline to save the file package on each datanode.

(6) After each Datanode save the package, it will return the confirmation message, confirm that it is saved in the confirmation queue, and when all the Datanode in the pipeline return a successful confirmation message, it will be deleted from the confirmation queue.

(7) After all the Datanode in the pipeline are saved, call Close () of the FileSystem object to turn off the data flow.

Iv. page interface for Hadoop

1. Interface Address

The web interface of HDFs can be accessed through http://NameNodeIP:50070.

V. the Java API for HDFs

1. Use URL to read data

1 //To read a file in HDFs using the URL interface2 Static  {3Url.seturlstreamhandlerfactory (Newfsurlstreamhandlerfactory ());4 }5  PublicString gethdfsbyurl (string url)throwsmalformedurlexception,ioexception6 {7String str= "";8InputStream in =NULL;9OutputStream out=NULL;Ten     Try { Onein=Newurl (url). OpenStream (); A         //ioutils.copybytes (in,out,4096,false); -Str=out.tostring (); -     } the     finally { - Ioutils.closestream (in); - Ioutils.closestream (out); -     } +     returnstr; -}

2. FileSystem API Read Data

Readfile//url: "/user/hadoop/data/write.txt" public  String  ReadFile (string url) throws ioexception{    String filecontent= "";    Configuration conf = new configuration ();    FileSystem fs = Filesystem.get (conf);    Path PATH = new path (URL);    if (fs.exists (path)) {        Fsdatainputstream is = fs.open (path);        Filestatus status = Fs.getfilestatus (path);        byte[] buffer = new Byte[integer.parseint (string.valueof (Status.getlen ()))];        is.readfully (0, buffer);        Is.close ();        Fs.close ();        Filecontent=buffer.tostring ();    }    return filecontent;}

  

3. FileSystem API Create Directory

Create HDFs directory//dirpath: "/user/hadoop/data/20130709" public  void  MakeDir (String dirpath) throws IOException {    Configuration conf = new configuration ();    FileSystem fs = Filesystem.get (conf);    Path PATH = new Path (dirpath);    Fs.create (path);    Fs.close ();}

  

4. FileSystem API Write Data

HDFs Write file//fileurl: "/user/hadoop/data/write.txt" public  void  WriteFile (String fileurl,string filecontent) Throws ioexception{    Configuration conf = new configuration ();    FileSystem fs = Filesystem.get (conf);    Path PATH = new Path (fileurl);    Fsdataoutputstream out = fs.create (path);    Out.writeutf (filecontent);    Fs.close ();}

  

5. FileSystem API Delete File

Delete file//fileurl: "/user/hadoop/data/word.txt" public void  DeleteFile (String fileurl) throws ioexception{    Configuration conf = new configuration ();    FileSystem fs = Filesystem.get (conf);    Path PATH = new Path (fileurl);    Fs.delete (path,true);    Fs.close ();}

  

6. Query meta-data

Query file metadata public  void  showfilestatus (String fileUrl) throws  ioexception{    Configuration conf = new Configuration ();    FileSystem fs = Filesystem.get (conf);    Path file=new path (FILEURL);    Filestatus stat=fs.getfilestatus (file);        System.out.println ("File path:" +stat.getpath ());    System.out.println ("is directory:" +stat.isdirectory ());    System.out.println ("Whether it is a file:" +stat.isfile ());    SYSTEM.OUT.PRINTLN ("Size of Block:" +stat.getblocksize ());    System.out.println ("File owner:" +stat.getowner () + ":" +stat.getgroup ());    System.out.println ("File Permissions:" +stat.getpermission ());    System.out.println ("File Length:" +stat.getlen ());    System.out.println ("Number of Backups:" +stat.getreplication ());    SYSTEM.OUT.PRINTLN ("Modification Time:" +stat.getmodificationtime ());}

  

"Original" HDFs introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.