The understanding of HDFs

Source: Internet
Author: User

What do I need to consider for HDFs (Hadoop distributed system) design?

The first is how data is stored (physical storage of data)

There is a Datanode node on each machine. This node is used to store the data.

HDFs blocks a large file, and each version may be different for each chunk size. The Hadoop 1 version defaults to 64M,

Suppose 80M things, it is divided into 64M and 16M things. Then he is divided in such a format. Each is distributed in a decentralized storage. Maybe this fast 64M is in this datonode, the other piece 16 is on another datanode.

The second is the security of the data (loss of data)

Distributed systems usually have a backup, a 80M thing, maybe 64M things have several, 16M of things have several. They back up things, several datanode nodes can communicate with each other, backup each other, and then tell Namenode there are several, even if a datanode, bad, data can still have backup.

The third is access to data (client and server-side communication)

How the client communicates with the distributed system.

In the open distributed service is the follow-up, there will be 3 processes, one process is Datanode, one is Namenode, the other is Secondarynamede

Client--------Step 1 (RPC communication)----->namenode



DataNode1 DataNode2 DataNode3

The server is going to read the file, first he communicates with Namenode, Namenode contains meta data. It checks accordingly, and then it tells the client to look for the Datanode, read and write through the stream, and close the stream when it is finished.

Secondarynamenode is a cold backup. It is a backup of the namenode thing, Namenode crashed, the entire distributed system is over, there must be remedial measures.

How does Secondarynamenode remedy Namenode?

First of all, Namenode metadata information is stored in memory, power loss is easy to lose, so he must be saved to disk, how to save it, is through the serialization mechanism to achieve from memory to disk. The serialized image file name is the Fsimage file.


Client-to-file (send request information)----> Namenode

<-(query file is not, permissions enough, and then to the Datanode, return to the client.)



|

|

| Write information to datanode.


Datanode1 Datanode2 Datanode3


The client begins to write information to the Datanode, and there is a edits file in the Namenode to start recording information, no matter what the file name is. Store the Datanode, the fast size), or failure, it will record a log data information. Immediately after the metadata in Namenode, there is also a data description information. At this point after writing, and did not sync to fsimage (because he is not in time synchronization) assume one months ago to the HDFs upload 2 files, secondary into a certain condition will be merged (some edits file size, or more than a certain time to merge), one months, It has been merged many times, synchronized. The metadata in memory should have saved 2 descriptive messages at this time. Fsimage There are 2 description messages (already synchronized), the edits file now has no information, because a little metadata merged into the fsimage information, the edits file will be emptied. Now the client began to upload a file to HDFs, at this time, there are 1 in edits, in-memory metadata into 3 messages, fsimage or 2 messages, at this time metadata and fsimage out of sync. When a certain condition is met, Secondarynamenode begins to work. It downloads fsimage and edits files to Namenode via the HTTP protocol, Namenode generates a new Editsnew file (toggle), and if there is a file to read and write, it saves the record in the Editsnew file. Secondary merge fsimage and edits files to generate a new fsimage, fsimage back to Namenode,namenode to remove fsimage and edits files, replace editsnew files with edits files , so the data can be synchronized again.

The Distributed file system is suitable for one-time write, multiple queries, and does not support concurrent write situations. Small files are not appropriate.

This article from the "Jane Answers Life" blog, reproduced please contact the author!

The understanding of HDFs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.