The understanding of HDFs

Last Update:2016-11-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What do I need to consider for HDFs (Hadoop distributed system) design?

The first is how data is stored (physical storage of data)

There is a Datanode node on each machine. This node is used to store the data.

HDFs blocks a large file, and each version may be different for each chunk size. The Hadoop 1 version defaults to 64M,

Suppose 80M things, it is divided into 64M and 16M things. Then he is divided in such a format. Each is distributed in a decentralized storage. Maybe this fast 64M is in this datonode, the other piece 16 is on another datanode.

The second is the security of the data (loss of data)

Distributed systems usually have a backup, a 80M thing, maybe 64M things have several, 16M of things have several. They back up things, several datanode nodes can communicate with each other, backup each other, and then tell Namenode there are several, even if a datanode, bad, data can still have backup.

The third is access to data (client and server-side communication)

How the client communicates with the distributed system.

In the open distributed service is the follow-up, there will be 3 processes, one process is Datanode, one is Namenode, the other is Secondarynamede

Client--------Step 1 (RPC communication)----->namenode

DataNode1 DataNode2 DataNode3

The server is going to read the file, first he communicates with Namenode, Namenode contains meta data. It checks accordingly, and then it tells the client to look for the Datanode, read and write through the stream, and close the stream when it is finished.

Secondarynamenode is a cold backup. It is a backup of the namenode thing, Namenode crashed, the entire distributed system is over, there must be remedial measures.

How does Secondarynamenode remedy Namenode?

First of all, Namenode metadata information is stored in memory, power loss is easy to lose, so he must be saved to disk, how to save it, is through the serialization mechanism to achieve from memory to disk. The serialized image file name is the Fsimage file.

Client-to-file (send request information)----> Namenode

<-(query file is not, permissions enough, and then to the Datanode, return to the client.)

| Write information to datanode.

Datanode1 Datanode2 Datanode3

The client begins to write information to the Datanode, and there is a edits file in the Namenode to start recording information, no matter what the file name is. Store the Datanode, the fast size), or failure, it will record a log data information. Immediately after the metadata in Namenode, there is also a data description information. At this point after writing, and did not sync to fsimage (because he is not in time synchronization) assume one months ago to the HDFs upload 2 files, secondary into a certain condition will be merged (some edits file size, or more than a certain time to merge), one months, It has been merged many times, synchronized. The metadata in memory should have saved 2 descriptive messages at this time. Fsimage There are 2 description messages (already synchronized), the edits file now has no information, because a little metadata merged into the fsimage information, the edits file will be emptied. Now the client began to upload a file to HDFs, at this time, there are 1 in edits, in-memory metadata into 3 messages, fsimage or 2 messages, at this time metadata and fsimage out of sync. When a certain condition is met, Secondarynamenode begins to work. It downloads fsimage and edits files to Namenode via the HTTP protocol, Namenode generates a new Editsnew file (toggle), and if there is a file to read and write, it saves the record in the Editsnew file. Secondary merge fsimage and edits files to generate a new fsimage, fsimage back to Namenode,namenode to remove fsimage and edits files, replace editsnew files with edits files , so the data can be synchronized again.

The Distributed file system is suitable for one-time write, multiple queries, and does not support concurrent write situations. Small files are not appropriate.

This article from the "Jane Answers Life" blog, reproduced please contact the author!

The understanding of HDFs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The understanding of HDFs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The understanding of HDFs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support