Initial knowledge of the HDFS system of Hadoop

Last Update:2016-05-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFs is a distributed file system that uses the Master/slave architecture to manage large volumes of files. An HDFS cluster consists of a namenode and a certain number of Datanode, Namenode is a central server that manages the execution schedule in the cluster, and Datanode is the execution node for the specific task.

HDFs processes files in blocks as a basic unit, and each Datanode stores a block,block default size of 64MB, which developers can configure themselves as needed. The HDFs client divides the size required for storage into blocks, and then stores each block in the appropriate datanode through the Namenode management schedule. In general, each block will have a corresponding two copies of the backup node, one stored in the same rack with itself, the other is stored on another rack, this is to prevent one of the racks will not be lost after the outage, but also to ensure efficient reading.

NameNode: acts as the master role, manages scheduling datanode and stores node location and log information, and so on.
DataNode: acts as the slave role and is responsible for storing files in block form.

Write Operations for HDFs:
Suppose you now have 128MB of distributed storage for a file. There are two racks, and the block is configured by default size.

The client first divides the a file into two block1 and block2 of size 64MB.
Client sends a request to Namenode to write data
Namenode queries and returns available Datanode information to the client
The client writes to Datanode Block1
The process of writing Datanode is as follows:
64MB files are divided into a package in 64KB format.
Then send the Package1 to DataNode1 (hereinafter referred to as the DN)
DN1 receives the PACKAGE1 and sends it to DN2, and the client sends the PACKAGE2 to DN1
DN2 received Package1 and sent to DN3, and received DN1 sent to Package2
And so on, until Block1 completes all writes successfully, returns the client a confirmation signal and then writes the BLOCK2 process.
Note: If the DN fails during the write process, the write stream of the DN is closed and the remaining blocks continue to be transmitted to the remaining DN.

Read operations for HDFs:
The condition configuration is the same as above
1. The client initiates a read request to Namenode (hereinafter referred to as NN)
2. NN returns a partial or full block list of a file to the client, and for each BLOCK,NN returns the address of the backup node for that block
3. The client selects the nearest DN to read the block, closes the connection to the current DN after reading the data from the block, and looks for the next best DN storage block
4. If no files have been read until after the block of the list has been read, the client will apply to NN for the next block list
Note: If an error occurs while reading the current DN data, the client notifies NN,NN to return the backup node information of the error DN to the client, allowing the client to read the backup node's block

Bugs in HDFs
1. Because nn stores the file system's metadata in memory, the number of files in the file system is determined by the memory size of the NN, so it is not appropriate to use HDFS when the client needs to store a large number of small files
2. Because HDFs is written in the form of a stream, the write can only be one line, not concurrent processing, so it is not suitable for multiple users to write to the same file
3. HDFs is a storage system for large data, so it is not very efficient to read and write operations, which leads to a delay in data access, and if users need low latency data access is not recommended to use HDFs

Initial knowledge of the HDFS system of Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Initial knowledge of the HDFS system of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Initial knowledge of the HDFS system of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support