Initial knowledge of the HDFS system of Hadoop

Source: Internet
Author: User

HDFs is a distributed file system that uses the Master/slave architecture to manage large volumes of files. An HDFS cluster consists of a namenode and a certain number of Datanode, Namenode is a central server that manages the execution schedule in the cluster, and Datanode is the execution node for the specific task.

HDFs processes files in blocks as a basic unit, and each Datanode stores a block,block default size of 64MB, which developers can configure themselves as needed. The HDFs client divides the size required for storage into blocks, and then stores each block in the appropriate datanode through the Namenode management schedule. In general, each block will have a corresponding two copies of the backup node, one stored in the same rack with itself, the other is stored on another rack, this is to prevent one of the racks will not be lost after the outage, but also to ensure efficient reading.

NameNode: acts as the master role, manages scheduling datanode and stores node location and log information, and so on.
DataNode: acts as the slave role and is responsible for storing files in block form.

Write Operations for HDFs:
Suppose you now have 128MB of distributed storage for a file. There are two racks, and the block is configured by default size.

    1. The client first divides the a file into two block1 and block2 of size 64MB.
    2. Client sends a request to Namenode to write data
    3. Namenode queries and returns available Datanode information to the client
    4. The client writes to Datanode Block1
      The process of writing Datanode is as follows:
    5. 64MB files are divided into a package in 64KB format.
    6. Then send the Package1 to DataNode1 (hereinafter referred to as the DN)
    7. DN1 receives the PACKAGE1 and sends it to DN2, and the client sends the PACKAGE2 to DN1
    8. DN2 received Package1 and sent to DN3, and received DN1 sent to Package2
    9. And so on, until Block1 completes all writes successfully, returns the client a confirmation signal and then writes the BLOCK2 process.
      Note: If the DN fails during the write process, the write stream of the DN is closed and the remaining blocks continue to be transmitted to the remaining DN.

Read operations for HDFs:
The condition configuration is the same as above
1. The client initiates a read request to Namenode (hereinafter referred to as NN)
2. NN returns a partial or full block list of a file to the client, and for each BLOCK,NN returns the address of the backup node for that block
3. The client selects the nearest DN to read the block, closes the connection to the current DN after reading the data from the block, and looks for the next best DN storage block
4. If no files have been read until after the block of the list has been read, the client will apply to NN for the next block list
Note: If an error occurs while reading the current DN data, the client notifies NN,NN to return the backup node information of the error DN to the client, allowing the client to read the backup node's block

Bugs in HDFs
1. Because nn stores the file system's metadata in memory, the number of files in the file system is determined by the memory size of the NN, so it is not appropriate to use HDFS when the client needs to store a large number of small files
2. Because HDFs is written in the form of a stream, the write can only be one line, not concurrent processing, so it is not suitable for multiple users to write to the same file
3. HDFs is a storage system for large data, so it is not very efficient to read and write operations, which leads to a delay in data access, and if users need low latency data access is not recommended to use HDFs

Initial knowledge of the HDFS system of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.