HDFs read-write process

Source: Internet
Author: User

Data read process:

    1. The client accesses the Namenode, informing the file that needs to be read

    2. Customer Identity Confirmation

      1. By trusting the client. By which the user name is specified

      2. Through mandatory authentication mechanisms such as Kerberos

    3. Checks the owner of the file and its set access rights, if the file does exist, and the user has access to it. Namenode tells the label of the first chunk of the client file and the Datanode list that holds the block (the list is based on the distance between the Datanode and the client, and the distance is calculated based on the rack topology of the Hadoop cluster)

    4. The client accesses the most appropriate datanode based on the block designator and Datanode hostname, reading the required block of data until all data blocks have been read or the client has actively closed the file stream


Exception Condition:

    1. Process or host exception when reading data from Datanode. When the read operation does not stop, the HDFS library automatically attempts to read the data from other Datanode that have a copy of the data. If all copies of the data are inaccessible, the read operation fails and the client receives an exception error message

    2. When the client tries to read data from Datanode, the data block location information returned by Namenode has expired. If there are other Datanode to save the copy of the block, the client will attempt to read the data from those datanode, otherwise the read operation will fail


Data write Process:

    1. The client sends a request via the Hadoop file system-related API to open a file to be written to, and if the user has sufficient privileges, request a namenode to establish the metadata for the file on Namenode

    2. Client received a "Open file Success" response

    3. The client writes data to the stream, the data is automatically split into packets, and the packet is persisted in the memory queue

    4. The client's stand-alone thread reads packets from the queue and requests a set of Datanode lists to Namenode to write multiple copies of the data block

    5. The first datanode in the client direct connection list, the Datanode is connected to the second Datanode, the second one connects to the third, and the replication pipeline that blocks the data

    6. The packet is streamed to the first Datanode disk, and the next Datanode in the pipeline is written to its disk, and so on

    7. Each datanode in the replication pipeline confirms that the received packet was successfully written to the disk

    8. The client maintains a list of which packets have not received a confirmation message, and each response is received, and the client knows that the data was successfully written to a datanode in the pipeline

    9. When the block is full, the client will reapply to Namenode for the next set of Datanode

    10. The client writes all remaining packets to disk, closes the data stream, and notifies the namenode that the write operation is complete


Exception Condition:

One of the datanode in the replication pipeline cannot write data to disk, and the pipeline shuts down immediately. Packets that have been sent but have not yet received confirmation are rolled back to the queue to ensure that the downstream node of the error node in the pipeline can obtain the packets. In the remaining Health data node, the data block being written is assigned a new ID. When a failed data node recovers, the redundant data block appears to be discarded automatically, and a new replication pipeline consisting of the remaining nodes is reopened, continuing the write operation until the file is closed

This article is from "Lucas" blog, please be sure to keep this source http://4292565.blog.51cto.com/4282565/1672863

HDFs read-write process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.