Data read process:
The client accesses the Namenode, informing the file that needs to be read
Customer Identity Confirmation
By trusting the client. By which the user name is specified
Through mandatory authentication mechanisms such as Kerberos
Checks the owner of the file and its set access rights, if the file does exist, and the user has access to it. Namenode tells the label of the first chunk of the client file and the Datanode list that holds the block (the list is based on the distance between the Datanode and the client, and the distance is calculated based on the rack topology of the Hadoop cluster)
The client accesses the most appropriate datanode based on the block designator and Datanode hostname, reading the required block of data until all data blocks have been read or the client has actively closed the file stream
Exception Condition:
Process or host exception when reading data from Datanode. When the read operation does not stop, the HDFS library automatically attempts to read the data from other Datanode that have a copy of the data. If all copies of the data are inaccessible, the read operation fails and the client receives an exception error message
When the client tries to read data from Datanode, the data block location information returned by Namenode has expired. If there are other Datanode to save the copy of the block, the client will attempt to read the data from those datanode, otherwise the read operation will fail
Data write Process:
The client sends a request via the Hadoop file system-related API to open a file to be written to, and if the user has sufficient privileges, request a namenode to establish the metadata for the file on Namenode
Client received a "Open file Success" response
The client writes data to the stream, the data is automatically split into packets, and the packet is persisted in the memory queue
The client's stand-alone thread reads packets from the queue and requests a set of Datanode lists to Namenode to write multiple copies of the data block
The first datanode in the client direct connection list, the Datanode is connected to the second Datanode, the second one connects to the third, and the replication pipeline that blocks the data
The packet is streamed to the first Datanode disk, and the next Datanode in the pipeline is written to its disk, and so on
Each datanode in the replication pipeline confirms that the received packet was successfully written to the disk
The client maintains a list of which packets have not received a confirmation message, and each response is received, and the client knows that the data was successfully written to a datanode in the pipeline
When the block is full, the client will reapply to Namenode for the next set of Datanode
The client writes all remaining packets to disk, closes the data stream, and notifies the namenode that the write operation is complete
Exception Condition:
One of the datanode in the replication pipeline cannot write data to disk, and the pipeline shuts down immediately. Packets that have been sent but have not yet received confirmation are rolled back to the queue to ensure that the downstream node of the error node in the pipeline can obtain the packets. In the remaining Health data node, the data block being written is assigned a new ID. When a failed data node recovers, the redundant data block appears to be discarded automatically, and a new replication pipeline consisting of the remaining nodes is reopened, continuing the write operation until the file is closed
This article is from "Lucas" blog, please be sure to keep this source http://4292565.blog.51cto.com/4282565/1672863
HDFs read-write process