HDFS append/hflush/read Design

Source: Internet
Author: User

Original position https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

1. design challenges

For hflush, HDFS needs to make the last block of the file not closed visible to all readers.

There are two challenges:

1. read consistency issues. Different replicas of the last block at a given time may contain different bytes. In this case, how does HDFS provide consistency? In worse cases, how does HDFS ensure consistency.

2. Data durability. When an error occurs, recovery cannot simply discard the last block. The idea rediscovery should maintain at least hflushed bytes to achieve read consistency.

2. Replica/block states

In this document, the file block in datanode is represented as replica, and the block in namenode is represented as different

2.1 need for new States

A replica on pre-append/hflush datanode, which is either completed or temporary. When a replica is created for the first time, it is in a temporary state. When the client does not have data to write to this replica, it will send a close request, and the temporary replica becomes the complete replica. When datanode is restarted, the temporary replicas will be deleted. This is acceptable for pre-append/hflush, because HDFS provides good data durability for the data being constructed. However, after append/hflush is supported, this is unacceptable. HDFS needs to provide stronger data durability for blocks being constructed.
Therefore, some temporary replica needs to be retained when datanode is started.

2.2 replica states (datanode)

In datanode, this design introduces the replica being written (rbw) status and other States to handle errors. In datanode's memory, replica can be one of the following states:

Finalized: A finalized replica has completed all bytes. No new bytes will be written to this replica, unless you re-open the append operation. Its data and metadata are exactly matched. Replicas containing the same block ID are exactly the same as the replica content. However, the generation Stamp (GS) of finalized replica may not remain unchanged, and the error recovery may greatly change it.

Rbw (replica being written): Once replica is created or appended, it enters the rbw state, and this replica is being written to Data. In the rbw state, replica of the last block of the file is not closed. The data on the disk may not match the metadata. Other replicas under the same block ID may contain less data or be redundant. Bytes in this rbw (maybe not all) are visible to all readers. If any failure occurs, keep the data in rbw as much as possible.

Rwr (replica waiting to be recovered): If a datanode goes down or restarts, all of its rbw replicas become rwr. rwr replicas does not belong to any pileline, so it no longer accepts writing new data. Rwr replica may become obsolete, or if the client is also down, this replica will participate in a recovery.

RUR (relica under recovery): When the contract expires, resulting in the start of replica recovery, replica enters the RUR state. For more details, refer to the Contract Recovery Section.

Temporary: Temporary replica is a constructed replica, which is similar to rbw replica, but the data in this status of replica is invisible to readers. If the replica structure fails or the datanode restarts, the temporary replica will be deleted.

On a datanode disk, each data directory has three subdirectories:CurrentSave the completed (finalized) Replicas,TMPDirectory save temporary replicas,RbwSave rbw, rwr, and RUR replicas. When a replica is created for a request from the DFS client, it is stored in the rbw directory. When a request is created for replication or cluster balancing, it is stored in the tmp directory. Once this replica is complete, it is moved to current
Directory. When a datanode restarts, replica under the temporary directory will be deleted; replicas under the rbw directory will be loaded as rwr; replicas under the current directory will be loaded according to the completed replicas.

During the datanode upgrade, all replicas in the current and rbw directories need to be saved to snapshot.

2.3 block states (namenode)

Namenode also introduces some new states for the block. The block contains the following states:

Underconstruction:

Once a block is created or appended, it enters the underconstruction state and the block is being written into data. It must be the last block of the file not closed. Its length and GS are not yet determined. Data (not all) in the block is visible to all readers. The block under underconstruction stores the path of its write pipeline (for example, the valid rbw replicas position) and the location of rwr replicas when the client is down.

Underrecovery:

When the lease of a file expires and the last block is in underconstruction, the block changes to underrecovery state after the block Recovery starts.

Committed:

A submitted block has completed all writes and generation Stamp (GS), but has not received at least one GS/length match from datanodes. No new data needs to be written into this block, and the Gs will not be added (unless you re-open the append operation ). To meet the read request, a commited data block also needs to retain the location of rbw replicas. The length of GS and completed replicas must be tracked. If the client requests to add a new block or close the file, submit the under construction block for the file not closed. When the last block is still in the committed State, the file cannot be closed. The addblock and close operations need to wait until the GS and length of the last block are completed.

Complete:

A complete block is GS and length, and namenode receives Gs/Len matching. A completed block only contains the position of replicas. The file can be closed only when all blocks of the file becomes complete.

Unlike the replica's status, the block's status is not saved to the disk. Therefore, when namenode is restarted, the last block of the file not closed is considered as underconstruction, and other blocks are considered as complete.

The later part of this article will detail more details about the status of replica/block. The status transition diagram of replica/block is in the last section.

3. Write/hflush
3.1 block construciton Pipeline

An HDFS file contains multiple blocks. Each block is constructed using Write pipeline. Data is pushed into pipeline in units of packet. If no error occurs, a block structure is divided into three processes. Demonstrate three datanodes (DN) and blocks containing five packages. In the figure, the line indicates the data flow, the line indicates the ACK information, and the line indicates the control information (Setup/Close ).

From t0 ~ T1 is the setup phase of pileline; T1 ~ T2 is the data transmission phase, T1 is the transmission time of the first data packet, T2 is the receipt time of the last data packet response; T2 ~ T3 is the close stage

Stage1 setup a pipeline

The client sends a write_block request and transmits it down the pipeline. After the last datanode receives this request, the response goes up to the client along the datanode. The result of setup is that all the network links required by pipeline have been established, and each datanode creates a write operation or opens a replica.

Stage2 data streaming

User Data is first cached on the client. When a packet is filled, the data is pushed to the pipeline. You can push the next packet to pipeline before receiving the previous packet ACK message. The number of outstanding packages supported by the client is limited by the size of the outstanding packets window. If the user application shows that hflush is called, the packet can be pushed to the pipeline after filling. Hdflush is a synchronization operation. No data is written before being responded by the flushed packet.

Stage3 close (finalize a block and shutdown pipeline)

When receiving ack messages from all packages, the client sends a close request. This ensures that when data streaming fails, the following case does not need to be considered for rediscovery: Some replicas have completed the operation, while some replicas have no data.

3.2 Packet Handling at a datanode

For each packet, the datanode in pipeline requires three tasks.

1. Stream Data: A. receive data from the previous datanode or client; B. If there is a downstream datanode, push the data to the downstream datanode.

2. Write Data/CRC to the local disk file

3. Stream ack: A. receives the response from the downstream datanode; B. If the response is received from the downstream node or the datanode is the last datanode of the pipeline, the response is sent to the upstream datanode or client.

Note that the preceding numerical order does not imply the sequence in which three things must be executed. Streaming ack (3) must be executed after streaming data (1. However, when writing data to a disk (2), it can theoretically be executed at any time after 1.. This algorithm selects to write data to the disk after 1. B and before receiving the next packet.

Each datanode starts two threads for each pipeline. The data thread is responsible for data streaming and disk writing. For each packet, datanode is executed in sequence 1.a 1. B and 2. Once a packet is refreshed to the disk, the packet can be deleted from the memory buffer. The ack thread is responsible for ACK streaming. For each packet, 3.a, 3.bis executed sequentially. Because data lines and ACK threads run in parallel, the order of 2 and 3 cannot be guaranteed. ACK packet may have been sent before the packet is refreshed to disk.

This algorithm balances the write performance, data persistence, and simplified algorithms.

1. improves data persistence. Data is written to the disk before Ack is received;

2. Data Transfer-down/ack transfer-up and disk write operations are performed in parallel.

3. The buffer management is simplified because each pipeline has only one packet in the memory.

3.3 consistency support

When the client reads data from rbw replica, all the bytes that datanode may not receive are displayed to the client.

Each rbw replica maintains two pointers:

1. Ba: number of bytes confirmed by downstream datanodes. datanode makes these bytes visible to all readers. In the rest of this document, we may also call it the visual length of raplica.

2. BR: the number of bytes received by this datanodes, including the bytes written to the disk file and in the datanode buffer.

Assume that (BA, Br) = (a, a) of all datanodes in the original pipeline ). So when the client pushes a packet to pipeline and no packets is pushed into piipeline, it is assumed that the packet size is B.

1. After step 1. A, the (BA, Br) of datanode becomes (A, A + B)

2. After step 3. A, the (BA, Br) of datanode is changed to (A + B, A + B)

3. After an ACK is successfully sent to the client, all the datanode (BA, Br) on the pipeline changes to (A + B, A + B)

In a pipeline with M datanode: dn0, dn1,... DNM, dn0 is the first node of pipeline and the node closest to the client. At any time, the following conditions are met:

Ba0 <= BA1 <= Bam <= BRM <= br1 <= br0

4. Read

When reading data from an unclosed file, the final block may be being constructed. It is a challenge to handle the consistency of the read block. The algorithm must ensure that the replicas read from each datanode are consistent.

Algorithm 1:

  • When reading a block being constructed, first send a request to datanode to obtain a replica BA for this block,
  • If an application tries to read bytes outside Ba, the DFS client throws an eofexception.
  • This request is sent to datanode only when the read position of the read request is smaller than the visible length of the last block. When a datanode receives the read request and the read byte range is smaller than that of the replica Br, datanode returns the required byte.
  • Assume that the read request is a triple (BLK, off, Len), BLK contains the block ID and Its Gs, off is the offset in the block, and Len is the number of bytes to read.
  • When the Gs of the replica of datanode is equal to or new to the requested Gs, datanode can serve this request
  • The sum of off and Len must be smaller than the BA of this datanode
  • Assume that the read request is sent to datanode I and Its Replica status is (Bai, Bri)
    • If off + Len <= Bai, DNI can safely send off bytes starting with Len to the DFS client.
    • If off + Len> Bai, it is because off + Len <Baj, BAJ> = Bai. In pipeline, DNI must be in the upstream of DNJ, that is, closer to writer clent. So there are the following Bri> = BRJ> = Baj, so Bri> = off + Len, which means that DNI must have the data that the DFS client wants to read. DNI sends data to the client
    • Off + Len cannot be greater than bri.
  • If DNI goes down during service, the DFS client switches to another datanode containing replica.
  • This algorithm is simple, but you need to re-open the file to obtain new data. Because the last part is obtained before reading, the DFS client cannot read data beyond this length.

Algorithm 2

  • This algorithm allows the DFS client to execute consistent control.
  • A read request is a triple (BLK, off, Len). The BLK contains the block ID and GS, The off is the offset address of the block to be read, and the Len is the length of the byte to be read.
  • When the Gs of the replica of datanode is new to the requested Gs, datanode can serve this request.
  • If the block is stateful (Bai, Bri), DNI sends bytes between Bai and [Off, min (off + Len, Bri)] to the client.
  • The client receives the data to the buffer. At the same time, find the largest BA, and then only send data within the BA range to the application.
  • If reading from DNI fails, the DFS client switches to another datanode to read data.
  • How can we ensure the consistency of this algorithm?
    • This is because the maximum Ba is smaller than the minimum BR.
  • This algorithm needs to change the read protocol, and the DFS client is more complex because DFS needs to control read consistency. However, this algorithm does not need to re-open the file to read new data.
5. append
5.1. append API support

1. The client sends an append request to the namenode

2. Check the file namenode to make sure the file has been closed. Then namenode checks the last block of the file. If the block is not full and there is no replica, append fails. Otherwise, the file becomes under construction. If the last part is full, namenode allocates a new last block. If the last part is not satisfied, namenode changes the block status to under construction block and uses the completed replicas to construct the pipeline. Namenode returns the block ID, GS, length, and its location. If the last
If the block is not full, a new GS must be returned.

3. Set pipeline and view details in the pipeline Settings section.

4. If the end position of the last block is no longer the checksum chunk boundary, the read must be aligned by CRC chunk to calculate the checksums.

5. Other aspects are the same as normal writing.

5.2 duration support
  • Make sure that the number of replicas containing the pre-append data meets the File Replication factor.
  • Constructing the persistence of the pre_append data contained in the block is not considered in the current design.
6. Error Handling
6.1. Pipeline recovery

When a block is being built, the error may occur in any of the following phases: stage1 pipeline is being set, stage2 data is flowing in pipeline, and stage3 pipeline is disabled. These errors occur when Pipeline recovery processes pipeline datanodes.

6.1.1 recover from pipeline setup failure

If datanode detects a failure during the pipeline setup phase, datanode sends a failure notification to its upstream node or client, and closes the block file and all TCP/IP connections. Once the client detects a failure, it takes different processing methods based on the purpose of setting the pipeline:

  • If pipeline is constructed to create a new block, the client simply discards the block and requests a new block from the namenode. Then construct the pipeline for the new block.
  • If pipeline is used to append a block, it uses another datanode to re-construct a pipeline and increase the Gs of the block. For more details, see section 7.

A special example of failed pipeline setting is that the access token is not thick: A datanode complains that the access token is incorrect when the access token is set to pipeline. If the pipeline setting fails due to access token expiration, the DFS client should use all the datanode of the previous pipeline to recreate the pipeline. The current version (0.21) avoids this failure case by obtaining a new access token. This design is introduced in this article.

6.1.2 recover from data streaming failure
  • Errors in datanode can occur in 1.a, 1. B, 2, 3.a, 3. at any stage of B, no matter when it occurs, datanode will exclude itself from the write pipeline: Close all TCP/IP connections; if the error occurs at 3, write All buffered data to the disk and disable disk files.
  • When the DFS client detects a failure, it stops sending data to the pipeline.
  • DFS client re-constructs a write pipeline with the remaining datanodes. For more information, see section7. After re-build, all replicas in this block obtain a new GS
  • DFS client resends data starting from BAC with the new GS. Note that this process can be further optimized to only send bytes starting with Min (BRI ).
  • When a datanode receives a packet, if the packet already exists, datanode simply pushes the packet down, but does not write the disk (the packet already exists in the disk)

This rediscovery algorithm has another advantage: Any bytes visible to the client, or even from the oldest datanode of the old pipeline, is still visible to reader after pipeline recovery. This is because the pipeline recovery process does not reduce the BA and Br of datanode.

6.1.3. recover from a close failure

Once the client detects this failure, the client uses the remaining datanodes to re-construct the pipeline. Each datanode bump block GS then completes the replica. After receiving ACK, destroy the network connection.

6.2. datanode restart
  • When a datanode restarts, it reads every replica in the rbw directory to the memory and waits for recovery. The maximum length of a CRC value.
  • Any replica waiting for recovery does not participate in Reader requests or pipeline recovery
  • Wait for the replica of the recovery, and the client becomes expired and deleted by namenode because the client is still alive, or the lease recovery. becomes complete.
6.3 namenode restart
  • The block status is not diskized, so when the namenode restarts, the status of each block needs to be restored. The last block of the file that has not been closed, no matter what the previous declaration cycle is, it becomes underconstruction. Other blocks values are changed to complete.
  • Request registration reports for each datanode, including: finalized, rbw, rwr, and RUR replicas
  • Namenode does not exit safemode until the number of blocks that have been completed and are being built reaches the predefined threshold. Note that each block is counted as long as one replica is received.
6.4. Lease recovery

When a file lease expires, namenode needs to close the file for the client. There are two problems: 1) concurrency control: What if a lease recovery is already med while the client is still alive either in the process of setting up pipeline, writing, close, or recovery. what if there are multiple concurrent lease recoveries? 2) consistency guarantee: if the last block is in the construction state, all replicas need to be rolled back to the consistent state: all replicas have the same length on the disk and the same new GS

  1. Namenode refresh the lease, change the owner of the file to DFS, and write the change to editlog. Therefore, if the client is still alive, any write-related requests, such as new GS requests or close files, will be rejected because the client is no longer the owner of the lease. This prevents the client from changing the files not closed on the namenode associated with the other layer concurrently.
  2. Namenode checks the status of the last two blocks of the file, and other blocks should be in the completed status. The following table shows possible status combinations and actions for each combination.

Last Last part Action
Complete Complete Close the file
Complete Committed Close the file before the next Lease Term expires. after a certain number of attempts, close the file
Committed Complete Close the file before the next Lease Term expires. after a certain number of attempts, close the file
Committed Committed Close the file before the next Lease Term expires. after a certain number of attempts, close the file
Complete Underconstruction Start the recovery ing of the last block
Committed Underconstruction Start the recovery ing of the last block
Complete Underrecovery Start a block recovery for the last block. After trying for a certain number of times, stop the recovery
Committed Underrecovery Start a block recovery for the last block. After trying for a certain number of times, stop the recovery

6.5. Block recovery

  1. Namenode selects a primary datanode (PD) as the proxy for executing block recovery. PD is a datanode that contains this block replica. If such a datanode does not exist, cancel the block recovery
  2. Namenode obtains a new GS to identify the generation of the block after successful recovery. Then, datanode changes the status of the last block from underconstruction to underrecovery.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.