Pre-analysis of Hadoop 3.0 Erasure Coding Erasure code function

Last Update:2016-02-26 Source: Internet

Author: User

Tags american time erasure coding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

HDFs can also support the Erasure coding feature, which will be released in Hadoop 3.0 and can be validated by the diagram:

In HDFS-7285, this new feature is implemented. Since this feature is far from the release stage, it is possible that the code behind this block will be further modified, so just do a so-called pre-analysis , Help you get an idea of how the Hadoop community is doing this right now. I have not been exposed to erasure coding technology, the intermediate process is indeed a bit accidental, I believe this article can bring you harvest.

Encounter Hadoop 3.0 Erasure Coding

The first initiative to understand Erasure Coding this thing is purely curious, because I usually mingle in the Hadoop community in the HDFs module part, often see a lot of issue summary with the word Erasure Coding , And these tasks are generally subordinate to sub-tasks under HDFS-8031, such as the 1 shown:

It turns out that this is the work of the erasure coding follow-up phase 1. Then I went online to check the meaning of the erasure coding so that the intention to write this article. Erasure coding also as a technology, before learning Hadoop 3.0 Erasure Coding, it is very necessary to learn Erasure coding this technology.

Erasure Coding Erasure Code

Erasure coding Erasure Code technology referred to as EC, is a data protection technology. The earliest data recovery in the communication industry is a kind of coding fault-tolerant technology. He makes the data in each part correlated by adding new validation data to the original data. In the case of a certain range of data errors, Through the Erasure code technology can be restored. The following is a simple demonstration of the image, first there is the original data n, then the M-checksum data block. As shown in:

Parity part is to verify the data block, we have a row of data blocks into a stripe strip, each row of stripe is composed of n data blocks and M-check blocks. Both the original data block and the checksum data block can be recovered from existing data blocks, as follows:

If there is an error in the checksum data block, it is regenerated by encoding the original data block
If an error occurs in the original data block, the decoding of the data block can be regenerated by verifying

And the values of M and n are not fixed and can be adjusted accordingly. Some may wonder, what is the principle of this? In fact, the reason is very simple, you take the above image as a matrix, because the matrix operation is reversible, so that the data can be restored, give a standard matrix multiplication graph, we can associate the two.

As for the aspects involved in mathematical reasoning, students can find information on their own to learn.

Advantages and disadvantages of Erasure coding technology

Erasure code technology as a data protection technology, naturally there are many advantages, the first can be solved is the current distributed system, the use of replicas in cloud computing to prevent data loss. The replica mechanism can really solve the problem of data loss, But the doubling of data storage space is also bound to be consumed. This is very deadly. The application of EC technology can solve this problem directly.

Disadvantage

The advantages of EC technology are obvious, but his use also takes some cost, and once the data needs to be restored, he will consume 2 of the largest resources:

Network bandwidth consumption because data recovery needs to read other data blocks and check blocks
Encode, the decoding computation consumes CPU resources

In a word, the fact is that both consumption of the network and CPU consumption, it seems that the cost is not small. So in view of this, the use of this count for online services may feel not stable enough, so the best choice is for cold data clusters, there are 2 reasons to support this choice

Cold data clusters often have a large number of long-term data not accessed, the volume is really large, using EC technology, can greatly reduce the number of replicas
Cold data cluster is basically stable, consumes less resources, so once data recovery, it will not have a big impact on the cluster

For these 2 reasons, cold data clustering is nothing more than a good choice.

The implementation of Erasure coding technology in Hadoop

It took a lot of space to introduce EC technology, and I am sure that we have learned more or less about this technology. Now is the focus of this article, Hadoop Erasure coding implementation. As we all know, Hadoop is a mature distributed system with a 3 copy strategy, So the technology is very important to hadoop itself. Given that the implementation details of the EC technology in Hadoop can be complex, I will not analyze the code-by-line, and get the idea in a big direction.

Evolution of the EC concept in Hadoop

The EC concept refers to data block blocks, parity block check blocks, stripe bands and so on how these concepts are converted in HDFs, because the implementation of EC technology, at least conceptually the same.

Data block,parity block in HDFs is the normal block data block
The concept of the stripe strip requires that each block be split, each block consists of several cells of the same size, and then each stripe is made up of a row of cells, which is equivalent to extracting a row from all data blocks and the parity block.

The following graphic visual display

The above-and-below structure can be seen much like the previously mentioned matrix. Why the stripe band concept is because the matrix operation reads the data for each row. OK, let's zoom in on the diagram above.
To correspond to the above 3 concepts, you need to design several logical unit concepts, with the following 2 logical concepts

Block group, the section of the blue matrix in the diagram, logically represents an HDFs file.
The cell concept is to logically divide each block block into a cell size, because different block sizes are different, so the number of cells for different block blocks may also be different.

The internal blocks in the middle is the block that eventually stores the data, which is what we normally call block blocks in HDFs.
The calculation logic for the size of the stripe in HDFs is as follows:

// Size of each stripe (only counting data blocks)finalint stripeSize = cellSize * numDataBlocks;

Is the size of a row. The implementation logic to get the block length is as follows

    //Size of each stripe (only counting data blocks)    Final intStripesize = cellsize * numdatablocks;//If block group ends at stripe boundary, each internal block have an equal    //Share of the group    Final intLaststripedatalen = (int) (datasize% stripesize);if(Laststripedatalen = =0) {returnDatasize/numdatablocks; }Final intNumstripes = (int) ((DataSize-1)/Stripesize +1);return(Numstripes-1L) *cellsize + lastcellsize (Laststripedatalen, cellsize, numdatablocks, i);

If the exact last line stripe length is 0, then each block length is equal, return directly, otherwise add the size of lastcellsize.

HDFS Erasure Coding Implementation

Understanding the above mentioned concepts, you can begin to really understand the implementation of the EC in HDFs, the implementation steps are mainly in the Erasurecodingworker#reconstructandtransferblock class. As you can see from the notes, there are 3 major strides.

Step1:read bufferSize data from minimum number of sources required by reconstruction.
Step2:decode data for targets.
Step3:transfer data to targets.

Now let's take a step-by-step look.

Step1

Look at the description of the first step in the official note:

   trytoreadfromnumberofiforreadfromnewforroundandineachround.

The fact is that he will first select the best n nodes from the sources node source nodes, and if there are bad or slow nodes in the node, it will be selected again, the code is as follows

    readsourcefor reconstruction.    sourcedoread from    Map<ExtendedBlock, Set<DatanodeInfo>> corruptionMap = new HashMap<>();    try {        readMinimumStripedData4Reconstruction(success,        toReconstruct, corruptionMap);    } finally {        // report corrupted blocks to NN        reportCorruptedBlocks(corruptionMap);    }

The corresponding striperreader for each source node will then be read remotely, and the remote read will use the Striperreader blockreader and buffer buffers.

    privateaddStripedReader(intlong offsetInBlock) {      final ExtendedBlock block = getBlock(blockGroup, liveIndices[i]);      new StripedReader(liveIndices[i], block, sources[i]);      stripedReaders.add(reader);      BlockReader blockReader = newBlockReader(block, offsetInBlock, sources[i]);      ifnull) {        initChecksumAndBufferSizeIfNeeded(blockReader);        reader.blockReader = blockReader;      }      reader.buffer = allocateBuffer(bufferSize);      return reader;    }

Graphic display effect is as follows

Because the first step is more sub-steps, so I made the execution sequence diagram

Step2

Also give the official source code comment

  Inifalltoandifisblocktoandallblockif they are more than one.

The second step is to encode and decode the data. The first step is to read the data into the buffer, and the second step is the process of calculation. Here's a key point.

iftocallandifistocall decode.

The decision to encode and decode depends on the object of the restoration, which is consistent with the principles mentioned in the previous section. The relevant code is as follows

   // step2: decode to reconstruct targets   reconstructTargets(success, targetsStatus, toReconstruct);

    ...    int[] erasedIndices = getErasedIndices(targetsStatus);    ByteBuffer[] outputs = new ByteBuffer[erasedIndices.length];    0;    for0; i < targetBuffers.length; i++) {      if (targetsStatus[i]) {        targetBuffers[i].limit(toReconstructLen);        outputs[m++] = targetBuffers[i];      }    }    decoder.decode(inputs, erasedIndices, outputs);...

But here I have a little doubt, the direct use of the decode decoding operation, it may be in this scenario is the decoding situation.

Step3

The third step is very simple, is the operation of transfering data, buffer in buffer to write to the target node.

  thetobyandasthethetheofthethe reconstructed data are sent remotely.

The way to write is very simple, direct remote write, because this kind of write operation only involves 1 nodes, do not need to build the subsequent pipeline action. This aspect can be read from the Dfsoutputstream pipeline writing mechanism to streamer thread leak problem.

    // step3: transfer data    if0) {        String"Transfer failed for all targets.";        thrownew IOException(error);    }

OK, this is one of the main implementations of the EC data recovery technology in Hadoop 3.0. A complete sequence diagram is as follows

Improved optimization points

2 improvement points have been mentioned in the official notes and should be perfected in the future.

The current data is not used in local reading, all the data read by remote means.
The target data recovery transmission does not return an ACK confirmation code for the packet packet, unlike pipeline, which has a robust set of systems.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More