Super Cool algorithm: Fountain code

Source: Internet
Author: User
Tags sha1 sha1 hash

Today's theme is fountain code, or called "No rate Code". Fountain code is the method of converting some data, such as a file, into a valid arbitrary number of encoding packets, so that you can recover the source data as long as you receive a subset of the encoded packets that are slightly larger than the number of source packets. In other words, you create a "fountain" of encoded data that, as long as the receiving end receives enough "drops", you can recover the files, regardless of which one they are missing.

The reason the fountain code is so well known is that it allows you to transfer files in the case of lossy connections (such as the Internet), and that the transfer process does not depend on whether you know the packet loss rate or which packets are missing from the receiving end. It can be seen in many scenarios, from transmitting a static file via broadcast media, such as on-demand TV, to propagating the package in a multi-source parallel download, like BitTorrent, the fountain code has been well applied.

Although the Fountain code fundamentally is surprisingly simple. It has many kinds, but in this article we only introduce the simplest--LT code, or the Luby transform code. The procedure for generating a code package is as follows:

    1. Randomly select a number between L and K D,d represents the number of blocks in a file. We'll discuss how to choose the best d in a later section.
    2. Select D blocks randomly from the file and combine them. Here we can combine these blocks with an XOR operation.
    3. Transmits the merged block while sending information about which blocks it is composed of.

These are very simple, aren't they? Depends mainly on how we select the number of blocks and combine them (called degree distributions), and we'll briefly introduce them in the next step. You can see from the above description that some of the coding blocks are finally composed of only a single source block, and most will consist of multiple source blocks.

Another thing that may not be immediately apparent is that we do not need to send that list in detail, although we do have to let the receiver know which blocks of code are merged. If the sender and receiver use the same pseudo-random number generator (pseudo-random numbers generator,prng), we can use a randomly selected seed to generate the PRNG, and use this to select degrees and the group of source blocks. Then we only need to send the seed while sending the encoding block, and our receiver can use the same process to reconstruct the list of source blocks we used.

The decoding process is a bit complicated, but not very complex:

    1. Rebuilds the list of source blocks used to generate the encoded block.
    2. For each source block in the list, if it has been decoded, it will be different from the code block or operation, and it is removed from the source block list.
    3. If there are at least two source blocks left in the list, add the encoding block to a waiting area.
    4. If there is only one source block left in the list, we have successfully decoded another source block, then add it to the decoded file, iterate over the waiting list, and repeat the process until the code block contains it.

Let's use a decoding example to illustrate the process more clearly. Let's say we receive 5 code blocks, each length is a byte, and we know what each source block consists of. We can use graphs to represent the data as follows:

The node on the left represents the block of code we received, and the nodes on the right represent the source block. The first node that we receive 0x48 is made up of only one source block (the first source block), so we already know which block it is. Along the inverse of the arrow pointing to the first source block, you can see that the second and third block of code depend only on the first source block and another source block, because we know the first source block, we can do the different or operation, as shown in:

Repeating the above process, we can see that we now have enough information to decode the fourth block of code, which relies on the second and third source blocks, both of which we now know. To do the XOR or operation, you can get the fifth and last source block, as follows:

Finally, we can decode the last remaining source block and get the remaining information:

It should be admitted that this is a very special example, this example just receives the block we need to decode this message, there is no surplus, and is a very simple order, but this example is a good demonstration of the principle of the algorithm. I'm sure you can see that this algorithm is fairly straightforward to apply to large-scale blocks and large-scale files.

In the previous I mentioned that it is important to select the number of source blocks that each code block needs, that is, the degree distribution, which is really important. Ideally, we need to generate some code blocks that contain only one source block, and then we can start decoding, and most of the coding blocks depend on very few other blocks of code. This ideal distribution is present, called the ideal solitary wave distribution.

Unfortunately, the ideal solitary wave distribution is not so ideal in real-time, just as random variables make some source blocks not contained by any block of code, or when all known blocks are used up, the decoding stops. A variant of the ideal solitary wave distribution, called robust solitary wave distribution, has been improved in this area by generating more code blocks with very few source blocks, and by merging all or almost all of the source blocks to generate some code blocks to help decipher the last source blocks.

In short, this is the fountain code, more precisely the LT code, the principle of operation. The LT code is the least efficient of known fountain codes, but most easily explained. If you want to study further, I highly recommend reading this technical paper on fountain Code, also can read raptor code, raptor code only adds a bit more complexity than the LT code, but in the transmission cost and calculation significantly improve their efficiency.

Before we summarize, there is a question of further thinking. For the system, the fountain code may look ideal, such as bitstream, which allows the seed to generate and distribute almost unlimited blocks of code, more or less eliminating the "last piece" of the sparse seed stream, and ensuring that the two randomly selected parallel sides have almost always useful information to exchange with one another. But it faces a major problem: verifying that data received from the parallel side will be difficult.

Protocols like bitstream use secure hash functions, such as SHA1, and a trusted center (the original uploader) to send an authoritative hash to all the parallel ends. Each parallel can then validate the package of the hash block that they downloaded, and compare it to the authoritative hash. But for fountain yards, this is hard. There is no way to calculate a SHA1 hash on a coded block, let alone a hash on a separate block. We can't trust the results of our parallel-side computations, because they can lie to us. We can wait until we get all the files and then start from the list of invalid code blocks and try to infer what kind of block is invalid, but it's difficult and unreliable, and the information may be too late. An alternative approach is to have the original publisher advertise a public key and label all the build blocks. Then we can validate the code block, but at the cost of it: only the original publisher can generate valid blocks of code, and we lose many of the benefits of using fountain codes initially. It seems that we are trapped.

There is another option, and has proved to be a very clever scheme called homomorphic hashing, although it has its own considerations and drawbacks. We'll discuss this in the next version of the cool algorithm.

Article source: Super Cool algorithm fountain code

Super Cool algorithm: Fountain code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.