Merkle Tree was proposed by computer scientist Ralph Merkle many years ago and named after his own name. However, Merkle tree does involve a lot of interesting practical applications, such as the Bitcoin wallet service using the Merkle Tree mechanism to make "hundred percent reserve proof" (http://blog.csdn.net/lucky_greenegg/article/ details/51155252), Git version control system, ZFS file system and point-to-point network BT Download, all through the Merkle Tree for integrity check, is to check whether the data is damaged.
Integrity Check
In fact, to achieve integrity check, the simplest way is to verify the entire data file to do a hash operation. Now the most popular online file verification method is the computer MD5 and SHA1, Microsoft released Windows operating system or other software, now use CRC32 combined SHA1 way, almost 100% will not collide. Because the biggest feature of the hash is that if your input data, slightly changed a little bit, then after the hash, you get the hash value will become unrecognizable (except Simhash), even if only a little bit of modification, the hash value will be completely different. If we download from a stable server, it is acceptable to use a single hash to verify the form.
However, from a theoretical point of view, CRC is not fully reliable to verify the integrity of the data, because CRC polynomial is a linear structure, it is easy to change the data way to reach the CRC collision, assuming a string with CRC checksum code in the transmission, if there is a continuous error, when the number of errors reached a certain number of times, It is almost certain that a collision will occur (the value is incorrect but the CRC results are correct). Data chunking
When we transmit data in a point-to-point network, we will download it from multiple machines at the same time, and many of these machines can be considered to be unstable or untrustworthy, which requires a more ingenious approach. In practice, the point-to-point network in the transmission of data, in fact, is a relatively large file, cut into small pieces of data. The BitTorrent protocol is a peer file transfer protocol that is architected on the TCP/IP protocol and is in the application layer of the TCP/IP architecture. According to the BitTorrent protocol, a file publisher will provide a. torrent file, which is a seed file, or "seed", based on the file to be published. The torrent file is essentially a text file that contains tracker information and file information in two parts. Tracker information is mainly used in BT download the address of the tracker server and settings for tracker server, the file information is based on the calculation of the target file generated, the results of the calculation according to the BitTorrent protocol B coding rules. Its main principle is the need to provide the downloaded file virtual into equal size blocks, the block size must be 2k of the whole number of square (because it is a virtual block, the hard disk does not produce individual block files), The index information and hash verification code of each block are written to the. torrent file; So, the. torrent file is the "index" of the downloaded file.
To download the contents of the file, the download needs to get the appropriate. torrent file and then download it using the BT client software. When downloaded, the BT client first parses the. torrent file to get the tracker address and then connects to the tracker server. The tracker server responds to the request of the downloader, providing the IP of the other downloader (including the publisher) of the Downloader. The downloader then connects to the other downloader, according to the. torrent file, which tells each other the blocks they already have, and then swaps the data they don't have. There is no need for other servers to participate, which distracts the data traffic on a single line, thereby easing the burden on the server. For each block the downloader needs to figure out the hash verification code for the download block and the comparison in the. torrent file, if the same block is correct, the block will need to be re-downloaded. This provision is intended to address the issue of the accuracy of download content.
General http/ftp download, publish files only in one or a few servers, download too many people, the bandwidth of the server is very easy to load, become very slow. The BitTorrent protocol download feature is that the more people download, the more bandwidth is provided, the seeds will be more and more, download faster
The advantage of this is that if a small piece of data is damaged in transit, I just need to re-download this block of data without having to download the entire file again. This, of course, requires each chunk to have its own hash value. When BT downloads, we will download a hash list first before downloading the actual data. Then there is a problem that arises, so many hashes, how do we guarantee that they are all right? The answer is that we need a root hash. The hash of each small block is put together, and then a hash is done for the entire long string, and the end result is the root hash of the hash list. So, if we can get a correct root hash (torrent file) from the server, we can use it to verify that each hash in the hash list is correct, and thus guarantee the correctness of each chunk of the download, the Merkle tree algorithm will be used here. Merkle Tree
Let's look at its structure first.
At the very bottom, as with the hash list, we divide the data into small chunks of data that have a corresponding hash and it corresponds to it. But go up, not directly to the operation of the root hash, but instead of the two adjacent hashes into a string, and then the hash of the string, so that every two hashes to marry a child, get a "sub-hash." If the lowest total number of hashes is singular, then there must be a single hash at the end, which is directly hashed, so it can also get its sub-hash. So push up, still the same way, you can get the number of new first-level hash, and eventually inevitably form a tree upside down, to the root of this position, this generation left a root hash, we call it Merkle root.
One of the obvious benefits of a hash list,merkle tree is that it can take a single branch (as a small tree) to verify part of the data, a lot of the use of which brings the convenience and efficiency that the hash list cannot match.
As shown above, the leaf node node7 value = Hash (F1), is the hash of the F1 file, and its Father node node3 the value = hash (V7, V8), that is, its child node Node7 node8 worth of hash. This is how you represent a hierarchical operation relationship. The value of the root node is actually the only characteristic of the value of all leaf nodes.
If file 5 on a is not the same as on B. How do we find different files through the Merkle treee information of two machines? The comparison retrieval process is as follows:
1, first compare whether V0 is the same, and, if different, retrieve its children Node1 and Node2.
2, V1 same, V2 different. Retrieving Node2 's child node5 Node6;
3, V5 different, V6 the same, retrieving the NODE5 child node 11 and node
4, V11 different, V12 the same. Node 11 is the leaf nodes and gets its directory information.
5, the comparison is complete. The theoretical complexity of the procedure above
is log (N). The actual process is greater than this complexity, because nodes of different values need to be compared by each child node.