This is a creation in Article, where the information may have evolved or changed.
/* Recently looking at Ethereum, one of the important concepts is Merkle Tree, has never heard of before, so looked up some information, learning Merkle tree knowledge, because the contact time is not long, the understanding of Merkle tree is not very deep, if there is wrong place, I hope you have the great God * *
Merkle Tree Concept
Merkle tree, often referred to as hash tree, is, as its name implies, the one that stores the hash value. The leaf of the Merkle tree is the hash value of the data block (for example, a collection of files or files). A non-leaf node is a hash of its corresponding child node concatenated string. [1]
1. Hash
A hash is a function that maps arbitrary-length data into fixed-length data [2]. For example, for data integrity check, the simplest way is to hash the whole data to get a fixed length of hash value, and then the resulting hash value published on the net, so that users downloaded to the data, the data again hash, compare the results of the calculation and the online published hash value to compare, If the two hash values are equal, the downloaded data is not corrupted. This can be done because a slight change in the input data will cause the hash result to be unrecognizable, and it is difficult to reverse the characteristics of the original input data according to the hash value. [3]
If you download from a stable server, it is advisable to use a single hash. But if the data source is unstable, once the data is corrupted, it needs to be re-downloaded, the efficiency of this download is very low.
2. Hash List
Data is downloaded from multiple machines at the same time in a point-to-point network, and many machines can be considered unstable or untrustworthy. In order to verify the integrity of the data, a better way is to divide the large files into small chunks (for example, a block of data divided into 2 K units). The advantage of this is that if the small piece of data is damaged during transmission, it is only necessary to re-download the fast data, without having to re-download the entire file.
How to make sure that the small data block is not damaged? You only need to hash each block of data. When BT downloads, we will download a hash list before downloading to the real data. So the question comes again, how to make sure this hash list is right? The answer is to put the hash value of each small piece of data together, and then the long string in a hash operation, so that the hash list of the root hash (Top hash or root hash). When the data is downloaded, the correct root hash is obtained from the trusted data source, which can be used to verify the hash list, and then verify the data block by verifying the hash list.
3. Merkle Tree
The Merkle tree can be seen as a generalization of the hash list (a hash list can be seen as a special Merkle tree, which is a multi-fork Merkle trees with a height of 2).
At the very bottom, as with the hash list, we divide the data into small chunks of data that have a corresponding hash and it corresponds to it. But go up, not directly to the operation of the root hash, but instead of the two adjacent hashes into a string, and then the hash of the string, so that every two hashes to marry a child, get a "sub-hash." If the lowest total number of hashes is singular, then there must be a single hash at the end, which is directly hashed, so it can also get its sub-hash. So upward push, still the same way, can get the number of new first-level hash, eventually inevitably form a tree upside down, to the root of this position, this generation left a root hash, we call it Merkle root[3].
The Merkle tree root of a file is obtained from a trusted source before the network is downloaded from a peer-to-web network. Once the root is obtained, Merkle tree can be obtained from other sources that are never trusted. Check the received Merkle tree with a trusted root. If the Merkle tree is corrupt or false, get another Merkle tree from another source until you get a merkle tree that matches the trusted root tree.
The main difference between the Merkle tree and the hash list is that you can directly download and immediately verify a branch of the Merkle tree. Because the file can be cut into small chunks of data, so if there is a piece of data corruption, just re-download the data block on the line. If the file is very large, then the Merkle tree and hash list are very good, but Merkle tree can download one branch at a time, then immediately verify the branch, and if the branch validation passes, it can download the data. The hash list can only be verified by downloading the entire hash list.
Features of the Merkle tree
- Mt is a tree, most of which is a two-fork tree, can also be multi-fork tree, whether it is a few fork tree, it has all the characteristics of the tree structure;
- The value of the leaf node of the Merkle tree is the unit data of the data set or the hash of the cell data.
- The value of the non-leaf node is based on all of the leaf node values below it, and is then calculated according to the hash algorithm. [4] [5]
In general, cryptographic hashing methods like SHA-2 and MD5 are used to make hash. But if the data is not intentionally corrupted or tampered with, you can use some low-security but efficient checksum algorithms, such as CRC.
The roots of the Second preimage attack:merkle tree do not represent the depth of the trees, which can lead to Second-preimage Attack, where an attacker creates a false document with the same Merkle root. A simple workaround is defined in certificate transparency: When calculating the hash of a leaf node, add 0x00 before the hash data. When calculating the internal node is, precede with 0x01. Other implementations limit the root of a hash tree by prefixing the hash value with a depth prefix. Therefore, each step of the prefix is reduced, and the extracted hash chain is defined as valid only if the prefix is still positive when the leaf is reached.
Operation of the Merkle tree
1. Create Merckle Tree
There are 9 data blocks at the bottom of the join.
Step1: (red line) hash the data block, node0i = hash (data0i), i=1,2,..., 9
Step2: (Orange Line) adjacent to two hash blocks in series, then do hash operation, Node1 ((i+1)/2) = hash (node0i+node0 (i+1)), i=1,3,5,7; for I=9, Node1 ((i+1)/2) = hash ( node0i)
Step3: (Yellow line) Repeat Step2
Step4: (Green Line) Repeat Step2
STEP5: (Blue Line) Repeat Step2, generate Merkle Tree Root
Easily, create Merkle tree is O (n) complexity (here refers to the O (n) hash), n is the size of the data block. The tree height that gets merkle trees is log (n) +1.
2. Retrieving data blocks
To better understand, we assume that there are two machines A and B, a needs to have 8 files in the same directory as B, the files are F1 F2 F3 ... f8. This time we can make a quick comparison by Merkle tree. Suppose we build a merkle Tree for each machine when the file is created. Specific example:
From the can be known, the leaf node node7 value = Hash (F1), is the hash of the F1 file, and its father node Node3 value = hash (V7, V8), that is, its child node Node7 node8 worth of hash. This is how you represent a hierarchical operation relationship. The value of the root node is actually the only characteristic of the value of all leaf nodes.
If file 5 on a is not the same as on B. How do we find different files through the Merkle treee information of two machines? The comparison retrieval process is as follows:
Step1. First compare whether V0 is the same, if different, retrieve their children Node1 and Node2.
Step2. V1 Same, V2 different. Retrieving Node2 's child node5 Node6;
Step3. V5 different, V6 same, retrieve NODE5 child node 11 and node 12
Step4. V11 is different, V12 is the same. Node 11 is the leaf nodes and gets its directory information.
Step5. The search is comparatively complete.
The theoretical complexity of the above process is log (N). The process description diagram is as follows:
It is possible to find the corresponding file in the same process quickly.
3. Update, INSERT and delete
Although there is a lot of information on Merkle tree on the web, most of it does not involve the update, insert, and delete operations of Merkle tree, and discusses the Merkle tree's retrieval and traversal more. I am also very confused, the operation of a tree structure must include not only the search, but also update, insert and delete AH. Later found a problem on the Stackexchange, only a little bit clear, the original see [6].
The update operation for the Merkle tree block is actually very simple, updating the data block and then updating its hash value on the root path will not change the structure of the Merkle tree. However, the insert and delete operations will certainly change the structure of the Merkle tree, such as an insert operation:
After inserting data block 0 (consider the location of the data block), the structure of the Merkle tree is this:
The students in [6] Consider an insertion algorithm that satisfies the following conditions:
- The number of re-hashing operations is controlled within log (n)
- Data block checksum is within log (n) +1
- Unless the original tree's n is an even number, the tree after inserting the data has no orphans, and if there is an orphan, then the orphan is the last chunk of data
- The order of the data blocks remains consistent
- Merkle tree remains in balance after insertion
Then the result of the above insert is this:
According to the respondents in [6], the insertion and deletion of Merkle tree is actually an engineering problem, and different problems will have different insertion methods. If you want to make sure that the tree is balanced or that the tree height is log (n), you can use any of the standard balanced binary tree patterns, such as AVL tree, red-black tree, stretch tree, 2-3 tree, etc. These balanced binary tree update modes can be inserted in O (LGN) time, and can guarantee that the tree height is O (LGN). Then it is easy to see that updating all merkle hashes can be done in O ((LGN) 2) time (for each node to be updated from it to Tree root o (LGN) nodes, and O (LGN) nodes need to be updated in order to meet the tree height requirements). If analyzed carefully, updating all of the hashes can actually be done in O (LGN) time, because all the nodes to be changed are associated, that is, if they are either from a leaf node to a path on the tree root, or this is similar.
[6] The respondents said that in fact the structure of the Merkle tree (whether it is balanced, the height of the tree is limited) is not important in most applications, and the order of data blocks is not required in most applications. As a result, you can design your own insert and delete operations based on the specific application situation. A generic merkle tree insert delete operation is meaningless.
Application of Merkle Tree
1. Digital signature
The original Merkle tree was designed to efficiently handle Lamport one-time signatures. Each Lamport key can only be used to sign a message, but a combination with Merkle tree can be signed with multiple Merkle. This method has become an efficient digital signature framework, namely Merkle Signature Scheme.
2. Peer Network
The Merkle tree is used to ensure that blocks of data received from other nodes are not corrupted and not replaced, and even checking that other nodes do not spoof or publish false blocks. We are familiar with the BT download is the use of peer technology to enable data transmission between the client, one can speed up the download speed, and reduce the burden of download server. BT is BitTorrent, a central index-to-peer file Analysis Communication protocol [7].
To get into the download you must obtain an index file with an extension of torrent from the central Index Server (that is, the seed that you say), and the torrent file contains the information to share, including the file name, size, hash information for the file, and a url[8 that points to tracker. The hash information in the torrent file is a cryptographic summary of the contents of each file to be downloaded, and these summaries can also be run for verification at download time. A large torrent file is a bottleneck for Web servers and cannot be directly included in RSS or gossiped around (spread with rumors spread protocol). A related problem is the use of large chunks of data, because in order to keep the torrent file very small, the number of hash blocks is small, which means that each chunk is relatively large. Large chunks of data affect the efficiency of trading between nodes, because only large chunks of data are downloaded and verified to be able to trade with other nodes.
To solve the above two problems is to use a simple merkle tree instead of a hash List. Design a layer of enough two fork tree, leaf node is the hash of the data block, the insufficient leaf node is replaced by the zero. The upper node is the hash of its corresponding child node concatenation. The hash algorithm uses the same SHA1 as the ordinary torrent. The data transfer process is similar to the one described in the first section.
3. Trusted Computing
Trusted computing is a trusted computing group that provides endpoint credibility for a computing platform that participates in nodes in a distributed computing environment. Trusted computing Technology introduces a trusted Platform Module (Trusted PLATFORM,TPM) to the hardware layer of the computing platform, which actually provides a hardware-based trusted root (root of Trust,rot) for the computing platform. From the trusted root, using the trust chain transfer mechanism, trusted computing technology can measure the hardware and software level of the local platform, and reliably save the measurement result to the Platform configuration register (Platform configuration REGISTER,PCR) of the TPM. The remote computing platform can then verify the trustworthiness of the on-premises computing platform by measuring the results in local PCR using Remote authentication mechanism (attestation). [10] Trusted computing allows participating nodes of distributed applications to get rid of their reliance on the central server and build trust directly from the TPM chip on the user's machine, making it possible to create secure distributed applications that are more scalable, more reliable, and more available. The core mechanism of trusted computing technology is Remote authentication (attestation), and the participation node of distributed application is to establish mutual trust through the remote authentication mechanism to ensure the security of the application.
A remote authentication mechanism based on Merkle tree is proposed in [10], and its core is the integrity measure value hash tree.
First, Ramt is no longer an integrity measure list (ML) maintained in the kernel, but an integrity measure hash tree (integrity measurement hash tree, abbreviated IMHT). where, The data objects stored by the leaf nodes of the imht are the integrity hashes of the various programs that are measured on the computing platform to be validated, and their internal nodes are dynamically generated based on the hash value of the connection of the Merkle hash tree's construction rule.
Second, in order to maintain the integrity of the imht leaf nodes, Ramt needs to use a piece of memory in the TPM to hold the value of the IMHT trusted root hash.
Again, the Ramt integrity verification process is implemented based on the authentication path (authentication path). The authentication path is the path from the leaf node to the root hash on imht.
4. IPFs
IPFs (Interplanetary File System) is a complex of many NB Internet technologies, such as DHT (distributed HashTable, distributed hash table), Git version control system, BitTorrent, etc. It creates a cluster of peers that allows the exchange of IPFs objects. All of the IPFs objects form a cryptographic authentication data structure called the Merkle dag.
The IPFs object is a data structure that contains two fields:
- data– binary data of non-structure, smaller than 256kB in size
- links– an array of link data structures. IPFs objects through which they are linked to other objects
The link data structure consists of three domains:
- Name–link's name.
- Hash–link linked to object hash
- Size–link linked to the cumulative size of the object, including its links
A Merkle DAG (directed acyclic graph) is composed of a collection of name and LINKS,IPFS.
For small files (<256kb), it is a IPFs object without links.
For large files, it is represented as a collection of file blocks (<256KB). Only objects with the smallest data represent this large file. The name of the links for this object is an empty string.
Directory structure: A directory is a IPFs object with no data, and its links to the files and directories it contains.
IPFs can represent the data structure that Git uses, and Git commits object. The main feature of Commit object is that he has one or more links called ' parent0 ' and ' parent1 ' (these links point to the previous version), and an object (which becomes a tree in git). Point to the file system structure that references this commit.
5. Bitcoin and Ethereum[12][13]
The earliest application of Merkle proof was bitcoin, which was described and created by the Nakamoto in 2009. Bitcoin's blockchain uses Merkle proofs to store transactions for each chunk.
The benefit of this is the concept of "simplifying payment validation" (simplified Payment VERIFICATION,SPV) described in Nakamoto, a "light client" Client) can download only the chunk header of the chain, which is the 80byte block of data in each chunk, containing only five elements, instead of downloading every trade and each chunk:
- The hash value of the upper block
- Time stamp
- Mining difficulty value
- Workload Proof random number (nonce)
- Root hash of the Merkle tree that contains the chunk transaction
If the client wants to confirm the status of a transaction, it simply initiates a merkle proof request, which shows that the particular transaction is in one of Merkle trees, and that the root of the Merkle tree is in a chunk header of the main chain.
But Bitcoin's light client has its limitations. One limitation is that although it can prove the involved transaction, it cannot carry on proof of the current state (e.g. holding of digital assets, name registration, status of financial contracts, etc.).
Bitcoin how to find out how many coins you currently have? A bitcoin light client can use a protocol that involves querying multiple nodes, and believes that at least one node will notify you about any particular transaction expense in your address, and this allows you to implement more applications. But for other more complex applications, these are far from enough. The exact nature of the impact of a deal (precise nature) can depend on the previous transactions, which in itself depend on the more previous trades, so you can eventually verify every trade on the chain. In order to solve this problem, the concept of Ethereum Merkle tree will be further.
Ethereum's Merkle Proof
Each ethereum chunk header does not include a merkle tree, but is a three tree designed for three objects:
- Trading transaction
- Receipt receipts (essentially a multi-block of data that shows the impact of each transaction)
- Status state
This makes it possible to have a very advanced Light client protocol, which allows light clients to easily perform and verify the following types of query answers:
- Is this transaction included in a specific chunk?
- Tell me this address. In the last 30 days, all instances of the X type event (for example, a crowdfunding contract completed its goal)
- What is the current balance of my account?
- Does this account exist?
- If this transaction is run in this contract, what will its output be?
The first is handled by the transaction tree, and the third and fourth are handled by the state tree, and the second is handled by the receipt tree (receipt tree). It is fairly straightforward to calculate the first four query tasks. The server simply finds the object, gets the Merkle branch, and replies to the light client through the branch.
The fifth query task is also handled by the state tree, but it is more complex to calculate. Here, we need to build a proof of Merkle state Transition (Merkle, transition proof). Essentially, the proof is that "if you run trade T on the root s state tree, the result state tree will be the root s ', log is L, and the output is O" ("Output" as a concept that exists in Ethereum because each transaction is a function call; it is not necessary in theory).
To infer this proof, the server creates a fake chunk locally, sets the state to S, and pretends to be a light client when requesting the transaction. In other words, if the process of requesting the transaction requires the client to determine the balance of an account, the light client (simulated by the server) issues a balance query request. If a light client is required to query for a particular entry in the storage of a feature contract, the light client will make such a request. This means that the server (by simulating a light client) responds correctly to all of its own requests, but the server also keeps track of all the data it sends back.
The server then merges the data from the above requests and sends the data to the client in a proven manner.
The client then takes the same steps, but uses the certificate provided by the server as a database. The client accepts this proof if the result of the step is the same as the server provides.
MPT (Merkle Patricia Trees)
As we mentioned earlier, one of the simplest merkle trees is a binary tree in most cases. However, the Merkle tree used by Ethereum is more complex and we call it "Merkel Patricia" (Merkle Patricia tree).
The binary Merkle tree is a very good data structure for verifying that the list format (essentially, it is a series of contiguous blocks of data) is a well-formed piece of information. They are also good for trading trees, because once the tree has been built, it doesn't matter how much time it takes to edit the tree, and once the tree is established, it will always be there and will not change.
However, the situation is more complicated for the state tree. The state tree in Ethereum basically contains a key-value mapping where the key is the address, and the value includes the statement of the account, the balance, the random number nounce, the code, and the storage of each account (where the store itself is a tree). For example, the founding state of the Modern Test network (the Morden Testnet) is as follows:
However, unlike the transaction history, the state tree needs to be updated frequently: The account balance and the account's random number nonce often notconsistent, and more importantly, the new account is frequently inserted, and the stored key (key) is often inserted and deleted. We need this data structure, which can be quickly computed to the root of a tree after an insert, UPDATE, delete operation, without having to recalculate the hash of the whole trees. This data structure also includes two very good second features:
- The depth of the tree is limited, even if it is considered that the attacker would deliberately make some trades, making the tree as deep as possible. Otherwise, an attacker could perform a denial-of-service attack (DOS attack) by manipulating the depth of the tree, making the update extremely slow.
- The root of a tree depends only on the data, and not on the order in which it is updated. Updating in a different order, or even re-calculating the tree from scratch, does not change the root.
MPT is the closest data structure that satisfies the nature of the above. The simplest explanation for how MPT works is that values are stored by key, and the key is encoded into the path that the search tree must go through. Each node has 16 children, so the path is also 16 binary encoding decision: for example, the key ' dog ' 16 encoding is 6 4 6 15 6 7, so from Root to the sixth branch, then to the fourth, then sixth, then to 15th, so in turn to reach the tree leaves.
In practice, there are additional optimizations when trees are scarce, and we make the process more efficient, but this is the basic principle.
6. Other applications
There are many applications for Merkle tree, such as Git,amazon dynamo,apache Wave protocol,tahoe-lafs backup System,certificate Transparency Framework,nosql systems like Apache Cassadra and Riak
Reference
[1] Https://en.wikipedia.org/wiki/Merkle_tree
[2] Https://en.wikipedia.org/wiki/Hash_function#Hash_function_algorithms
[3] http://www.jianshu.com/p/458e5890662f
[4] http://blog.csdn.net/xtu_xiaoxin/article/details/8148237
[5] Http://blog.csdn.net/yuanrxdu/article/details/22474697?utm_source=tuicool&utm_medium=referral
[6] Http://crypto.stackexchange.com/questions/22669/merkle-hash-tree-updates
[7] Https://en.wikipedia.org/wiki/BitTorrent
[8] Liang Chengren, Li Jianyong, Huangdaoying, et. Optimization strategy of BT system torrent files based on Merkle tree [J]. Computer Engineering, 2008, 34 (3): 85-87.
[9] Http://bittorrent.org/beps/bep_0030.html
[10] Xu Ziyao, he is also flat, Dunlingli. An efficient remote authentication mechanism for privacy protection [J]. Journal of Software, 2011, 22 (2).
[One] http://whatdoesthequantsay.com/2015/09/13/ipfs-introduction-by-example/
[https://www.weusecoins.com/what-is-a-merkle-tree/]
[13] http://www.8btc.com/merkling-in-ethereum