Source post address: http://blog.csdn.net/colinchan/archive/2006/05/08/712760.aspxPetar Maymounkov and David mazi' eresfpetar, dmg@cs.nyu.eduhttp: // kademlia.scs.cs.nyu.edu
SummaryThis article describes a point-to-point (P2P) system with verifiable stability and high performance stability in an error-prone network environment. Our system uses a novel topology based on exclusive or operations to send queries and locate nodes, which simplifies the algorithm and makes verification easier. This topology has the following features: it can exchange messages to convey and enhance useful contact information between nodes. The system uses this information to send parallel, asynchronous query messages to deal with node failures without causing timeout latency.
1
. IntroductionThis thesis describes Kademlia, a point-to-point (P2P) <key, value> tuple storage and query system. Kademlia has many gratifying features that cannot be provided at the same time by any previous P2P system. It reduces the number of configuration messages that must be sent by nodes for mutual understanding. When performing a key query, the configuration message is automatically transmitted. Nodes have sufficient knowledge and flexibility to send query requests through low-latency paths. Kademlia uses parallel and asynchronous query requests to avoid the time-out delay caused by node failure. Some basic denial-of-service (DoS) attacks can be prevented by the existing algorithms recorded by nodes. Finally, we can officially confirm many important features of Kademlia by simply using assumptions that are weak in Distributed Runtime (those identified by measurements of existing point-to-point systems ).
.Kademlia uses many basic P2P systems. A key is an implicit Number of 160-bit values (for example, a SHA-1 hash value for some large data ). Each participating machine has a node ID and a 160-bit key. <Key, value> the pair is stored on nodes whose IDs and keys are very close to each other. Here, 'closeness 'is calculated based on the concept of closeness. Finally, a node ID-based routing algorithm allows anyone to locate a server near a target key. Many of the advantages of Kademlia are that it uses a very novel method, that is, using the key between nodes as the result of an exclusive or operation as the distance between nodes. The XOR operation is symmetric and allows the participants of Kademlia to receive search requests from nodes in the same distribution and included in their route table. Without this nature, like Chord, the system cannot learn useful routing information from the query requests they receive. Worse, because the operations in Chord are asymmetric, the route table of Chord is stricter. Each item of the query table of a Chord node must store a precise node that increments by the interval of the ID field. Any node in the interval is larger than some keys in the interval, so it is far away from the key. On the contrary, Kademlia can send requests to any node within a certain interval, allow Routing Based on latency, or even send parallel, asynchronous queries. To locate a node near a specific ID, Kademlia uses a one-way routing algorithm from start to end. On the contrary, some other systems use an algorithm to approach the target ID, and then use another algorithm in the last few hops. In the existing system, the first phase of Kademlia and pastry is the most like (although the author does not describe it in this way ), kademlia's exclusive or operation can roughly halved the distance between the current node and the target ID to find the node. In the second stage, Pastry no longer uses distance calculation, but instead compares the numbers of IDs. It uses the second method, instead of the Number Difference operation. Unfortunately, the calculation by the second operation is much closer than the first one, which causes the interruption of the ID value of a specific node and reduces the performance, in addition, the attempt to formally analyze the worst behavior fails.
2
. System DescriptionEach Kademlia node has a 160-bit node ID. In the Chord system, IDs are constructed by certain rules, but in this article, to simplify, we assume that each machine will select a random 160-bit value when joining the system. The message sent by each node contains its node ID, and the receiver is allowed to record the existence information of the sender, if necessary. The 160-bit identifier. To publish and search for <key, value> pairs, Kademlia depends on the concept of distance between two identifiers. Given two identifiers, x and y, Kademlia defines the bitwise difference between the two or the result of (XOR) as the distance between the two, d (x, y) = x ⊕ y. We first notice that the exclusive or operation is a meaningful operation, although it is not a euclidean operation. Obviously, it has the following properties: d (x, x) = 0; if x is less than y, d (x, y)> 0; Any x, y: d (x, y) = d (y, x ). The exclusive or operation also satisfies the triangular nature: d (x, y) + d (y, z) ≥ d (x, z ). The reason why this triangle is true is based on the fact that d (x, z) = d (x, y) + d (y, z); and any a> = 0, B ≥ 0: a + B ≥ a then B. Like the clockwise cyclic Operation of Chord, the exclusive or operation is also unidirectional. For a given vertex x and distance △, there is only one vertex y, making d (x, y) = △. One-way query ensures that all queries for the same key are aggregated to the same path, regardless of the origin node. Therefore, caching the <key, value> pair on the search path can reduce the chance of a crash. Like Pastry rather than Chord, the exclusive or operation is also symmetric. (For All x and y, d (x, y) = d (y, x ))
2
. 1. node statusThe Kademlia node stores the contact information of each other to route and query messages. For any 0 = <I <160, each node saves the list of node information between 2i and 2i + 1, including the <IP address, UDP port, node ID>. We call these lists K-buckets. Nodes in each K-bucket are sorted by the last contact time. nodes that have not been contacted for a long time are placed in the header, and nodes that have recently been contacted are placed at the end. For small I values, the K-bucket is usually empty (because there are no suitable nodes in the system ). For a relatively large I value, the number of list nodes can reach k, and k is a system-level redundant parameter. The k value option must meet one condition, that is, it is unlikely that any k nodes will expire within an hour (for example, k = 20 ).
Figure 1:In the form of a function with the current online time, it shows the proportion of nodes that continue online in the next hour. The x axis represents minutes, and the Y axis represents the ratio of those nodes that have been online for x minutes to continue online for one hour. When a Kademlia node receives any message (requested or replied) from another node, it updates its own K-bucket, that is, the bucket corresponding to the sending node ID. If the sending node already exists in the receiver's K-bucket, the receiver moves it to the end of the list. If the node does not exist in the corresponding K-bucket and the bucket contains less than k nodes, the receiver inserts the sender to the end of the list. If the corresponding K-bucket is full, the sender will send the ping command to the node that has not been contacted for the longest time to test whether the ping command exists. If no reply is sent to the node that has not been contacted for the longest time, remove it from the list and insert the new sender to the end of the list. If it replies, the new sender information will be discarded. The K-bucket efficiently removes nodes that have not been associated for a long time. The surviving nodes will never be removed from the list. This kind of tendency to retain old nodes is obtained by analyzing the tracking data of the Gnutella protocol collected by Saroiu and others. Figure 1 shows the proportion of a Gnutella node that is online after one hour in the form of a function with the current time. The longer a node will survive, the more likely it will survive for an hour. By retaining the nodes with the longest survival time, the probability of the nodes stored in the K-bucket going online is greatly increased. The second advantage of K-bucket is that it provides resistance to certain DoS attacks. The continuous influx of new nodes in the system will not cause the node's Route status to update too fast. The Kademlia node inserts a new node into the k-bucket only when the old node leaves the system.
2
. 2. Kademlia ProtocolThe Kademlia protocol consists of four Remote Procedure Calls (RPC): PING, STORE, FIND_NODE, and FIND_VALUE. Ping rpc to test whether the node exists. STORE indicates that a node stores a <key, value> pair for future retrieval. FIND_NODE uses the 160-bit ID as the variable. The RPC receiver returns k <IP address, UDP port, and node ID> tuples closest to the target ID. These tuples can come from one K-bucket or multiple K-buckets (when the nearest K-bucket is not full ). In any case, the RPC receiver must return k items (unless all K-bucket tuples of this node are less than k together, in this case, the RPC receiver returns all the nodes It knows.) The FIND_VALUE and FIND_NODE act similarly-the <IP address, UDP port, and node ID> tuples are returned. The only difference is that if the RPC receiver has received the store rpc of the key, it only needs to return the stored value. In all RPC, the receiver must respond to a 160-bit random rpc id, which can prevent address spoofing. In PING, the RPC receiver can send a message back in the RPC reply to obtain an extra guarantee for the network address of the sender. The most important task that a Kademlia participant must do is to locate k nearest nodes for a given node ID. We call this process a node query. Kademlia uses a recursive algorithm for node queries. The query initiator extracts the nearest non-empty K-bucket node (or, if this bucket does not have a quota item, only the closest nodes that it knows are retrieved ). The initiator then sends parallel, asynchronous FIND_NODE RPC to the selected worker nodes. Parallelism is a system-level parallel parameter, for example, 3. In this recursive step, the initiator resends FIND_NODE to the nodes learned from the last RPC (this recursion can start before all previous RPC returns ). In the k nodes that are returned closest to the target, the initiator selects two nodes that have not been asked and resends FIND_NODE RPC to them. Nodes that do not have immediate responses will not be considered unless and until they respond. If no FIND_NODE returns a node that is closer to the nearest node, the initiator resends the FIND_NODE to all k nearest nodes that have not been queried. The query ends only when the initiator has asked k nodes that are closest to the node and receive a response. When latency = 1, the query algorithm is very similar to Chord in terms of message expenditure and latency when detecting invalid nodes. However, Kademlia can achieve low-latency routing because it has sufficient flexibility to select one of the k nodes for query. Most operations can be performed according to the preceding Query Process. To STORE a <key, value> pair, the participant locates k nodes closest to the key and sends store rpc to these nodes. In addition, each node republishes all its <key, value> pairs every hour. This ensures that the <key, value> pair persists in the system with a high probability (we will see it in the validation Summary section ). Generally, we also require the original publisher of the <key, value> pair to re-publish every 24 hours. Otherwise, all <key, value> pairs expire 24 hours after the original release, to minimize obsolete information in the system. Finally, to maintain the consistency of the <key, value> pair in the release-search lifecycle, we require that node w have a new node u at any time, u is closer to some <key, value> pairs in w than w. W copies the <key, value> pairs to u and does not delete them from the database. To find a <key, value> pair, the node first looks for the nodes with k IDs close to the key. However, the value query uses FIND_VALUE instead of FIND_NODE RPC. The process ends immediately as long as any node returns a value. For the sake of caching, as long as a query is successful, the request node will set the <key, value> the pair is stored on the node that has the closest and failed to return values. Because of the unidirectional nature of this topology, future searches for the same key are likely to hit the cached items before the nearest node is queried. For a specific key, after multiple searches and propagation, the key may be cached on many nodes. To avoid excessive caching, we designed a <key, value> the number of nodes in the database of any node corresponding to the current node and the node ID closest to the key ID is an exponential inverse proportion. Simply removing a node that has not been associated for a long time leads to similar survival time distribution. There is no natural way to select the cache size, because the node cannot know in advance how many values the system will store. Generally, because of the query communication between nodes, the bucket keeps refreshing. To avoid exceptions when no communication occurs, each node refreshes buckets that have not been queried by nodes within one hour, refreshing means selecting a random ID within the bucket range and then performing node search for this ID. In order to join the network, node u must contact a node w that has already been added to the network. U adds w to the appropriate bucket, and then u performs a node search for its node ID. Finally, node u refreshes all K-buckets farther than the nearest neighbor node. During the refreshing process, node u performs two necessary tasks: filling its K-bucket and inserting itself into the K-bucket of other nodes.
3
. Verification OverviewTo verify the unique functions in our system, we must confirm that the vast majority of operations consume [log n] + c time overhead, and c is a relatively small constant, and <key, value> search returns a key stored in the system with a high probability. Let's first make some definitions. For a K-bucket with a coverage distance of [2i, 2i + 1), define the index number of this bucket as I. Defines the depth h of a node as 160-i, where I is the index number of the smallest non-empty bucket. The bucket height of node y in node x is defined as the index number of the bucket inserted to x minus the least important empty bucket index number of x. Because node IDs are randomly selected, it is unlikely that the height of the node is unevenly distributed. Therefore, at a very high probability, the height of any given node is within log n, where n is the number of nodes in the system. Furthermore, for an ID, the bucket height of the node closest to the node nearest the k is likely within the constant log k. In the next step, assume that each K-bucket of each node contains the contact information of at least one node if the node exists in a suitable range. With this assumption, we can find that the node search process is correct and the time overhead is exponential. Suppose the depth of the node closest to the target ID is h. If the h most meaningful K-bucket of this node is not empty, in each step, you can find a node that is closer to half of the target node (or a bit closer to the target node ), therefore, the target node will appear after step h-log. If a K-bucket of the node is empty, the target node may be within the distance of the empty bucket. In this case, the last step cannot reduce the distance by half. However, the search can continue correctly, just as the empty bucket-related location in the key has been reversed. Therefore, the search algorithm can always return the nearest node after step h-log. In addition, once the closest node has been found, the degree of parallelism will be extended from limit to k. The number of steps to find the remaining K-1 closest to the node will not exceed the bucket height of the closest node in the k closest to the node, that is, it is unlikely to exceed the log k plus a constant. In order to verify the correctness of the above unchanged conditions, first consider the effect of Bucket refreshing, if the unchanged conditions are true. After a bucket is refreshed, it may contain k valid nodes or all nodes within its range, if less than k nodes exist (this is derived from the correctness of the node search process .) Newly Added nodes will also be inserted into any buckets that are not full. Therefore, the only way to violate this constant condition is to have k + 1 active and more nodes in a particular bucket, in addition, all the k nodes in the bucket are invalid without any interference of searching or refreshing. However, the K value is precisely selected to ensure that the probability of all nodes failing in an hour (the maximum refresh time) is small enough. In fact, the probability of failure is much lower than the probability that all the k nodes will leave in an hour, because each incoming or outgoing request message will update the node's bucket. This is produced by the symmetry of the exclusive or operation, because in one incoming or outgoing request, the ID of the peer node that communicates with a given node is evenly distributed within the bucket range of the node. Moreover, even if this constant condition does not expire in a single bucket of a single node, it only affects the running time (adding a hop count in some queries ), it does not affect the correctness of node search. Only when k nodes in the search path have to lose k nodes in the Same bucket without any search or refresh interference can a search fail. If the buckets of different nodes do not overlap, the probability of this situation is 2-k * k. Otherwise, the node appears in the buckets of multiple other nodes, which may lead to a longer running time and lower probability of failure. Now let's consider the restoration of the <key, value> pair. When a <key, value> pair is published, it is stored in k nodes close to the key. At the same time, it will be re-released once every one hour. Because even new nodes (the least reliable nodes) have a 1/2 probability of surviving for an hour, one hour later <key, value> the probability that the pair still exists on one of the k closest nodes is 1-2 k. This nature will not change because of the insertion of new nodes with a proximity key, because once such a node is inserted, in order to fill their buckets, they will interact with those nodes closest to them and receive the <key, value> pairs they should store nearby. Of course, if the k nodes closest to the key are invalid and the <key, value> pair is not cached anywhere else, Kademlia will lose the <key, value> pair.
4
. DiscussionThe routing algorithm we use is similar to that in Pastry [1] and Tapestry [2. However, all three algorithms may encounter problems (for the purpose of acceleration) when they select a bits close to the target node ). If there is no difference or topology, we also need an additional algorithm structure to find the target node from nodes with the same prefix as the target node but with different numbers of the next B bits. The methods used by all three algorithms to solve this problem are different, and each algorithm has its own shortcomings. They are O (2b log2b n) in size) in addition to the main table, a secondary route table with the O (2b) size is required. This increases the costs of Self-lifting and maintenance, and makes the Protocol more complex, in addition, Pastry and Tapestry prevent formal analysis of correctness and consistency. Although Plaxton can be confirmed, it is not suitable for environments that are vulnerable to point-to-point (P2P) network failures. On the contrary, Kademlia is very easy to optimize without a base of 2. We can configure our bucket table to make the speed of B bits per hop close to the target node. This requires that a condition be met, that is, any 0 <j <2b and 0 ≤ I <160/B, in the distance from us to [j2160-(I + 1) b, (j + 1) 2160-(I + 1) B], there is a bucket. The total quantity of actual items is not expected to exceed the bucket. In the current implementation, we set B to 5.
5
. SummaryWith a novel topology based on exclusive or operations, Kademlia is the first one that combines verifiable consistency and high performance, minimum latency routing, and a symmetric, unidirectional topology (P2P) system. In addition, Kademlia introduces a concurrency parameter, latency, which allows people to adjust a constant parameter of bandwidth for asynchronous hop selection of the lowest latency and failure recovery without latency. Finally, Kademlia is the first point-to-point (P2P) system that utilizes the fact that node failure is inversely proportional to its running time.
References[1] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems.
Accepted for Middleware, 2001, 2001. http://research.microsoft.com/?antr/pastry /. [2] Ben Y. zhao, John Kubiatowicz, and Anthony Joseph. tapestry: an infrastructure for fault-tolerant wide-area location and routing. technical Report UCB/CSD-01-1141, U. c. berkeley, limit L 2001. [3] Android ea W. richa C. greg Plaxton, Rajmohan Rajaraman. accessing nearby copies of replicated objects in a distributed environment. in
Proceedings of the ACM SPAA, Pages 311-320, June 1997. [4] Stefan Saroiu, P. krishna Gummadi and Steven D. gribble. A Measurement Study of Peer-to-Peer File Sharing Systems. technical Report UW-CSE-01-06-02, University of Washington, Department of Computer Science and Engineering, July 2001. [5] Ion Stoica, Robert Morris, David Karger, M. frans Kaashoek, and Hari Balakrishnan. chord: A scalable peer-to-peer lookup service for internet applications. in
Proceedings of the acm sigcomm '01 Conference, San Diego, California, August 2001.