Description of the kademlia code in eMule
When the kademlia network is used in the eMule, the central server will no longer become invalid because there is no central server in this network, or all users are servers, all users are also clients, and thus completely implement P2P. Next, we will analyze the kademlia network in the eMule. We will analyze the principles in this section. The other sections will be described based on the different classes used to implement kademlia in eMule. Where:
Ckademlia is the master class of the whole kademlia network. It can start or stop the kademlia network directly and contain the process method to process daily transactions.
Cprefs is responsible for processing its own kademlia-related information, such as its own ID.
Croutingzone, croutingbin, and CContact constitute the contact information learned by each node and the data structure composed of the contact information.
Ckademliaudplistener processes network information.
Cindexed is responsible for processing the index information stored locally.
Csearch and csearchmanager process search-related operations. The former indicates a single search task, and the latter processes all search tasks.
Cuint128 processes a 128-bit long integer and has built-in operations. As mentioned above.
Basic Principles of kademlia in eMule
Kademlia is a structured overlay network. The so-called Overlay Network is a virtual network built on the physical Internet, all participating nodes know the IP addresses of some other nodes. These nodes are called their neighbors. If you need to find something, it first searches locally. If you cannot find anything, the query is forwarded to its neighbor, hoping to find the corresponding result. The coverage network is divided into two types: structured and unstructured. The difference between them is that each node knows which other nodes have specific information rules. In an unstructured Coverage Network, the neighbor condition of each node does not have a specific rule. Therefore, in a non-structured network, if you want to perform a query, a method called flooding will be adopted. If each node does not find the desired result locally, the lookup request is forwarded to its neighbor, and then the neighbor's neighbor is used for step-by-step lookup. However, if this method is not processed properly, the message load of the entire network is too large. Many articles have thoroughly explored the optimization of queries in the unstructured Coverage Network.
For a structured Coverage Network, each node has a certain rule on which nodes it will select as neighbors, so that when searching, when a node forwards a search request, it can select the shard nodes to which the request is forwarded according to certain rules. This can also reduce the search cost. A structured Coverage Network usually requires each node to generate a random ID to determine the relationship between nodes. This ID must be independent from the physical network in which it is located.
For the kademlia network, this ID is a 128-bit value. All nodes use this ID to measure their logical distance from other nodes. The Calculation Method of logical distance is to perform an exclusive or (XOR) operation on two nodes. During the formation of the kademlia network, the principle of selecting a neighbor for each node is that the node closer to its own logical distance is more likely to be added to its neighbor node list, specifically, when you get a new node information, whether to add it to your neighbor node list is handled based on the distance. The code of a specific program will be described later.
The advantage of a structured network is that if we want to find a node that is close enough to the logic of a specific ID, we can ensure the number of hops at the O (logn) level. You only need to first find a node with a long logical distance from the target ID, and then ask whether the node is closer to the target ID. Therefore, this is also true when searching. When resources need to be published, the file is hashed so that a 128-bit ID can be calculated or the keywords are hashed. Find the node closest to the result logic, and send the file or keyword information to it for storage. When someone wants to search for the same thing, because it uses the same hash algorithm, it can calculate the corresponding ID, and search for nodes that are close to the logic of this ID, because it knows that if there are such resources in the network, these nodes are most likely to know the information. From this we can see that the resource search efficiency of the structured network is very high, but compared with the non-structured Coverage Network, the disadvantage is that it cannot perform complex queries, you can only search by simple keywords or file hash values. Searches in an unstructured network are randomly forwarded. Each node that receives the query request has a clear understanding of the local resources. Therefore, complex queries are supported, however, obviously, complex queries supported by unstructured networks are unlikely to mobilize all nodes for this action. Currently, there is no way to combine the advantages of the two kinds of coverage networks. I am also very curious about this method.
Basic metrics class of kademlia in eMule
The main control class of kademlia is ckademlia, which is responsible for starting and disabling Code related to the whole kademlia network. In its process function, transactions related to the kademlia network will be processed. For example, check whether the number of nodes in a certain interval is too small. If so, find some new nodes. In addition, you often check your neighbors. These are all tasks that need to be arranged on a daily basis. The daily processing of all search tasks also requires scheduling. It also serves as a representative of the kademlia network and returns some statistics of the kademlia network to other part of the code of the eMule.
Cprefs is another basic struct class, which is similar to cpreferences in general eMule Code. However, cprefs only retains local information related to the kademlia network and needs to be stored for a long time. In this version, the local ID is used.
Another important infrastructure is cuint128, which implements various processing for 128-bit IDs, as mentioned in the previous sections.
Contact List Management for kademlia in eMule
Croutingzone, croutingbin, and CContact constitute the contact list data structure. It must meet our search requirements, that is, the time to search for the target must be acceptable, and the occupied space must be acceptable.
First, the CContact class contains information about a contact, including the IP address, ID, TCP port, UDP port, Kad version number, and its health (m_bytype) of the other party ). The health level is 0-4. The newly added contact, that is, the health status is unknown. The value is set to 3. The system regularly checks the health status of each contact. The number of contacts that can be contacted is gradually reduced to 0. However, if the contact is not reached, the value will gradually increase. If the contact fails to be reached after 4, it will be deleted from the contact list.
The croutingbin class contains a list of ccontacts. Note that the contact information must be accessed through a croutingbin. The croutingzone does not directly contain the contact information. You can add new contact information to a specific croutingbin. You can also search for contacts. It also provides a way to find the contacts closest to a specific ID and provide such a list. This is very important. Finally, the number of ccontacts that can be included in a croutingbin class is also limited.
The croutingzone class is at the top of the contact data structure and provides operation interfaces for the kademlia network. The structure of this class is a binary tree with two croutingzones pointing to its left and right subtree. It also contains a croutingbin pointer. However, this pointer to the croutingbin type makes sense only when the current croutingzone class is the leaf node of the entire binary tree. This Binary Tree features that the IDs of all contacts under each node contain a common prefix. The deeper the layers of nodes, the longer the common prefix. For example, the ID of all nodes in the left subtree of the root node must have a prefix "0", and all nodes in the right subtree must have a prefix "1 ". Similarly, the IDs of all nodes under the right subtree of the Left subtree of the root node must have the prefix "01", and so on. Let's imagine the process of adding nodes to this binary tree. At the beginning, there was only one root node, which is also a leaf node. At this time, its croutingbin is meaningful. When the contact information is continuously added, the capacity of the croutingbin is full, in this case, a split operation is required. At this time, two left and right child nodes will be added, and then the contact information in their croutingbin will be copied to the left and right nodes according to their prefix features, finally, the croutingbin is abolished, so that the split process is complete. After the split, the system tries to add the contact information again. In this case, it tries to add the contact information to the corresponding subtree according to its ID. However, not all nodes are split in this case, because if any split is allowed, the number of node information to be stored locally increases sharply. Here, the role of its own ID is reflected. This node is split only when its ID and the node to be split have a common prefix. If it is determined that a node cannot be split and Its croutingbin is full, the contact information is denied.
We can see that, under the above policy, the closer the logical distance from the ID (that is, the longer the common prefix), the more likely the contact information will be added, because the node corresponding to it is more likely to obtain more child nodes due to split, it also corresponds to more capacity. In this way, in the kademlia network, the ratio of participants closer to their own logical distance is higher for each participant to know other participant information. Because when searching, you only need to constantly find the closer ID, and each step will make progress, therefore, the time required to find the target ID is O (logn). From the perspective of the binary tree structure, we can also see that only some nodes are split, therefore, the cost of storage space is O (logn ).
In fact, croutingzone has some differences with the theory of kademlia. For example, there is a minimum split layer starting from the root node. That is to say, if the number of layers is too low, it is always allowed to split, in this way, it can know a little more contact information in other regions.
Kafemlia network message processing in eMule
Ckademliaudplistener processes all messages related to the kademlia network. We have already made a rough description of the basic situation of the eMule communication protocol. We can see that the messages processed by ckademliaudplistener must be only related to the kademlia network, sorting has been completed in the normal UDP client processing code of Emule. The specific message format is described earlier. The following describes the specific message types.
The first is the Health Check message, which is a general ping-pong mechanism. The corresponding messages include kademlia_hello_req and kademlia_hello_res. When you check the local contact information list, the system sends a kademlia_hello_req message to them and processes the received kademlia_hello_res message.
The most commonly used message is the node search message. In the kademlia network, the node search is the main message to be transmitted by daily applications. Its implementation method is iterative search. This means that when you start searching for an ID, you can find the closest contact in the local contact information list and send a search request to them, in this way, you can usually get some closer contact information, and then send a search request to them to continuously perform such search queries, you can obtain the contact information closest to the target ID. The message codes are kademlia_req and kademlia_res.
The next step is to publish or search the content. This can be better understood by combining the analysis of the cindexed class. EMule stores three types of information in the kademlia Network: file source, keyword information, and file comments. The file source corresponds to each specific file, and each file uses its content hash value as the unique identifier of the file, A file source information is such a fact that someone owns a specific file. A keyword indicates the fact that the keyword corresponds to a file. Obviously, a keyword may correspond to multiple files, and there may be more than one file source for a specific file. However, their indexes are all based on fixed hash algorithms, making search and publishing very simple.
Let's look at the release process. Each eMule client has clearly understood the details of its shared files. In the traditional scenario of a central Indexing Server, it uploads the information of all its files to the central Indexing Server. However, in the kademlia network, it needs to be dispersed. The first thing it does is to split the file name into words, that is, extract the keywords one by one from the file name, the word splitting method is very simple, that is, to find characters with delimiters in the file name, such as underscores, and then cut the file name. After calculating the hash value of these keywords, it publishes the keyword information to the corresponding contact. And publish the file information to the contact whose hash value is close to the file content. The corresponding messages are kademlia_publish_req and kademlia_publish_res. In addition, eMule allows users to comment on a file. The comment information is stored separately, but the principle is the same.
When users use the kademlia network to search and download files, they first need to search for a keyword. Because the same hash algorithm is used, in this way, you only need to find the contact information whose ID value is similar to the calculated hash value, and then it can directly send them a request to search for specific keywords. If the returned information is obtained, the searcher will know how many files the keyword corresponds to and list the information of these files. When a user decides to download an object, the search process starts for the specific object. If the search succeeds, the returned information is the file source information of the object. In this way, eMule then needs to connect to the corresponding address according to the information, and uses the traditional eMule protocol to negotiate with them to download files. The corresponding messages are kademlia_search_req and kademlia_search_res.
The actual implementation of the kafemlia2 Protocol has the same principle. Only the protocol code is different from the specific message format. For example, kademlia_req and kademlia_res correspond to kademlia2_hello_req and kademlia2_hello_res, however, the latter contains more information than the former in a specific message. In implementation, 0.47c is more inclined to use kademlia2, while 0.47a is more inclined to use kademlia. Of course, both protocols can be processed. In addition, 0.47c adds a feature for tracking requests that have been sent, that is, a list containing the trackpackets_struct type, this details the time when an opcode request was sent to an IP address. Why? This is to prevent routing contamination attacks against DHT, because when you search for contacts, if you find some contact information, it will also try to add it to the local contact information list first. In this way, if someone wants to launch a malicious attack, as long as it constantly sends kademlia_res to the eMule client it wants to attack, and contains a large amount of false contact information in the message content, so that the contact information list of the other party is full of garbage. In this way, due to the lack of correct and valid contact information, its kademlia network function is basically useless. This feature added in 0.47c will directly ignore the situation where no response is sent, so as to avoid being fooled.
Kafemlia distributed index management in eMule
The biggest benefit of the kademlia network is to distribute the information originally stored on the central Indexing Server to various clients. If you want to make it more accurate, then we can say that it distributes the information to the cindexed classes of various eMule clients. We can start to look at the design of cindexed and how it completes the work. Before that, let's take a closer look at the various types of information that eMule publishes to the kademlia network.
A file source information is the hash value of a file content and the IP address of the client that owns the file, the correspondence between various port numbers and other information. A keyword information is the relationship between the keyword and the corresponding file. In keyword information, the corresponding file information should be more detailed, usually including the file name, file size, and file content hash value. If it is an MP3 or other media file, it also contains the author, production time, file length (this length is the playback length of the media file measured by time), genre, and other tag information. The hash value of the file content is used to distinguish different files corresponding to the keyword.
Cindexed uses a series of maps to store the corresponding information. cmap is the template class for implementing the map in the standard STL in MFC. cindexed contains four such classes, they are used to store the file source information, keyword information, file comment information, and load information. The file comment information is not saved for a long time, and other information will be written to the file upon exit, and re-transferred when eMule is restarted the next time. In addition, the load information is not released by other contacts, but is dynamically adjusted based on the distribution of file source information and keyword information. The load of the corresponding ID increases every time you receive the release information, which is reflected in the Response Message (kademlia_publish_res.
The information in cindexed is checked frequently. Every thirty minutes, it clears all the information stored by itself that is too old. The storage time of the file source information is five hours, the keyword information is twenty-four hours, and the storage time of the file comment information is also twenty-four hours. Therefore, file publishing and keywords must be carried out cyclically and repeatedly. In fact, this is also good for the stability of the whole kademlia network, because every contact attempts to add the other party to its own contact list, you can also specify the last time you saw the contact in the contact list.
Cindexed provides the interfaces for adding information and searching information required by other code, so that the relevant search or publishing requests can be obtained from the network, after ckademliaudplistener completes message interpretation, it can be handed over to cindexed for processing.
Kademlia search task management in eMule
Csearch and csearchmanager complete specific search tasks. Csearch corresponds to a specific search task, which includes the entire process from the start to the end of a search task. Note that a search task is not just a task that searches for file sources or keywords, for a single release task, you also need to create a csearch object and run it. Csearchmanager is familiar with all the search tasks. It contains a cmap containing all the csearch pointer objects. The reason for using cmap is that all csearch tasks must correspond to one ID, the ID is the target of the csearch. Whether you want to find a node or search for or publish information, you must find a contact with a similar ID. Therefore, csearchmanager can use cmap to represent all search tasks.
We noticed that csearch added itself to csearchmanager when it was created. In addition, csearch needs to describe its type when it is created. For example, it is only for searching nodes, keywords, or file source information, of course, it may also be the source information or keyword information of the published file. Let's take a look at the functions of several csearch methods to get a rough idea of the csearch process. Go is its startup process. It will start searching for candidate contacts from the local contact list for the first time and start searching. The sendfindvalue function is to send a request to a contact who searches for the contact information of a specific ID. Jumpstart is when the search reaches a certain level. If some intermediate results are obtained and the next action is started, the next action may still be sendfindvalue, it is also possible to think that the searched contact is close enough to the target, so you can start a substantive request. Storepacket is such a substantive request. For example, in a csearch task with the publishing file source as the task, storepacket will send the kademlia2_publish_source_req to the target contact (if kademlia2 is not supported, it is kademlia_publish_req ). Finally, csearch can process various search results, and then return the processed results to the code that calls it.
Csearchmanager is directly in contact with other parts of the code of the kademlia network. For example, if ckademliaudplistener finds some results, it will send these results to csearchmanager, then csearchmanager searches for the search task and transfers the result. In addition, csearchmanager provides an interface for creating various search tasks, which is similar to the factory in the design mode. Other code only needs to describe what kind of search task needs to be started, csearchmanager to create a csearch task.