Note: There is no source code analysis in this article. However, I think it is more useful to understand the essence than to understand the source code, because after understanding the essence, you may not need to look at the source code again, you can even write the source code. This is why the Linux kernel and Cisco websites contain a large number of documents.

Introduction: routing is a core concept of the Internet. In a broad sense, it makes each node of the group exchange network independent of each other, coupled by routing, and even in the circuit exchange network, virtual circuits also rely on routing. Routing refers to the data path in the network. In a narrow sense, a route is an IP route that supports the entire IP network.

Because the IP address is a datagram network, it does not establish a connection, so the IP Group is forwarded by a hop, and the path is connected by a hop of the route information, therefore, routes are directly related to the connectivity of the entire IP-based network. Because the IP protocol has no direction, and even does not have the concept of a session, the routing must be bidirectional. Otherwise, the data will be lost. (Some people advocate using NAT to solve the problem of reverse routing, in fact, Nat has a poor reputation in public core networks. It even breaks the principles of the IP protocol. Remember, Nat is generally used only for endpoints ). As the Internet is so large, there will be a lot of routing information on each vro. How can a vro use the fastest speed in a large amount of routing information-obviously important-to retrieve what you need? In addition, how does one generate such a massive amount of routing information? This article focuses on answering the first question. For the second question, please refer to Internet routing structure (version 2) (Cisco press. If you want to see it, you can buy it quickly. If you don't buy it, you can't buy it, several books on Cisco are really popular, and it is always hard to buy)

1. Basic Concepts

The concept of routing: routing is a kind of finger pointing. Because the network is moving forward from one hop to another, there must be a series of finger pointing points in each hop. In fact, it is not only the packet switching network that requires routing, but also the circuit switching network that creates virtual circuits. In our daily life, routing is everywhere. Simply put, a route consists of three elements: the destination address, mask, and next hop. Note that there is no output port in the route entry-it is a link layer concept. The Linux operating system will confuse the route table with the forwarding table, in fact, they should be separated (one of the advantages of separation makes MPLS easier to implement ).

The route entry is added to the kernel in two ways. One is added through the user-mode Routing Protocol process or the user's static configuration, and the other is the route automatically discovered by the host. The so-called automatic discovery route is actually a "found route entry and a forwarding table", which means it takes effect when a host Nic is started, such as starting eth0, the system generates the following route table items/forwarding items: packets sent to the same IP segment of eth0 are sent through eth0.

Route table: A route table contains a series of table items, including the preceding three elements.

The layer of the routing framework: the routing is divided into two elements. The first level is the generation of route table items, and the second level is the host's query of route table items.

Route table item generation algorithm: There are two ways to generate route table items. One is manually configured by the Administrator, and the other is dynamically generated through the routing protocol.

Route Search Algorithm: This article focuses on the host-level query algorithm for route table items. After all, this is a pure technical job... on the contrary, the implementation and configuration of the routing protocol are more human-oriented. If you only need to configure several commands for rip or OSPF manually, try configuring a BGP, it focuses on a large number of strategies, not purely technical solutions. If I have time, I will write a separate article about the routing protocol. But today, I only talk about how routers/hosts search route table items.

This process is very important. If the efficiency of the router search algorithm is improved, it is obvious that the end-to-end latency is reduced, which is certain.

2. Linux hash Search Algorithm

This is a classic routing Search Algorithm in Linux, and is still the default routing search algorithm. However, it is very simple. Because of its simplicity, the kernel development team has always admired it. Although it has such limitations, the Linux kernel philosophy is "enough ", because Linux is almost never used in a professional core network routing system, the Hash Lookup method is always the default choice.

2. 1. Search Process

Shows the query structure:

Shows the search sequence:

To achieve**The longest prefix match starts from the longest mask. Each mask has a hash table,**Hash the destination IP address to a specific bucket of these hash tables, and traverse the conflicting linked list to obtain the final result.

Note that the hash search algorithm strictly matches the longest prefix Based on the traversal of the mask. That is to say, if a datagram is to be sent through the default gateway, it must be matched for at least 32 times to get the result. This method is very similar to the filtering method of the traditional netfilter filter table-Try matching one by one, unlike the hipac filtering method, which is based on search. Next, we will see that high-performance routers use the lookup Data Structure-based method when searching for routes. The most commonly used method is the lookup tree.

2. Limitations

We know that the scalability of hash algorithms has always been a problem. A specific hash function is only suitable for a certain number of matching items, and it is almost difficult to find a common hash function, it can adapt to situations from several matching items to tens of millions of matching items. Generally, with the increase of matching items, hash collision also increases, and its time complexity is uncontrollable, this is a big problem. This problem prevents the hash routing search algorithm from moving to the core dedicated router and limits the Linux routing scale, it is impossible to use hash to deal with a large amount of routing information produced by large interconnected networks or Inter-Domain Routing protocols such as BGP.

On the core router, it is undoubtedly inappropriate to use the hash algorithm. It is necessary to find an algorithm that limits the time complexity of searching to a specific range (we do not care about the space complexity, this has nothing to do with the end-to-end user experience. It is only related to the amount of money they spend. vrouters that cost 0.1 million have 4 GB of memory, for a vro that costs 1 million, 64 GB memory is supported ...). We know that tree-based search algorithms can do this. In fact, many routers use tree-based search algorithms. We start with the trie tree of Linux. Easy to check the code (although this article does not analyze the code ...).

3. Linux LC-trie tree search algorithm

The trie algorithm consists of three parts: search, insert/delete, and balance. First, no matter what the name is, you do not have to understand the trie tree concept in depth. Although many textbooks like the insertion of the lookup data structure at the end, I want to talk about insertion first, because once you understand the insertion, the query is self-evident. In addition, after the insertion, next, I want to talk about the balance and multi-path operations of the trie tree. In this way, the final search will become more efficient. We have the right to make efficient search operations an inevitable result.

3. 1. Basic Theory

Sorry, there is no theory here. Everything is simple. We can recognize the trie tree by phone number. The trie tree is essentially a retrieval tree. Like the global phone number book, we know that the phone number consists of three parts: country code + Country Number + number, such as 086 + 372 + 5912345. If you dial this number from the United States, you must first decide the country to which the number is sent, all you need to do is match the country code with the country code of the specified number of digits and the country code of the forwarding table of the export switch. It is found that 086 is China, and then the number will match the area code after arriving in China, it was found that the request was sent to Anyang city, and finally arrived in Anyang city. Then, the request was sent to the number 5912345.

Now the question is, how can we use the fastest way to retrieve the next request in each link? I think the best way is to use**"Bucket Algorithm"**For example, a table is placed at the exit of the telephone request in the United States. X indicates the total number of countries and regions in the world. The Country Code of China is 086, then it is 86th table items. In this way, 86th table items are taken directly to obtain the corresponding exchange information. The telephone request is sent to China through the link indicated in the information...

Another example is the computer page table, which we will talk about in section 3.3.

The trie tree is similar to the above structure, except that the search segments of the above structure are fixed. For example, the telephone number is a three-digit 10-digit number, the location of the matching index is also fixed. For example, the telephone number sequence number starts from a decimal number of 4th digits. For the trie tree, the location to be detected is not fixed. It is represented by POS, and the index length is not fixed. It is represented by bits. We set each detection point as a checknode, its struct is as follows:

**Checknode {**

Int Pos;

Int bits;

Node children [1 <bits];

}

Union node {

Leaf entry;

Checknode node;

}

The diagram is as follows:

It can be seen that POs and bits are the core of a checknode. Pos indicates the position from which the detection starts. Bits indicates the child node array and directly Retrieves**Key [pos... POS + bits]**You can directly obtain the child node.

3.2.trie Tree Insertion

I thought that when studying a tree-type structure, it is undoubtedly the best to understand its insertion algorithm first. However, many textbooks start with searching, and then insert operations are taken over, this is inappropriate. I think as long as you understand the insert operation deeply, the next query and deletion will be very simple. After all, insert is the first step! Although insertion is important, people who want to learn cannot think of it as difficult. You must know that it is not difficult to understand what people come up with. What is difficult? The hard thing is that you won't be able to figure it out first! How should I insert data? :

**Step 1: If no checknode node exists, create the root checknode node and create a leaf. Note that each route entry is a leaf. If the root checknode already exists, you need to calculate the insert position of the new node.**

Step 2: Calculate the position matching before the insert position.The procedure is as follows:

Based on the POS/bits information of the existing checknode, perform a series of comparisons from the root:

__1). Retrieve the root checknode__

2) set the current checknode to prechecknode

3) determine whether to continue matching.

4) if you need to continue matching, you can check which child you are using or the child's branch, and retrieve the child's child-checknode as the current checknode, and return to 2.

5). Exit the matching process if you do not need to continue matching.

The algorithm used to determine whether checknode needs to continue matching its child-checknode is as follows:

Newkey and checknode must have different bits in the above-mentioned blue dotted line area, so they do not have to continue matching with child-checknode, newkey must be inserted and used as a prechecknode. If you need to continue matching, you can determine the child as follows:

**Step 3: Determine the Insert Location and insert**The steps are as follows:

__0 ). if the second step does not match the child-checknode, newkey is directly used as the leaf as the position of newkey [prechecknode of prechecknode... insert the POs + prechecknode bits] of prechecknode. Otherwise, perform the following steps to handle conflicts with child-checknode.__

__1) create a checknode,__Then you can see:

Assume that the Green Circle BITs do not match the child-checknode and newkey for the first time and are marked as Miss. newkey creates a new checknode, which is recorded as newnode and its pos is Miss, its bits is 1, so that the original child-checknode becomes a child of newnode, And the newkey to be inserted creates a new leaf as another child of newnode. Newnode replaces child-checknode as the child of prechecknode and inserts it into its child array.

**Step 4: complete**

Basically, the above process is clear, but it is better to give an example. Next I will give an example to insert three route entries in sequence:

1: 192.168.10.0/24

2: 192.168.20.0/24

3: 2. 232.20.0/24

Next, let's look at the figure. First, let's take a look at the bitmap:

Next, let's take a look at how trie is inserted:

3.3.trie balancing and multiple ways of Trie

If we look at the content described in section 3.2, we find that Trie__It's just a binary tree.__What can we say about this? However, the trie as a route table structure is far more simple than that. If we can't think of the length of the trie tree as a route table, we can first consider the page table. After all, this is the key to implementing virtual memory, processor designers will definitely choose a very efficient way to find physical addresses from virtual addresses. The page table uses the segmented index method to quickly locate page table items, that is to say, a virtual address is divided into N segments, and each segment locates an index. However, connecting these index layers is the final page table item. The illustration is no longer provided here. There is a lot of information about the page table.

If we look at the structure of the page table from the page directory, the structure of the page table is a tree with a big forks. It has 4096 forks, but it is not high, that is, two layers to four layers. Let's take a look at why it is so efficient, because it is relatively small, the index can quickly locate the branches of the tree, and finally quickly reach the leaves.

However, what is the cost of being short? The time complexity is lower, and the space complexity is generally larger. It consumes too much memory. Therefore, the best solution is that the tree cannot be too tall or too short. The multi-channel trie tree is designed in this way. In extreme cases, multiple trie trees may degrade into a linked list or evolve into a "2 to 32 power" tree with only two layers:

Linked List-bits = 0

Multi-tree scenario-bits = 32

While**What dynamic multi-channel trie needs to maintain is to make this tree not so extreme.**

First, let's take a look at the insertion of a common multi-channel trie tree. Note that the so-called multi-channel trie Tree insertion is false. In the implementation of Linux, only the balanced operation can make trie multi-channel, the instances provided here will not appear in Linux. Only the balanced trie tree will look like this. That is to say, it is impossible to insert one, the specific checknode bits is determined in advance here, but it is dynamically adjusted in the implementation of Linux. The essence of multi-channel trie lies in its "multi-channel", while the nature of multi-channel is the bits field of checknode. Let's take a look at the above example. At this time, we have another route entry and thus a node. First, let's look at the bit diagram:

Let's take a look at the multiline trie tree:

This is a multi-channel trie tree.

The so-called balancing operation is very simple. Every time a new node is inserted, the tree will be balanced. The principle is as follows:

__1). If it is too high, press it.__

Change the POs of the checknode, and add bits to 1 to increase the capacity of the child to double. Then, re-Add the child to the new checknode, and recursively execute the balancing operation during addition.

2) if it is too fat, raise it.

The pos of the checknode remains unchanged, and the bits is reduced by 1, which doubles the capacity of the child. Then, the child is re-added to the new checknode, and the balancing operation is performed recursively during the process.

In short, the trie tree implemented by Linux is dynamically changing. The advantage of this dynamic change is that the trie tree form can be dynamically adjusted based on the current system load and memory conditions, this improves the overall utilization of resources. However, there are also some shortcomings, that is, the algorithm itself is too complex and not suitable for expansion. The most important thing is that it is not suitable for hardware implementation.

3.4.search for the trie tree

Finally, the search operation is complete. After understanding the above insert and balance operations, the search becomes very simple. We can not only see its simplicity-Good algorithms are generally simple, in addition, because the balanced operation algorithm is still very efficient, the only new thing is backtracking. However, this section only describes general backtracking, And the next section describes Optimization of backtracking.

Searching is actually very simple. I simply don't want to write any algorithm flow. I have a little bit of fun in my house, and I have to have a drink... let's take an example. For example, we have a data packet whose destination address is 192.168.10.23. Let's look at how to find it and write it in binary format:

According to the trie tree root, we know that Pos = 0/bits = 3. Therefore, we know that we should go to the root checknode's 7th children, so we get to checknode2. Similarly, we check the two digits behind the 19th-bit IP address and reach the leaf node 1. Because the mask is 24, we can find it successfully. Before the process of searching the description tree, I will first show the bitmap of the added Default Gateway:

Then the trie tree is given:

The entire trie search process is marked by the red line:

Next, let's take a look at tracing. First, let's look at why tracing is required. Unlike the page table, the trie tree covers the entire 32-bit virtual address. There is a gap between the coverage range of the trie checkpoint:

The area enclosed by the blue dotted line is a gap (see path compression). In case of any mismatch in this area during search, it cannot be detected directly, it seems that the search process has entered a dead end,**Note that the first matching search process is exact match. After this time, the search strategy will be changed immediately, from exact match to "Longest prefix match ", because the prefix of the node closer to the leaf (understood as the subnet mask) is longer-because it is more precise, this search uses the method from the leaf to the root to find the matching of the longest prefix, this is backtracking,**For example:

1). 111100 and 111110 do not match

2). But it matches with 111000,110000, 100000,000000.

3). Obtain the longest match, that is, 111000.

For example, if the destination address is 192.169.20.32, the difference of 16th bits will be skipped and the final leaf node is 4. However, the final overall check fails and the longest prefix match is entered, that is, tracing. Where should we first trace back? Of course it is checknode3. What about next? Before introducing the next step, let's look at the Backtracking principle. In the longest prefix match, 0 is very important. As long as a match matches all the matches except the following 0, the match is successful, what we need to do is to find the "longest" matching. Which one is the longest match? We can get the result through an algorithm, which is also used in the Linux kernel:

In this case, let's take a look at the entire process:

Finally, it is worth noting that each checknode and leaf have a prefix linked list, for example:

192.168.10.0/24 Via 1.2.3.4

192.168.10.0/27 via 4.3.2.1

The two entries share a leaf, but the leaf has two masks, and the two masks are linked to the chain. When a match occurs, the mask on each linked list must be matched in sequence. Two principles determine that the prefix of the final matching result is the longest,**First, precise matching from the root of the tree to the leaves; second, the mask linked list of each leaf node is arranged in the ascending order.**

3. 5. Backtracking Optimization

Backtracking is very inefficient. For example, in the above example, there are two circles. If we can find the unmatched bit in advance, we don't have to spend so much effort, in fact, this is very simple, that is, when taking the next child, let's judge:**Is there a difference between the [POS + bits] of the current checknode and the [POS + bits] of the child node to be migrated? If yes, check whether the different ones in the checknode are all 0. If yes, check the mask linked list of the checknode directly. Otherwise, you can trace back directly to avoid useless work.**This "ignore mismatch" phenomenon is as follows:

After checking this situation, the matching process immediately enters the "Longest prefix match", and the mask is changed from 32 bits (exact match) to the position indicated by the first unmatched bit of the key [POS + bits] child of the current checknode:

The one where the search key and the matching item are different, not 0 or 1. The search can continue only when the matching item is 0 (the retrieval key is 1, otherwise, trace back! After the search is continued, it is matched according to the regular match. The difference is that the mask is different, the exact match is a 32-bit mask match, and the longest mask match is a n-bit mask match.

. Essence of dynamic multi-channel trie tree-path compression

Because the multi-path trie tree quickly finds a leaf node from the root node and then matches it, if not, it will be traced back, therefore, the trie tree of the route table should be able to quickly and uniquely route from the root to the leaf. Therefore, the height of the tree is too inconvenient, therefore, there is no need to detect every bit of the search key. The POS and bits in the checknode in the trie tree determine where the detected bits are located, the trie tree has been created at this time. You can ignore the bit information detection that is not currently inserted into the route entry. This is path compression. For details, see:

The bits circled by the blue color of the search key do not need to be matched during exact match. You can wait until the longest prefix is matched. The advantage of path compression is that the number of computation times decreases during Matching. However, with the insertion of more route entries, many nodes will make the following equations true:

**Node. Pos = pnode. Pos + pnode. Bits (pnode is the node's parent)**

If there are too many such children in a checknode, it means that all the matching operations in this branch will take a long way. In order to make the matching operation "shorter ", it is time to perform a balance operation. All you need to do is to compress the high tree and the low pressure.

4. BSD/Cisco's Radix search algorithm 4. 1. Basic Theory

In many cases, this name is still very confusing, Radix tree? Tree? Binary Tree ?... Stop!

4.2.radix search

We can find all the complex multi-channel trie trees. Is this still difficult? The only possible difference is that the BSD tree is relatively fixed compared with Linux, so it is easier to use hardware for parallel implementation. For example, if we divide an IP address into four equal parts, each of which is eight digits, it is easy to process four indexes in parallel, even if it is not processed in parallel, it is quite fast to use the hardware crossover network. As you can see, this is very similar to page query, but page table search failure will cause a page missing exception, however, failed route search will be traced back. Or that question? Like the trie tree, the basic algorithm also relies on the existence of a mask linked list for each checknode...

5. BSD/Cisco's X-tree search algorithm 5. 1. Basic Theory and search

Changing the space for time is not crazy and legitimate because time is more important than space, and people are more sensitive to time than space, in a broad sense, the space can be infinite, but time has a threshold value. In addition, parallelism is also a direct benefit of changing the space time. We know that parallelism is a concept of time.

In the optimized version of the Linux trie tree, if no matching is found, it will be traced back. The backtracking process includes one attempt step, it is nothing more than converting 1 from right to left to 0 and then trying prefix matching again. This method is feasible, but if you can directly point out which node to match next, you do not need to trace the attempted behavior in the process. This is exactly the implementation of Cisco. The legendary 256 Cross Tree uses a fixed set of four 8-bit pairs to locate the index, which is the same as page table item search, if the child corresponding to an index is null or does not match, the child will be directly directed to the next node based on the "next node" indicated in the node structure to continue matching. For the bit structure, see:

It can be seen that there is no gap in the middle of the 256 Cross Tree, that is to say, every bit must be involved in index positioning, and there will be no omissions. In addition, during insertion, if the node does not exist, it is dynamically calculated from where to start matching. That is to say, each empty node contains a pointer pointing to the next node that may be matched. In addition, in each non-empty node, it also contains a pointer pointing to the "next node that may match" (This pointer is almost unnecessary), so dynamic computation is not required during backtracking, after obtaining the "next node that may be matched", you can retrieve all the children with zero values in one way. This is "prefix matching ". The 256 Cross Tree can be searched in one step, greatly improving the efficiency. The search tree is as follows:

The search process is very simple. Calculate the first eight bits as P, the second eight bits as Q, the third eight bits as l, and the fourth eight bits as N, therefore, the index of the matching item on each layer (starting from layer 2nd) in the tree is p, q, L, N. In this way, it is easy to locate the final node. If it is a spatial point, it indicates that there is no precise matching item, then we will start to trace back and trace back to the red line in the process.

Therefore, a single search operation can be found within a limited number of times. The tree is very short and the time efficiency is very high. However, because all the paths are determined at the time of insertion, therefore, the insert operation is complicated, but even if it is more complicated, it is nothing more than calculating the backtracing path as it is when searching multiple trie trees, add it to the node entry of the 256 Cross Tree, and use it directly and efficiently during route search!

. Evaluation

The query structure of the 256 Cross Tree is a general route table structure,**In fact, the CEF Implementation of the Cisco router is an optimization of the above 256 cross-tree-the data structure used by the CEF is a 256-way-mtrie, which is essentially divided into four layers.**, Is no different from the above, but there are no empty nodes, and there is no static backtracking path in the red line, instead, the information of the node to which the red line finally points is directly written into the empty node. It looks like this:

In fact, CEF uses a multi-channel trie tree, but this tree is easy to associate with hardware, so as to use hardware to create a forwarding table, while Linux trie is dynamic and software-only.

6. Overall Evaluation

The general comment is not about hash algorithms, because the scalability of hash functions is poor, and I don't like it very much. Although hash is widely used in Linux kernel, however, these hashes limit the size of applications supported by Linux, and it is too difficult to find a good hash function. If your west wall is down now, and you don't care about the east wall at this time, you can use the hash function to remove the east wall and build the west wall!

The tree algorithm is a good choice, with strong certainty. The simpler the tree, the higher the efficiency. Why? Because it is easy to implement with hardware, professional-level hardware is much more efficient than software that simply uses CPU.**It is a problem to design an efficient and complex pure software algorithm or to implement a simple but not efficient algorithm using hardware.**Basically, it can be determined that hardware-only traversal is much better than software-only hashing. hardware is signal-and current-driven, while software relies on CPU commands, clock cycle...

This article describes the two trees used for route search. The first one is a binary tree, for example (the image is from Google ):

The second type is the 256 Cross Tree, for example (the image is from Google ):

Another tree, multi-channel dynamic trie tree, is actually a tree between a binary tree that degrades to a linked list and a 32-power Cross Tree of 2.