What kind of computations do you think are present in a general-purpose computer with modern operating systems that look and must be super fast, without a doubt, the answer is memory access.
What kind of computing in a general-purpose computer that now carries modern operating systems looks and is theoretically super-slow, without a doubt, the answer is routed addressing.
The early holiday is really boring!
Why do you think memory addressing is fast? Why do you think it must be fast?
Because the operating system is now based on virtual memory address addressing, there needs to be a mapping between the virtual address and the physical addresses, which in fact slows down memory access, but we have the gift of the CPU, the address cache, which is tlb!
The great God said that the core of modern programming technology is the technology of addressing. That's true. CPU core processing performance is no longer a bottleneck, because the CPU is always to communicate with peripherals, various bus laying and outside the core, the road far fatigue, electromagnetic effects, and its efficiency in fact slowed down the overall processing progress. The role of the CPU cache at this time is to avoid remote addressing as much as possible, including L1/l2/l3 Cache,tlb, where the TLB cache is the result of the address translation, which is the last attempt to improve efficiency after the data/instruction cache L1/l2/l3 misses, if the TLB is not hit, You need to perform the slowest process: Perform page table lookup, map to physical address, address bus launch physical address;
Let's broaden our horizons a little.
Today, the entire IP network is a huge computer, the IP address is you can think of it as a memory address, try to compare the 32-bit system memory address and IPV4 address, you may have thought of something. on today's Internet, CDN is already playing the role of CPU Lx cache, avoiding long-distance addressing as much as possible, but the performance optimization of IP routing is seldom unified. For the problem here, there is no unified solution is because in the network domain, and the computer field is the opposite, the transmission capacity is constantly enhanced, its development speed has been far beyond the development of the CPU, network transmission technology, like the outbreak of general, the current 10Gbit Ethernet without pressure, these cables can be buried in the soil, Winding on the beam, and those multi-layer interchange board on the high-level neatly arranged on the performance of the board is not up to 10gbit/s-bus compared to the hero does not ask the source. At this point, if you use the CPU for pure software routing lookup, it will slow down the Midbox line speed capability, CPU this on the same board on the tall things to the high-speed network has become a bottleneck.
Hardware forwarding around the CPU
Over the past more than 10 years, the development of network technology faster than the speed of CPU development, which is mainly limited by the integration of on-chip cache technology, bus technology, concurrency technology and lock technology, unlike the gigabit/100,000 Gigabit Ethernet technology can be regarded as an independent technology, the general computer needs to consider the various resources of mutual cooperation, CPU technology, memory bus, electromagnetic compatibility ... Therefore, the basic of professional router is to throw away the CPU, but directly do the forwarding chip. A card inserted in a general-purpose computer, became a professional router. The data path is completely complete on the card, bypassing the CPU completely, as if a small computer is built into the computer. General-purpose computer provides only management and control functions, such as Cisco's Express forwarding is the way to play.
A long time must be divided, long will be together
With the improved integration of on-chip technology and on-board bus technology, the cache of universal CPUs has improved greatly in size, efficiency and use, and the efficiency of accessing memory has increased. At this point, the CPU once again the various routes to forward the hardware card unified together the opportunity to come. Of course, not all of the hardware forwarding technology will be replaced by CPU forwarding technology, if you want to pursue higher speed, of course, professional hardware forwarding to win over CPU forwarding, but at least, high-speed CPU forwarding technology can eliminate and integrate the most of the hardware forwarding technology on the market.
Analog virtual address mapping technology-ancient algorithms
Proficient in Linux network technology should know that the existing two routing lookup algorithms are hash lookup and trie tree algorithm, both of these algorithms include complex fragmented data structure, for the pure software level of the concept, they are well designed, and also proficient in the BSD protocol stack should know, BSD's radix tree algorithm is also well-behaved in software routing lookups. But think about it, these algorithms are almost all born out of the hardware routing and forwarding technology has been booming for decades, so they are more like the hidden power of the private plots spontaneous sprout, not the result of the trend. This kind of algorithm in essence is a kind of common routing lookup algorithm, they are not targeted to use the hardware structure of the hardware, such as CPU cache, and do not know what platform to run, these are blocked by the interface of the OS, All optimizations for the architecture exist in the form of precompiled macros or patches.
-DXR algorithm of analogy virtual address mapping technology
Consider the target IPV4 address as the virtual memory address of an address space, and take the next hop to the destination address as the physical page address of the corresponding virtual memory address, is it possible to construct the routing table like the page table? Think about how often virtual addresses are translated to physical pages, and how highly efficient they are! Unfortunately, there is no MMU mechanism inside the CPU that handles IPV4 addresses, and of course it should not have this mechanism as a general-purpose CPU. But there is always something that can be learned by analogy.
The efficiency of the routing table lookup is not the time complexity of the algorithm itself (it is believed that few people use the traversal method, which can be selected as an algorithm for the system to find an acceptable time complexity), but rather in the overhead of implementation, if the cache of the CPU can be used, The same time complexity algorithms and algorithms that do not use the cache have an order of magnitude beyond the efficiency. If you want to use the CPU cache, then the data structure has strict requirements, must be compact, and can not occupy too much space, the routing table organized into page Directory/Page table class structure is a good idea, enough compact, can be loaded into the cache. In addition to do a bit of optimization, IPV4 address and virtual memory address can be completely analogous, but the route table corresponding to the page directory is not a fixed IPv4 address bits to index, but take k>=16, where the index? The index is not a page table, but a range table, although the table can also be indexed according to 32-k, but given that many of the routing items are aggregated, this can be a binary tree organization, which is still a compact array. This is the core of the DXR algorithm. Visible, there is nothing new, but the data structure of the existing algorithm has been re-organized, the core idea can be seen in the CPU of the MMU implementation.
DXR Routing Lookup Algorithm forward