All kinds of PCIe acceleration boards

Source: Internet
Author: User

I declare that this article does not involve any specific APIs or specific vendors, but it is worth noting that, the successes and failures of some acceleration board manufacturers are precisely due to their versatility. In this era where people are still dependent on professional boards, boards are still regarded as specialized problems, vendors that represent these boards and claim that they can solve common problems should be cautious! Although I am very optimistic about generic boards, I am not an expert. Even if I am an expert, aren't everyone always attacking experts? In short, it is necessary to make inferences to resolve conflicts.
At the beginning, a variety of PCIe acceleration boards have emerged. These boards tend to focus on one thing, so as to release more and more CPU pressure, when such a method of inserting a card into a general computer motherboard to achieve certain processing acceleration becomes a trend, many other SDN concepts become pure. Ctmd, all of these concepts are known by TMD. The key is how to use them.
An agent came to the company and said how fierce their board was, which made me a little interested! In fact, I have always wanted a card or device that can use hardware to process the packet routing query and packet filtering, but it has never been as expected, i'm curious because I want to know how Cisco works. I want to know how these boards are designed. It's nothing more than door circuits, how can we combine these complex hardware algorithms ?! In fact, the door circuit is not complex. It fully follows the Unix idea, or Unix follows its idea! When a series of door circuits are combined, it is easier to implement whatever you want and view the implementation of the sub-devices.
Before my body, what I have to say is social production. Today, no matter how simple things are, we can't do it by ourselves. I suppose we want to fry a lion head, I have to waste a pot of oil. I think this is not worth it, so I chose to buy it. In this case, the division of labor will be divided into different jobs! The interface between me and the cook is currency. I give him money, and he gives me the lion's head (currency is really a good thing, instead of thing exchange, whatever the value of the random exchange becomes possible) Of course I did not consider the hotel owners, ignoring them because I looked down on them. MD, I now find that I can't even make a small toy bucket for my daughter. If I want to, my dad made a toy bucket for me... my dad also made a box, which is extremely exquisite and still available today. Although it is not a luxury, I think he made it by himself from woodworking to stereotypes, I admire it! But now it's not enough. If you want to make something simple, you need a complete set of devices... A complete device is too costly for individuals. That is to say, during the social growth period, everything depends on the division of labor. Isn't that exactly the idea of UNIX?
Doing only one thing is a great idea. It creates a possibility that, through an exchange medium, simple elements can be combined to produce random and complex things. This exchange medium is design. Assume that the CPU is a microscopic representative, while the multi-core architecture is a macro representative. The separation of the control plane and data plane is another representative.

In the single-core era of OS Kernel Software Architecture overhead, the kernel implementation protocol stack does not expose any problems, whether it is a user-State program or kernel code, it shares a running stream, if frequent switching leads to overhead of switching itself, the kernel usually acts as a proxy for many transactions. However, in the multi-core era, the kernel should be decentralized as much as possible, only as an administrator, and many other transactions should be handed over to the user State. The user mode can freely and conveniently bind different tasks to different cores, so that these cores can work in a dedicated and crazy manner without being intervened by kernel transactions, for example, the overhead of kernel tasks such as kernel preemption, thread switching, Process Migration between processors, and timer caused by random interruptions.
I have been admiring the load balancing mechanism of Linux kernel processes among core processors for a while in the multi-core era. I also appreciate the so-called enlightening intelligent process scheduling mechanism, but now, I am in the opposite direction. In fact, those mechanisms are indeed very intelligent. For example, there are more opportunities for running a sleep process, for example, process migration can be smoothly performed based on historical weights. In fact, the introduction of such intelligence is precisely because the kernel feels that the user-state process will not be able to perform a perfect performance on a stage like multi-core architecture, the kernel represents the transactions that should be owned by the user-state process. In fact, the kernel has always kept an interface, which can disable various interventions by the kernel on user-State processes. For example, you can forcibly bind the process to a core.
According to the original thought of UNIX, the kernel is a control plane, and the data plane is composed of user States. However, the processor architecture turned this idea into an illusion.
Initially, the system had only one processor core and had to design a time-sharing system so that multiple tasks, including the operating system kernel itself, could share a running stream. Such kernel and application peer-to-peer extension have changed people's minds, and almost all general operating systems have inherited this idea. That is, there is no longer a deliberate distinction between the data plane and the control plane. Where do people's line of sight shift? The core tasks of the operating system are transferred to how to improve the efficiency of the time-based system. At the same time, the system performs (weighted) fair scheduling between all processes including the kernel itself. Finally, a series of intelligent algorithms in the kernel are generated due to the spoiled user space. In fact, these algorithms are pushed Based on the kernel based on a few processes that can be obtained, the kernel will never know what the process is. This leads to some misunderstandings. For example, the kernel documentation says that enabling an option will improve the performance, and the result will lead to an embarrassing reduction in the performance...
The title of this section is "OS kernel in the multi-core Era", but so far I have not mentioned what the OS Kernel should look like in this new era, but by looking at the future of the past, I want to understand the history above and the sorrow it brings, and what the OS Kernel looks like in the new era should have a rough outline.
First, there is no need for the kernel to coexist with user programs in a time-sharing system. Of course, if you want to, you can still Let Them coexist. In fact, virtualization is the embodiment of this. A multi-core system runs multiple OS kernels... for the use of the core, the kernel needs to give the user State a lot of other control. Let's think about the essence of computer programming, that is, programming the cpu so that it can run our program, rather than programming the system call interface provided by the OS kernel. At this point, the OS Kernel only needs to manage and allocate shared resources. When the time is no longer shared resources, it should give up the management right of this Part, unless a core has multiple processes in the same time-sharing system, or when the process has active requirements, it needs to intervene. This provides a possibility to focus on a job when a core or core is canceled. This also indicates a trend, that is, the super-multi-core system architecture. The processor core is allocated according to the logic, rather than the processor core is allocated according to the process. This may completely avoid overhead caused by switching and switching.
In the traditional sense, non-on-chip operations always require a bus that connects everything. This bus is often the bottleneck! Intel's architecture has made this style more popular. When someone used to compete for the performance of their computers, the bandwidth of the front-end bus was a hard indicator, AMD later changed this fact on the premise of being compatible with ia32, but it was still incomplete. It is assumed that the use of on-chip storage will solve this problem. However, due to technological limitations and cooling factors, on-chip storage is generally not very large, however, even if you do not use on-chip storage, there is still a way to change the connection mode between the processor core and memory, and replace the bus mode with the matrix connection mode. In fact, in the multi-core era, the existence of bus is an error. In other words, bus is a relic of the Single-core era.
PCIe bus seems to have opened a good start, softening the hard bus, changing the bus topology into a hub-based star topology, and repeating the evolution mode of Ethernet, perhaps only the original switched Ethernet is comparable. However, PCIe is essentially intended to interconnect peripheral devices and is not suitable for the connection between the processor and the memory, but its idea can be used for reference. That is to change the topology, so that the fully interconnected topology, matrix topology generation and other switching structures Replace the bus topology. If the bus is changed to a switch type, the "bus" becomes a name. Just like CSMA/CD, the protocols running on the front-end bus will no longer exist.
The protocol stack in the new architecture belongs to the data layer. Therefore, the correct method is the user-mode protocol stack, which uses multiple cores to distribute the running streams. For example, a group of cores specifically process link layer forwarding, A group of core is responsible for ACL matching, and a group of core is responsible for Packet Classification... there is no switchover at all, which makes all the caches novel and available. The matrix connection mode enables each core to quickly ask for different memory regions, reducing bus competition and being forced to wait.
After network data is received by the network adapter, it should be placed in a buffer in the user State. Suppose you think this is not related to the kernel? No! The NIC receives data by the driver, while the driver directly manages the OS kernel. That is to say, the actions received from the NIC chip occur in the kernel state, only the kernel immediately put this packet to the user State and did not process it on the kernel protocol stack. Similar ideas are also used in direct Io for disk operations, that is, the kernel no longer retains the buffer, and the user mode is solely responsible for this.
I am not familiar with coprocessor, because it is a rebellion against general computing, partially or completely offset the convenience flexibility and economic benefits brought about by programming, for example, the first ASIC forwarding chip, the instruction is completely solidified in the hardware, and then the coprocessor is added? With flexibility, it becomes programmable, but the functions that can be completed are limited, just as you have shown, things are becoming more universal step by step. In the past century, CPU + programming has replaced various specialized devices. When such an architecture becomes a bottleneck, dedicated devices have emerged. For example, the ASIC, various encryption cards, and co-processors mentioned above, however, due to the implementation of GPU and various super-multi-core processors, a higher level of universal process is in progress.
The so-called smart chip on the Current routing card is actually very professional, that is to say, it does not have the general processing capability, the typical routing chip uses the common comparison tool, shift register, selector, the multiplexing can be completed. It can be designed simply by understanding the computer composition principle. If you want to see the software principle, let's take a look at the implementation of the trie routing Search Algorithm in Linux, or the BSD routing search algorithm, which is much more complicated than pure hardware implementation, but the basic idea is the same. For example, to improve the performance, the producer can use parallel cam matching. Although it greatly accelerates the routing search efficiency, as the general-purpose processor architecture becomes more and more flat (low frequency, low power consumption, and multiple cores), I think the myth of dedicated chips will soon be broken.
Interaction between New Architectures and traditional architectures many others exist in the form of boards, and many others use PCIe to plug in the motherboard PCIe slot of traditional architecture equipment, the reason for this situation is not only the economic factors such as investment of the traditional motherboard, but also the reason is that most of these new architecture boards are specific, such as GPU, intel Gigabit/10g card, cavium encryptor, and tilera GX processing card. I personally prefer GPU and last tilera processing card ratio.
Assuming that the main program is still running in the old architecture due to the influence of existing investment and traditional thinking in the form of boards, it is inevitable to interact with the main device. I think, this approach is extremely inappropriate. The traditional large and all-Western Empire-type OS kernel, the traditional congested motherboard bus, and the applications designed under the traditional single-core idea will become bottlenecks and cumbersome. Why don't I directly put the program on the card? In this case, it involves who acts as the control plane for the work on the board, of course, the OS kernel in the new architecture era. That is to say, a super lightweight OS Kernel runs on the board. The kernel maximizes the authority to user-State applications, without processing the protocol stack, and only provides management and scheduling of shared resources. The tilera GX processing card provides this method. In this way, PCIe is no longer an interface in the data plane sense, and many others act as management interfaces. In this way, the management plane is provided by the old architecture host, and the control plane is provided by the Board OS kernel, the data plane is provided by the user mode of the board. If no management is required, PCIe is a power supply interface.
One thing to be clarified is that many people use the accelerator board to make up for the shortcomings caused by the main board processor clock speed and power consumption. Their focus is on the main board, the system on the traditional motherboard is just a management plane. Their demand has led to continuous increase in PCIe bandwidth, and PCIe is also unacceptable. In general, most people only use these boards for acceleration, it does not mean that data is completely taken over. This is actually a wrong idea. The essence of this idea is a single concept. People have been rejecting borderless things since ancient times, so they have various doors and mouths, such as the ancient city gate, now, let's assume that, following this idea, these doors or ports are added with traffic? Sooner or later becomes a bottleneck, So breaking these doors or eloquence is the right idea. Why is it better to have the motherboard and board collaborate and hand over all the boards directly ?!
As for core network technology, Cisco and other vendors have already adopted such a new architecture very early, simply saying that their core network services are generally single-core, similar to routing, exchange, firewall, and so on, so such a model is not promoted. Just after the terminal virtualization era, the pressure on resource subnets like data centers is constantly increasing ?, This forces the server to be disconnected or to use a hybrid architecture that separates the data plane, control plane, and management plane. In this case, we assume that SDN is the concept of separation, and SDN is born together with the traditional core network technology and the data center technology in the cloud era. What is added to SDN in addition to the concept of Plane Separation? The idea of centralized management is like a cable tie, which provides a unified view. Every network technology has had such a process.
There is no way, and there is no way to go when there are too many people. At first, there was no human management, but at last there was a centralized management. The road network has developed so far, whether it's the ancient Rome road network or China's railway network. The telephone network is not like this. If you see the poles at the end of the 19th century, you will find how messy they are, but are you still confused? It can be said that the entire human civilization is a process of networking, which is constantly distributed and centralized, integrated and separated. The main character is data plane, control plane, and management plane.
Finally, let's talk about Sdn. It is not as simple as you think. It may require some disruptive thinking to better understand it. Remember that today's networks are controlled in a distributed manner. This is the inevitable result of a normal technological evolution and is also the foundation of historical development. In general, this is the certainty of the layered protocol model. In this model, the network is divided into the Bearer Network and the superposition network, that is, the superposition network is used as the load operation of the bearer network, no matter what layer is capable of bearer network, no matter what layer is capable of overlapping network, for example, x over y, divided into the following three modes:
1. Typical protocol stack-based encapsulation, such as UDP over IP, HTTP over TCP, and IP over ether;
2. Generally, the lower layers of the Upper-layer encapsulation, such as pppoe and ipoatm... openvpn, are also based on this model. It belongs to IP/ether over UDP/TCP;
3. Add between two layers at will? New layer, such as IPSec ESP/AH, SSL/TLS.
But SDN has broken all of this. In the SDN world, data is completely forwarded according to the suggestions of the stream table, while the establishment of the stream table is completely separated from the data forwarding. In other words, even if there is no hierarchical protocol stack, you only need to tell the port from which the device sends a packet to complete the packet routing. That is to say, the protocol stack processing greatly reduces the workload of parsing the packet header layer by layer, just match the stream table. Even if you customize a non-TCP/IP protocol, you only need to create a stream table to forward it! The establishment of a stream table has nothing to do with the detailed protocol. It is an independent protocol! Therefore, to put it simply, you can completely transfer the layered protocol to the control plane, and completely discard the layered protocol on the data plane...
Let's take a look at what the New Architecture Board shows and what we encounter in reality. Therefore, SDN is not a new idea, but an old idea. Starting from the time when Unix was born, we finally went back to the Unix idea.

All kinds of PCIe acceleration boards

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.