Octeon and dpdk (Intel data plan Develop Kit)

Source: Internet
Author: User
Tags prefetch
Http://note.youdao.com/share? Id = 248a6e3b877213484a02b507 F69ccf84 & type = Note

The Sina Blog link, http://blog.sina.com.cn/s/blog_6b0d60af0101cu1z.html
Octeon is designed for packet processing and provides a large number of hardware acceleration units. Its chip design team designed the alpha chip, which is also well-developed in the processor design. Dpdk is a package processing software solution in intel X86 architecture and Linux environment. In summary, there are several main aspects: hardware access layer, user space polling mode driver, mempool Lib, ring Lib, mbuf Lib, and timer Lib. The last four are libraries, pure software. There are also some other pure software libraries and X86 architecture-specific support.

Let's talk about some subjective feelings. I used to be the fae of octeon, and then came into contact with dpdk, so I was biased. Insert the history of dpdk. dpdk is developed by 6wind. 6wind first transplanted its software to octeon, and then developed a set of dpdk based on the need of porting it to x86. Therefore, if you are familiar with octeon, understanding dpdk will not take a bag of smoke. Octeon is a dedicated processor. dpdk tries to implement octeon on a general-purpose processor. Therefore, comparative analysis is very interesting for the system. The Manual of octeon is hard to read, but the API is concise; the guide of dpdk is easy to read, but the API is annoying. I think it is mainly because x86 does not have octeon pure and beautiful. In octeon, everything is abstracted just right, so this abstract process is a headache. However, once the abstract model is built, the API is pure and elegant. For example, the mempool of dpdk is very muddy, And the FPA of octeon is very neat. FPA is very simple, so it can interact between software and hardware. Two callback functions are required for dpdk mempool initialization. Mempool and FPAThe free object/buffer in mempool is managed through the ring. This is probably to use the ring feature that can be operated by multiple cores at the same time, so that mempool can be operated by multiple cores/threads. However, the ring occupies extra memory, and the ring of the dpdk is fixed size. In addition, mempool also designed a cache mechanism to reduce the frequency of ring access, and ultimately to reduce lock conflicts. Because the dpdk ring is not completely lockless. In octeon FPA, how does one solve the above problems. First, FPA is an I/O unit. Requests to FPA can be completed through an I/O read/write operation, including requests from software (that is, core) and hardware (hardware units other than core Pip, (PKO, etc.) requests are sorted by the underlying hardware. Because the hardware bus cannot send two requests at the same time, the request is one by one, which is equivalent to a lock. Therefore, the internal implementation of the FPA unit does not need to consider the lock issue, just manage the buffer. FPA uses a very clever method, that is, to save the management information of free buffer in free buffer, a byte of memory will not be wasted. In addition, there is no limit on the pool capacity of FPA. You can also add a new free
Buffer, because its internal management uses list instead of array/array. You can even release the buffer of one pool to another, as long as the buffer sizes of the two pools are the same. The FPA hardware supports the prefetch operation and the core sends the buffer address to the scratch memory (part of the data cache, that is, the L1 cache) of the core ), therefore, the request for free buffer only costs one cycle. Insert a bit. No matter whether the software implements the pool or FPA implementation, the buffer managed in a pool does not need to come from the same contiguous memory. However, whether in the octeon SDK or in the dpdk, the function that creates a mempool only supports the use of a contiguous piece of memory. If you want to use discontinuous memory in a pool, you can call cvmx_fpa_free on the octeon SDK to add the buffer from the discontinuous physical memory to the same pool (the buffer size must be the same ), it won't work on dpdk, so we can only implement another mempool on our own. In addition, considering the optimization of the memory system, the mempool of dpdk makes a special alignment for the number of channels, rank, and dimm of dram to the first address of the Free object, so that the adjacent two objects/buffer are not in the same channel: Rank: dimm. In this way, the load can be allocated to multiple channels and rank, but there are still two problems. First, the buffer is not necessarily sequentially released. That is to say, after a program runs for a period of time, the access sequence to the buffer may be 1, 3, 5, not necessarily 1, 2, 3, 4, 5; second, in a multi-core environment, the system may have multiple mempools, and the memory optimization of a single mempool may offset each other. For memory access problems, octeon is optimized in the DRAM controller. First, octeon is a multi-core environment with many DRAM access requests. DRAM Automatically sorts and unorders requests based on information such as channels. This is completely transparent to the software. It is estimated that Intel's processor will also perform similar optimization. I spent a bag of cigarettes and wrote it again in another day. To be continued... PIP/PKO and rte_etherFor x86 network interfaces, we can't help but mention these abbreviations. If we don't have these feature, then x86 will be useless for packet processing. Not every feature is required at the same time. RSS, receive side scalingvmdq, virtual machine device queuedcb, data center bridging, guarantee lossless delivery, Congestion Notification, priority-based flow control, and priority groups. DCA, Direct cache accessThe intel Ethernet controller uses the descriptor ring method for free buffer management regardless of the packets sent and received. This poses a problem: the software needs to consider when to add a buffer to the receive ring of the port. If it is added more, the buffer added to a port receive ring cannot be shared by other ports, resulting in waste. If it is added less, it may cause instant buffer depletion and packet loss. It's hard to balance. The simplest thing is to waste more memory. In octeon, pip applies for free buffer only after receiving the message. In Pip, only a small amount of free is prefetch.
Buffer, so free buffer can wait for all ports and all hardware and software units at any time. In octeon, pip collects packets in the POW queue, and core only receives packets/wqe from that group, instead of directly requesting packets from a port/port, therefore, as long as there are packets/messages/wqe to be processed, no invalid Round Robin will occur. On x86, port polling is inevitable, regardless of whether the port receives packets. When sending packets, dpdk checks whether the previously used TX descriptor has an associated rte_mbuf. If yes, it is free. This is a clever practice, which avoids interruptions (used to notify the Controller that the software packet has been sent) and does not require regular polling of the descriptor ring. Because the buffer release action is purely software, when used with the indirect buffer of dpdk, it is easy to solve the problem of packet data buffer release in multicast. Whether or not the message sending API takes this into consideration, you have to wait to check the code. In octeon, the software can choose to release the buffer or notify the software after the buffer is sent, and then release the buffer by the software. For multicast problems, either copy packets or reference counters are selected, but the hardware does not understand the reference counter. Therefore, if you use the reference counter, you must use the software to release the buffer. There are two methods for sending PKO notifications to the software: one is to set the memory address to zero, which requires software polling or query, and the other is to send a wqe from PKO, to solve the buffer release problem, another wqe is required. No matter which method is not convenient, please let us know if there is any solution to this problem on the new octeon. TimerThe implementation principle of dpdk timer is not described in this document. However, it is clear that timer relies on lcore software to periodically call the RTE timer manage () function to check whether timer times out. This causes the same issues as round-robin and packet collection, A waste of CPU resources. However, there is no good way to do this. As for the Implementation Algorithm to be analyzed, this is also directly related to efficiency. After reading the implementation code of dpdk timer, It is a list data structure with very low efficiency. Every time we have to traverse the data from the beginning to the end, I am speechless. No way, I had to implement a soft solution based on octeon's timer algorithm. Because of the benefits of the hardware timer module and POW unit, the timer in octeon can ensure both accuracy and efficiency without being picky. Octeon manages all messages/packets in a unified manner through the POW unit. The core uses the unique get_work interface to handle all tasks. This solution has the following advantages: First, it avoids CPU waste caused by Invalid polling, the second is that priority can constrain different types of tasks at the same time. Imagine that if a separate interface is used to round-robin packets and timer respectively, the timer message with a higher priority should be processed first, low-priority timer messages and received messages are processed in the same order of time. This is a very troublesome and costly task. SynchronizationOcteon has a hardware POW unit, and pow has another name: SSO, schedule sync order. As the name suggests, octeon has hardware units to synchronize, sort, and mutually exclusive packets, tasks, and cores. Dpdk is essentially a user process on Linux. Therefore, the synchronization mechanism dpdk that can be used by Linux applications can be used. At the underlying layer, octeon supports three methods for synchronization: pow, shared memory, and inter-core interruption. Dpdk can only use shared memory. The underlying layer of the lock and semaphore is essentially shared memory. Pow is actually the essential difference between octeon and x86. Octeon = Io + POW + MIPss, various hardware acceleration engines are actually io (encryption and decryption in the core, ). X86 is the CPU in the eyes of common programmers. What's wrong with dpdk 1. Why is RX descriptor less than TX descriptor in example? # Define rte_test_rx_desc_default 128 # define rte_test_tx_desc_default 5122. In example, unsigned variables are used to define variables separately, without Int. Is this a standard? How many digits are defined in this way? Q &Enthusiastic Changfeng 15:20:10 [reply] [delete] [report] Support yonglong 2372, I am interested in its hardware access layer: How to directly collect packets in user mode? Can I bypass the kernel? I think this is a very important issue in performance, and other libraries are on the cloud.> Users can access registers and Rx/TX descriptor ring to send and receive packets. This is supported in Linux, such as MMAP technology. BTW and dpdk only use some of the ready-made technologies of Linux. This is used in tilera, And the octeon SDK also supports this mode. Other implementations are also important to performance, such as poll mode driver. By referencing a friend said a few days ago, many small optimizations eventually form a big gap.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.