What the hell is an RDMA? It is believed that most children who do not care about high-performance networks do not know much. But the advent of nvme over fabrics let the storage have to take time to look at this thing, this article will introduce the RDMA I know.
RDMA (Remote Direct memory access) is intended to directly access the host's memory at the remote, without the need for host involvement . For example, when both the host and client are equipped with RDMA NICs, the data is transferred directly between the two memory ports via the NIC's DMA engine without the need to go through the OS's network protocol stack. This technique is very attractive for high bandwidth storage systems in LANs.
In network technology, protocol is an essential part. RDMA environment, the traditional TCP/IP protocol is too large, so need a special protocol to play its advantages, this is the InfiniBand Protocol (IB). InfiniBand defines a complete set of IB frameworks that have most of the concepts we know in Ethernet: Switches, routers, sub-networks, and so on.
Although the InfiniBand looks very good, but the formation of an IB network, especially when the network topology is more complex, for users of the custom Ethernet, the technology and cost too much. To accommodate this need, IB has added protocols for Ethernet on the basis of the IB agreement: ROCE and Iwarp. The use of these two types of protocols can be through the common Ethernet hardware network.
The relationships of these protocols can be seen, with IB performing best and ROCE using the most. Regardless of the technology, the implementation of RDMA must be ensured.
The simple understanding of RDMA, we compare with the traditional network card to further illustrate.
The normal network card is the OS-based TCP/IP technology for the upper layer to provide networking services. In the Linux kernel, there is a well-known structure sk_buff, which is used to temporarily store the transmitted data, it runs through the Kernel network protocol stack and network card driver, the user's sending and receiving data must go through Sk_buff. It can be inferred that this design requires at least one memory copy, coupled with TCP/IP processing, resulting in a lot of overhead (introduced latency and CPU processing time).
RDMA is somewhat similar to a socket on a programming model, such as exchanging information using send and receive. In RDMA, there is also the concept of a queue Pair (QP), each of which consists of a send queue and receive queue, which is used to send data, and the receive queue receives data at the other end. Both the server and client side need to establish such a queue Pair before communicating. RDMA supports multiple QP, and its number limit is determined by the corresponding NIC.
The transfer in QP is done using the work Request (WR), not the form of the data stream. The application specifies the address of the data to be sent and received in the work request (RDMA requires the address of the data to be stored, which must be registered with the IB driver before it is used). In addition, QP's send queue and receive queue are also equipped with a completion queue, which is used to hold the WR processing results (sent or received), and the WR information can be obtained from completion The work completion (WC) in the queue is obtained.
Further, WR is also divided into Receive WR and send wr,receive WR used to specify the location of the data sent from the other end, and send WR is the actual data sending request. The type of memory registered in the host (Ib_access_flags) determines how the remote client operates the host memory. If you have remote access permission, you can specify the address to operate directly in send WR (this is the RDMA read/write operation), the host does not need to participate in the operation, otherwise send WR has no control over the remote address, that is, the data sent is not stored by the Send WR decides (only SEND/RECV operation), the host needs to process the request. You can specify the mode of operation in Ib_wr_opcode.
These two ways are often referred to as one-sided and two-sided. The benefit of one-sided (RDMA read/write) compared to two-sided is that it frees the CPU on the host side and reduces the latency of the transmission. As can be seen, the one-sided way on the host side without generating WQE, there is no need to deal with work completion.
Finally, we go back to nvme over fabrics to demonstrate how NVMF leverages RDMA technology in a client-written request processing process.
The 1,nvme queue corresponds to the client-side RDMA QP One by one, and the NVMe command in the NVME submission queue is stored in an RDMA QP registered memory address (possibly with an I/O Payload) and then via RDMA Send Queue sent out;
2, when Target's NIC receives work request, the NVME Command DMA is registered in memory with target RDMA QP and a work completion is set in the Targe QP completion queue.
3,target processing work completion, the NVME command is sent to the back-end PCIe NVMe driver, if the transmission is not with I/O Payload, then use RDMA read to obtain;
After receiving the PCIe NVME driver processing results, 4,target the NVME completion results back to the client via the QP send queue.
The above process is the implementation of the current NVMF, it can be seen that the current NVMe command using the two-sided form. Part of the reason is that the target CPU needs to process QP's work completion: submits the received NVMe command to the PCIe NVME driver. If this piece can be offload, may be able to achieve NVMF one-sided transmission, then the performance will be more powerful.
Summarize
This article describes some of the technical details of the TRANSPORT-RDMA that were first implemented in NVMF. RDMA is already a proven technology that has been used more in high-performance computing, and the advent of NVMe has allowed it to expand rapidly. For today's storage practitioners, especially the NVMe domain, Understanding RDMA technology is a great help to follow the development trend of NVME, hopefully this article will help you understand RDMA, but more RDMA technical details need to be mined according to their own needs.
Description
This article was first published in the public, "the forefront of storage technology."
Thanks to the experts in the links below for the material provided in this article
1, Http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf
2, How_ethernet_rdma_protocols_support_nvme_over_fabrics_final,john Kim, Mellanox,under SINA.
3, RDMA Programming concepts, OpenFabrics Alliance
4, https://www.zurich.ibm.com/sys/rdma/model.html
5, Http://lxr.free-electrons.com/source/include/rdma/ib_verbs.h
Related reading
NVMe over Fabrics: concepts, applications and implementations
NVMe over fabrics makes RDMA technology fire again.