ZOVN-lossless Virtual Network

Source: Internet
Author: User

ZOVN-lossless Virtual Network

Note:ZVALEThis is also an article published by the IBM Research Institute, which may be an extension of VALE.

Currently, the data center network is dominated by two trends. One is for Lossless layer-2 fabrics, which is based on enhanced Ethernet and infiniband, by improving performance and performance. On the other hand, it is based onSoftware Defined NetworkThis makes it possible for overlapping virtual networks (SDN is the driving force of overlapping virtual networks). The problem is that there are some conflicts between the two aspects. The physical fabrics uses the Flow_based Control mechanism to prevent packet loss, while the virtual network (without the flow Control mechanism) will cause packet loss. Therefore, this article proposes a zero-loss overlapping virtual network. The prototype is zOVN, which is described in this article.

Introduction

This article mainly introduces two backgrounds: Network virtualization and Fabrics lossless. For example, FCoE needs to use enhanced Ethernet CEE.

Because the data center program has low latency requirements, but the fabric requirements for virtualization and lossless high performance are usually different lines. They all independently influence the data center. The purpose of this article is to analyze and compare the impact of the traffic control mechanism on workload performance in a virtualized environment.

Network Virtualization

Server virtualization makes it possible to create, delete, and migrate dynamic and automatic virtual machines. The network of the data center must support these functions without too many restrictions. In addition to virtual machine migration and easy management, traffic isolation is also important to ensure security. However, network virtualization often causes many problems, such as insufficient VLAN, insufficient IP addresses, and insufficient mac addresses. To solve these problems, we have proposed many network virtualization solutions, such as various OVerlay networks.

Interpretation from wiki:

An overlay network is a computer network, which is built on the top of another network. nodes in the overlay can be thought of as being connected by virtual or logical links, each of which corresponds to a path, perhaps through your physical links, in the underlying network. for example, distributed systems such as peer-to-peer networks and client-server applications are overlay networks because their nodes run on top of the Internet. the Internet was originally built as an overlay upon the telephone network, while today (through the advent of VoIP), the telephone network is increasingly turning into an overlay network built on top of the Internet.

An example of overlay network is VXLAN. To solve the data center problems, for example, 4096 VLANs cannot meet the needs of large-scale cloud computing data centers.

Vxlan (virtual Extensible LAN) is an overlay network technology that uses MAC in UDP to encapsulate a packet header of 50 bytes. Refer to this article

Test packet loss in the virtual environment

First experiment:IperfTest whether packet loss occurs.

Iperf (Iperf is a network performance testing tool, that is, the network speed test in LINUX), the two generotors start to fill the traffic at full speed. Then we start to calculate the number of packet loss at each point in Figure 1 based on the statistical information. The result is shown in:

Because of the different configurations in Table 1, the seven configurations from C1 to C7 have different forwarding traffic in the 10 s window. The performance in the virtual environment is related to computing resources, so computing-intensive configuration leads to lower throughput and lower packet loss rate. For example, the e1000 has no packet loss rate. In addition, especially for virtual optimized NICs, such as virtio, there is a higher throughput, but it also leads to vswitch overflow. The VALE with particularly optimized performance also extends the bottleneck of packet loss to the Virtual Machine Kernel stack. All these packet loss are caused by the absence of a good traffic control mechanism between virtual network devices.

 

The second experiment:

The purpose is to test the sustainable bandwidth of A vswitch without packet loss.

The genetator speed is increased to 5 MB. We still use the previous test example for testing. The results of this experiment are similar to the previous ones. The figure looks better, but the saturated bandwidth of each situation is measured.

Even far lower than overload, we can find that packet loss in the virtual environment is several orders of magnitude higher than that in the physical environment (10-2 and 10-8 respectively ). These noises confirm the previous analysis-the virtual environment will be more unstable because of unstable processes such as processor resources and memory resources.

ZOVN Implemention design goal

Converged virtual networks must meet all application requirements, such as lossless requirements (hpc and storage ), io-intensive workloads require performance requirements (user program response time is less than ms ).

So what should we do to meet the lossless requirements? The first analysis path ensures that every point in the path is lossless, so it can basically be lossless. Data packets are transmitted between programs running on a virtual machine (VM, data packets are sent from one queue to another (the queue is in different software or hardware components ). Here we will describe this queue system in detail, and emphasize the flow control mechanism between each queue. The trace of the data packet path is shown in Figure 5.

From this figure, we can see that there is a component such as qdisc and its queue, and then there are socket TX and RX buffer as well as a variety of NIC and switch queues. In this figure, the left side indicates the sending path, and the right side indicates the receiving path. You must first understand the sending and receiving mechanisms.

 

Sending rule -- Qdisc

By setting different types of network interface queues, you can control the traffic by changing the packet transmission rate and priority. If the kernel needs an interface to send data packets, it needs to send data packets according to the qdisc queuing rules of this interface, and then the kernel will retrieve data packets from Qdisc as much as possible, assign them to the driver module of the network adapter. Linux does not have good control over the receiving queue, so it generally only controls the sending queue, but does not control the sending and receiving. Qdisc encapsulates class and filter.

After processing packets in the Virtual Machine kernel, they reach the hypervisor through the vnic and then forward the packets to the vswitch (a bridge that provides communication between the virtual machine and the physical Nic adapter ).

In the middle, this bridge needs to play the OVN tunnel function, encapsulate the packets from the virtual machine and forward them to the queue of the physical adapter. After the physical adapter is used, they are forwarded to the target server. The physical network adapter is forwarded to the Virtual Network Bridge. The virtual bridge should assume the role of the OVN Terminator, encapsulate the data packet, and send it to the hypervisor, and then forward it to the client operating system. After being processed by the client kernel, this packet is forwarded to the target application at noon.

Based on detailed end-to-end path analysis, we locate possible packet loss points, which may be in the vswitch and the client kernel of the receiving path.

 

Receiving mechanism-napi

 

NAPI is a technology used in Linux to improve network processing efficiency. Its core concept is to read data without interruption, instead, we first use the service program that interrupts data reception, and then POLL the data.

Sending Path Analysis and Processing

On the sender end, the user program generates a data packet and then sends a send system call to copy the data packet from the user space to the kernel space. Then, the data packet is stored in a data structure called sk_buffer and put into the TX buffer (the TX buffer of the socket opened by this program ). This program can know whether the sending buffer overflows through the return value of the system call, so this operation is lossless.

Then, the packet sends the buffer from the socket to the Qdisc (related to the Virtual Interface) queue. Qdisc stores a series of pointers pointing to data packets in the socket. These pointers are arranged according to a selected algorithm, generally FIFO.To prevent packet loss in this process, we increase the length of the qdisc queue so that it is the same as the length of all the sending queues of sockets, but this change requires more memory.This Qdisc attempts to send data packets to the sending queue of the NIC adapter. If the sending queue reaches a threshold value, the Qdisc will stop sending and the transmission will be paused, thus preventing packet loss in the data transmission kernel path; when the TX queue is smaller than the threshold value, Qdisc will continue to be transmitted. In this way, as long as the length of Qdisc is properly configured, the internal transmission of this client must be lossless.

Our architecture is based on virtio technology. In this way, the virtual adapter queue is shared between the client operating system and the underlying hypervisor. (This eliminates the need for an additional copy :)). This virtio adapter notifies hypervisor of the new packet Queue (the sending queue of the adapter ), then, the QEMU-based hypervisor software forwards the packet from the sending queue in the virtual adapter to the vOVN sending queue.

The qemu network line code contains two components: Virtual Network Device and network backend. Their network backend here uses netmap, which is embedded into the latest qemu version, and some necessary bugs are modified.We use a lossless method between the virtual network device and the backend. (I have analyzed that the XEN IO ring mechanism will cause packet loss during the receiving process. Click here). The packet is then forwarded on the net bridge, which is the same as the general principle of the switch we use. If the packet is directly forwarded to other virtual ports, then forward the IP address based on the mac address. If not, it is forwarded to the physical port (the listening mode is set ). From here, the packet consumed by the bridge (LOCAL Mode? Not necessarily) is encapsulated (because it is an overlay network) and then put into the sending queue of the physical Adapter. Then, the package is sent to the target end on the enhanced physical network.

As mentioned earlier, the current vswitch still does not support the traffic control mechanism. Our experiment also verifies this from multiple vswitches.Therefore, we re-designed the VALE vSWITCH to increase the internal traffic control mechanism. The sending process is completely lossless.

 

Receiving Path Analysis and Processing

Packets received from the physical adapter queue are unencapsulated by the ovn termination bridge and then placed in the transmission queue of the virtual bridge, and then forwarded to the receiving queue of a virtual machine interface. The forwarding process of this vswitch is also lossy. This packet is then copied to the virtio virtual device by the qemu Virtual Machine monitor. The receiving queue of the virtual device is shared by the hypervisor and the virtual machine kernel (the same is true for XEN, But I have analyzed that the I/o Loop Mechanism of XEN will cause packet loss during the receiving process, click here ). The VM monitor sends notify (xen event channel) when receiving the package, and the VM receives an interruption. This interrupt is handled based on the linux NPIV framework. A Soft Interrupt is triggered to trigger the consumption of the receiving queue. This packet is sent to the netif_receive_skb function (which implements the ip routing and filter functions) for processing. If the package is for the local protocol stack, it is placed in the receiving cache of the target socket. If the target cache is full, this packet will be discarded. For tcp, you do not have to worry about loss, but udp is different.We modified the Linux kernel. When the target socket receiving queue reaches a threshold value, the soft Interrupt will be stopped, and the reception will be stopped. After the program consumes data in the socket, the receiving continues.. This ensures lossless tcp and udp sockets.

 

ZVALE: lossless vswitch

As mentioned earlier, our lossless vswitches are derived from the VALE we studied and are based on the netmap architecture. It adds a physical port to each virtual machine. Each port has a sending queue and a receiving queue. This forwarding process is lossy, because the packet is always forwarded to the receiving queue as soon as possible, regardless of whether the receiving queue is full. If yes, the package is discarded.

We designed an algorithm to represent the lossless nature of the vswitch. Each sender (producer) is connected to a receiving queue Ij. Each receiver (consumer) is connected to an OK receiving queue. When a package is produced, the sender checks whether the input queue is full. If yes, the sender goes to bed for a while, and then waits for the free buffer zone to put the package into the sending queue for forwarding and then sends it to the egress queue. The forwarder checks whether the egress queue has sufficient space. If this queue has space, the forwarder will transmit the data packets to the egress queue, then wake up the corresponding consumer (may be waiting for a new packet ). On the receiver side, the corresponding output queue is also checked. If it is not empty, the data packet in the middle is consumed. If it is empty, the forwarder forwards packets in the input queue to this output queue. If data pulling (pull) exists, it will be consumed. Otherwise, the receiver will sleep for a while until it is awakened by the sender. This vswitch is designed to be dual-pull/push. When the sending end is faster, it is sleeping most of the time, waiting for free space, when the receiving end consumes data, it will wake up. When the receiving end is faster, the receiving end is mostly sleeping, and the receiving end is awakened only when the sending end has new data. In this way, the overhead of the lossless operation can be minimized. Shows the pseudo-code:

P.s:Test and Evaluation-related parts.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.