Openvpn performance-the first bottleneck of openvpn lies in the Tun driver.

Source: Internet
Author: User

The first bottleneck of openvpn is that the Tun character device reads and writes frames at one link layer. The reason why the user-mode openvpn process must have the same link-MTU on both ends, it is because each time openvpn reads a complete Ethernet frame from the/dev/NET/TUN character device, there are not many, and the library interface: ssize_t read (int fd, void * Buf, size_t count); the count in is the MTU value set when openvpn is started. If the MTU values at both ends of openvpn are different, for example, 2000 and 5000 respectively, the ifconfig values of the Tun device are also different, which are 3000 and 5000 respectively. Because the sender sends 5000 bytes each time from the Tun Nic, and the sender's openvpn also reads 5000 bytes from the character device, this is the length of a link layer frame. However, the receiver only receives 2000, and the remaining 3000 bytes will be truncated, causing an error. This situation is very difficult when openvpn is used for forward, because the data of the sender's Tun Nic comes from the real Nic for forward, and the average MTU of the real Nic is 1500.
We know that for openvpn, it is best to set the MTU at both ends to be consistent. In the case that we are unlikely to change the MTU of the real network before and after openvpn, Tun basically sends and receives messages based on around 1500 of the data, although completely simulating network behavior, however, it is fundamentally different from the real network environment. In the real environment, data is directly sent to the network cable, which is basically the line speed. However, in the openvpn environment, the data is sent to the user-state openvpn process by a frame of the Tun character device each time. Then, the openvpn encrypts the data and sends it to the peer end through the physical Nic each time through the socket, switching between the user State and kernel state is a bottleneck, and the processing speed is still one frame at a time. If we can buffer data on the Tun character device, that is, each time N frames are sent to openvpn, and then N frames are sent to the peer end, the extra overhead of user-kernel switching is buffered to offset, when openvpn sockets send data to the peer in batches, the sender only fills the data in the queue of the Tun character device at roughly the same rate. openvpn sends and receives data at N frames each time.
For example, if a train needs to deliver goods at a site, the normal transfer only transfers the goods from one train to another, and then delivers the goods. This is very fast, however, openvpn is used to remove cargo from a train station, inspect and encapsulate the cargo by relevant departments, and then enter the train station and move the cargo to another train, if the processing is based on one piece of goods (for example, there is no space for backlog of goods on the platform), it will be very slow. The general solution is to backlog the goods to a total number at the railway station first, then the train station will be transported out of the railway station. After the handling, the railway station will be transported in a unified manner. During the previous batch of handling, the second batch will be in the backlog and will not be noticed. When the first batch reaches the peer end, the second batch is removed and processed, and so on... the key is that the three steps of cargo delivery station-accept processing-ship into the station consume too much, and the time consumed is the time on the road and the processing time. If a lot of goods are processed each time, the ratio of the time on the road to the processing time and the total cargo will decrease, and the efficiency and throughput will increase, just as the reason why we cannot transport only one suitcase at a time with a ten-ton freighter.
First, let's implement a simple example. Instead of openvpn, we aim to prove that this "backlog" approach is conducive to improving throughput and speed. In fact, there is no need to perform I/O on the read/write interface of the Tun character device based on the size of a link frame, because SKB enters the "Physical Layer" after the start_xmit function is generated, for tun, the physical layer is the program of the character device and the user-mode read/write device, so how to transmit the physical link is free. Therefore, the Tun character device can be transmitted in batches, after the Tun character device is written to the peer end, the link layer frame can be written to the Tun virtual nic one by one. The read/write operations on Tun character devices and sockets are both system calls. The overhead of system calls is large and the overhead of each system call is certain, and it is irrelevant to the operation parameters, therefore, a MMAP-like method is required to reduce the number of system calls, which can increase the throughput. Therefore, you need to modify the Tun driver. Instead of reading a link layer frame each time, you need to accumulate as much as possible. In fact, the TSO (TCP segment offload) of the NIC on the PCI bus also applies a similar mechanism to reduce the number of accesses to the PCI bus, because the access bus overhead is very high, if TCP is divided into N segments, each transmission segment needs to access the PCI bus and needs to be accessed n times. Therefore, a large TCP segment is directly sent to the NIC, it is segmented by the NIC, so it is enough to access the PCI bus once. Theoretically, the layered model is good, but it is aimed at understanding the problem. In practice, it is necessary to destroy this model so that the NIC can process the data at the transport layer to improve efficiency, this is also an example of the victory of pragmatism, just as there is no pure CISC or RISC processor and it is a mixture of them, just as the twisted pair is better than the same axis.
The following changes to the Tun Nic only use the Tun method, but not the tap method. If you want to support the tap mode, it is also easy to modify the method of parsing packets in tun_chr_aio_write. Two functions are modified: aio_read, aio_write, and 2.6.32.27:
Static ssize_t tun_chr_aio_read (struct kiocb * iocb, struct iovec * IV,
Unsigned Long Count, loff_t POS)
{
Struct file * file = iocb-> ki_filp;
Struct tun_file * tfile = file-> private_data;
Struct tun_struct * Tun = _ tun_get (tfile );
Declare_waitqueue (wait, current );
Struct sk_buff * SKB;
Ssize_t Len, ret = 0;
Char _ User * Buf = IV-> iov_base; // retrieves the user-mode Buf
Int len1 = IV-> iov_len; // retrieves the length of the User-mode Buf.
Int to2 = 0;
Int getone = 0;
Int result = 0;
If (! Tun)
Return-ebadfd;
Len = iov_length (IV, count );
If (LEN <0 ){
Ret =-einval;
Goto out;
}
Add_wait_queue (& Tun-> socket. Wait, & wait );
While (len1> 0 ){
Current-> state = task_interruptible;
If (len1-Dev-> MTU <0) break; // if the remaining space is insufficient to accommodate one SKB, return
If (! (SKB = skb_dequeue (& Tun-> socket. sk-> sk_receive_queue ))&&! Getone) {// at least one packet must be returned
If (file-> f_flags & o_nonblock ){
Ret =-eagain;
Break;
}
If (signal_pending (current )){
Ret =-erestartsys;
Break;
}
If (Tun-> Dev-> reg_state! = Netreg_registered ){
Ret =-EIO;
Break;
}

/* Nothing to read, let's sleep */
Schedule ();
Continue;
} Else if (SKB = NULL & getone) {// If a SKB is copied and SKB is null, the queue is null.
Break;
}
Netif_wake_queue (Tun-> Dev );

IV-> iov_base = Buf; // assign the current Buf pointer and length to IOV
IV-> iov_len = Dev-> MTU;
Ret = tun_put_user (Tun, SKB, IV, Dev-> MTU); // get a SKB
Kfree_skb (SKB );
Result + = ret; // The total length of the push.
BUF + = ret; // pushes forward the user-state Buffer
Len1-= ret; // reduce the remaining length
Getone + = 1; // Add statistics
// Do not use break at last;
}
Current-> state = task_running;
Remove_wait_queue (& Tun-> socket. Wait, & wait );
Out:
Tun_put (Tun );
Return result;
}
Static ssize_t tun_chr_aio_write (struct kiocb * iocb, const struct iovec * IV,
Unsigned Long Count, loff_t POS)
{
Struct file * file = iocb-> ki_filp;
Struct tun_struct * Tun = tun_get (File );
Ssize_t result = 0;
Char _ User * Buf = IV-> iov_base;
Int Len = IV-> iov_len;
Int I = 0;
Struct iovec iv2;
Int ret = 0;
If (! Tun)
Return-ebadfd;
While (LEN> 0 ){
Uint8_t HI = (uint8_t) * (BUF + 2); // because it is only used for Tun devices, therefore, the data written to the character device in user mode is a complete IP data packet or multiple IP data packets connected in sequence, here, multiple possible IP data packets are parsed using the IP protocol header format. The third and fourth bytes of the IP protocol header represent the total length field.
Uint8_t Lo = (uint8_t) * (BUF + 3); // retrieve the fourth byte.
Uint16_t TL = (Hi <8) + LO; // assembled into a total length of 16 bits
Iv2.iov _ base = Buf;
Iv2.iov _ Len = TL; // assign the total length to the IOV length. SKB is assigned according to the length.
Ret = tun_get_user (Tun, & iv2, iov_length (& iv2, 1 ),
File-> f_flags & o_nonblock );
Result + = ret; // The total length of the pushed write
BUF + = ret; // pushes forward the write buffer
Len-= ret; // reduce the remaining Buffer
}
Tun_put (Tun );
Return result;
}

User-state programs can choose transparent direct transmission or resolution and modification. The latter is more suitable for management needs, just like openvpn, in which the original IP packet information is parsed. The test here uses a simpletun program. The IO framework is as follows:
While (1 ){
Int ret;
Fd_set rd_set;
Fd_zero (& rd_set );
Fd_set (tap_fd, & rd_set );
Fd_set (net_fd, & rd_set );
Ret = select (100 + 1, & rd_set, null, null );
If (Ret <0 & errno = eintr ){
Continue;
}
If (Ret <0 ){
Perror ("select ()");
Exit (1 );
}
If (fd_isset (tap_fd, & rd_set )){
Nread = cread (tap_fd, buffer, bufsize );
Plength = htons (nread );
Nwrite = cwrite (net_fd, (char *) & plength, sizeof (plength ));
Nwrite = cwrite (net_fd, buffer, nread );
}
If (fd_isset (net_fd, & rd_set )){
Nread = read_n (net_fd, (char *) & plength, sizeof (plength ));
If (nread = 0 ){
Break;
}
Nread = read_n (net_fd, buffer, ntohs (plength ));
Nwrite = cwrite (tap_fd, buffer, nread );
}
}
Select bufsize for the client and server as follows:
# Define bufsize 1500*4
Test command: AB-K-C 8-N 500 http: // 10.0.188.139/5m.html
Machine deployment:
S0:
Eth0: 192.168.188.194 MTU 1500 e1000e 1000baset-fd flow-control
Tun0: 172.16.0.2 MTU 1500
Route: 10.0.188.139 Dev tun0
S1:
Eth0: 192.168.188.193 MTU 1500 e1000e 1000baset-fd flow-control
Eth1: 10.0.188.193 MTU 1500 e1000e 1000baset-fd flow-control
Tun0: 172.16.0.1 MTU 1500
S2:
Eth1: 10.0.188.139 MTU 1500 e1000e 1000baset-fd flow-control
Route: 172.16.0.0 GW 10.0.188.193
Test data:
Use the modified Tun DRIVER:
Transfer Rate: 111139.88 [Kbytes/sec] canceled ed
If the native Tun driver is used for testing, the data is:
Transfer Rate: 102512.37 [Kbytes/sec] canceled ed
If the Tun virtual network card is not used, the bare running rate of the physical network card through ip_forward is as follows:
Transfer Rate: 114089.42 [Kbytes/sec] canceled ed
Impact:
1. If the Tun Nic load is TCP:
While increasing the total throughput, it also increases the latency of a single TCP packet, which will affect the sliding of the TCP window, resulting in the illusion of "a long journey" to both ends of the transmission and receiving, Thus adjusting the RTT.
2. If the Tun Nic load is UDP:
UDP originally better supports real-time, but does not care much about packet loss. The modified Tun Nic adds a single packet delay, and the real-time performance is not as good as before.
3. Parameter associativity:
Several parameters are important. The first is the user-state buffer size, the second is the length of the sending queue of the Tun Nic, and the third is the MTU of the physical Nic, how these three parameters work together to get the best position on the saddle surface (the best balance between throughput and latency) still needs to be tested.
More intense effect:



The above test does not reflect the revolutionary speed improvement brought about by the modification of the Tun driver. Therefore, the overhead of modifying the openvpn code is greater and not worthwhile, because the simpletun directly forwards the code, that is, data from the Tun character device is directly sent to the socket, and the data received from the socket is directly sent to the Tun character device. In this case, the accumulation effect cannot be reflected, if the data has experienced some time-consuming operations in the simpletun application, the results will be more effective, this is a summary of the next article "openvpn performance-the second bottleneck of openvpn in SSL encryption and decryption.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.