Research and Implementation of Zero Copy Technology

Source: Internet
Author: User
Tags delete cache

Research and Implementation of Zero Copy Technology
By firstdot)
E-MAIL: firstdot@163.com

I. Basic Concepts
The basic idea of zero-copy is to reduce the number of data copies, reduce system calls, and achieve zero CPU involvement when data is transmitted from network devices to user program spaces, completely eliminate the CPU load in this area. The main technology used to achieve zero copy is the DMA data transmission technology and memory region ing technology. 1. Traditional Network datagram processing requires two copies from network devices to the operating system memory space and from the system memory space to the user application space, at the same time, you also need to go through the system calls that the user sends to the system. The zero copy technology uses DMA technology to directly transmit network data packets to the address space pre-allocated by the system kernel to avoid CPU involvement, map the memory area of data packets stored in the system kernel to the application space of the detection program (another way is to create a cache in the user space and map it to the kernel space, similar to the kiobuf Technology in Linux, the detection program directly accesses this memory, which reduces the system kernel copying to the user space and the overhead of system calls, implements "zero copy ".

Figure 1 Comparison between traditional data processing and zero copy Technology
II. Implementation
On redhat7.3, modify the 8139too included in the kernel source code. c: When the 8139too NIC Driver Module is started, apply for a kernel cache and establish a data structure to manage it, then, write multiple string data to the cache, and pass the cached address to the user process through the proc file system; the user process obtains the cache address by reading the proc file system and maps the address of the cache to read data from it. Haha, in order to be lazy, this article only tests the address ing part in the zero copy idea, but does not implement DMA data transmission (too troublesome, you have to understand the hardware ), this experiment is not part of the packet capture module in an IDS product. In addition to DMA, you need to consider some issues. For more information, see section 3 of this article. The following are the main steps to achieve zero copy. For detailed code, see the appendix.

Step 1: Modify the NIC Driver
A. Apply for a cache in the NIC Driver: The maximum allocable continuous cache size supported in the linux2.4.x kernel is 2 m Therefore, if you need to store more network data packets, You need to allocate multiple discontinuous caches and manage these caches using linked lists, arrays, or hash tables.

# Define pages_order 9
Unsigned long su1_2
Su1_2 = _ get_free_pages (gfp_kernel, pages_order );

B. Write Data to the cache: the zero-copy implementation in the real IDs product should be to use DMA data transmission to directly write the packets received by the hardware of the NIC into the cache. As a test, I only write a few arbitrary strings to the cache. If you want to write real network data packets to the cache without considering DMA, you can write them in 8139too. call netif_rx () in rtl8139_rx_interrupt () of C and insert the following code:

// Put_pkt2mem_n ++; // Number of packages
// Put_mem (SKB-> data, pkt_size );
For the put_pkt2mem_n variable and put_mem function, see the appendix.

C. upload the cached physical address to the user space: because the requested cache address in the kernel is a virtual address, what you need in the user space is the physical address of the cache, therefore, you must first convert the virtual address to the physical address. in Linux, you can use the kernel virtual address Subtraction 3G To obtain the corresponding physical address. Transferring the cached address to the user space requires a small amount of data transmission between the kernel and the user space. This can be achieved through character drives, proc file systems, and other methods. Here the proc file system is used.

Int read_procaddr (char * Buf, char ** start, off_t offset, int count, int * EOF, void * Data)
{
Sprintf (BUF, "% u/N" ,__ PA (su1_2 ));
* EOF = 1;
Return 9;
}
Create_proc_read_entry ("nf_addr", 0, null, read_procaddr, null );

Step 2: Access the shared cache in the user program
A. Read cache address: it can be obtained by Directly Reading the proc file.

Char ADDR [9];
Int fd_procaddr;
Unsigned long ADDR;
Fd_procaddr = open ("/proc/nf_addr", o_rdonly );
Read (fd_procaddr, ADDR, 9 );
ADDR = atol (ADDR );

B. map the cache to the user process space: Enable the/dev/MEM device (equivalent to physical memory) in the user process ), use MMAP to map the cache requested by the NIC Driver to your own process space, and then you can read the required network packets.

Char * su1_2;
Int FD;
FD = open ("/dev/mem", o_rdwr );
Su1_2 = MMAP (0, pages * 4*1024, prot_read | prot_write, map_shared, FD, ADDR );

Iii. Analysis
Synchronization is the most critical issue in the zero copy process. On the other hand, the NIC Driver in the kernel space writes network packets to the cache, one side is that the user process directly analyzes the cached data packets (note that it is not a copy before analysis). Because the two are in different spaces, the synchronization problem becomes more complicated. The cache is divided into multiple small blocks, each of which stores a network data packet and represents it in a data structure. In this experiment, a flag is used in the packet data structure to identify when data can be read or written, when the NIC Driver fills in real package data into the package data structure, the package is identified as readable. After the user process analyzes the data in the package data structure, the package is identified as writable, this basically solves the synchronization problem. However, because the IDS analysis process directly performs intrusion analysis on the cached data, instead of copying the data to the user space before analysis, the read operation is slower than the write operation, this may cause the NIC Driver to write data without cache space, resulting in certain packet loss. The key to solving this problem is to apply for a large cache. A small cache may cause packet loss, if the cache is too large, the management is troublesome and the system performance will be greatly affected.

Iv. Appendix
Code added to A. 8139too. c

/* Add_by_liangjian for zero_copy */
# Include <Linux/wrapper. h>
# Include <ASM/page. h>
# Include <Linux/slab. h>
# Include <Linux/proc_fs.h>
# Define pages_order 9
# Define pages 512
# Define mem_width 1500
/* Added */

/* Add_by_liangjian for zero_copy */
Struct mem_data
{
// Int key;
Unsigned short width;/* buffer width */
Unsigned short length;/* buffer length */
// Unsigned short wtimes;/* write process count, reserved, can be written by multiple processes in the future */
// Unsigned short rtimes;/* Number of read processes, reserved, can be read by multiple processes in the future */
Unsigned short wi;/* write pointer */
Unsigned short Ri;/* read pointer */
} * Mem_data;
Struct mem_packet
{
Unsigned int Len;
Unsigned char packetp [mem_width-4];/* sizeof (unsigned INT) = 4 */
};
Unsigned long su1_2;/* cache address */
/* Added */

/* Add_by_liangjian for zero_copy */
// Delete Cache
Void del_mem ()
{
Int pages = 0;
Char * ADDR;
ADDR = (char *) su1_2;
While (pages <= pages-1)
{
Mem_map_unreserve (pai_to_page (ADDR ));
ADDR = ADDR + page_size;
Pages ++;
}
Free_pages (su1_2, pages_order );
}
Void init_mem ()
/*************************************** *****************
* Initialize Cache
* Input: amode: Buffer read/write mode: R, W *
* Return value: 00: Failed *
*> 0: buffer address *
**************************************** ****************/
{
Int I;
Int pages = 0;
Char * ADDR;
Char * Buf;
Struct mem_packet * curr_pack;

Su1_2 = _ get_free_pages (gfp_kernel, pages_order );
Printk ("[% x]/n", su1_2 );
ADDR = (char *) su1_2;
While (pages <= pages-1)
{
Mem_map_reserve (virt_to_page (ADDR); // The cached page resident memory
ADDR = ADDR + page_size;
Pages ++;
}
Mem_data = (struct mem_data *) su1_2;
Mem_data [0]. Ri = 1;
Mem_data [0]. Wi = 1;
Mem_data [0]. Length = pages * 4*1024/mem_width;
Mem_data [0]. width = mem_width;
/* Initial su1_2 */
For (I = 1; I <= mem_data [0]. length; I ++)
{
Buf = (void *) (char *) su1_2 + mem_width * I );
Curr_pack = (struct mem_packet *) BUF;
Curr_pack-> Len = 0;
}
}
Int put_mem (char * abuf, unsigned int pack_size)
/*************************************** *************************
* Write a buffer subroutine *
* Input parameter: amem: buffer address *
* Abuf: write data address *
* Output parameter: <= 00: Error *
* XXXX: data item number *
**************************************** *************************/
{
Register int S, I, width, length, mem_ I;
Char * Buf;
Struct mem_packet * curr_pack;

S = 0;
Mem_data = (struct mem_data *) su1_2;
Width = mem_data [0]. width;
Length = mem_data [0]. length;
Mem_ I = mem_data [0]. Wi;
Buf = (void *) (char *) su1_2 + width * mem_ I );

For (I = 1; I <length; I ++ ){
Curr_pack = (struct mem_packet *) BUF;
If (curr_pack-> Len = 0 ){
Memcpy (curr_pack-> packetp, abuf, pack_size );
Curr_pack-> Len = pack_size ;;
S = mem_ I;
Mem_ I ++;
If (mem_ I> = length)
Mem_ I = 1;
Mem_data [0]. Wi = mem_ I;
Break;
}
Mem_ I ++;
If (mem_ I> = length ){
Mem_ I = 1;
Buf = (void *) (char *) su1_2 + width );
}
Else Buf = (char *) su1_2 + width * mem_ I;
}

If (I> = length)
S = 0;
Return S;
}
// Proc file read Function
Int read_procaddr (char * Buf, char ** start, off_t offset, int count, int * EOF, void * Data)
{
Sprintf (BUF, "% u/N" ,__ PA (su1_2 ));
* EOF = 1;
Return 9;
}
/* Added */

Add the following code to the rtl8139_init_module () function of 8139too. C:
/* Add_by_liangjian for zero_copy */
Put_pkt2mem_n = 0;
Init_mem ();
Put_mem ("data1dfadfaserty", 16 );
Put_mem ("data2zcvbnm", 11 );
Put_mem ("data39876543210poiuyt", 21 );
Create_proc_read_entry ("nf_addr", 0, null, read_procaddr, null );
/* Added */

Add the following code to the rtl8139_cleanup_module () function of 8139too. C:
/* Add_by_liangjian for zero_copy */
Del_mem ();
Remove_proc_entry ("nf_addr", null );
/* Added */

B. Read the cache code from the user space

# Include <stdio. h>
# Include <unistd. h>
# Include <sys/STAT. h>
# Include <sys/Mman. h>
# Include <fcntl. h>
# Define pages 512
# Define mem_width 1500
Struct mem_data
{
// Int key;
Unsigned short width;/* buffer width */
Unsigned short length;/* buffer length */
// Unsigned short wtimes;/* write process count, reserved, can be written by multiple processes in the future */
// Unsigned short rtimes;/* Number of read processes, reserved, can be read by multiple processes in the future */
Unsigned short wi;/* write pointer */
Unsigned short Ri;/* read pointer */
} * Mem_data;

Struct mem_packet
{
Unsigned int Len;
Unsigned char packetp [mem_width-4];/* sizeof (unsigned INT) = 4 */
};

Int get_mem (char * amem, char * abuf, unsigned int * size)
/*************************************** *************************
* Read buffer subroutine *
* Input parameter: amem: buffer address *
* Abuf: return data address. The data zone length must be greater *
* Buffer width *
* Output parameter: <= 00: Error *
* XXXX: data item number *
**************************************** *************************/
{
Register int I, S, width, length, mem_ I;
Char * Buf;
Struct mem_packet * curr_pack;

S = 0;
Mem_data = (void *) amem;
Width = mem_data [0]. width;
Length = mem_data [0]. length;
Mem_ I = mem_data [0]. Ri;
Buf = (void *) (amem + width * mem_ I );

Curr_pack = (struct mem_packet *) BUF;
If (curr_pack-> Len! = 0) {/* If the first byte is 0, it indicates that this part is empty */
Memcpy (abuf, curr_pack-> packetp, curr_pack-> Len );
* Size = curr_pack-> Len;
Curr_pack-> Len = 0;
S = mem_data [0]. Ri;
Mem_data [0]. Ri ++;
If (mem_data [0]. RI> = length)
Mem_data [0]. Ri = 1;
Goto ret;
}

For (I = 1; I <length; I ++ ){
Mem_ I ++;/* continue searching backward. The worst case is to search the entire buffer zone */
If (mem_ I> = length)
Mem_ I = 1;
Buf = (void *) (amem + width * mem_ I );
Curr_pack = (struct mem_packet *) BUF;
If (curr_pack-> Len = 0)
Continue;
Memcpy (abuf, curr_pack-> packetp, curr_pack-> Len );
* Size = curr_pack-> Len;
Curr_pack-> Len = 0;
S = mem_data [0]. Ri = mem_ I;
Mem_data [0]. Ri ++;
If (mem_data [0]. RI> = length)
Mem_data [0]. Ri = 1;
Break;
}

RET:
Return S;
}

Int main ()
{
Char * su1_2;
Char receive [1500];
Int I, J;
Int FD;
Int fd_procaddr;
Unsigned int size;
Char ADDR [9];
Unsigned long ADDR;

J = 0;
/* Open device 'mem 'as a media to access the Ram */
FD = open ("/dev/mem", o_rdwr );
Fd_procaddr = open ("/proc/nf_addr", o_rdonly );
Read (fd_procaddr, ADDR, 9 );
ADDR = atol (ADDR );
Close (fd_procaddr );
Printf ("% u [% 8lx]/n", ADDR, ADDR );
/* Map the address in kernel to user space, use MMAP function */
Su1_2 = MMAP (0, pages * 4*1024, prot_read | prot_write, map_shared, FD, ADDR );
Perror ("MMAP ");
While (1)
{
Bzero (receive, 1500 );
I = get_mem (su1_2, receive, & size );
If (I! = 0)
{
J ++;
Printf ("% d: % s [size = % d]/n", J, receive, size );
}
Else
{
Printf ("There have no data/N ");
Munmap (su1_2, pages * 4*1024 );
Close (FD );
Break;
}
}
While (1 );
}

5. References
1. Christian kurmann, Felix Rauch, Thomas M. Stricker.
Speculative defragmentation-leading Gigabit Ethernet to true zero-copy Communication
2. Alessandro Rubini, Jonathan Corbet. Linux Device Drivers 2, O 'Reilly & Associates 2002.
3. Hu Ximing, Mao decao. Linux kernel source code scenario analysis, Zhejiang University Press 2001

About the author: Liang Jian, a master's degree in North China Computing Technology Research Institute. His research direction is information security. This paper is titled host exception intrusion detection and defense based on system call analysis. Having more than two years of research experience in IDS, familiar with Linux kernel, familiar with Linux C/C ++ programming, Win32 API programming, and interested in security of network and operating system.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.