Introduction to Linux network code v0.2

Source: Internet
Author: User
Tags sendmsg

Introduction to Linux network code v0.2

◆ Linux network code introduction v0.2

Author: yawl <yawl@nsfocus.com>
Home: http://www.nsfocus.com/
1 Preface

Many people analyze the network part in Linux code (mainly src/Linux/NET, src/Linux/include/net, and files in the src/Linux/include/Linux directory) i'm quite interested. Indeed, although I have learned a lot of TCP/IP principles from books, I still don't have a specific impression on my mind if I don't read the source code. One problem with analyzing this part of code is that there is a lot of code and there is little information. The purpose of this article is to outline a framework to give readers a general idea of how TCP/IP works. Many of the code analysis I have seen earlier is based on the 2.0 kernel. In the new kernel, many functions have changed their names, which is especially difficult for beginners. This article uses the 2.4.0-test9 code as an example, in this way, the Code may be clearer.

In fact, I have carefully analyzed some of the network code on only one line of the firewall, and I am only a bit confused in many other places. If you have any mistakes, please correct me.

We recommend that you use source insight (www.soucedyn.com) to create a project and check the code at the same time, which may be better. I have also used some other tools, but when analyzing a large amount of code, no tool is more convenient than it.

2 Text

The layer-7 model of ISO is very familiar. Of course, it is more suitable to use a layer-4 model for the Internet. In these two models, network protocols appear in layers. However, in Linux kernel code, it is difficult to strictly separate the clear layers, because the entire kernel is actually a single process except for some/"kernel thread (kernel thread. Therefore, the so-called/"Network Layer/" is only a set of related functions, and most of the layers interact through common function calls.

Logically, the code on the network should be more layered and reasonable:
. BSD Socket Layer: This part processes BSD socket-related operations. Each socket is represented in a struct Socket Structure in the kernel.
This part of files mainly include:/NET/socket. C/NET/protocols. c etc

. Inet Socket Layer: BSD socket is an interface that can be used for various network protocols. When it is used for TCP/IP, that is, a socket in the af_inet format is created, additional parameters need to be retained, so there is the struct sock structure.
Files mainly include:/NET/IPv4/protocol. C/NET/IPv4/af_inet.c/NET/CORE/sock. c etc

. TCP/UDP layer: process the operation on the transport layer. The transport layer is represented by the struct inet_protocol and struct proto structures.
Files include:/NET/IPv4/udp. C/NET/IPv4/datax. C/NET/IPv4/tcp. C/NET/IPv4/tcp_input.c
/NET/IPv4 // tcp_output.c/NET/IPv4/tcp_minisocks.c/NET/IPv4/tcp_output.c
/NET/IPv4/tcp_timer.c etc

. IP layer: process operations at the network layer. The network layer is represented by the struct packet_type structure.
The main files are:/NET/IPv4/ip_forward.c ip_fragment.c ip_input.c ip_output.c etc.

. Data link layer and driver: each network device is represented by struct net_device. Common processing is in Dev. C,
All drivers are in the/driver/NET directory.

There are many other files in the network, such as firewalls and routes. Generally, you can guess the corresponding processing based on the name.

Now I want to give a table. The full text is to illustrate this table (if you think my language in the article is boring, you can discard them and read the code by yourself using this table ). When I first looked at some of the network code, I liked the section 8 of Linux kernel internals. Here is an example of process a remotely sending packets to another process B over the network, describes in detail how a data packet passes through the network stack. I think this can help readers see the full picture of the forest more quickly. Therefore, this article uses this structure
Description.

^
| Sys_read fs/read_write.c
| Sock_read net/socket. c
| Sock_recvmsg net/socket. c
| Inet_recvmsg net/IPv4/af_inet.c
| Udp_recvmsg net/IPv4/udp. c
| Skb_recv_datw.net/CORE/datw.c
| -------------------------------------------
| Sock_queue_rcv_skb include/NET/sock. h
| Udp_queue_rcv_skb net/IPv4/udp. c
| Udp_rcv net/IPv4/udp. c
| Ip_local_deliver_finish net/IPv4/ip_input.c
| Ip_local_deliver net/IPv4/ip_input.c
| Ip_recv net/IPv4/ip_input.c
| Net_rx_action net/dev. c
| -------------------------------------------
| Netif_rx net/dev. c
| El3_rx driver/NET/3c309. c
| El3_interrupt driver/NET/3c309. c

======================================

| Sys_write fs/read_write.c
| Sock_writev net/socket. c
| Sock_sendmsg net/socket. c
| Inet_sendmsg net/IPv4/af_inet.c
| Udp_sendmsg net/IPv4/udp. c
| Ip_build_xmit net/IPv4/ip_output.c
| Output_maybe_reroute net/IPv4/ip_output.c
| Ip_output net/IPv4/ip_output.c
| Ip_finish_output net/IPv4/ip_output.c
| Dev_queue_xmit net/dev. c
| --------------------------------------------
| El3_start_xmit driver/NET/3c309. c
V

The environment we assume is as follows: two hosts are connected over the Internet. One machine runs process a and the other runs process B. process a sends a message to process B, for example,/"Hello/", and B accepts this information.
TCP processing itself is very complex. In order to facilitate the description, we will use UDP as an example later.

2.1 create a socket

Before sending data, you need to establish a socket. The following statements are called in both programs:

...
Int sockfd;
Sockfd = socket (af_inet, sock_dgram, 0 );
...

This is a system call. Therefore, the system kernel is interrupted through 0x80 and the corresponding functions in the kernel are called. when looking for the corresponding process of the system call in the kernel, we usually add/"sys _/" to find it. For fork, it is to call sys_fork. However, there are some special socket-related calls. All such calls use an entry, that is, sys_socketcall, to enter the system kernel, and then call specific sys_socket, socket_bind, and other functions through parameters.

Sys_socket calls sock_create to generate a struct Socket Structure (see include/Linux/net. h), each socket has a corresponding structure in the kernel. After some common members of this structure are initialized (such as inode allocation, according to the second parameter for the type item assignment, etc.), the response is scheduled according to one of its parameters, that is,
One sentence:
...
Net_families [Family]-> Create (sock, Protocol );
...

The first parameter of our program is af_inet, so this function pointer will point to inet_create (); (net_families is an array that retains the information of the net families, these protocol families are loaded with sock_register .)

The most important information in the struct Socket Structure is stored in the struct sock structure, which is often used in network code. We recommend that you use it with other common structures (such as struct sk_buff) print it out at hand. In inet_create, memory is allocated for this structure and initialization varies depending on the socket type (actually the second parameter of the socket function:
...
If (SK-> prot-> init)
SK-> prot-> Init (SK );
...

If the type is sock_stream, tcp_v4_init_sock will be called. The socket of the sock_dgram type has no additional initialization, and the socket call ends.

It is worth noting that after inet_create () is called, The sock_map_fd function will be called. This function will assign a file descriptor to the socket and assign a file. At the application layer, You can process sockets like processing files.

At the beginning, some procedures may be difficult to follow, mainly because the actual direction of these function pointers varies according to the type.

2.2 send data

When process a wants to send data, the program will call the following statement (if the sendto function is used, it will follow a similar process, omitted ):
...
Write (sockfd,/"Hello/", strlen (/"Hello /"));
...

The corresponding function of write in the kernel is sys_write. This function first finds the struct file structure based on the file descriptor. If this file exists (the file pointer is not empty) and writable (file-> f_mode & fmode_write is true), The write operation of this file structure is called:
...
If (file-> f_op & (write = file-> f_op-> write )! = NULL)
Ret = write (file, Buf, Count, & file-> f_pos );
...

F_op is a struct file_operations structure pointer. In sock_map_fd, point it to socket_file_ops, which is defined as follows (/NET/socket. C ):
Static struct file_operations socket_file_ops = {
Llseek: sock_lseek,
Read: sock_read,
Write: sock_write,
Poll: sock_poll,
IOCTL: sock_ioctl,
MMAP: sock_mmap,
Open: sock_no_open,/* Special open code to disallow open via/proc */
Release: sock_close,
Fasync: sock_fasync,
Readv: sock_readv,
Writev: sock_writev
};

At this point, the wirte function pointer clearly points to sock_write. We will continue to see that this function sorts a string buffer into struct msghdr, and finally calls sock_sendmsg.

I don't know about scm_send in sock_sendmsg (SCM is short for socket level control messages). Fortunately, it is not critical. We noticed this sentence:
...
Sock-> OPS-> sendmsg (sock, MSG, size, & SCM );
...

It is also a function pointer, sock-> ops is initialized in the inet_create () function, because we are UDP sockets, sock-> ops points to inet_dgram_ops (that is, sock-> Ops = & inet_dgram_ops;), which is defined in net/IPv4/af_inet.c:
Struct proto_ops inet_dgram_ops = {
Family: pf_inet,

Release: inet_release,
BIND: inet_bind,
Connect: inet_dgram_connect,
Socketpair: sock_no_socketpair,
Accept: sock_no_accept,
Getname: inet_getname,
Poll: datagram_poll,
IOCTL: inet_ioctl,
Listen: sock_no_listen,
Shutdown: inet_shutdown,
Setsockopt: inet_setsockopt,
Getsockopt: inet_getsockopt,
Sendmsg: inet_sendmsg,
Recvmsg: inet_recvmsg,
MMAP: sock_no_mmap,
};

So we have to look at the inet_sendmsg () function. Now, this function calls another function through the function pointer:
...
SK-> prot-> sendmsg (SK, MSG, size );
...

We have to find the specific point again. How can I find a specific definition of the definition when I look at it? In the preceding example, SK is a struct sock structure defined by Linux/NET/sock. h), we can see that the Prot is a struct proto structure, and then we look for all the instances of this structure in the source code tree (such as jumping to the definition, finding references and other work in source insight is too convenient and fast ^_^), and soon you will find such jobs as udp_prot, tcp_prot, raw_prot and so on. I guess udp_prot is used, let's look at its reference in the source code and find that there is such a sentence in inet_create:
...
Prot = & udp_prot;
...

In fact, if you read the inet_create function carefully, you will find it earlier, but I have never been so careful :).

Follow udp_sendmsg to go down:
The main function of this function is to fill the UDP header (source port, destination port, etc.), and then call
Ip_route_output:
...
Ip_build_xmit (sk,
(SK-> no_check = udp_csum_noxmit?
Udp_getfrag_nosum:
Udp_getfrag ),
& Ufh, Ulen, & IPC, RT, MSG-> msg_flags );
...

A large proportion of the ip_build_xmit function is to generate sk_buff and add an IP header to the data packet.
The following is a sentence:
...
Nf_hook (pf_inet, nf_ip_local_out, SKB, null, RT-> U. dst. Dev, output_maybe_reroute );
...

To put it simply, without Firewall code intervention, you can think of it as calling output_maybe_reroute directly. (For details, see kernel firewall Netfilter in lumeng 14.)
Output_maybe_reroute has only one sentence:
Return SKB-> DST-> output (SKB );

The pointer is actually specified in ip_route_output (in the prompt: RTH-> U. DST. output = ip_output;), ip_route_output searches for routes and records the results to SKB-> DST.

So we started to look at the ip_output function, and it immediately went to ip_finish_output ~~.
Each network device, such as the NIC, is represented by a net_device in the kernel. The device used is found in ip_finish_output (also initialized in ip_route_output ), this parameter is passed to the function that netfilter registers at the nf_ip_post_routing point. After the function is completed, ip_finish_output2 is called, and this function will call:
...
Hh-> hh_output (SKB );
...

In this case, dev_queue_xmit is called. At this point, we have completed the TCP/IP layer and started data link layer processing.

After some judgment, the actual call is:
...
Dev-> hard_start_xmit (SKB, Dev );
...

This function is defined in the driver of the network card. Each different network card has different processing methods. My network card is commonly used 3c509 (its driver is 3c509. c) when the NIC is processed (el3_probe), there are:
...
Dev-> hard_start_xmit = & el3_start_xmit;
...

The next step is the I/O operation. The packet is actually sent to the network, and the sending process ends.

In the middle, I was a little hasty. I did not care about any special processing such as errors, blocking, and sharding. I just described the ideal process.
The purpose of this short article is to help you build a general impression. In fact, every place has very complicated processing (especially TCP ).

2.3 accept data

When data arrives at the network card, a hardware interruption occurs, and then the function in the network card driver is called for processing. For my 3c509 network card, the processing function is el3_interrupt. (The corresponding IRQ Number is determined by the request_irq function when the system starts and the NIC is initialized .) The first thing the interrupt handler needs to do is to read the data through some IO operations (read Io using the inw function). When the data frame is successfully accepted, execute el3_rx (Dev) for further processing.

In el3_rx, the received datagram is encapsulated into struct sk_buff, and is detached from the driver and transferred to the general processing function netif_rx (Dev. C. For the sake of CPU efficiency, the upper-layer processing function will be activated in Soft Interrupt mode. An important task of netif_rx is to put the passed sk_buff in the waiting queue and set the Soft Interrupt flag, then you can rest assured that the return will wait for the next network packet to arrive:
...
_ Skb_queue_tail (& queue-> input_pkt_queue, SKB );
_ Cpu_raise_softirq (this_cpu, net_rx_softirq );
...

This is often called the/"bottom half/" processing-bottom half in the 2.2 kernel, and its internal implementation is basically similar, with the goal of quickly returning from the interrupt.

After a period of time, a CPU scheduling will occur for some reason (for example, the time slice of a process is used up ). In the process scheduling function schedule (), check whether there is any Soft Interrupt. If yes, run the corresponding processing function:
...
If (softirq_active (this_cpu) & softirq_mask (this_cpu ))
Goto handle_softirq;
Handle_softirq_back:
...
...
Handle_softirq:
Do_softirq ();
Goto handle_softirq_back;
...

During system initialization, specifically in net_dev_init, the soft interrupt handler is set to net_rx_action:
...
Open_softirq (net_tx_softirq, net_tx_action, null );
...

When the current process scheduling is executed, the system checks whether net_tx_softirq is soft interrupted. If yes, net_rx_action is called.

The net_tx_action function is a net_bh function in version 2.2. There are two global variables in the kernel used to register the network layer. One is the linked list ptype_all, and the other is the array ptype_base [16]. they documented all the layer-3 (according to the osi7 model) protocols that can be processed by the kernel. Each network layer receives
Struct packet_type indicates that the structure will be registered to ptype_all or ptype_base through the dev_add_pack function. Only when the type item in packet_type is eth_p_all will it be registered to the ptype_all linked list. Otherwise, for example, ip_packet_type, the corresponding position will be found in the array ptype_base [16. The difference between the two is that if it is registered with the eth_p_all type, the processing function will be subject to all types of packages; otherwise, it can only process the type registered by itself.

SKB-> protocol is assigned a value in el3_rx, which is actually the upper-layer protocol name extracted from the Ethernet frame header information. For our example, the value is eth_p_ip, therefore, in net_tx_action, the receiving and processing functions at the IP layer are selected. It is not difficult to see from ip_packet_type that this function is ip_recv ().
Pt_prev-> func (actually pointing to ip_recv) is preceded by a atomic_inc (& SKB-> Users) operation (in the 2.2 kernel, This Is A skb_clone statement, similar in principle ), the purpose is to increase the reference number of this sk_buff. When the receiving function at the network layer is processed or the sk_buff is discarded for some reason (such as the firewall), kfree_skb is called. In kfree_skb, the system first checks whether the function is needed elsewhere, if there is no space for reuse, the memory (_ kfree_skb) will be actually released. Otherwise, only the counter will be reduced by one.

Now let's take a look at ip_recv (net/IPv4/ip_input.c ). The operation of this function is very clear: first check the validity of the package (version number, length, checksum, and so on), and if it is valid, perform the following processing. In the 2.4 kernel, In order to flexibly process the Firewall code, the original ip_recv is divided into two parts, that is, the second half of the original ip_recv is separated into an ip_rcv_finish function. In ip_rcv_finish, some are IP packets with IP options (such as source routes), except that routes are searched through ip_route_input and the results are recorded in SKB-> DST. At this time, two types of packets are received, which are sent to the local process (the upstream Protocol is required) or forwarded (when used as the gateway). At this time, the processing functions required are different. If the packets are sent to the local machine, call ip_local_deliver (/NET/IPv4/ip_input.c); otherwise, call ip_forward (/NET/IPv4/ip_forward.c ). the function pointer SKB-> DST-> input leads the datagram to the correct path.

For our example, it is time to call ip_local_deliver.
The package sent is probably a fragmented package. In this case, we should first assemble them and then pass them to the upper-layer protocol. This is of course the first job done by the ip_local_deliver function, if the Assembly is successful (the returned sk_buff is not empty), continue with the process (for detailed assembly algorithms, see the analysis of IP fragmentation reorganization and common fragment attacks in lumeng magazine 13).
At this time, the Code is split into two parts by Netfilter. Like the previous one, we directly go to the second half, that is, ip_local_deliver_finish (/NET/IPv4/ip_input.c.

The processing of transport layer (such as TCP, UDP, raw) is registered to inet_protos (via inet_add_protocol ). Ip_local_deliver_finish calls the corresponding processing function based on the upper-layer protocol information (IPH-> Protocol) in the IP header information. For the sake of simplicity, we adopt UDP. In this case, ipprot-> handler is actually udp_rcv.

As mentioned above, each socket created in the application has a struct socket/struct sock in the kernel. Udp_rcv first finds the sock in the kernel through udp_v4_lookup, and then calls udp_queue_rcv_skb (/NET/IPv4/udp. c) as a parameter ). Immediately, the sock_queue_rcv_skb function is called. This function puts sk_buff in the waiting queue and notifies the upper-layer data to arrive:
...
Kb_set_owner_r (SKB, SK );
Skb_queue_tail (& SK-> receive_queue, SKB );
If (! SK-> dead)
SK-> data_ready (SK, SKB-> Len );
Return 0;
...

SK-> data_ready is defined when the sock structure is initialized (sock_init_data ):
...
SK-> data_ready = sock_def_readable;
...

Now we can see from the top down:
Process B receives the datagram and calls it in the program:
...
Read (sockfd, buff, sizeof (buff ));
...

The function called by this system in the kernel is the write-like Operation below sys_read (FS/read_write.c. udp_recvmsg function calls skb_recv_datmsg. If the data has not arrived and the socket is set to blocking mode, the process will be suspended (signal_pending (current )), wait until data_ready notifies the process that the resources have been met and continue processing (wake_up_interruptible (SK-> sleep );).

2.4 skbuff

The Network Code involves a lot of operations on sk_buff. Although this article tries its best to avoid it, it must be analyzed carefully, data packets are transmitted and processed in the network protocol layer in the form of sk_buff. It can be said that it is the most important data structure in the network. For more information, see Alan Cox's network buffers and memory management, which was published on Linux journal in October 1996.

Here we reference a diagram in the phrack-12 period. Although it only depicts a very small side of sk_buff, it is very useful, especially when you forget to adjust the pointer forward or backward using skb_put, just like me :)

--- ----------------- Head
^ |
| ^ Skb_push
|
| --------------- Data ------
| ^ |
True | V skb_pull
Size | Len
| ^ Skb_trim
| V |
| --------------- Tail ------
|
| V skb_put
V |
--- ----------------- End

Linux Network Layer efficiency: in Linux Network Layer Code, pointers are widely used, so as to avoid data copying and other operations that consume system resources. The data segment of a data packet is copied only twice during reading or sending, that is, from the NIC to the core-state memory, and from the core-state memory to the user-state memory. A few days ago, we saw that in some attempts to improve the sniffer packet capture efficiency, Turbo packet (a kernel patch) adopted the core State and
The user mode shares a piece of memory, which reduces a data copy and further improves the efficiency.

3 Postscript:
This article was written in a rush at the last moment. Looking at the poor text in it, I really feel a little overwhelmed by the audience ~~ If I have time, I will rewrite this part. In fact, this is what I always wanted to do :)

Publisher: crystal from: Linux technical support site

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.