Use sockopt to exchange data with the kernel

Source: Internet
Author: User
Tags ranges

Copyleft of this document belongs to yfydz and can be freely copied and reproduced when published using GPL. It is strictly prohibited to be used for any commercial purposes.
MSN: yfydz_no1@hotmail.com
Source: http://yfydz.cublog.cn

1. preface after opening a network socket, you can use set/getsockopt (2) to realize the communication between user space and kernel. In essence, it is similar to IOCTL. The difference is that set/getsockopt does not require new devices, you can directly use the existing socket type of the system. You can use the setsockopt function to write data to the kernel and use getsockopt to read data to the kernel. Kernel Code The version is 2.6.19.2. 2. first, register the set/getsockopt option command and related processing functions of the related Protocol in the kernel, after opening the socket of the Protocol in the user space, you can directly call set/getsockopt to specify the command words to execute related data interaction operations, common TCP and UDP sockets use these two system calls to iptables <-> netfilter, ipvsadm <-> ip_vs. 3. the basic format of the Set/getsockopt (2) Set/getsockopt (2) function is: int setsockopt (INT sockfd, int proto, int cmd, void * data, int datalen)
Int getsockopt (INT sockfd, int proto, int cmd, void * data, int datalen) the first parameter is the socket descriptor; the first parameter is the sock protocol, for IP raw, sol_socket/sol_ip is used. For TCP/UDP socket, sol_socket/sol_ip/sol_tcp/sol_udp is available. That is, the upper-layer socket can all use the command words of the lower-layer socket; the first parameter, CMD, is an operation command and is defined by yourself. The second parameter is the pointer at the starting position of the data buffer. During the set operation, the buffer data is written to the kernel, the get operation reads data from the kernel into the buffer. The length of the data is 5th. 4. the kernel implements two types of new sockopt commands. One is to add a complete new protocol and the other is to add new command words based on the original protocol command set. There is nothing special about the sockopt command word definition, that is, an integer, as long as it is unique within the Protocol, unlike the ioctl command word there are some format requirements. 4.1 complete Protocol each protocol uses the struct proto structure (include/NET/sock. h). In the Linux kernel, three types of data are defined by default: TCP, UDP, and raw. Raw is used to describe all non-TCP and UDP data. In net/CORE/sock. in C's sock_get/setsockopt () function, the kernel implements a set of sockopt read/write commands common to all sockets, and separately defines the unique command words of each Protocol within each protocol. Struct proto contains setsockopt and getsocket member functions, which are used to define the unique related command words of each protocol. For example, for the UDP setsockopt member function: static int udp_setsockopt (struct sock * SK, int level, int optname,
Char _ User * optval, int optlen)
{
// Determine whether it is a UDP layer first. If not, call the IP layer's sockopt processing.
If (level! = Sol_udp)
Return ip_setsockopt (SK, level, optname, optval, optlen );
// It is a UDP-level command that calls the sockopt processing of the UDP protocol itself
Return do_udp_setsockopt (SK, level, optname, optval, optlen );
} Static int do_udp_setsockopt (struct sock * SK, int level, int optname,
Char _ User * optval, int optlen)
{
Struct udp_sock * up = udp_sk (SK );
Int val;
Int err = 0; If (optlen <sizeof (INT ))
Return-einval; If (get_user (Val, (INT _ User *) optval ))
Return-efault;
// The actual commands that are unique to UDP are the two
Switch (optname ){
Case udp_cork:
If (Val! = 0 ){
Up-> corkflag = 1;
} Else {
Up-> corkflag = 0;
Lock_sock (SK );
Udp_push_pending_frames (SK, up );
Release_sock (SK );
}
Break; // UDP encapsulation, used in the IPsec NAT-T
Case udp_encap:
Switch (VAL ){
Case 0:
Case udp_encap_espinudp:
Case udp_encap_espinudp_non_ike:
Up-> encap_type = val;
Break;
Default:
Err =-enoprotoopt;
Break;
}
Break; default:
Err =-enoprotoopt;
Break;
}; Return err;
} To implement the sockopt control of a new protocol, you only need to process it in a similar way. After defining the struct proto structure, register it to the system, for protocols in the IP family, use the inet_register_protosw () function. Other protocols can be processed similarly. 4.2 It is not very likely that the new protocol is defined separately in actual use of command expansion. Generally, you only need to add new command words, and add new TCP and UDP command words, you need to modify the TCP/UDP implementation code of the kernel, add your own command words, and then re-compile the kernel to take effect. For commands at the IP raw level, netfilter provides nf_register_sockopt () and nf_unregister_sockopt () to dynamically register or cancel the sockopt command, so that you do not need to modify the original kernel code. The method is to define the sockopt operation set of Netfilter as a linked list. to define a new opt operation, define a new opt operation node to be mounted to the linked list. When the system calls sockopt, the command words in the linked list are searched in sequence and can be called successfully after matching. Therefore, the OPT command words cannot be defined the same as those in the original IP raw, but the command words are 32-bit, the value range is very large, so there is no conflict as long as you pay attention to it. The sock of Netfilter is raw. The sockopt operation node structure is simple and clear. It defines the range space of each command word and related processing functions:/* include/Linux/netfilter. H */struct nf_sockopt_ops
{
// Linked list Node
Struct list_head list; // protocol family
Int PF;/* Non-invasive ranges: use 0/0/null to never get called .*/
// Minimum value of the SET command
Int set_optmin;
// Maximum value of the SET command
Int set_optmax;
// Set function implementation
INT (* Set) (struct sock * SK, int optval, void _ User * user, unsigned int Len );
INT (* compat_set) (struct sock * SK, int optval,
Void _ User * user, unsigned int Len); // minimum value of the GET command
Int get_optmin;
// Maximum value of the GET command
Int get_optmax;
// Get function implementation
INT (* Get) (struct sock * SK, int optval, void _ User * user, int * Len );
INT (* compat_get) (struct sock * SK, int optval,
Void _ User * user, int * Len);/* Number of users inside set () or get ().*/
Unsigned int use;
Struct task_struct * cleanup_task;
}; Opt operation structure registration and revocation function:/* Net/Netfilter/nf_sockopt.c * // The sockopt linked list of NF. All the sockopt command processing is linked to this linked list.
Static list_head (nf_sockopts);/* functions to register sockopt ranges (exclusive ).*/
Int nf_register_sockopt (struct nf_sockopt_ops * REG)
{
Struct list_head * I;
Int ret = 0;
// Lock
If (mutex_lock_interruptible (& nf_sockopt_mutex )! = 0)
Return-eintr; // check whether the sockopt operation node is attached to the current linked list.
List_for_each (I, & nf_sockopts ){
Struct nf_sockopt_ops * Ops = (struct nf_sockopt_ops *) I;
If (OPS-> pF = reg-> pf
& (Overlap (OPS-> set_optmin, OPS-> set_optmax,
Reg-> set_optmin, reg-> set_optmax)
| Overlap (OPS-> get_optmin, OPS-> get_optmax,
Reg-> get_optmin, reg-> get_optmax ))){
Nfdebug ("nf_sock overlap: % u-% u/% u-% u v % u-% u/% u-% u \ n ",
OPS-> set_optmin, OPS-> set_optmax,
OPS-> get_optmin, OPS-> get_optmax,
Reg-> set_optmin, reg-> set_optmax,
Reg-> get_optmin, reg-> get_optmax );
Ret =-ebusy;
Goto out;
}
}
// Add a new node to the OPT linked list
List_add (& reg-> list, & nf_sockopts );
Out:
// Unlock
Mutex_unlock (& nf_sockopt_mutex );
Return ret;
}
Export_symbol (nf_register_sockopt); void nf_unregister_sockopt (struct nf_sockopt_ops * REG)
{
/* No point being interruptible: We're probably in cleanup_module ()*/
Restart:
Mutex_lock (& nf_sockopt_mutex );
If (reg-> use! = 0 ){
// The operation node is still in use, blocking the process until all operations are completed
/* To be woken by nf_sockopt call ...*/
/* Fixme: Stuart Young's name appears gratuitously .*/
Set_current_state (task_uninterruptible );
Reg-> cleanup_task = current;
Mutex_unlock (& nf_sockopt_mutex );
Schedule ();
Goto restart;
}
// Delete from the linked list
List_del (& reg-> list );
Mutex_unlock (& nf_sockopt_mutex );
}
Export_symbol (nf_unregister_sockopt); next, let's take a look at the specific call process. The first socket opened is a raw-type IP socket, and the setsockopt operation on this socket will call ip_setsockopt () function:/* Net/IPv4/ip_sockglue.c */INT ip_setsockopt (struct sock * SK, int level,
Int optname, char _ User * optval, int optlen)
{
Int err; If (level! = Sol_ip)
Return-enoprotoopt; // follow the sockopt operation of a common IP address first
Err = do_ip_setsockopt (SK, level, optname, optval, optlen );
# Ifdef config_netfilter
// The kernel must support netfilter.
/* We need to exclude all possible enoprotoopts should t default case */
If (ERR =-enoprotoopt & optname! = Ip_hdrincl &&
Optname! = Ip_ipsec_policy & optname! = Ip_xfrm_policy
# Ifdef config_ip_mroute
& (Optname <mrt_base | optname> (mrt_base + 10 ))
# Endif
){
// If the IP address does not contain this opt command, call sockopt of Netfilter.
Lock_sock (SK );
Err = nf_setsockopt (SK, pf_inet, optname, optval, optlen );
Release_sock (SK );
}
# Endif
Return err;
}/* Net/Netfilter/nf_sockopt.c */
Int nf_setsockopt (struct sock * SK, int PF, int Val, char _ User * opt,
Int Len)
{
// Call the nf_sockopt Function
Return nf_sockopt (SK, PF, Val, opt, & Len, 0 );
} Static int nf_sockopt (struct sock * SK, int PF, int Val,
Char _ User * opt, int * Len, int get)
{
Struct list_head * I;
Struct nf_sockopt_ops * OPS;
Int ret; If (mutex_lock_interruptible (& nf_sockopt_mutex )! = 0)
Return-eintr; // scan the sockopt linked list of Netfilter.
List_for_each (I, & nf_sockopts ){
// Retrieve the OPT operation Node
Ops = (struct nf_sockopt_ops *) I;
// Determine whether to process the command word based on the protocol and command word range
If (OPS-> pF = PF ){
If (get ){
// Get operation
If (Val> = OPS-> get_optmin
& Val <OPS-> get_optmax ){
// Add 1 to count for nodes with the OPT Structure
OPS-> Use ++;
Mutex_unlock (& nf_sockopt_mutex );
Ret = OPS-> get (SK, Val, opt, Len );
Goto out;
}
} Else {
// Set operation
If (Val> = OPS-> set_optmin
& Val <OPS-> set_optmax ){
OPS-> Use ++;
Mutex_unlock (& nf_sockopt_mutex );
Ret = OPS-> set (SK, Val, opt, * Len );
Goto out;
}
}
}
}
Mutex_unlock (& nf_sockopt_mutex );
Return-enoprotoopt;

Out:
Mutex_lock (& nf_sockopt_mutex );
// The operation is completed. Remove one from the OPT structure node.
OPS-> Use --;
If (OPS-> cleanup_task)
Wake_up_process (OPS-> cleanup_task );
Mutex_unlock (& nf_sockopt_mutex );
Return ret;
} In this way, the OPT node of the defined NF can be traversed and the operation will be effective. specific instance, ip_vs opt operation node:/* Net/IPv4/ipvs/ip_vs_ctl.c */static struct nf_sockopt_ops ip_vs_sockopts = {
. PF = pf_inet,
// Define the word range of the set command
. Set_optmin = ip_vs_base_ctl,
. Set_optmax = ip_vs_so_set_max + 1,
. Set = do_ip_vs_set_ctl,
// GET command word range
. Get_optmin = ip_vs_base_ctl,
. Get_optmax = ip_vs_so_get_max + 1,
. Get = do_ip_vs_get_ctl,
}; The set/get function is simple. Check the validity of the Set/get function and perform related processing based on the command words: static Int.
Do_ip_vs_set_ctl (struct sock * SK, int cmd, void _ User * user, unsigned int Len)
{
Int ret;
Unsigned char Arg [max_arg_len];
Struct ip_vs_service_user * usvc;
Struct ip_vs_service * SVC;
Struct ip_vs_dest_user * udest; // check User Permissions
If (! Capable (cap_net_admin ))
Return-eperm; // check the Data Length
If (Len! = Set_arglen [set_cmdid (CMD)]) {
Ip_vs_err ("set_ctl: Len % u! = % U \ n ",
Len, set_arglen [set_cmdid (CMD)]);
Return-einval;
}
// Copy data
If (copy_from_user (ARG, user, Len )! = 0)
Return-efault;/* increase the module use count */
// Ipvs module usage count
Ip_vs_use_count_inc (); // lock
If (mutex_lock_interruptible (& __ ip_vs_mutex )){
Ret =-erestartsys;
Goto out_dec;
} // Perform the following command: If (cmd = ip_vs_so_set_flush ){
/* Flush the virtual service */
Ret = ip_vs_flush ();
Goto out_unlock;
...... 5. the operation of user space in user space is very simple. It is to use socket (2) to open the socket of the relevant protocol type and directly call the set/getsockopt function to perform the operation. instance: ipvsadmint ipvs_init (void)
{
Socklen_t Len; Len = sizeof (ipvs_info );
// Open a raw socket
If (sockfd = socket (af_inet, sock_raw, ipproto_raw) =-1)
Return-1;
// Read the basic information of ipvs
If (getsockopt (sockfd, ipproto_ip, ip_vs_so_get_info,
(Char *) & ipvs_info, & Len ))
Return-1; return 0;
} 5. conclusion Using setgetsockopt () to transmit data in the user space and kernel space is also one of the common methods, which is simple and convenient, and can transmit different data structures to different commands in the same socket. Adding new command words can be added according to the new Protocol or to the existing implementation, but there is no special requirement, the dynamic register opt command word provided by Netfilter can dynamically add and delete the sockopt Operation Command word without modifying the original Kernel Program .

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.