Linux Kernel network protocol stack Study Notes (6)

Source: Internet
Author: User
Tags rfc

This article discusses how to send and receive IP packets (routes are not included currently)

Let's first look at inet_init,

The first step is to call proto_register and register tcp_prot, udp_prot, raw_prot. The first half of proto_register is to initialize various slab_cache, and the second half links the struct proto structure to proto_list.

Next, call sock_register. the kernel has a global net_proto_family structure net_families array. inet_init calls sock_register to add inet_family_ops to net_families [pf_net]. The structure of inet_family_ops

Static struct net_proto_family inet_family_ops = {
. Family = pf_inet,
. Create = inet_create,
. Owner = this_module,
};

Call inet_add_protocol to fill in the inet_protos array. inet_protos is a global pointer array, which is defined as follows:

Const struct net_protocol * inet_protos [max_inet_protos] ____ cacheline_aligned_in_smp;

We can see that the maximum length of the array max_inet_protos is 256, and all protocols are defined in. h.

/* Standard well-defined IP protocols .*/
Enum {
Ipproto_ip = 0,/* dummy protocol for TCP */
Ipproto_icmp = 1,/* Internet Control Message Protocol */
Ipproto_igmp = 2,/* Internet Group Management Protocol */
Ipproto_ipip = 4,/* ipip tunnels (older ka9q tunnels use 94 )*/
Ipproto_tcp = 6,/* Transmission Control Protocol */
Ipproto_egp = 8,/* exterior Gateway Protocol */
Ipproto_pup = 12,/* pup Protocol */
Ipproto_udp = 17,/* User datasync Protocol */
Ipproto_idp = 22,/* xns IDP Protocol */
Ipproto_dccp = 33,/* dataticongestion Control Protocol */
Ipproto_rsvp = 46,/* RSVP protocol */
Ipproto_gre = 47,/* Cisco GRE tunnels (RFC 1701,1702 )*/
Ipproto_ipv6 = 41,/* IPv6-in-IPv4 tunnelling */
Ipproto_esp = 50,/* encapsulation security payload Protocol */
Ipproto_ah = 51,/* Authentication Header Protocol */
Ipproto_beetph = 94,/* IP Option pseudo header for beet */
Ipproto_pim = 103,/* Protocol Independent Multicast */
Ipproto_comp = 108,/* compression header Protocol */
Ipproto_sctp = 132,/* Stream Control Transport Protocol */
Ipproto_udplite = 136,/* UDP-lite (RFC 3828 )*/
Ipproto_raw = 255,/* Raw IP packets */
Ipproto_max
};

In inet_init, only ICMP, IGMP, TCP, and UDP are defined in inet_protos. Taking TCP as an example, net_protocol is defined

Static const struct net_protocol tcp_protocol = {
. Handler = tcp_v4_rcv,
. Err_handler = tcp_v4_err,
. Gso_send_check = tcp_v4_gso_send_check,
. Gso_segment = tcp_tso_segment,
. Gro_receive = tcp4_gro_receive,
. Gro_complete = tcp4_gro_complete,
. No_policy = 1,
. Netns_ OK = 1,
};

When the IP layer sends packets up. ip_local_deliver_finish is to find the corresponding net_protocol structure in inet_protos Based on the SKB protocol, and then call the net_protocol-> handler function, e.g. if it is a tcp skb, tcp_v4_rcv is called.

Next, initialize the inetsw array and inetsw_arry array. inetsw is a list_head array. Each index represents a type of IP report (determined by four layers), such as sock_stream, sock_dgram, and sock_raw, the definition is as follows:

Enum sock_type {
Sock_stream = 1,
Sock_dgram = 2,
Sock_raw = 3,
Sock_rdm = 4,
Sock_seqpacket = 5,
Sock_dccp = 6,
Sock_packet = 10,
};

The inetsw_array is an inet_protosw array, which is defined as follows:

Static struct inet_protosw inetsw_array [] =
{
{
. Type = sock_stream,
. Protocol = ipproto_tcp,
. Prot = & tcp_prot,
. Ops = & inet_stream_ops,
. No_check = 0,
. Flags = inet_protosw_permanent |
Inet_protosw_icsk,
},

{
. Type = sock_dgram,
. Protocol = ipproto_udp,
. Prot = & udp_prot,
. Ops = & inet_dgram_ops,
. No_check = udp_csum_default,
. Flags = inet_protosw_permanent,
},

{
. Type = sock_raw,
. Protocol = ipproto_ip,/* wild card */
. Prot = & raw_prot,
. Ops = & inet_sockraw_ops,
. No_check = udp_csum_default,
. Flags = inet_protosw_reuse,
}
};

The definition of inet_protosw is as follows:

/* This is used to register socket interfaces for IP protocols .*/
Struct inet_protosw {
Struct list_head list;
/* These two fields form the lookup key .*/
Unsigned short type;/* This is the 2nd argument to socket (2 ).*/
Unsigned short protocol;/* This is the L4 Protocol Number .*/
Struct proto * prot;
Const struct proto_ops * OPS;
Char no_check;/* checksum on RCV/xmit/none? */
Unsigned char flags;/* See inet_protosw _ * below .*/
};
# Define inet_protosw_reuse 0x01/* are ports automatically reusable? */
# Define inet_protosw_permanent 0x02/* Permanent protocols are unremovable .*/
# Define inet_protosw_icsk 0x04/* Is this an inet_connection_sock? */

The list of inet_protosw is the list_head pointer pointed to by inetsw.

Finally, arp_init, ip_init, tcp_v4_init, tcp_init, and udp_init are called separately.

Next, let's talk about the IP protocol. Here we skip the IP Option section, because there is almost no IP option in the actual application network. First, let's look at the IP header.

Struct iphdr {
# If defined (_ little_endian_bitfield)
_ U8 IHL: 4,
Version: 4;
# Elif defined (_ big_endian_bitfield)
_ U8 version: 4,
IHL: 4;
# Else
# Error "Please fix <ASM/byteorder. h>"
# Endif
_ U8 TOS;
_ Be16 tot_len;
_ Be16 ID;
_ Be16 frag_off;
_ U8 TTL;
_ U8 protocol;
_ Sum16 check;
_ Be32 saddr;
_ Be32 daddr;
/* The Options start here .*/
};

The IHL unit is 4 bytes. Generally, the IHL length is 20 bytes. Therefore, the value is 5.

Tot_len is measured in bytes.

IDs are generally used for IP segments/combinations. the IDs of all segments in the same IP package are the same

Protocol indicates the layer-4 protocol Value

Check is the checking and

In the sk_buff structure, SKB-> csum stores the L4 checksum. SKB-> ip_summed indicates the checksum status.

Checksum_none, indicating that the L4 checksum is invalid and needs to be recalculated

Checksum_hw, indicating that the ENI has correctly calculated the L4 checksum, but the program needs to verify the L4 checksum again.

Checksum_unnecessary, indicating L4 checksum does not require verification

Static struct packet_type ip_packet_type _ read_mostly = {
. Type = cpu_to_be16 (eth_p_ip ),
. Func = ip_rcv,
. Gso_send_check = inet_gso_send_check,
. Gso_segment = inet_gso_segment,
. Gro_receive = inet_gro_receive,
. Gro_complete = inet_gro_complete,
};

The L2 layer finds the ip_rcv function through ip_packet_type, and transmits the packet to L3. The ip_rcv function is analyzed below:

/* When the interface is in promisc. mode, drop all the crap
* That it has es, do not try to analyze it.
*/
If (SKB-> pkt_type = packet_otherhost)
Goto drop;

Ip_upd_po_stats_bh (dev_net (Dev), ipstats_mib_in, SKB-> Len );

If (SKB = skb_share_check (SKB, gfp_atomic) = NULL ){
Ip_inc_stats_bh (dev_net (Dev), ipstats_mib_indiscards );
Goto out;
}

If (! Pskb_may_pull (SKB, sizeof (struct iphdr )))
Goto inhdr_error;

IPH = ip_hdr (SKB );

If SKB is obtained in the mixed mode and is not sent to the local machine, it is discarded directly. If SKB is shared, it is called to copy skb_1__check for processing. The pskb_may_pull function is complicated, the purpose is to ensure that there is at least an iPhone Dr content in the linear memory starting from SKB-> data (the complexity of the structure of sk_buff here is: in many cases, the actual message content does not exist in the linear memory where SKB is located. Generally, sk_buff is followed by a linear memory space, which is indicated by skb_shared_info, the content of scatter-gather packets will be stored here. These contents are scattered in different memory pages, with one
Frags in the skb_frag_t array indicates that the number of elements in the array is saved in nrfrags. If the IP package contains shards, you can see that there is an array frag_list of sk_buff, which contains the SKB of the shard ), if SKB-> data does not have enough memory in the future, pskb_may_pull will expand the SKB structure and copy the IP header content in frags or frag_list to the skb linear memory.

If (IPH-> IHL <5 | IPH-> version! = 4)
Goto inhdr_error;

If (! Pskb_may_pull (SKB, IPH-> IHL * 4 ))
Goto inhdr_error;

IPH = ip_hdr (SKB );

If (unlikely (ip_fast_csum (u8 *) IPH, IPH-> IHL )))
Goto inhdr_error;

Len = ntohs (IPH-> tot_len );
If (SKB-> Len <Len ){
Ip_inc_stats_bh (dev_net (Dev), ipstats_mib_intruncatedpkts );
Goto drop;
} Else if (LEN <(IPH-> IHL * 4 ))
Goto inhdr_error;

This code basically performs some check, skipped

/* Our transport medium may have padded the Buffer out. Now we know it
* Is IP we can trim to the true length of the frame.
* Note this now means SKB-> Len holds ntohs (IPH-> tot_len ).
*/
If (pskb_trim_rcsum (SKB, Len )){
Ip_inc_stats_bh (dev_net (Dev), ipstats_mib_indiscards );
Goto drop;
}

Pskb_trim_rcsum is used to remove the L2 padding part and re-calculate the checksum.

Return nf_hook (pf_inet, nf_inet_pre_routing, SKB, Dev, null,
Ip_rcv_finish );

Finally, go to netfilter. If it is not dropped or what, go to ip_rcv_finish.

Static int ip_rcv_finish (struct sk_buff * SKB)
{
Const struct iphdr * IPH = ip_hdr (SKB );
Struct rtable * RT;

/*
* Initialise the virtual path cache for the packet. It describes
* How the packet travels inside Linux networking.
*/
If (skb_dst (SKB) = NULL ){
Int err = ip_route_input (SKB, IPH-> daddr, IPH-> saddr, IPH-> TOS,
SKB-> Dev );
If (unlikely (ERR )){
If (ERR =-ehostunreach)
Ip_inc_stats_bh (dev_net (SKB-> Dev ),
Ipstats_mib_inaddrerrors );
Else if (ERR =-enetunreach)
Ip_inc_stats_bh (dev_net (SKB-> Dev ),
Ipstats_mib_innoroutes );
Goto drop;
}
}

If (IPH-> IHL> 5 & ip_rcv_options (SKB ))
Goto drop;

RT = skb_rtable (SKB );
If (RT-> rt_type = rtn_multicast ){
Ip_upd_po_stats_bh (dev_net (RT-> U. dst. Dev), ipstats_mib_inmcast,
SKB-> Len );
} Else if (RT-> rt_type = rtn_broadcast)
Ip_upd_po_stats_bh (dev_net (RT-> U. dst. Dev), ipstats_mib_inbcast,
SKB-> Len );

Return dst_input (SKB );

Drop:
Kfree_skb (SKB );
Return net_rx_drop;
}

Ip_rcv_finish first calls ip_route_input to obtain the destination route. For more information about the route, use the local route table to check whether the packet should be received locally or forwarded, ip_route_input stores the route information in (struct dst_entry *) SKB-> _ skb_dst, the pointer to the dst_entry-> input function is determined by ip_local_deliver or ip_forward in ip_route_input_slow (ip_route_input_slow is called by ip_route_input)

In ip_route_input_slow, call ip_mkroute_input to check whether route table entries are forwarded. If no route table entries exist, the returned result is received locally. Ip_mkroute_input calls _ mkroute_input, which calls dst_alloc to create an rtable and sets rth-> U. dst. Input = ip_forward. The code segment is as follows:

Rth = dst_alloc (& ipv4_dst_ops );
If (! Rth ){
Err =-enobufs;
Goto cleanup;
}
Atomic_set (& rth-> U. dst. _ refcnt, 1 );
Rth-> U. dst. Flags = dst_host;
If (in_dev_conf_get (in_dev, nopolicy ))
Rth-> U. dst. Flags | = dst_nopolicy;
If (in_dev_conf_get (out_dev, noxfrm ))
Rth-> U. dst. Flags | = dst_noxfrm;
Rth-> Fl. fl4_dst = daddr;
Rth-> rt_dst = daddr;
Rth-> Fl. fl4_tos = TOS;
Rth-> Fl. Mark = SKB-> mark;
Rth-> Fl. fl4_src = saddr;
Rth-> rt_src = saddr;
Rth-> rt_gateway = daddr;
Rth-> rt_iif =
Rth-> Fl. IIF = in_dev-> Dev-> ifindex;
Rth-> U. dst. Dev = (out_dev)-> dev;
Dev_hold (rth-> U. dst. Dev );
Rth-> IDEV = in_dev_get (rth-> U. dst. Dev );
Rth-> Fl. OIF = 0;
Rth-> rt_spec_dst = spec_dst;
Rth-> U. dst. Input = ip_forward;
Rth-> U. dst. Output = ip_output;
Rth-> rt_genid = rt_genid (dev_net (rth-> U. dst. Dev ));
Rt_set_nexthop (rth, res, ITAG );
Rth-> rt_flags = flags;

If broadcast input or local_input is used, the following code segment is displayed:

Local_input:
Rth = dst_alloc (& ipv4_dst_ops );
If (! Rth)
Goto e_nobufs;

Rth-> U. dst. Output = ip_rt_bug;
Rth-> rt_genid = rt_genid (net );

Atomic_set (& rth-> U. dst. _ refcnt, 1 );
Rth-> U. dst. Flags = dst_host;
If (in_dev_conf_get (in_dev, nopolicy ))
Rth-> U. dst. Flags | = dst_nopolicy;
Rth-> Fl. fl4_dst = daddr;
Rth-> rt_dst = daddr;
Rth-> Fl. fl4_tos = TOS;
Rth-> Fl. Mark = SKB-> mark;
Rth-> Fl. fl4_src = saddr;
Rth-> rt_src = saddr;
# Ifdef config_net_cls_route
Rth-> U. dst. tclassid = ITAG;
# Endif
Rth-> rt_iif =
Rth-> Fl. IIF = Dev-> ifindex;
Rth-> U. dst. Dev = net-> loopback_dev;
Dev_hold (rth-> U. dst. Dev );
Rth-> IDEV = in_dev_get (rth-> U. dst. Dev );
Rth-> rt_gateway = daddr;
Rth-> rt_spec_dst = spec_dst;
Rth-> U. dst. Input = ip_local_deliver;
Rth-> rt_flags = flags | rtcf_local;
If (res. type = rtn_unreachable ){
Rth-> U. dst. Input = ip_error;
Rth-> U. dst. Error =-err;
Rth-> rt_flags & = ~ Rtcf_local;
}
Rth-> rt_type = res. type;
Hash = rt_hash (daddr, saddr, Fl. IIF, rt_genid (net ));
Err = rt_intern_hash (hash, rth, null, SKB );
Goto done;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.