Open vswitch Research: Flow

Source: Internet
Author: User

Struct sw_flow_key is used to define a unique flow. This structure is quite complex. Please refer to the source code

Sw_flow_key is divided into four parts, representing the profile of switch, L2, L3, and L4 respectively.

The switch profile is a struct PHY structure, including tunnel ID, priority, Input Switch Port; the Ethernet profile is a struct ETH structure, including SRC Mac, DST Mac, vlan tci, PROTO; the IP profile is the struct IP structure, including IP proto, TOS, TTL; L4 profile is struct IPv4/struct IPv6, including src ip, dst ip, if L4 is TCP/UDP, it also includes SRC
Port, DST port. If it is ARP, it will also include Sha, tha

Struct sw_flow indicates a stream,

Struct sw_flow {
Struct rcu_head RCU;
Struct hlist_node hash_node [2];
U32 hash;

Struct sw_flow_key key;
Struct sw_flow_actions _ RCU * sf_acts;

Atomic_t refcnt;
Bool dead;

Spinlock_t lock;/* lock for values below .*/
Unsigned long used;/* last used time (in jiffies ).*/
U64 packet_count;/* number of packets matched .*/
U64 byte_count;/* number of bytes matched .*/
U8 tcp_flags;/* Union of seen TCP flags .*/
};

Hash_node. Hash indicates that the flow is hashed. According to the principle of ovs, a flow is matched for each package and the corresponding action is executed, this action is saved in struct sw_flow_actions * sf_acts. The key is used to uniquely identify a flow. The final used, packet_count, and byte_count are statistics information.

Next we will analyze the flow. C code

Check_header: Call pskb_may_pull to reserve the header space of LEN Length for SKB. If it succeeds, 0 is returned. Otherwise, the error code is returned.

This check applies to arphdr_ OK, check_iphdr, tcphdr_ OK, udphdr_ OK, and icmphdr_ OK. Because there are multiple options for IP and TCP, the Code is a little more. Take check_iphdr as an example,

Static int check_iphdr (struct sk_buff * SKB)
{
Unsigned int nh_ofs = skb_network_offset (SKB );
Unsigned int ip_len;
Int err;

Err = check_header (SKB, nh_ofs + sizeof (struct iphdr ));
Nh_ofs is the offset of the IP header from SKB-> head, which indicates that the nh_ofs + IP header length is reserved for SKB.

If (unlikely (ERR ))
Return err;

Ip_len = ip_hdrlen (SKB );
Ip_len is the length of the IP header including IP Option calculated based on the iPhone dr-> IHL.

If (unlikely (ip_len <sizeof (struct iphdr) |

SKB-> Len <nh_ofs + ip_len ))
Return-einval;
If SKB-> Len is not long enough, return an error and exit

Skb_set_transport_header (SKB, nh_ofs + ip_len );

Set SKB-> transport_header to point SKB-> data to the L4 Header

Return 0;
}

Ovs_flow_actions_alloc: allocate a sw_flow_actions structure. First, let's look at the struct nlattr structure,

Struct nlattr {
Uint16_t nla_len;
Uint16_t nla_type;
};

This is the header of A Netlink packet (the length is nla_hdrlen), the linear space behind it is the nla_data part, and the nla_len is the length of the Data part. When sw_flow_actions is assigned, in addition to the header of sizeof (struct sw_flow_action), a linear space of nla_len length is required.

Let's take a look at the data structure flex_array:

In the comments of flex_array, the reason why the data structure was created is to prevent the kmalloc failure of large size. We can see that flex_array is composed of multiple flex_array_parts, each flex_array_part is a page_size space. Therefore, you can regard flex_array as a linked list of pages.

Struct flex_array {
Union {
Struct {
Int element_size;
Int total_nr_elements;
Int elems_per_part;
U32 reciprocal_elems;
Struct flex_array_part * parts [];
};
/*
* This little trick makes sure that
* Sizeof (flex_array) = page_size
*/
Char padding [flex_array_base_size];
};
};

Flex_array is not counted as flex_array_part and occupies a maximum page size. Therefore, we can calculate the maximum number of parts (flex_array_nr_base_ptrs macro)

The flex_array_alloc function is clear in this way. It is passed into element_size and total to calculate elems_per_part and total_nr_elements. If the total number of elements is calculated, it can be placed in the remaining page of flex_array, the memory in memset is 0x6c, which is used as a poison (otherwise it must be re-allocated when it is used)

Flex_array_free_parts is used to free all pages occupied by parts. In addition to calling flex_array_free_parts, flex_array struct is free.

_ Fa_get_part: If the flex_array_part pointer represented by part_nr exists, this pointer is returned. Otherwise, kmalloc has a page-sized flex_array_parts and has flex_array-> parts [part_nr ].

Flex_array_put: copy the element_nr element to flex_array. This function checks whether flex_array is only in one page. If not, locate the page of the part corresponding to element_nr and call index_inside_part, locate the offset of the element, and finally copy the data to flex_array.

Flex_array_clear: clears element element_nr from flex_array. This memory is set to poison 0x6c.

Flex_array_prealloc: Specifies the start element and the number of elements to be allocated. The part space is allocated to these elements in advance.

Flex_array_get, returns the element in the flex_array of the NR.

Flex_array_shrink: For parts without elements, free the page occupied

Ovs uses flex_array as an array of hlist_head * to store the flow hash table. The flow with the same hash value is hung under the same bucket. The structure of struct flow_table is as follows:

Struct flow_table {
Struct flex_array * buckets;
Unsigned int count, n_buckets;
Struct rcu_head RCU;
Int node_ver;
U32 hash_seed;
Bool keep_flows;
};

The core is a flex_array of buckets, which is the bucket of the hash table. All the flows are stored in these buckets.

Alloc_buckets: First assign a flex_array structure to buckets, and then pre-allocate the page space of parts for the 0-n_buckets elements.

Find_bucket, based on the hash value, obtains the hlist_head * of the hash index of flex_array *

Free_buckets: Release the page corresponding to flex_array and flex_array_part.

The following operations are performed on the stream table:

Ovs_flow_tbl_alloc: allocate a flow_table structure and call alloc_buckets to assign a flex_array to table-> buckets.

Ovs_flow_tbl_destroy, free flow_table and the corresponding flex_array Space

Ovs_flow_tbl_next, get the next flow of flow_table

Ovs_flow_tbl_insert is the insert operation of a hash table.

Ovs_flow_tbl_lookup: searches for streams based on the Flow key. ovs_flow_hash calculates the hash value based on the key and key length, and then calls find_bucket to locate hlist_head *. For each hist_node in hlist_head, if the flow-> key is the same, this flow is returned.

Ovs_flow_tbl_remove. Because the hlist_node pointer has been saved in the flow structure, you can directly call hlist_del_rcu.

Ovs_flow_tbl_rehash, ovs_flow_tbl_expand, reassign a new flow_table structure. As the n_buckets size changes, re-calculate the hash value, call flow_table_copy_flows to copy the stream in the old stream table to the new stream table.

Flow_table_copy_flows: Copies the streams in the old stream table to the new stream table, and updates the node_ver of the stream table.

Ovs_flow_extract: parse the content of sw_flow_key Based on SKB.

Ovs can also exchange flow information through Netlink. The following functions involve operations between flow and netlink.

Let's take a look at the Netlink encapsulation implementation.

Linux/Netlink. H is Linux's definition of Netlink. Netlink can communicate through netlink socket, and its message format is as follows:

<------ Nlmsg_total_size (payload) ------>

<-- Nlmsg_msg_size (payload) -->

+ ---------------- + -------- + -------------- + -------- + ---------------------

| Nlmsghdr | pad | payload | pad | nlmsghdr...

+ ---------------- + -------- + -------------- + -------- + ---------------------

Netlink message header and payload must be 4-byte aligned. nlmsg_total_size is the length after nlmsghdr + payload alignment, and nlmsg_msg_size is the length after nlmsghdr + payload is not aligned. Nlmsg_data (nlmsghdr *) can return the starting position of payload, and nlmsg_next (nlmsghdr *) can return the starting position of the next Netlink message.

Payload format:

<----- Hdrlen -----> <-nlmsg_attrlen->

+ ---------------------- + ------- + ----------------------- +

| Family header | pad | attributes |

+ ---------------------- + ------- + ----------------------- +

^ -- Nlmsg_attrdata (nlmsghdr *, hdrlen)

The payload length can be obtained by using nlmsghdr-> nlmsg_len-nlmsg_hdrlen. It can be seen that the nlmsghdr-> nlmsg_len contains the nlmsg_hdrlen + unaligned payload length. Nlmsg_attrlen (nlmsghdr, hdrlen) is the payload length minus the hdrlen value after the alignment. Nlmsg_attrdata (nlmsghdr *, hdrlen) returns the attributes header location.

Nlmsg_new: Call alloc_skb to create a linear SKB with the nlmsg_total_size (payload) size.

Nlmsg_put: To put a Netlink packet of the length of nlmsg_total_size (payload) into the linear space skb_tailroom at the end of SKB, prepare space. _ Nlmsg_put call skb_put to expand a space of nlmsg_length (payload) from the tail of the SKB Linear Space

Nlmsg_get_pos, return skb_tail_pointer (SKB)

Nlmsg_trim: Calls skb_trim to crop the SKB linear space. It is only the SKB-> tail pointer in the operation. This is very different from pskb_trim.

In addition to the alignment of hdrlen, payload is an attributes array whose data structure nlattr is

Struct nlattr {

_ 2010nla_len;

_ 2010nla_type;

}

The structure of the attributes array is as follows:

<------ Nla_total_size (payload) ------->

<--- Nla_attr_size (payload) --->

+ ------------ + ------- + ------------------ + --------

| Header | pad | payload | pad | Header

+ ------------ + ------- + ------------------ + --------

<-- Nla_len -->

The nlattr-> nla_len contains the length of nla_attr_size (payload ).

Nla_reserve, reserved linear space of nla_total_size (payload) Length in SKB

Nla_put, in addition to reserve, also copies data to the ATTR Space

Nla_append: Add attribute to the linear space under SKB-> tail

Nla_put_xxx: Uses XXX as the payload of attribute and adds it to the linear space of SKB.

Nla_put_flag, set the type of nlattr

Nlmsg_find_attr: Call nla_find. In the payload of nlmsg, find the struct nlattr of the corresponding attrtype *

Nlmsg_validate: Call nla_validate to verify that attribute data is valid in the linear space of attributes. For each attribute in the attributes stream, nla_validate calls validate_nla Based on nla_policy to verify the validity. The validity here is mainly about whether the data type and length match.

Nlmsg_parse: nla_parse is called. nla_parse is used to input the attributes linear memory. The attribute in the parse memory is included in a struct nlattr * array. The array size is determined by the number of attribute types.

Ovs_flow_to_nlattrs: For each flow member item, call nla_put_uxx and nla_reserve repeatedly, and copy the flow member data to the corresponding attribute.

Ovs_flow_from_nlattrs. First, call parse_flow_nlattrs to resolve the Netlink message to an array of struct nlattr *. The number of arrays is _ ovs_key_attr_max. For details, see Enum ovs_key_attr.

Enum ovs_key_attr {

Ovs_key_attr_unspec,
Ovs_key_attr_encap,/* nested set of encapsulated attributes .*/
Ovs_key_attr_priority,/* u32 SKB-> priority */
Ovs_key_attr_in_port,/* u32 ovs DP Port Number */
Ovs_key_attr_ethernet,/* struct ovs_key_ethernet */
Ovs_key_attr_vlan,/* be16 vlan tci */
Ovs_key_attr_ethertype,/* be16 Ethernet type */
Ovs_key_attr_ipv4,/* struct ovs_key_ipv4 */
Ovs_key_attr_ipv6,/* struct ovs_key_ipv6 */
Ovs_key_attr_tcp,/* struct ovs_key_tcp */
Ovs_key_attr_udp,/* struct ovs_key_udp */
Ovs_key_attr_icmp,/* struct ovs_key_icmp */
Ovs_key_attr_icmpv6,/* struct ovs_key_icmpv6 */
Ovs_key_attr_arp,/* struct ovs_key_arp */
Ovs_key_attr_nd,/* struct ovs_key_nd */
Ovs_key_attr_tun_id = 63,/* be64 tunnel ID */
_ Ovs_key_attr_max
};

As you can see, for each flow member, there is an Enum ovs_key_attr which corresponds to the type, and the type value is the nl_type in nlattr. Parse_flow_nlattrs parses the nl_type for each nlattr and stores the linear space starting with struct nlattr * In struct nlattr * A [type, at the same time, each type is marked with a bit to indicate that the resolution is successful.

Call nla_get_xxxx to obtain the data from a [type] based on the Type parsed by the tag. For non-simple type data, use nla_data to take data from a [type] and combine the data into a sw_flow_key

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.