Struct sw_flow_key is used to define a unique flow. This structure is quite complex. Please refer to the source code
Sw_flow_key is divided into four parts, representing the profile of switch, L2, L3, and L4 respectively.
The switch profile is a struct PHY structure, including tunnel ID, priority, Input Switch Port; the Ethernet profile is a struct ETH structure, including SRC Mac, DST Mac, vlan tci, PROTO; the IP profile is the struct IP structure, including IP proto, TOS, TTL; L4 profile is struct IPv4/struct IPv6, including src ip, dst ip, if L4 is TCP/UDP, it also includes SRC
Port, DST port. If it is ARP, it will also include Sha, tha
Struct sw_flow indicates a stream,
Struct sw_flow {
Struct rcu_head RCU;
Struct hlist_node hash_node [2];
U32 hash;
Struct sw_flow_key key;
Struct sw_flow_actions _ RCU * sf_acts;
Atomic_t refcnt;
Bool dead;
Spinlock_t lock;/* lock for values below .*/
Unsigned long used;/* last used time (in jiffies ).*/
U64 packet_count;/* number of packets matched .*/
U64 byte_count;/* number of bytes matched .*/
U8 tcp_flags;/* Union of seen TCP flags .*/
};
Hash_node. Hash indicates that the flow is hashed. According to the principle of ovs, a flow is matched for each package and the corresponding action is executed, this action is saved in struct sw_flow_actions * sf_acts. The key is used to uniquely identify a flow. The final used, packet_count, and byte_count are statistics information.
Next we will analyze the flow. C code
Check_header: Call pskb_may_pull to reserve the header space of LEN Length for SKB. If it succeeds, 0 is returned. Otherwise, the error code is returned.
This check applies to arphdr_ OK, check_iphdr, tcphdr_ OK, udphdr_ OK, and icmphdr_ OK. Because there are multiple options for IP and TCP, the Code is a little more. Take check_iphdr as an example,
Static int check_iphdr (struct sk_buff * SKB)
{
Unsigned int nh_ofs = skb_network_offset (SKB );
Unsigned int ip_len;
Int err;
Err = check_header (SKB, nh_ofs + sizeof (struct iphdr ));
Nh_ofs is the offset of the IP header from SKB-> head, which indicates that the nh_ofs + IP header length is reserved for SKB.
If (unlikely (ERR ))
Return err;
Ip_len = ip_hdrlen (SKB );
Ip_len is the length of the IP header including IP Option calculated based on the iPhone dr-> IHL.
If (unlikely (ip_len <sizeof (struct iphdr) |
SKB-> Len <nh_ofs + ip_len ))
Return-einval;
If SKB-> Len is not long enough, return an error and exit
Skb_set_transport_header (SKB, nh_ofs + ip_len );
Set SKB-> transport_header to point SKB-> data to the L4 Header
Return 0;
}
Ovs_flow_actions_alloc: allocate a sw_flow_actions structure. First, let's look at the struct nlattr structure,
Struct nlattr {
Uint16_t nla_len;
Uint16_t nla_type;
};
This is the header of A Netlink packet (the length is nla_hdrlen), the linear space behind it is the nla_data part, and the nla_len is the length of the Data part. When sw_flow_actions is assigned, in addition to the header of sizeof (struct sw_flow_action), a linear space of nla_len length is required.
Let's take a look at the data structure flex_array:
In the comments of flex_array, the reason why the data structure was created is to prevent the kmalloc failure of large size. We can see that flex_array is composed of multiple flex_array_parts, each flex_array_part is a page_size space. Therefore, you can regard flex_array as a linked list of pages.
Struct flex_array {
Union {
Struct {
Int element_size;
Int total_nr_elements;
Int elems_per_part;
U32 reciprocal_elems;
Struct flex_array_part * parts [];
};
/*
* This little trick makes sure that
* Sizeof (flex_array) = page_size
*/
Char padding [flex_array_base_size];
};
};
Flex_array is not counted as flex_array_part and occupies a maximum page size. Therefore, we can calculate the maximum number of parts (flex_array_nr_base_ptrs macro)
The flex_array_alloc function is clear in this way. It is passed into element_size and total to calculate elems_per_part and total_nr_elements. If the total number of elements is calculated, it can be placed in the remaining page of flex_array, the memory in memset is 0x6c, which is used as a poison (otherwise it must be re-allocated when it is used)
Flex_array_free_parts is used to free all pages occupied by parts. In addition to calling flex_array_free_parts, flex_array struct is free.
_ Fa_get_part: If the flex_array_part pointer represented by part_nr exists, this pointer is returned. Otherwise, kmalloc has a page-sized flex_array_parts and has flex_array-> parts [part_nr ].
Flex_array_put: copy the element_nr element to flex_array. This function checks whether flex_array is only in one page. If not, locate the page of the part corresponding to element_nr and call index_inside_part, locate the offset of the element, and finally copy the data to flex_array.
Flex_array_clear: clears element element_nr from flex_array. This memory is set to poison 0x6c.
Flex_array_prealloc: Specifies the start element and the number of elements to be allocated. The part space is allocated to these elements in advance.
Flex_array_get, returns the element in the flex_array of the NR.
Flex_array_shrink: For parts without elements, free the page occupied
Ovs uses flex_array as an array of hlist_head * to store the flow hash table. The flow with the same hash value is hung under the same bucket. The structure of struct flow_table is as follows:
Struct flow_table {
Struct flex_array * buckets;
Unsigned int count, n_buckets;
Struct rcu_head RCU;
Int node_ver;
U32 hash_seed;
Bool keep_flows;
};
The core is a flex_array of buckets, which is the bucket of the hash table. All the flows are stored in these buckets.
Alloc_buckets: First assign a flex_array structure to buckets, and then pre-allocate the page space of parts for the 0-n_buckets elements.
Find_bucket, based on the hash value, obtains the hlist_head * of the hash index of flex_array *
Free_buckets: Release the page corresponding to flex_array and flex_array_part.
The following operations are performed on the stream table:
Ovs_flow_tbl_alloc: allocate a flow_table structure and call alloc_buckets to assign a flex_array to table-> buckets.
Ovs_flow_tbl_destroy, free flow_table and the corresponding flex_array Space
Ovs_flow_tbl_next, get the next flow of flow_table
Ovs_flow_tbl_insert is the insert operation of a hash table.
Ovs_flow_tbl_lookup: searches for streams based on the Flow key. ovs_flow_hash calculates the hash value based on the key and key length, and then calls find_bucket to locate hlist_head *. For each hist_node in hlist_head, if the flow-> key is the same, this flow is returned.
Ovs_flow_tbl_remove. Because the hlist_node pointer has been saved in the flow structure, you can directly call hlist_del_rcu.
Ovs_flow_tbl_rehash, ovs_flow_tbl_expand, reassign a new flow_table structure. As the n_buckets size changes, re-calculate the hash value, call flow_table_copy_flows to copy the stream in the old stream table to the new stream table.
Flow_table_copy_flows: Copies the streams in the old stream table to the new stream table, and updates the node_ver of the stream table.
Ovs_flow_extract: parse the content of sw_flow_key Based on SKB.
Ovs can also exchange flow information through Netlink. The following functions involve operations between flow and netlink.
Let's take a look at the Netlink encapsulation implementation.
Linux/Netlink. H is Linux's definition of Netlink. Netlink can communicate through netlink socket, and its message format is as follows:
<------ Nlmsg_total_size (payload) ------>
<-- Nlmsg_msg_size (payload) -->
+ ---------------- + -------- + -------------- + -------- + ---------------------
| Nlmsghdr | pad | payload | pad | nlmsghdr...
+ ---------------- + -------- + -------------- + -------- + ---------------------
Netlink message header and payload must be 4-byte aligned. nlmsg_total_size is the length after nlmsghdr + payload alignment, and nlmsg_msg_size is the length after nlmsghdr + payload is not aligned. Nlmsg_data (nlmsghdr *) can return the starting position of payload, and nlmsg_next (nlmsghdr *) can return the starting position of the next Netlink message.
Payload format:
<----- Hdrlen -----> <-nlmsg_attrlen->
+ ---------------------- + ------- + ----------------------- +
| Family header | pad | attributes |
+ ---------------------- + ------- + ----------------------- +
^ -- Nlmsg_attrdata (nlmsghdr *, hdrlen)
The payload length can be obtained by using nlmsghdr-> nlmsg_len-nlmsg_hdrlen. It can be seen that the nlmsghdr-> nlmsg_len contains the nlmsg_hdrlen + unaligned payload length. Nlmsg_attrlen (nlmsghdr, hdrlen) is the payload length minus the hdrlen value after the alignment. Nlmsg_attrdata (nlmsghdr *, hdrlen) returns the attributes header location.
Nlmsg_new: Call alloc_skb to create a linear SKB with the nlmsg_total_size (payload) size.
Nlmsg_put: To put a Netlink packet of the length of nlmsg_total_size (payload) into the linear space skb_tailroom at the end of SKB, prepare space. _ Nlmsg_put call skb_put to expand a space of nlmsg_length (payload) from the tail of the SKB Linear Space
Nlmsg_get_pos, return skb_tail_pointer (SKB)
Nlmsg_trim: Calls skb_trim to crop the SKB linear space. It is only the SKB-> tail pointer in the operation. This is very different from pskb_trim.
In addition to the alignment of hdrlen, payload is an attributes array whose data structure nlattr is
Struct nlattr {
_ 2010nla_len;
_ 2010nla_type;
}
The structure of the attributes array is as follows:
<------ Nla_total_size (payload) ------->
<--- Nla_attr_size (payload) --->
+ ------------ + ------- + ------------------ + --------
| Header | pad | payload | pad | Header
+ ------------ + ------- + ------------------ + --------
<-- Nla_len -->
The nlattr-> nla_len contains the length of nla_attr_size (payload ).
Nla_reserve, reserved linear space of nla_total_size (payload) Length in SKB
Nla_put, in addition to reserve, also copies data to the ATTR Space
Nla_append: Add attribute to the linear space under SKB-> tail
Nla_put_xxx: Uses XXX as the payload of attribute and adds it to the linear space of SKB.
Nla_put_flag, set the type of nlattr
Nlmsg_find_attr: Call nla_find. In the payload of nlmsg, find the struct nlattr of the corresponding attrtype *
Nlmsg_validate: Call nla_validate to verify that attribute data is valid in the linear space of attributes. For each attribute in the attributes stream, nla_validate calls validate_nla Based on nla_policy to verify the validity. The validity here is mainly about whether the data type and length match.
Nlmsg_parse: nla_parse is called. nla_parse is used to input the attributes linear memory. The attribute in the parse memory is included in a struct nlattr * array. The array size is determined by the number of attribute types.
Ovs_flow_to_nlattrs: For each flow member item, call nla_put_uxx and nla_reserve repeatedly, and copy the flow member data to the corresponding attribute.
Ovs_flow_from_nlattrs. First, call parse_flow_nlattrs to resolve the Netlink message to an array of struct nlattr *. The number of arrays is _ ovs_key_attr_max. For details, see Enum ovs_key_attr.
Enum ovs_key_attr {
Ovs_key_attr_unspec,
Ovs_key_attr_encap,/* nested set of encapsulated attributes .*/
Ovs_key_attr_priority,/* u32 SKB-> priority */
Ovs_key_attr_in_port,/* u32 ovs DP Port Number */
Ovs_key_attr_ethernet,/* struct ovs_key_ethernet */
Ovs_key_attr_vlan,/* be16 vlan tci */
Ovs_key_attr_ethertype,/* be16 Ethernet type */
Ovs_key_attr_ipv4,/* struct ovs_key_ipv4 */
Ovs_key_attr_ipv6,/* struct ovs_key_ipv6 */
Ovs_key_attr_tcp,/* struct ovs_key_tcp */
Ovs_key_attr_udp,/* struct ovs_key_udp */
Ovs_key_attr_icmp,/* struct ovs_key_icmp */
Ovs_key_attr_icmpv6,/* struct ovs_key_icmpv6 */
Ovs_key_attr_arp,/* struct ovs_key_arp */
Ovs_key_attr_nd,/* struct ovs_key_nd */
Ovs_key_attr_tun_id = 63,/* be64 tunnel ID */
_ Ovs_key_attr_max
};
As you can see, for each flow member, there is an Enum ovs_key_attr which corresponds to the type, and the type value is the nl_type in nlattr. Parse_flow_nlattrs parses the nl_type for each nlattr and stores the linear space starting with struct nlattr * In struct nlattr * A [type, at the same time, each type is marked with a bit to indicate that the resolution is successful.
Call nla_get_xxxx to obtain the data from a [type] based on the Type parsed by the tag. For non-simple type data, use nla_data to take data from a [type] and combine the data into a sw_flow_key