Open vswitch Research: vswitchd

Source: Internet
Author: User
Tags htons

Vswitchd is a user-state daemon process. Its core is to execute the ofproto logic. We know that ovs is implemented in accordance with the openflow switch specification. Taking Layer 2 packet forwarding as an example, a traditional switch (including the implementation of Linux Bridge) searches for a cam table, find the port corresponding to the DST Mac, while the implementation of the open vswitch is to find whether there is a corresponding flow based on the input package SKB. If there is flow, it means that this SKB is not the first package of the stream, you can find the forwarded port in Flow-> action. It should be noted that the SDN idea is that all packages need to correspond to a flow, and the packet action is given based on the flow. The traditional action is nothing more than forward, accept, or discard, in SDN, more action definitions are available: Modify SKB content, change the package path, and clone multiple copies to different paths.

If SKB does not have the corresponding flow, it indicates that this is the first flow package. You need to create a flow for this package. vswitchd will repeatedly check whether there are any ofproto requests in a while loop, it may be from ovs-ofctl or openvswitch. the upcall request sent by Ko via Netlink is, of course, a flow creation request due to flow miss in most cases. In this case, vswitchd will create flow and action based on openflow specifications. Let's take a look at this process:

Because open vswitch is a layer-2 switch model, all packets are received from a port, that is, ovs_dp_process_received_packet is called. This function is based on SKB to generate keys through ovs_flow_extract, call ovs_flow_tbl_lookup to find the flow based on the key. If the flow cannot be found, call ovs_dp_upcall to send a dp_upcall_info structure to vswitchd for processing (call genlmsg_unicast)

Vswitchd will process the above Netlink request in handle_upcils. handle_miss_upcils will be called in case of Miss In the flow table, and handle_flow_miss will be called. The following describes how handle_miss_upcils is implemented.

Static void
Handle_miss_upcils (struct dpif_backer * backer, struct dpif_upcall * upcils,
Size_t n_upcils)
{

/* Construct the to-do list.
*
* This just amounts to extracting the flow from each packet and sticking
* The packets that have the same flow in the same "flow_miss" structure so
* That we can process them together .*/
Hmap_init (& Todo );
N_misses = 0;

Note: The following cycle will traverse the struct dpif_upcall that Netlink transmits to the user State. This structure includes Miss packet and the flow key generated based on the packets, packages with the same flow key are processed in a centralized manner.

For (upcall = upcall; upcall <& upcall[ n_upcall]; upcall ++ ){

Fitness = odp_flow_key_to_flow (upcall-> key, upcall-> key_len, & flow );
Port = odp_port_to_ofport (backer, flow. in_port );

Odp_flow_key_to_flow, first call the LIB/parse_flow_nlattrs function to parse upcall-> key, upcall-> key_len, and put the parsed ATTR attribute in a bitmap present_attrs, the corresponding type of struct nlattr is put in struct nlattr * attrs. Next, obtain the value of each digit of present_attrs from upcall-> key and store it in the flow. For VLAN parse, parse_8021q_onward is specially called.

Odp_port_to_ofport is used to convert flow. in_port, that is, the port number of datapath, to openflow port, that is, struct ofport_dpif * port.

Flow_extract (upcall-> packet, flow. skb_priority,
& Flow. Tunnel, flow. in_port, & miss-> flow );

Resolve packet to flow. This function and odp_flow_key_to_flow are repeated in some places.

/* Add other packets to a to-do list .*/
Hash = flow_hash (& miss-> flow, 0 );
Existing_miss = flow_miss_find (& todo, & miss-> flow, hash );
If (! Existing_miss ){
Hmap_insert (& todo, & miss-> hmap_node, hash );
Miss-> ofproto = ofproto;
Miss-> key = upcall-> key;
Miss-> key_len = upcall-> key_len;
Miss-> upcall_type = upcall-> type;
List_init (& miss-> packets );

N_misses ++;
} Else {
Miss = existing_miss;
}
List_push_back (& miss-> packets, & upcall-> packet-> list_node );
}

Flow_hash calculates the hash value of Miss-> flow, and searches for struct flow_miss * Based on the hash value in the hmap of todo. If it is null, it indicates this is the first flow_miss, initialize the flow_miss file and add it to Todo. Finally, suppose packet is included in the flow_miss-> packets list. The preceding conclusions are verified here. For Multiple one-time upcall, the packages belonging to the same flow_miss will be linked to the same flow_miss and processed together.

Ovs defines facet, which is used to represent user State programs, such as vswitchd. For a view of a matched flow. At the same time, the kernel space also has a view for a flow, and facet represents the same part of the two views. Different parts are represented by subfacet. The action is defined in struct subfacet.

If the flow_key calculated by datapath is exactly the same as the flow_key calculated by vswitchd Based on packet, Facet only contains the unique subfacet, if the flow_key calculated by datapath has more members than the packet-based computation of vswitchd, each additional part will become a subfacet.

Struct subfacet {
/* Owners .*/
Struct hmap_node;/* In struct ofproto_dpif 'subfacets 'list .*/
Struct list list_node;/* In struct Facet's 'facets 'list .*/
Struct facet * facet;/* owning facet .*/

/* Key.
*
* To save memory in the common case, 'key' is null if 'key _ fitness 'is
* Odp_fit_perfect, that is, odp_flow_key_from_flow () can accurately
* Regenerate the ODP flow key from-> facet-> flow .*/
Enum odp_key_fitness key_fitness;
Struct nlattr * key;
Int key_len;

Long long int used;/* time last used; time created if not used .*/

Uint64_t dp_packet_count;/* Last known packet count in the datapath .*/
Uint64_t dp_byte_count;/* Last known byte count in the datapath .*/

/* Datapath actions.
*
* These shoshould be essential identical for every subfacet in a facet,
* May differ in trivial ways due to VLAN splinters .*/
Size_t actions_len;/* number of bytes in actions []. */
Struct nlattr * actions;/* datapath actions .*/

Enum slow_path_reason slow;/* 0 if fast path may be used .*/
Enum subfacet_path path;/* installed in datapath? */

}

Let's first look at handle_flow_miss

/* Handles flow Miss 'Miss 'on 'ofprocess'. may add any required datapath
* Operations to 'ops', incrementing '* n_ops' for each new op .*/
Static void
Handle_flow_miss (struct ofproto_dpif * ofproto, struct flow_miss * miss,
Struct flow_miss_op * ops, size_t * n_ops)
{
Struct facet * facet;
Uint32_t hash;

/* The caller must ensure that Miss-> hmap_node.hash contains
* Flow_hash (miss-> flow, 0 ).*/
Hash = miss-> hmap_node.hash;

Facet = facet_lookup_valid (ofproto, & miss-> flow, hash );

Find the flow in the data structure struct ofproto_dpif * ofproto that represents datapath. Ofproto-> facets is a hashmap. The hash value of Miss flow is calculated first, and then the matching flow is found in the hmap_node list corresponding to the hash. The comparison method is more violent, compare memcmp directly ..

If (! Facet ){
Struct rule_dpif * Rule = rule_dpif_lookup (ofproto, & miss-> flow );

If (! Flow_miss_should_make_facet (ofproto, miss, hash )){
Handle_flow_miss_without_facet (Miss, rule, OPS, n_ops );

In this case, it is not necessary to create a flow facet. For some trivial traffic, creating a flow facet will lead to a larger overload.

Return;
}

Facet = facet_create (rule, & miss-> flow, hash );

Okay, let's create a facet for this flow.
}
Handle_flow_miss_with_facet (Miss, facet, OPS, n_ops );
}

Struct flow_miss is an encapsulation of flow, which is used to accelerate the batch processing of Miss flow. In most cases, this facet is created,

2012-10-26t07: 15: 43z | 22522 | ofproto_dpif | info | [qinq] Miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, Src Mac 0: 16: 3E: 83: 0: 1, DST Mac 0: 25: 9e: 5D: 62: 53

2012-10-26t07: 15: 43z | 22529 | ofproto_dpif | info | [qinq] Miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, Src Mac 0: 25: 9e: 5D: 62: 53, DST Mac 0: 16: 3E: 83: 0: 1

We can see that a duplex communication creates two flows and facet.

Next let's look at handle_flow_miss_with_facet, which calls subfacet_make_actions to generate action. This function first calls action_xlate_ctx_init and initializes an action_xlate_ctx structure. The structure is defined as follows:

Struct action_xlate_ctx {
/* Action_xlate_ctx_init () initializes these members .*/

/* The ofproto .*/
Struct ofproto_dpif * ofproto;

/* Flow to which the openflow actions apply. xlate_actions () will modify
* This flow when actions change header fields .*/
Struct flow;

/* The packet corresponding to 'flow', or a null pointer if we are
* Revalidating without a packet to refer .*/
Const struct ofpbuf * packet;

/* Shocould ofpp_normal update the Mac learning table? Shoshould "Learn"
* Actions update the flow table?
*
* We want to update these tables if we are actually processing a packet,
* Or if we are accounting for packets that the datapath has processed,
* Not if we are just revalidating .*/
Bool may_learn;

/* The rule that we are currently translating, or null .*/

Struct rule_dpif * rule;

/* Union of the set of TCP flags seen so far in this flow. (used only
* Nxast_fin_timeout. set to zero to avoid updating rules'
* Timeouts .)*/
Uint8_t tcp_flags;

/* Xlate_actions () initializes and uses these members. The client might want
* To look at them after it returns .*/

Struct ofpbuf * odp_actions;/* datapath actions .*/
Tag_type tags;/* tags associated with actions .*/
Enum slow_path_reason slow;/* 0 if fast path may be used .*/
Bool has_learn;/* actions include nxast_learn? */
Bool has_normal;/* actions output to ofpp_normal? */
Bool has_fin_timeout;/* actions include nxast_fin_timeout? */
Uint16_t nf_output_iface;/* output interface index for NetFlow .*/
Pai_mask_t mirrors;/* bitmap of associated mirrors .*/

/* Xlate_actions () initializes and uses these Members, but the client has no
* Reason to look at them .*/

Int recurse;/* recursion level, via xlate_table_action .*/
Bool max_resubmit_trigger;/* recursed too deeply during translation .*/
Struct flow base_flow;/* flow at the last commit .*/
Uint32_t orig_skb_priority;/* priority when packet arrived .*/
Uint8_t table_id;/* openflow table ID where flow was found .*/
Uint32_t sflow_n_outputs;/* Number of output ports .*/
Uint16_t sflow_odp_port;/* output port for composing sFlow action .*/
Uint16_t user_cookie_offset;/* used for user_action_cookie fixup .*/
Bool exit;/* no further actions shoshould be processed .*/
Struct flow orig_flow;/* Copy of original flow .*/
};

Then xlate_actions is called. openflow1.0 defines the following action,

Enum ofp10_action_type {
Ofpat10_output,/* output to switch port .*/
Ofpat10_set_vlan_vid,/* set the 802.1Q vlan id .*/
Ofpat10_set_vlan_pcp,/* set the 802.1Q priority .*/
Ofpat10_strip_vlan,/* strip the 802.1Q header .*/
Ofpat10_set_dl_src,/* Ethernet source address .*/
Ofpat10_set_dl_dst,/* Ethernet destination address .*/
Ofpat10_set_nw_src,/* IP source address .*/
Ofpat10_set_nw_dst,/* IP destination address .*/
Ofpat10_set_nw_tos,/* IP ToS (dscp field, 6 bits ).*/
Ofpat10_set_tp_src,/* TCP/UDP source port .*/
Ofpat10_set_tp_dst,/* TCP/UDP destination port .*/
Ofpat10_enqueue,/* output to queue .*/
Ofpat10_vendor = 0 xFFFF
};

Corresponding to different action types, the data structure passed in by the action is also different, e.g.

/* Action structure for ofpat10_set_vlan_vid .*/
Struct ofp_action_vlan_vid {
Ovs_be16 type;/* ofpat10_set_vlan_vid .*/
Ovs_be16 Len;/* length is 8 .*/
Ovs_be16 vlan_vid;/* vlan id .*/
Uint8_t pad [2];
};

/* Action structure for ofpat10_set_vlan_pcp .*/
Struct ofp_action_vlan_pcp {
Ovs_be16 type;/* ofpat10_set_vlan_pcp .*/
Ovs_be16 Len;/* length is 8 .*/
Uint8_t vlan_pcp;/* VLAN priority .*/
Uint8_t pad [3];
};

Union ofp_action {
Ovs_be16 type;
Struct ofp_action_header header;
Struct ofp_action_vendor_header vendor;
Struct ofp_action_output output;
Struct ofp_action_vlan_vid vlan_vid;
Struct ofp_action_vlan_pcp vlan_pcp;
Struct ofp_action_nw_addr nw_addr;
Struct ofp_action_nw_tos nw_tos;
Struct ofp_action_tp_port tp_port;
};

Do_xlate_actions: input a struct ofp_action * array and perform different operations on each struct ofp_action.

Case ofputil_ofpat10_output:
Xlate_output_action (CTX, & Ia-> output );
Break;

Case ofputil_ofpat10_set_vlan_vid:
CTX-> flow. vlan_tci & = ~ Htons (vlan_vid_mask );
CTX-> flow. vlan_tci | = Ia-> vlan_vid.vlan_vid | htons (vlan_cfi );
Break;

Case ofputil_ofpat10_set_vlan_pcp:
CTX-> flow. vlan_tci & = ~ Htons (vlan_pcp_mask );
CTX-> flow. vlan_tci | = htons (
(Ia-> vlan_pcp.vlan_pcp <vlan_pcp_shift) | vlan_cfi );
Break;

Case ofputil_ofpat10_strip_vlan:
CTX-> flow. vlan_tci = htons (0 );
Break;

The most important thing for forwarding packets is xlate_output_action. This function calls xlate_output_action __, where the input port is datapath port index or other control parameters, you can see the following definition in the definition of ofp_port:

Enum ofp_port {
/* Maximum number of physical switch ports .*/
Ofpp_max = 0xff00,

/* Fake output "ports ".*/
Ofpp_in_port = 0xfff8,/* Send the packet out the input port. This
Virtual port must be explicitly used
In order to send back out of the Input
Port .*/
Ofpp_table = 0xfff9,/* perform actions in flow table.
NB: this can only be the destination
Port for packet-out messages .*/
Ofpp_normal = 0 xfffa,/* process with normal L2/L3 switching .*/
Ofpp_flood = 0 xfffb,/* all physical ports should T input port and
Those disabled by STP .*/
Ofpp_all = 0 xfffc,/* all physical ports cannot input port .*/
Ofpp_controller = 0 xfffd,/* Send To controller .*/
Ofpp_local = 0 xfffe,/* local openflow "Port ".*/
Ofpp_none = 0 xFFFF/* not associated with a physical port .*/
};

In xlate_output_action _, most of the requests go to ofpp_normal. When xlate_normal is called, mac_learning_lookup is called to find the output port of the message in the Mac table, output_normal is called, and compose_output_action

Compose_output_action _ (struct action_xlate_ctx * CTX, uint16_t ofp_port,

Bool check_stp)
{
Const struct ofport_dpif * ofport = get_ofp_port (CTX-> ofproto, ofp_port );
Uint16_t odp_port = ofp_port_to_odp_port (ofp_port );
Ovs_be16 flow_vlan_tci = CTX-> flow. vlan_tci;
Uint8_t flow_nw_tos = CTX-> flow. nw_tos;
Uint16_t out_port;

...

Out_port = vsp_realdev_to_vlandev (CTX-> ofproto, odp_port,
CTX-> flow. vlan_tci );
If (out_port! = Odp_port ){
CTX-> flow. vlan_tci = htons (0 );
}
Commit_odp_actions (& CTX-> flow, & CTX-> base_flow, CTX-> odp_actions );
Nl_msg_put_u32 (CTX-> odp_actions, ovs_action_attr_output, out_port );

CTX-> sflow_odp_port = odp_port;
CTX-> sflow_n_outputs ++;
CTX-> nf_output_iface = ofp_port;
CTX-> flow. vlan_tci = flow_vlan_tci;
CTX-> flow. nw_tos = flow_nw_tos;
}

Commit_odp_actions is used to save the formats of all action-encoded vehicle functions nlattr to CTX-> odp_actions. The following callback (CTX-> odp_actions, ovs_action_attr_output, out_port) adds the output port of the message, the Flow Action combination is almost complete.

We will discuss the cam table in vswitchd, the code in lib/mac-learning.h lib/mac-learning.c,

Vswitchd maintains a MAC/port cam table. The Mac entry aging time is 300 seconds. The cam table defines the concept of flooding VLAN, that is, if the VLAN is flooding, it means that no address is learned, and all the forwarding of this VLAN is completed through flooding,

/* A Mac learning table entry .*/
Struct mac_entry {
Struct hmap_node;/* node in a mac_learning hmap .*/
Struct list lru_node;/* element in 'lrus' list .*/
Time_t expires;/* expiration time .*/
Time_t grat_arp_lock;/* gratuitous ARP lock expiration time .*/
Uint8_t Mac [eth_addr_len];/* known MAC address .*/
Uint16_t VLAN;/* VLAN tag .*/
Tag_type tag;/* tag for this learning entry .*/

/* Learned port .*/
Union {
Void * P;
Int I;
} Port;
};

/* Mac learning table .*/
Struct mac_learning {
Struct hmap table;/* learning table. */mac_entry: An hmap hash table. mac_entry is mounted to mac_learning-> table through hmap_node.
Struct list LRUs;/* In-use entries, least recently used at
Front, most recently used at the back. */LRU linked list, mac_entry is mounted to mac_learning-> LRUs through lru_node
Uint32_t secret;/* Secret for randomizing hash table .*/
Unsigned long * flood_vlans;/* bitmap of learning disabled VLANs .*/
Unsigned int idle_time;/* max age before deleting an entry. */maximum aging time
};

Static uint32_t
Mac_table_hash (const struct mac_learning * ml, const uint8_t Mac [eth_addr_len],
Uint16_t VLAN)
{
Unsigned int mac1 = get_unaligned_u32 (uint32_t *) MAC );
Unsigned int mac2 = get_unaligned_2010( (uint16_t *) (Mac + 4 ));
Return hash_3words (mac1, mac2 | (VLAN <16), ML-> secret );
}

The hash value calculated by mac_entry is calculated by mac_learning-> secret, VLAN, and MAC address through hash_3words.

Mac_entry_lookup: Use the MAC address and VLAN to check whether the corresponding mac_entry exists.

Get_lru: Find the first mac_entry corresponding to the LRU linked list.

Mac_learning_create/mac_learning_destroy, create/destroy the mac_learning table

Mac_learning_may_learn. If the VLAN is not a flooding VLAN and the MAC address is not a multicast address, true is returned.

Mac_learning_insert: Insert a mac_entry to mac_learning. First, use mac_entry_lookup to check whether mac_entry corresponding to the VLAN exists. If mac_learning already has mac_max entries and mac_entry is the oldest entry, create a mac_entry and insert it to the cam table.

Mac_learning_lookup, call mac_entry_lookup to find the MAC address of a VLAN in the cam table

Mac_learning_run, mac_entry that has timed out during cyclic aging

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.