[Go] Data Center Network Virtualization Tunneling Technology

Source: Internet
Author: User

Http://www.sdnlab.com/12077.html Sdnlab

How to realize the isolation of address space and data traffic between different tenants and applications is one of the first problems to realize the virtualization of data Center network. The so-called isolation of the address space means that the network (IP) addresses between different tenants and applications do not interfere with each other. In other words, two tenants can use the same network address completely. The so-called data traffic isolation means that no one tenant or application will perceive or capture traffic inside other virtual networks. To achieve this, we can build our own overlay (overlay) network for tenants on the physical network, while tunnel encapsulation technology is the key to achieving a coverage network. In this section, we will discuss the current more popular tunnel encapsulation techniques for building overlay networks, including Vxlan,vxlan-gpe,nvgre and STT technologies.
1.VXLAN
VXLAN (Virtual extensible local Area network) is a virtualized extensible LAN, a overlay technology that encapsulates two beginning with a four layer protocol. Specifically, the Vxlan uses a MAC-IN-UDP package to extend the two-tier network. The most widespread scenario for applying Vxlan technology in a data center is to enable free migration of virtual machines within the three-tier network range. With Vxlan, virtual machine migrations that were originally confined to the same data center, the same physical two-tier network, and the same VLAN can no longer be subject to these restrictions and can be scaled to anywhere on the virtual two-tier network as needed. In addition, Vxlan enables users to create up to 16M isolated virtual networks with a qualitative boost relative to the 4,096 virtual networks that VLANs can support. 1, the Vxlan message has a total of 50 bytes (or 54 bytes) of the encapsulated header, including 14 bytes (or 18 bytes, with 802.1 Qtag) external Ethernet frame header (corresponding to the virtual machine's physical machine's Mac), 20 bytes of external IP header (corresponding to the virtual machine's physical machine's IP), 8-byte external UDP header, 8-byte Vxlan header.

Figure 1. Vxlan Package Format

The destination port of the outer UDP header is used by default of 4798, can also be modified as needed, the source port number is a hash value calculated using the inner packet, the change range is 49152-65535. The source IP address of the outer IP header is the address of the source Vtep (VXLAN tunnel end point, VXLAN tunnel endpoint), the destination IP is the destination vtep address (the destination vtep for the inner packet is unknown, This is the IP multicast group address that corresponds to the virtual network to which the inner layer packet belongs. The MAC address of the outer Ethernet header is the MAC address of Vtep. This shows that the information of the virtual machine is completely obscured for the network device on the transmission path.

As shown in 2, the VXLAN header is 8 bytes long, where the currently valid field consists of a single byte of the flags identity bit and a 3-byte vni (VXLAN network Identifier,vxlan, which identifies a virtual network). The remaining 4 bytes remain as reserved fields for use, but these locations must be 0 at the time of use. For the Flags field, the I bit (I for invalidation) must be set to 1 and the remaining 7 bits to be set to 0. Vxlan uses VNI for network isolation, and virtual machines in the same vni can communicate with each other. Because Vxlan's network identity vni is 24bit, users can create 16M isolated virtual networks.

Figure 2. Vxlan header format [1]

The encapsulation/unpacking of the Vxlan message is done by the vtep at both ends of the tunnel. The source-end Vtep sends the encapsulated message to the destination Vtep through a tunnel after Vxlan encapsulation of a packet. The inner packet is not visible to the network device on the transmission path, and there is no need to maintain the forwarding information for the inner packet. When the destination Vtep receives the message, the packet is first unpacked and then sent to the corresponding virtual machine. The Vxlan protocol does not provide for the implementation of Vtep, which can be implemented by a hardware device or software that supports Vxlan.

Because the VM of the communication does not know the IP address of the Vtep for the peer VM (that is, it does not know which physical machine the VM is on), Vxlan takes a data plane learning mechanism: each virtual network of Vxlan corresponds to an IP multicast group (that is, the VNI and IP multicast address are one by one mappings). and maintain this correspondence locally in Vtep. In the case of packet encapsulation, if there is no vtep information for the destination VM locally, multicast is made within the IP multicast group corresponding to the virtual network, the vtep of the multicast is received to check whether the VM is local, responds, and the correspondence between the source VM and the source Vtep is recorded locally. When the source Vtep receives this response message, it learns the correspondence between the destination VM and the destination vtep. In addition, the Vxlan can also use a central controller to obtain the Vtep information: Each vtep the local VM MAC address and vtep IP corresponding to the controller, when the source vtep do not know the destination VM corresponding Vtep IP, directly to the controller request this information.

Here's an example of how to build a virtual network using Vxlan (based on [Rfc7348],vtep can be implemented on a physical switch or physical server, the following example vtep an example within the hypervisor of the physical server where the VM resides). First, assume that the current virtual network (VNI=300) contains two virtual machines: VM1 and VM2. As shown in 3, in order to communicate with each other within the virtual network, all related vtep need to join the multicast group (239.0.0.1) through the IGMP protocol.

Figure 3. Vxlan initializing/Joining multicast groups for virtual networks

At this point, assume that VM1 needs to send data to VM2. Because it is the first communication, VM1 does not have a VM2 MAC address. Therefore, as shown in 4, VM1 first issues an ARP request. The VTEP1 captures the message, encapsulates it, and sets the Vni in the Vxlan header to 300 to indicate that the message belongs to a virtual network with number 300. The destination IP address is set to the multicast address 239.0.0.1, and the source IP address is the IP address of VTEP1. After multicast forwarding, the destination VTEP2 receives the packet. It first encapsulates it and discovers that it has not stored VM1 address information, so VTEP2 first caches the mapping between VM1 's virtual network number (300), MAC address (MAC1), and VTEP1 IP address. After learning from the VM1 address, VTEP2 will unpack the ARP packets to broadcast within the virtual machines of all the vni=300 that are connected to them.

Figure 4. Send ARP Request process in Vxlan

As shown in 5, when VM2 receives an ARP request, it sends an ARP response to VM1 according to the standard process because it discovers itself as the target host. Obviously, the response must be intercepted by VTEP2 and encapsulated. Because VTEP2 has learned the address information of VM1 at this time, it can use VTEP1 IP address and host 1 MAC address for unicast transmission. After receiving a single broadcast, VTEP1 learns the mapping of the inner Mac to the outer IP address, encapsulates and forwards to the VM1 based on the destination MAC address of the encapsulated content. Finally, the processing of ARP response message is completed by VM1.

Figure 5. Sending ARP response process in Vxlan

2.vxlan-gpe
Vxlan defines the encapsulation format for encapsulating Ethernet frames in an external UDP message. VXLAN-GPE (Generic Protocol Extension VXLAN) is an extension to VXLAN, which allows the encapsulation of packets of any level, and also provides a pair of OAM (Operations, administration and Management) protocol support. The method of VXLAN-GPE extension is modified by some reserved bits of the outer layer VXLAN header. As shown in 6, VXLAN-GPE has made the following 4 changes to the VXLAN head:
1) Add Encapsulation protocol domain (next Protocol field): This field is used to indicate the protocol type of the encapsulated datagram, and the currently defined Next Protocol value includes:

    • 0x1:ipv4
    • 0x2:ipv6
    • 0x3:ethernet
    • 0x4:network Service Header (NSH). In the previous section, when we introduced the Cisco virtualization platform, we mentioned the concept of NSH, where we would like to make a brief review. The NSH header consists of two parts: 1) information about the service path (the service node uses it to select the next service node on the service path), and 2) provides the required metadata for the path of the network device and service device. The Nsh Service header is added by a device or application that has a service classification feature that can determine which packets require service, what services are required, and what service paths are passed accordingly.

2) Increase the P bit domain: the 5th bit in the flag domain (zero-based numbering) is defined as P bit, where the p bit needs to be set to 1 if the 8-bit Encapsulation protocol domain (Next Protocol) exists. If p bit is 0, it is parsed according to the standard Vxlan protocol.

3) Add o bit domain: The 7th bit of flag is defined as the O (OAM) bit. When the O bit is set to 1 o'clock, the inner encapsulated packet is the OAM packet, which triggers the processing of the OAM packet.

4) Add ver domain: Flag's 8th and 92 bits are defined as ver (version) to indicate the VXLAN-GPE release (currently VXLAN-GPE version 0).

Figure 6. VXLAN-GPE header format [2]

3.NVGRE
Before introducing Nvgre, let's begin by introducing the GRE (Generic Routing encapsulation, Generic Routing Encapsulation Protocol) presented by Cisco Corporation. The GRE is proposed mainly to solve the encapsulation problem between the network protocols at any level. The contents of this section are mainly referred to RFC1701 and RFC1702. First, the actual packets that GRE will need to transmit are called Payload Packages (payload packet). In order to transfer within the tunnel, we first need to encapsulate the payload package with the GRE header, and the encapsulated packet is called the GRE packet. Finally, the outside of the GRE header needs to encapsulate the corresponding header to achieve the transmission on the physical network. For convenience, the outer layer protocol is referred to as the Transfer Protocol (deliver protocol). 7 is shown in the GRE header, where each domain is defined as follows:
1) Flag domain: The first 2 bytes of the GRE header are the flag fields, where:

    • C (bit 0): Indicates that the checksum Present (check and field exists), if set to 1, indicates that the checksum field exists.
    • R (bit 0): Indicates that the routing Present (the routing field exists), and if set to 1, indicates that the optional routing field exists.
    • K (bit 2): Indicates the key Present (the key field exists), and if set to 1, indicates that the key field exists.
    • S (bit 3): Indicates sequence number Present (the serial numbers field exists), and if set to 1, the sequence # field exists.
    • s (bit 4): Indicates strict source-side routing (Strict source route). The so-called strict source routing means that all routers on the path should have the source end specified, and the order of the routers is not allowed to change. In general, this location is 1 only if all routing information contains strict routing information.
    • Recur (Bits 5-7); the recursive/nested control (recursion) domain uses 3 bits to indicate the number of additional layers allowed, which is typically set to 0 by default.
    • Flags (Bits 8-12): required to be set to 0 when transferring
    • Ver (Bits 13-15): Version number.

2) Next Protocol (2 bytes): Indicates the protocol type of the inner encapsulated packet and should be set to 0x800 if the inner packet is of type IPV4.

3) Checksum (2 bytes): Test and

4) offset (2 bytes): Indicates the offset of the starting address of the routing field to the first valid route information field, in bytes. This field only exists when the C or R bit is set 1 o'clock, and the information it contains only makes sense if the R bit is valid.

5) Key (4 bytes): The data in the same stream contains the same Key value, and the encapsulated tunnel terminal determines whether the packet belongs to the same stream based on that value. In Nvgre, this field is used to represent the identity of the virtual network.

6) Sequence Number (4 bytes): Used to indicate the order in which packets are routed.

7) Routing (variable length): Source routing information, when the R bit is valid, the domain contains multiple source routing entries (source Routing Entry, SRE). The specific definitions of SRE are omitted here, and interested readers can refer to RFC1701.

Figure 7. The format of the GRE header

The GRE package and package process is similar to the Vxlan encapsulation and encapsulation process, which is not mentioned here. Below we introduce the Nvgre (Network virtualization using Generic Routing Encapsulation) protocol. Originally proposed by Microsoft, Nvgre was designed to implement a multi-tenant virtual two-tier network in the data center by encapsulating Ethernet frames within the GRE header and transmitting over a three-layer network (MAC-IN-IP) [3]. As the name implies, Nvgre's underlying implementation details are copy GRE, so we'll focus on the difference between Nvgre and traditional GRE. As shown in 8, the packet header format using the Nvgre package is the same as using the GRE package. The difference is that when using Nvgre, the C and S bits in the GRE header must be set to 0. In other words, there will be no test and serial number in Nvgre's head. The K-bit must be set to 1, which makes the key field valid. But Nvgre redefined the key field, where the first 3 bytes are defined as Vsid (Virtual Subnet ID) and the fourth byte is defined as Flowid. The 24-bit vsid is used to indicate a two-tier virtual network, so Nvgre can support up to 16M virtual networks. This number is the same as the number of virtual networks supported by Vxlan. The 8-bit FLOWID enables the hypervisor to perform finer-grained manipulation of different data streams within a single virtual network. Flowid should be generated and added by the Nvgre endpoint (NVE), which does not allow network device modification during network transfer. If Nve does not generate FLOWID, then the domain must be zeroed. In addition, because the Nvgre internally encapsulates an Ethernet frame, the Protocol type field in the GRE header must be set to 0x6558 (transparent Ethernet bridging, transparent Ethernet bridge).

Figure 8. Packet header format with Nvgre encapsulation

4.STT
STT (stateless Transport tunneling), stateless transport tunneling, is another overlay technology for creating a 2-tier virtual network on a 2-tier/3-tier physical network in the data center. A stateless class TCP header (Tcp-like header) is used for data encapsulation, so it can be considered a mac-in-tcp way. The advantage of using a class TCP header is that you can use some of the hardware offload mechanisms of the NIC to improve system performance, such as TSO (TCP segmentation offload) and LRO (Large Receive offload). With TSO technology, we can offload TCP shard work to the NIC. By the network card to complete the bulk of the Shard, as well as replication Mac, IP, TCP header and other work. Instead, the so-called LRO technology, that is, the receiving end uses the network card to merge the shards into a large package and then generate an interrupt and send it to the operating system. The benefits of TSO and LRO are obvious. First, it reduces the number of system interrupts by transmitting large packets, thus reducing the overhead of outages. Second, the cost of encapsulation (the encapsulation header) can be amortized over multiple MTU-sized packets, so the effectiveness of the data transfer can also be significantly improved. To take advantage of this acceleration feature of the NIC, STT's encapsulation header simulates the TCP format, but STT does not maintain the TCP connection state. For example, no three handshakes are required before data is sent using STT, and TCP congestion control mechanisms, and so on, do not work. Although STT can use network card acceleration to improve system performance, it also encounters some problems because it does not maintain TCP status information. For example, some of the intermediate boxes (middlebox) may be used in some systems, but because some intermediate boxes check the four-level session state of the data flow, they can cause stateless STT streams to fail through the intermediate boxes. Of course this problem, the use of Mac-in-ip Nvgre scheme also exists. But the Vxlan scheme for MAC-IN-UDP is not a problem.

As shown in 9, the packet first needs to encapsulate the STT header before sending via STT. The STT frame header is shown in Format 10, where the meanings of each field are as follows. From the definition of these domains, we can see that the use of the STT header can greatly simplify the processing process of the receiver. For example, the receiver can easily determine whether the load data in the package is IPV4 or IPv6 through the IP version domain, or if the payload data is a TCP packet through the TCP payload domain.

1) Version: The current needs to be specified as 0

2) Flags:8-bit, specifically defined as follows:

    • 0-bit: checksum verified, if set to 1, indicates the inspection of encapsulated payload packets and has been verified.
    • 1-bit: checksum partial, this bit must be set if the test and only for TCP/IP headers are computed. Also, it is important to note that if the sender uses TSO, then this bit must be placed. Finally, as we can see from the definition, 0-bit and 1-bit cannot be 1 at the same time.
    • 2 bits: IP version, if the payload packet is a IPV4 packet, then the location bit, and if the IPV6 packet, then the pail bit.
    • 3-bit: TCP payload, if the payload packet uses TCP, then the location bit.
    • 4-7: reserved, the sender needs to set it to 0, the receiver needs to ignore the 4 bits.

3) L4 Offset:stt The offset of the end of the frame head from the four-layer head (TCP/UDP) of the load packet. The benefit of increasing this domain is to facilitate fast processing by the receiver.

4) Reserved Filed: The sender needs to set it to 0, and the receiver ignores the field.

5) Max Segment Size: The amount of TCP MSS the tunnel endpoint needs to send data over the network.

6) PCP: Priority

7) V: If set to 1, then the subsequent VLAN ID and the aforementioned PCP domain are valid

8) VLAN ID: Virtual network number, 12-bit

9) Context id:64 bit contextual ID. The purpose of the context ID is to identify the owner of the STT frame, so the context ID can be understood as a generalized virtual network identifier. The benefits of STT using 64 identifiers are obvious, which allows him to support more virtual networks than Vxlan and Nvgre. Also, because STT is made for sharding, you can allocate the cost of the STT header through Sharding.

Figure 9. Packet fragmentation format using STT encapsulation

Figure 10. STT Frame header format before sharding

As mentioned earlier, the advantage of STT compared to other encapsulation protocols is that it can use the TCP offload function of the network card to speed up the transmission of data. The key to the STT implementation is the Tcp-like head that has appeared in Figure 9. As shown in 11, the Tcp-like header used in STT is the same as the TCP header defined in [RFC0793]. The difference here is in the use of the two fields identified with the * number in Figure 11. First, the ACK field is used to identify the Shard, which is functionally the same as the ID field in the first piece of IPv4 and IPv6. For an STT data frame, the number of shards must be fixed, and the number of different STT data frame shards cannot be duplicated within a certain time period. Second, the 32-bit SEQ field is divided into two parts: the height of 16 bits to identify the length of the entire STT frame (in bytes), and the lower 16 bits to identify the offset of the current shard. We can see that for a STT frame, the high 16 bits of SEQ are constant, while the lower 16 bits increase gradually. This makes the modified SEQ the same way as the traditional SEQ works. In order to ensure the correct combination of data sharding, some considerations must be highlighted. First, for an STT frame, the source port of all shards must remain constant. Second, in order to facilitate the implementation of ECMP and other traffic equalization strategies, all STT frames for a data stream also need to have a constant source port number.

Figure 11. STT Shard Format

5. Summary
From the above, we can find that in order to build the overlay network, we need to transfer the virtual network data through the tunnel. In order to construct the tunnel, we need to encapsulate the original load packet. Depending on the level of encapsulation, there can be MAC-IN-UDP (VXLAN), Mac-in-ip (NVGRE), Mac-in-tcp (STT), ANY-IN-UDP (VXLAN-GPE), and Any-in-any (GRE). Different protocols will have different characteristics because of the different levels of packages chosen. For example, Vxlan uses the standard UDP protocol for encapsulation and therefore has the best pass-through. Although STT uses TCP for encapsulation, it modifies the definition of TCP headers and does not maintain state information for TCP, so for some intermediate boxes, STT data frames will fail. Nvgre because the IP encapsulation is used, the NVGRE packet will fail for some intermediate boxes that require four-layer state information. This section describes some of the encapsulation technologies for the overlay network, but we need to state here that it is not necessary to overlay the network to virtualize. For example, the Hop-by-hop method used in the NEC VTN mentioned in the previous section is also an option. The overlay approach is to virtualize on the host side, and the hop-by-hop approach is to virtualize the controller. Overlay mode in the middle network equipment to see the physical network traffic, hidden virtual machine traffic, Hop-by-hop Way network devices see the virtual machine traffic (that is, virtual machine visible), so that network equipment can do some QoS operation of virtual machine traffic. But at the same time, for the hop-by-hop mode, network equipment needs to maintain the network state information at the virtual machine level. If there are so many virtual machines in the network, this overhead is not to be ignored.

Reference documents
[1] Virtual Extensible Local Area Network (VXLAN): A Framework for overlaying virtualized Layer 2 Networks over Layer 3 networ KS RFC7348, http://datatracker.ietf.org/doc/rfc7348/

[2] Generic Protocol Extension for VXLAN, http://tools.ietf.org/html/draft-quinn-vxlan-gpe-04

[3] M. Sridharan, et al, "Nvgre:network virtualization using Generic Routing encapsulation--draft-sridharan-virtualization -nvgre-08 ", April, 2015.

[4] Brad McConnell, et al, "A Stateless Transport tunneling Protocol for Network virtualization--draft-davie-stt-01", March, 2012.

A Stateless Transport tunneling Protocol for Network Virtualization (STT), http://datatracker.ietf.org/doc/ draft-davie-stt/

Author Profile:
Bingzhang (PhD), male, associate researcher, Institute of Computing Technology, Chinese Academy of Sciences, the main research direction for large-scale computer system high-performance interconnection network, including data center network and on-chip network.

Wu Jie, Female, Institute of Computing Technology, Chinese Academy of Sciences, postgraduate, major research direction for data Center network virtualization.

Liu Jingyi, Female, Huazhong University of Science and Technology, major research direction for data Center network virtualization. Note: This article is for the Chinese Academy of Sciences in the calculation of the internship period work.

[Go] Data Center Network Virtualization Tunneling Technology

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.