Dockone WeChat Share (112): Principle and implementation of Vxlan backend in flannel

Source: Internet
Author: User
Tags vps etcd
This is a creation in Article, where the information may have evolved or changed.
"Editor's note" Overlay network is one of the important solutions of Kubernetes network model, and flannel, as a mature solution for anxiety, can build overlay network based on a variety of backend, which is based on the Vxlan of Linux kernel Backend has obvious advantages in both performance and ease of use.

"Shanghai station |3 Day burn Brain Service Architecture training Camp" Training content includes: DevOps, MicroServices, Spring Cloud, Eureka, Ribbon, feign, Hystrix, Zuul, Spring Cloud Config, spring Cloud sleuth and so on.

This share introduces the Vxlan backend in flannel, which contains two things:

deep understanding of the Vxlan principle in the kernel:Use native tools such as Iproute2 and bridge to build a Vxlan-based overlay network.

Understand how flannel works when using Vxlan backend:With the previous understanding of the kernel Vxlan principle, through the analysis of flannel part of the source code to master its Vxlan backend principle.

First, the principle of Vxlan

Virtual Extensible Local Area Network (VXLAN) is a protocol for building a 2-tier logical network on an existing 3-tier physical network.

After v3.7.0 at the end of 2012, Linux kernel joined the Vxlan protocol support, author: Stephen Hemminger, so if you want to use kernel support in Linux Vxlan, the lowest kernel version 3.7+ (recommended 3.9+).

Stephen Hemminger also implements tools such as Iproute2 and bridge to manage complex network configurations in Linux, which are now supported by default in most Linux distributions.

Vxlan is essentially a tunnel (tunneling) protocol used to implement a virtual 2-tier network based on a 3-tier network. Broadly speaking, the tunnel protocol is a bit like a conference call today, connecting different meeting rooms by visual phone so everyone can talk directly, as if sitting in a conference room. Many tunnel protocols, such as GRE, also have the use of vni similar to Vxlan.

Another important feature of the tunnel Protocol is software extensibility, one of the cornerstones of the software-defined network (software-defined network,sdn).

Flannel has two tunnel protocol-based BACKEND:UDP (the default implementation) and Vxlan, essentially tunnel protocol, the difference is only the protocol itself and the way it is implemented.

Here by the way: The tunnel protocol is already supported in older kernels, and I remember v2.2+ can use tunnel to create a virtual network, so UDP backend is suitable for use in Linux versions without Vxlan support, but performance is comparable to Vxlan Backend a little worse.

The above is a background introduction, the following is the introduction of Vxlan kernel support

Figure 1. Vxlan can build a 2-tier virtual network between hosts that distribute multiple network segments

Figure 2. Vxlan Fundamentals: Routines or tunnel, the difference is only the implementation of the tunnel protocol itself

To illustrate the Vxlan principle mentioned in Figure 1 and Figure 2, here we manually set up a overlay network on two VPS with different network segments and run Docker Container on two nodes respectively, when we see that the container uses the IP of the virtual network to complete the direct communication. The experiment was successful.

Figure 3 Building a network topology for Vxlan virtual networks manually

The concept of a vtep is mentioned in Figure 3, the full name Vxlan tunnel Endpoint, which is essentially tunnel in the previous Endpoint

Now formally start manually building the virtual network in Figure 3:

first step. Create a Docker bridge

The default Docker bridge address range is 172.17.0.1/24 (older version is 172.17.42.1/24), whereas in this experiment the subnet requirements for the two nodes Node1 and Node2 are: 192.1.78.1/24,192.1.87.1 /24.

Modify the Docker daemon startup parameters and restart the Docker daemon after adding the following parameters:

Node1:--bip=192.1.78.1/24

Node2:--bip=192.1.87.1/24



At this time Node1 and node2 between the container can not directly communicate, Node1 also can not cross the host and the container on the Node2 directly communication, and Node2 can not directly and Node1 container communication.

Step two. Create Vteps

Execute the following command on the Node1:
Prefix=vxlan
ip= $external-ip-of-node-1
destip= $external-ip-of-node-2
port=8579
Vni=1
subnetid=78
subnet=192. $VNI. 0.0/16
vxsubnet=192. $VNI. $SUBNETID. 0/32
Devname= $PREFIX. $VNI

IP link Delete $DEVNAME
IP link Add $DEVNAME type Vxlan ID $VNI dev eth0 local $IP dstport $PORT nolearning
echo ' 3 ' >/proc/sys/net/ipv4/neigh/$DEVNAME/app_solicit
IP address add $VXSUBNET dev $DEVNAME
IP link Set $DEVNAME up
IP route delete $SUBNET dev $DEVNAME scope Global
IP route add $SUBNET dev $DEVNAME scope Global

Execute the following command on the Node2:
Prefix=vxlan
ip= $external-ip-of-node-2
destip= $external-ip-of-node-1
Vni=1
subnetid=87
port=8579
subnet=192. $VNI. 0.0/16
vxsubnet=192. $VNI. $SUBNETID. 0/32
Devname= $PREFIX. $VNI

IP link Delete $DEVNAME
IP link Add $DEVNAME type Vxlan ID $VNI dev eth0 local $IP dstport $PORT nolearning
echo ' 3 ' >/proc/sys/net/ipv4/neigh/$DEVNAME/app_solicit
Ip-d Link Show
IP addr Add $VXSUBNET dev $DEVNAME
IP link Set $DEVNAME up
IP route delete $SUBNET dev $DEVNAME scope Global
IP route add $SUBNET dev $DEVNAME scope Global

step three. Configure forward table for Vtep
# Node1
node1$ Bridge fdb add $mac-of-vtep-on-node-2 dev $DEVNAME dst $DESTIP

Node2

node2$ Bridge fdb add $mac-of-vtep-on-node-1 dev $DEVNAME dst $DESTIP

Fourth step. Configuring Neighbors,ipv4 in the ARP Table
# Node1
node1$ IP neighbor Add $ip-on-node-2 lladdr $mac-of-vtep-on-node-2 Dev Vxlan.1

Node2

node2$ IP neighbor Add $ip-on-node-1 lladdr $mac-of-vtep-on-node-1 Dev Vxlan.1

Note: The ARP table is not generally updated manually, in the implementation of Vxlan by the corresponding network agent monitoring L3 miss to update dynamically; Here the ARP entry is simply for testing, and if multiple IPs are accessed across hosts, Each cross-host IP needs to be configured with the corresponding ARP entry.

The above operation requires root authority, completed the entire overlay network was built successfully, the following test two kinds of connectivity to summarize the experiment:
    • Node1 container communicates directly with the container on the Node2 (the container communicates directly with the cross-host container)
    • Node1 communicates directly with container on Node2, Node2 communicates directly with container on Node1 (communication between host and cross-host container)


The test of direct communication between container and cross-host container is first seen.

Now Node1 and Node2 on a busybox:
node1$ Docker run-it--rm busybox SH
node1$ IP A
1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue Qlen 1
Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00
inet 127.0.0.1/8 Scope host Lo
Valid_lft Forever Preferred_lft Forever
INET6:: 1/128 Scope Host
Valid_lft Forever Preferred_lft Forever
6:ETH0@IF7: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> MTU Qdisc Noqueue
Link/ether 02:42:c0:01:4e:02 BRD FF:FF:FF:FF:FF:FF
inet 192.1.78.2/24 Scope Global eth0
Valid_lft Forever Preferred_lft Forever
Inet6 FE80::42:C0FF:FE01:4E02/64 Scope link
Valid_lft Forever Preferred_lft Forever
node2$ Docker run-it--rm busybox SH
node2$ IP A
1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue Qlen 1
Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00
inet 127.0.0.1/8 Scope host Lo
Valid_lft Forever Preferred_lft Forever
INET6:: 1/128 Scope Host
Valid_lft Forever Preferred_lft Forever
10:ETH0@IF11: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> MTU Qdisc Noqueue
Link/ether 02:42:c0:01:57:02 BRD FF:FF:FF:FF:FF:FF
inet 192.1.87.2/24 Scope Global eth0
Valid_lft Forever Preferred_lft Forever
Inet6 FE80::42:C0FF:FE01:5702/64 Scope link
Valid_lft Forever Preferred_lft Forever

Next, let's enjoy the connectivity between the containers.
node1@busybox$ PING-C1 192.1.87.2
PING 192.1.87.2 (192.1.87.2): Data bytes
Bytes from 192.1.87.2:seq=0 ttl=62 time=2.002 ms

node2@busybox$ PING-C1 192.1.78.2
PING 192.1.78.2 (192.1.78.2): Data bytes
Bytes from 192.1.78.2:seq=0 ttl=62 time=1.360 ms

Then look at the test of connectivity between the host and the cross-host container.
node1$ PING-C1 192.1.87.2
PING 192.1.87.2 (192.1.87.2) bytes of data.
Bytes from 192.1.87.2:icmp_seq=1 ttl=63 time=1.49 ms

node2$ PING-C1 192.1.78.2
PING 192.1.78.2 (192.1.78.2) bytes of data.
Bytes from 192.1.78.2:icmp_seq=1 ttl=63 time=1.34 ms

The screenshot of the specific experiment can be accessed: https://asciinema.org/a/bavkebqxc4wjgb2zv0t97es9y.

Second, the realization of Vxlan backend in flannel

After figuring out the principle of vxlan in kernel, it is not difficult to understand the mechanism of flannel.

Attention:
    • When using Vxlan backend, the data is forwarded by kernel, flannel does not forward the data, only dynamically sets the ARP entry
    • UDP backend will assume the data forwarding tool (here does not expand its implementation), UDP backend comes with a C implementation of the proxy, to connect the tunnel on different nodes endpoints


The source discussed here is based on the latest stable version of v0.7.0.

Vxlan backend starts with two concurrent tasks dynamically:
    1. Listens for L3 miss in kernel and deserializes it into Golang object
    2. Automatic update of local neighbor configuration based on L3 miss and Subnet configuration (ETCD)


About the source code, you can poke here

Finally, there is a small detail in the implementation of the flannel, just added in the 0.7.0, that is, the Vtep IP Plus/32 bits of the mask to avoid the broadcast, the previous version is a/16 mask, to solve the Vxlan network due to broadcast caused by the "network Storm" problem.

Three, summarize

    • There are several backend in flannel, where Vxlan backend forwards data through the kernel, while UDP backend forwards data through proxy in the user-state process
    • Flannel when using Vxlan backend, a brief start-stop Flanneld will not cause a network outage, and UDP backend will
    • Many third-party network tests have shown that UDP backend than Vxlan backend network performance is about 1 orders of magnitude, generally as long as the kernel support (v3.9+), it is recommended to choose Vxlan Backend
    • When using Vxlan backend in flannel, it is recommended to upgrade to 0.7 + because previous versions have potential network storm problems


Iv. Q&a

Q:flannel is it possible to create multiple networks and enable isolation between networks?

A: Yes, the latest flannel has been added to the ability to manage multiple networks, you can set up multiple networks at startup, ETCD configuration information in a slightly different format, start Flanneld when there are parameters can be set to initialize which network.
Q: If you find in the flannel process that cross-node access is not accessible, what are the convenient ways to get started with troubleshooting?

A: First look at whether the virtual network you specified conflicts with the network segments in the existing physical networks, and then check that the UDP ports between the nodes can be connected, and finally consider whether the current system supports Vxlan, minimum requirements v3.7, recommended v3.9+, The default kernel of the CentOS7 is already ready to meet the requirements.
Q: Over a set of test data two VMs to VM (VLAN): 7.74 gbits/sec, using flannel Vxlan, two container between 1.71 gbits/sec, is this data normal, Vxlan bandwidth loss occurs, What are the tuning ideas, thank you.

A: First I want to confirm that your test results are TCP or UDP, it is recommended to actually test, this is my 2 VPS on digital Ocean test results, for reference only:
https://github.com/yangyuqian/. MD%23 Performance Evaluation
After figuring out the principle, I believe it is easy to judge the bottleneck location: The nodes are forwarded by UDP to L2 packets, I think this part may have a relatively large suspicion.
Q:flannel in use, how can I get the latest routing table information for each node if I need to add a new network segment? Need to update flannel configuration entries for all nodes, restart flannel?

A: This problem is actually good, more close to the actual combat; First you can restart the Flanneld to update the network configuration, and then flannel each 24h will automatically reassign the network within the cluster, so you do not restart, every 24h will automatically refresh the local network, If you find that the local network configuration does not meet the requirements configured by flannel in ETCD, the network configuration is regenerated.
Q: I used the Flanne Lvxlan backend in the project. According to the article, forwarding is carried out by the kernel, flannel hangs does not affect the overnight. But actually in use, flannel hangs up and does cause external other access to Docker. May I ask what the reason is?

A: First of all to clarify, not to say that hanging off the network does not affect, Flanneld hangs will cause the local ARP entry can not be automatically updated, but the network environment has been generated is still available, the concrete can see me in front of the manual setup overlay network process, is basically the ARP Table
The above content is organized according to the March 28, 2017 night group sharing content. Share people Yang Oracle Qian, freewheel Infrastructure Senior software engineer, is mainly engaged in service-based framework, container platform related research and development and promotion, and interested in the technology mainly Golang, Docker, kubernetes and so on. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.