Docker Network Basics
Since Kubernetes is based on the Docker container as the carrier of the application release, the network characteristics of Docker also determine that kubernetes in building a container interoperability network must solve the problem of Docker's own network.
Network namespaces
To support multiple instances of the network protocol stack, Linux introduces a network namespace in the network namespace, which is isolated into different namespaces Namespace. Resources are completely isolated in different namespaces and cannot be fully communicated with each other. With different network namespaces, you can virtualize multiple different network environments on a single host. Docker uses the features of the network command space to realize network isolation between different containers.
In the Linux network namespace, you can configure your own independent iptables rules to set up packet forwarding, NAT, packet filtering, and so on.
Because the network namespace is isolated from each other, can not communicate directly, if you want to open up two isolated network namespace, to achieve data interoperability, you need to use the Veth device pair. One of the main functions of veth device pair is to get through different network protocol stacks, it is like a network cable, each end of the connection is connected to a different Web namespace protocol stack.
If you want to communicate between two namespaces, you must have a veth device pair.
Operation of the network namespace
1. Create a network namespace named test:
# ip netns add test
2. Execute commands in this namespace ip a
# ip netns exec test ip a1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
If you want to execute multiple commands, you can go directly to this network namespace to execute:
# ip netns exec test sh退出执行# exit
We can relay the devices in different network namespaces, such as the Veth devices mentioned above, because a device can only belong to a single network namespace, so when the device is transferred, the device cannot be seen in the current namespace.
Veth Equipment to
Because the Veth needs to connect two different network namespaces, the Veth device is generally paired, which is called peer at the other end.
1. Create Veth device to
# ip link add veth0 type veth peer name veth1
Create a Veth device pair with Veth0 on the side and veth1 on the end
2. Check the device pair:
# ip link show1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 52:54:00:7f:52:5a brd ff:ff:ff:ff:ff:ff3: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether 26:3f:dd:c0:70:cb brd ff:ff:ff:ff:ff:ff4: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether a2:91:f4:9c:5b:6b brd ff:ff:ff:ff:ff:ff
3. Assign the VETH1 to the test network namespace:
# ip link set veth1 netns test
4, check the current device pair situation:
# ip link show1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 52:54:00:7f:52:5a brd ff:ff:ff:ff:ff:ff4: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether a2:91:f4:9c:5b:6b brd ff:ff:ff:ff:ff:ff link-netnsid 0
5. Look at the test network namespace, and find that the device pair has been assigned:
# ip netns exec test ip a1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:003: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 26:3f:dd:c0:70:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0
6, because the devices on both ends have no address, so can not communicate, now assign the address:
ip addr add 172.16.0.1/24 dev veth0 # 给本端的veth0分配ip地址ip netns exec test ip addr add 172.16.0.2/24 dev veth1 # 为对端的veth1 配置IP
7, you can view the status of the Veth, by default, is down:
# ip a|grep veth4: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 inet 172.16.0.1/24 scope global veth0# ip netns exec test ip a|grep veth3: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 inet 172.16.0.2/24 scope global veth1
8, start Veth device pair, check whether the network is open:
# ip link set dev veth0 up# ip netns exec test ip link set dev veth1 up# ping 172.16.0.2PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data.64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=0.150 ms64 bytes from 172.16.0.2: icmp_seq=2 ttl=64 time=0.028 ms
9. View the peer-to-peer device
When the device is relatively large, it is not possible to confirm which device pair the device is on, and you can use the Ethtool command to view the device number on the peer:
# ethtool -S veth0 # 查看veth0的对端设备编号NIC statistics: peer_ifindex: 3 # 这里显示的对端的设备编号为3# ip netns exec test ip link |grep 3: # 对端设备编号为3的设备信息3: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000# 本地的veth0 编号为4# ip link |grep veth4: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000#在对端验证# ip netns exec test ethtool -S veth1NIC statistics: peer_ifindex: 4
Network Bridge
A network bridge in Linux is similar to a real-world switch and is a virtual two-tier device. Bridge can attach several network interface devices, such as eth0,eth1, when there is data to reach the bridge, the bridge will be based on the MAC address information in the message to be forwarded or discarded processing. The bridge automatically learns the internal Mac port mappings and is periodically updated.
The difference between a bridge and a real-world device is that data coming from the network interface is sent directly to the bridge instead of being received from a particular port.
The network bridge can set the IP address, when a device such as Eth0 added to the bridge, the IP bound on the device is invalid, if you want to achieve communication, you need to configure a network bridge IP.
1, if you want to configure the bridge adapter, you need to install the Bridge-utils tool:
# yum install bridge-utils -y
2. Add a bridge device Br0
# brctl addbr br0
3, add eth0 to the BR0 (after this step, the IP on the eth0 will be invalidated, although the IP is still on the eth0, but cannot receive data, if using SSH will be disconnected):
4. Delete the IP on the eth0:
ip addr del dev eth0 10.0.0.1/24
5. Add this IP to Br0
ifconfig br0 10.0.0.1/24 up
6. Add a default route to Br0:
route add default gw 10.0.0.254
7, we can check the current routing information of the card by the following command:
ip route list netstat -rn route -n
Docker Network implementation
In a pure Docker environment, Docker supports Class 4 network mode:
- Host mode: Use the host's IP and port
- Container mode: Share network with existing containers
- None Mode: No network configuration
- Bridge mode: The default mode, using the bridged network, kubernetes use this mode.
Because only bridge mode is used in Kubernetes, only bridge mode is discussed here.
Docker Network Model
Network Example diagram:
By, it is clear that the network structure of the container, where the NIC eth0 in the container and the VETHXXX device bound on the Docker0 bridge are pair of veth devices. Where vethxxx is bound to the Docker0 bridge, so there is no IP address, the eth0 in the container is assigned and DOCKER0 the same network segment address, so that the container interconnect.
By looking at a host that runs two containers:
# IP A1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue State UNKNOWN Qlen + link/loopback 00:00:00:00:00: BRD 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever Preferred_lft Forever Inet6:: 1/128 Scope host Valid_lft Forever Preferred_lft Forever2:eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU Qdisc P Fifo_fast state up Qlen link/ether 52:54:00:15:c2:12 brd ff:ff:ff:ff:ff:ff inet 192.168.20.17/24 BRD 192.168.20 .255 scope Global eth0 Valid_lft forever Preferred_lft forever inet6 fe80::5054:ff:fe15:c212/64 Scope link Valid_lft Forever preferred_lft Forever3:docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU qdisc Noqueue State U P link/ether 02:42:fa:6f:13:18 BRD ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 scope global Docker0 valid_lft Foreve R Preferred_lft Forever Inet6 fe80::42:faff:fe6f:1318/64 scope link Valid_lft forever preferred_lft forever7: [E Mail protected]: < Broadcast,multicast,up,lower_up> MTU Qdisc Noqueue Master Docker0 State up Link/ether F2:4E:50:A5:FB:B8 BRD FF : ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::f04e:50ff:fea5:fbb8/64 scope link Valid_lft forever preferred_lft for EVER19: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU qdisc noqueue Master Docker0 State up Link/ether 7a:96:bc:c7:03:d8 BRD ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::7896:bcff:fec7:3d8/64 scope link Valid_lft Forever Preferred_lft Forever
By viewing the bridge adapter information, you can verify that the two Veth bindings are on the DOCKER0:
# brctl showbridge name bridge id STP enabled interfacesdocker0 8000.0242fa6f1318 no veth36fb1f6 veth37e9040
By default, Docker hides the configuration of the network namespace, and if you want to view the information through a ip netns list
command, you need to do the following:
# docker inspect 506a694d09fb|grep Pid "Pid": 2737, "PidMode": "", "PidsLimit": 0,# mkdir /var/run/netns# ln -s /proc/2737/ns/net /var/run/netns/506a694d09fb# ip netns list506a694d09fb (id: 0)6d9742fb3c2d (id: 1)
View the IP of two containers separately:
# IP netns exec 506a694d09fb IP a 1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue State UNKNOWN Qlen Ink/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_l FT FOREVER6: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU Qdisc Noqueue State up Link/ethe R 02:42:ac:11:00:02 BRD ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 scope global eth0 Valid_lft forever P Referred_lft foreve# IP netns exec 6d9742fb3c2d IP a 1:lo: <LOOPBACK,UP,LOWER_UP> MTU 65536 qdisc noqueue State UNK Nown qlen link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft Forever preferred_lft forever18: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> MTU Qdisc Noqueue state up Link/ether 02:42:ac:11:00:03 BRD ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.3/16 Scope Global Eth0 Valid_lft Forever PrefeRred_lft Forever
It can be found that these two containers belong to different network namespaces, but in the same network segment, through the Veth device pair, bind the DOCKER0 interconnect.
By being ethtool -S veth-name
able to view the corresponding peer side, this is no longer demonstrated, in fact, through the name of Veth ([email protected]) can also find the interface information referred to.
Network implementation of Kubernetes
Kubernetes mainly solves the following problems:
- Container-to-container communication
- Abstract pod-to-pod communication
- Pod-to-service communication
- Communication between the outside of the cluster and within the cluster
Communication between the container and the container
The containers in the same pod belong to the same network namespace, share the same Linux network stack, and communicate with the other containers in the pod through the local localhost network, such as the containers in the pod:
The communication between pod and pod
Communication on the same host:
In the host through the Docker0 Bridge network card, can realize the direct communication between the pod, here and the pure Docker environment of multiple container interoperability principle is similar.
Another scenario is to communicate between different pods on different host hosts, and the schematic diagram is as follows:
Network Plugin
The company is a container network specification proposed by CoreOS, which defines a simple interface specification between the container running environment and the network plug-in.
The MLM model involves two concepts:
- Containers: environments that have a separate Linux network namespace, such as Docker and Rkt.
- Network: A network represents a set of entities that can be interconnected, and these entities have independent, unique IP addresses.
The network plug-in used in Kubernetes
Kubernetes currently supports a variety of network plug-ins, which can be interfaced with plug-in providers using the interface implemented by the MLM plug-in specification. When specifying a plug-in in kubernetes, you need to specify the plug-in parameters in the Kubelet service startup parameters:
... --network-plugin=cni --cni-conf-dir=/etc/cni/net.d \ # 此目录下的配置文件要符合CNI规范。 --cni-bin-dir=/opt/kubernetes/bin/cni ...
There are several open source projects that support deployment to kubernetes in the form of a network plug-in, including Calico, Canal, Cilium, Contiv, Fannel, Romana, weave, etc.
Flannel Network Implementation principle
Flannel schematic diagram:
The reason we want to use a third-party network plug-in alone is to extend k8s, mainly because in an environment where Docker is used, the DOCKER0 default network segment on each node is 172.17.0.0/16 network. If you want to implement a different host node on the pod (which can also be understood as a container) to communicate with each other, you cannot use the default network segment provided by DOCKER0, we need to deploy a fannel overlay network, so that each node nodes Docker0 Network is in a different network segment, so that By adding some routing and forwarding policies, each pod in the cluster can communicate in the same virtual network.
Fannel first connects to Etcd, uses ETCD to manage assignable IP address segment resources, monitors the actual address of each pod in Etcd, and builds a POD node routing table in memory that encapsulates the packets sent to it by DOCKER0. The connection of the physical network is used to deliver the data to the target flannel, which completes the pod-to-pod communication.
Fannel in order to not conflict with the pod IP on other nodes, each time it gets ip,flannel in ETCD by default using UDP as the underlying transport protocol.
Calico Network Implementation principle
If network policy is to be implemented in Kubernetes, it is not possible to use only flannel networks, which itself only solves the problem of pod interconnection, and if you need the network policy function, you need to use such as Calico, Romana, Weave NET and trireme support Network policy. The principle of common calico is described here.
Calico Introduction
Calico is a BGP-based, pure three-tier network solution. Calico uses Linux kernel at each node to implement an efficient vrouter that is responsible for data forwarding. Each vrouter broadcasts the routing information of the container running on this node to the entire Calico network through the BGP1 protocol, and automatically sets the routing forwarding rules to the other nodes. Calico ensures that all data traffic between all containers is connected via IP routing. Calico node networks can directly utilize the network structure of the data center (L2 and L3), without the need for additional NAT, tunneling, or overlay network, so there is no additional packet and packet processing, can save CPU operation, improve network efficiency, The Calico network is more efficient than flannel.
Overlay Network and Calico network packet structure comparison (diagram):
Characteristics:
- Calico can be directly interconnected in small-scale clusters and can be done through additional BGP route reflector in large-scale clusters.
- Calico based on Iptables also provides a wealth of network policies. The network policy policy of Kubernetes is implemented to provide the ability to limit the accessibility of networks between containers.
- In environments where a overlay network is required, Calico uses IP-IN-IP tunneling and can be compatible with other overlay networks, such as flannel.
- Calico also provides a dynamic implementation of network security rules.
- Calico is more suitable for large kubernetes clusters deployed in both physical and private cloud environments, with higher performance, simpler and easier deployment and maintenance than coverage networks such as flannel.
With Calico's simple policy language, you can achieve fine-grained control of communication between containers, virtual machine workloads, and bare-metal host endpoints.
The Calico v3.0 integrated environment with Kubernetes and OPENSHIF has undergone large-scale production validation.
Calico Architecture and Components
Calico Frame Composition:
Calico Components:
- Felix: Calico agent, need to run on each host, mainly for the container or virtual machine set network resources (IP address, routing rules, iptables rules, etc.), to ensure cross-host container network interoperability.
- ETCD: Data storage for calico.
- Orchestrator Plugin : Orchestrator-specific code, which integrates calico tightly into the Orchestrator, mainly provides API conversion and feedback on the status of the Felix Agent with the integration platform.
- BIRD: The client component of BGP, which distributes the routing information for each node to the Calico network. (using BGP protocol)
- BGP Route Reflector (BIRD): Hierarchical routing distribution for large-scale clusters can be accomplished through a single and multiple BGP Router Reflector. (Optional components)
- calicoctl: Calico command-line tool.
Functions and implementations of each component Felix
Felix is a daemon that runs on every computer that provides endpoint: In most cases, this means running on the host node of the managed container or VM. It is responsible for setting up the routing and ACLs and any other tasks required on the host to provide the required connections to the endpoints on that host.
Felix is generally responsible for the following tasks:
= = Interface Management = =:
Felix programmed some information about the interface into the kernel to enable the kernel to properly handle the traffic emitted by that endpoint. In particular, it ensures that the host uses the host's Mac to respond to ARP requests from each workload and to enable IP forwarding for the interfaces it manages. It also monitors the presence and disappearance of interfaces to ensure that the programming of these interfaces is applied at the appropriate time.
= = route Planning = =:
Felix is responsible for programming the routes to its host endpoints to the Linux kernel fib (forwarding information base). This ensures that packets destined to the endpoint that arrive at the host are forwarded accordingly.
==acl planning = =:
Felix is also responsible for programming ACLs into the Linux kernel. These ACLs are used to ensure that only valid traffic is sent between endpoints and that the endpoint cannot bypass calico security.
- = = Status Report = =:
Felix is responsible for providing data on the health status of the network. In particular, it reports errors and problems when configuring its host. The data is written to Etcd to make it visible to other components and operators of the network.
Orchestrator plug-in (optional)
Unlike Felix without a separate orchestrator plugin, each major cloud orchestration platform (such as Kubernetes) has a separate plugin. These plug-ins bind calico more tightly to the coordinator, allowing the user to manage the Calico network just like the network tools in their management coordinator. The kubernetes can be used to replace this feature directly with the MLM plugin.
A good orchestrator plug-in example is the Calico Neutron ML2 mechanism driver. This component integrates with Neutron's ML2 plug-in to allow users to configure Calico networks through Neutron API tuning. This provides seamless integration with the neutron. The main features are:
==api conversion = =:
The Coordinator will inevitably have its own set of APIs for managing the network. The main task of the orchestrator plug-in is to convert these APIs into a calico data model and store them in the Calico data store.
Some of this transformation will be very simple, and other bits may be more complex to present a single complex operation (for example, live migration) as a series of simpler operations expected from the rest of the Calico network.
- = = Feedback = =:
If necessary, the orchestrator plugin will provide feedback from the Calico network to the coordinator. Examples include: providing information about Felix's vitality; If the network settings fail, some endpoints are marked as failed.
Etcd
Calico uses ETCD to provide communication between components, and as a consistent data store, ensures that Calico can always build an accurate network.
Depending on the orchestrator plug-in, the ETCD can be either the primary data store or a lightweight copy image of the individual data store. Main Features:
= = Data Storage = =:
ETCD stores data for Calico Networks in a distributed, fault-tolerant manner (this refers to a ETCD cluster using at least three ETCD nodes). This ensures that the Calico network is always in a known-good state.
This distributed storage of calico data also improves the ability of calico components to read from the database, allowing them to distribute reads around the cluster.
- = = Communications Hub = =:
The ETCD is also used as a communication bus between components. By querying the changes in the data in the ETCD, each component is made to act accordingly.
BGP Client (BIRD)
The Calico deploys BGP clients on each node that also hosts Felix. The purpose of the BGP client is to read the route state that the Felix program enters the kernel and distributes it around the data center.
In Calico, the most common of this BGP component is bird, but any BGP client (such as the GOBGP that can extract routes from the kernel and distribute them) applies to this role.
- = = Route Distribution = =:
When Felix inserts a route into the Linux kernel fib, the BGP client receives them and distributes them to other nodes in the deployment. This ensures that traffic is effectively routed around the deployment.
BGP Route Reflector (BIRD)
For deployments of large clusters, simple BGP can be a limiting factor because it requires each BGP client to connect to each of the other BGP clients in the mesh topology. This makes the client's connection grow at N ^ 2, and when more nodes become more difficult to maintain. Therefore, in the deployment of a large cluster, calico will deploy the BGP Route Reflector. This component, which is typically used on the Internet, acts as a central point for BGP client connections, preventing them from needing to communicate with each BGP client in the cluster. To achieve redundancy, multiple BGP Route Reflector can be deployed seamlessly.
BGP Route Reflector purely participates in network control: No endpoint data passes through them. In Calico, this BGP component is also the most common bird, configured as a BGP Route reflector instead of a standard BGP client.
- = = Routing distribution of large-scale clusters = =:
When the Calico BGP client advertises the route from its fib to the BGP route reflector, the BGP route reflector advertises the routes to the other nodes in the Calico network.
Kubernetes Network principle