Docker core technology and realization principle __docker

Source: Internet
Author: User
Tags iptables docker run

Mention virtualization technology, our first thought must be Docker, after four years of rapid development Docker has become a standard for many companies, is no longer a can only be used in the development phase of the toy. As a widely used product in the production environment, Docker has a very mature community and a large number of users, and the contents of the code base become very large.

Again, because of the development of the project, the splitting of functions, and the strange changes in PR, it is more difficult for us to understand Docker's overall architecture again.

Although Docker currently has more components, and the implementation is also very complex, but this article does not want to introduce Docker specific implementation details, we would like to talk about Docker this virtualization technology is the emergence of core technology support.

First, the presence of Docker must be due to the fact that the current backend in the development and operations phase does require a virtualization technology to address the problem of consistent development and production environment, by Docker we can incorporate the environment in which the program is running into version control, excluding the possibility that the environment will result in different operational results. But these requirements, while driving the emergence of virtualization technology, but without the right underlying technical support, then we still do not have a perfect product. The remainder of this article will introduce the core technologies used by several Docker, and if we understand how they are used and how they work, we can clearly understand the Docker principle. Namespaces

Namespaces (namespaces) are methods that Linux provides for us to isolate resources such as process trees, network interfaces, mount points, and interprocess communication. When using Linux or MacOS on a daily basis, we don't have the need to run multiple completely separate servers, but if we start more than one service on the server, these services will actually interact with each other, each service can see the process of another service, or access any file on the host machine. This is a lot of times we don't want to see the different services that run on the same machine can be completely isolated, just like running on multiple machines.

In this case, once a service on the server is compromised, the intruder will be able to access all the services and files on the current machine, which we do not want to see, and Docker is actually isolating the different containers through the Linux namespaces.

The Linux namespace mechanism provides the following seven different namespaces, including Clone_newcgroup, CLONE_NEWIPC, Clone_newnet, clone_newns, Clone_newpid, Clone_newuser, and Clone_newuts, with these seven options, we can set the resources on which the new process should be isolated from the host machine when creating a new process. Process

The process is a very important concept in Linux and now the operating system, which represents an executing program and a task unit in a modern time-sharing system. On every *nix operating system, we are able to print out the processes that are executing in the current operating system via the PS command, such as on Ubuntu, and the following results can be obtained with this command:

$ ps-ef UID PID PPID C stime TTY time CMD root 1 0 0 Apr08?        00:00:09/sbin/init Root 2 0 0 Apr08?        00:00:00 [Kthreadd] Root 3 2 0 Apr08?        00:00:05 [ksoftirqd/0] Root 5 2 0 Apr08?        00:00:00 [kworker/0:0h] Root 7 2 0 Apr08?        00:07:10 [rcu_sched] root 2 0 Apr08?        00:00:00 [migration/0] root 2 0 Apr08? 00:01:54 [watchdog/0] ... 

There are a lot of processes running on the current machine, there are two very special in the process, one is the/sbin/init process of PID 1, the other is the kthreadd process of PID 2, both of which are created by the god process idle in Linux, the former Responsible for performing part of the initialization and system configuration of the kernel, as well as creating some similar Getty registration processes, which are responsible for managing and scheduling other kernel processes.

If we run a new Docker container under the current Linux operating system and go through the bash inside it and print all of the processes through exec, we will get the following results:

root@iz255w13cy6z:~# Docker run-it-d Ubuntu
b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79
root@iz255w13cy6z:~# Docker exec-it b809a2eb3630/bin/bash
root@b809a2eb3630:/#
UID        PID  PPID  C stime TTY time          CMD
root         1     0  0 15:42 pts/0    00:00:00/bin/bash
root         9     0  0 15:42 pts/1    00:00:00/bin/bash
root     9  0 15:43 pts/1    00:00:00 ps-ef

The PS command inside the new container prints out a very clean list of processes, with only three processes including the current PS-EF, and dozens of processes on the host machine are gone.

The current Docker container successfully isolates the processes within the container from the processes in the host machine, and if we print all of the current processes on the host machine, we get the following three Docker-related results:

UID        PID  PPID  C stime TTY time          CMD
root     29407     1  0 Nov16?        00:08:38/usr/bin/dockerd--raw-logs
root      1554 29407  0 Nov19?        00:03:28 docker-containerd-l Unix:///var/run/docker/libcontainerd/docker-containerd.sock--metrics-interval=0-- Start-timeout 2m--state-dir/var/run/docker/libcontainerd/containerd--shim Docker-containerd-shim--runtime Docker-runc
root      5006  1554  0 08:38?        00:00:00 Docker-containerd-shim b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79/var/run/docker/ Libcontainerd/b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 Docker-runc

On the current host machine, there may be a process tree consisting of the different processes described above:

This is achieved by passing in Clone_newpid when a new process is created using Clone (2), that is, using the Linux namespace to implement process isolation, and any process within the Docker container knows nothing about the process of the host machine.

Containerrouter.postcontainersstart
└──daemon. Containerstart
    └──daemon.createspec
        └──setnamespaces
            └──setnamespace

The Docker container is used to isolate processes from the host machine, and when we run Docker run or Docker start, we create a Spec in the following method to set up process isolation:

Func (daemon *daemon) Createspec (c *container. Container) (*specs. Spec, error) {
	s: = OCI. Defaultspec ()

	///...
	If err: = Setnamespaces (daemon, &s, C); Err!= Nil {return
		nil, FMT. Errorf ("Linux spec namespaces:%v", err)
	} return

	&s, nil
}

The Setnamespaces method not only sets the process-related namespaces, but also sets the namespaces associated with users, networks, IPC, and UTS:

Func setnamespaces (daemon *daemon, S *specs. Spec, C *container. Container) Error {
	//user
	//network
	/IPC
	//UTS

	//PID
	if C.hostconfig.pidmode.iscontainer () {
		ns: = specs. Linuxnamespace{type: "pid"}
		pc, err: = Daemon.getpidcontainer (c)
		If Err!= nil {return
			err
		}
		ns. Path = FMT. Sprintf ("/proc/%d/ns/pid", PC. State.getpid ())
		Setnamespace (S, NS)
	} else if C.hostconfig.pidmode.ishost () {
		OCI. Removenamespace (S, specs. Linuxnamespacetype ("pid"))
	} else {
		ns: = specs. Linuxnamespace{type: "pid"}
		Setnamespace (S, NS)
	} return

	nil
}

All namespace-related settings Spec will eventually be set as a parameter to the CREATE function when creating a new container:

Daemon.containerd.Create (context. Background (), container.id, spec, createoptions)

All of the namespace-related settings are done in the two functions mentioned above, and Docker successfully isolated the host process and the network through the namespace. Network

If the Docker container completes the network isolation of the host process through the Linux namespace, but there is no way to connect through the host's network to the entire Internet, there are many limitations, so Docker although can create an isolated network environment through the namespace, but Docker Services still need to be connected to the outside world to play a role.

Each container that uses the Docker run starts with a separate network namespace, Docker provides us with four different network modes, Host, Container, None, and bridge mode.

In this section, we will introduce the Docker default network Setup mode: Bridge mode. In this mode, Docker also sets the IP address for all containers in addition to allocating the isolated network namespace. When a docker server is started on a host, a new virtual bridge Docker0 is created, and all services that are subsequently started on that host are connected to the bridge by default.

By default, each container creates a pair of virtual NIC when it is created, and two virtual NICs form the channel of the data, one of which is placed in the created container and added to the DOCKER0 Network Bridge. We can use the following command to view the current Network Bridge interface:

$ brctl Show Bridge
name Bridge	ID		STP enabled	interfaces
Docker0		8000.0242a6654980	No		veth3e84d4f
							            veth9953b75

DOCKER0 assigns a new IP address to each container and sets the Docker0 IP address to the default gateway. The Network Bridge DOCKER0 is connected to the network card on the host machine through the configuration in the Iptables, and all eligible requests are forwarded to the DOCKER0 via iptables and distributed to the corresponding machine by the Network Bridge.

$ iptables-t nat-l
Chain prerouting (policy ACCEPT)
target     prot opt source               destination
Docker     all  --  anywhere             anywhere             addrtype match Dst-type local

Chain Docker (2 references)
Target     prot opt source               destination return all  -  anywhere             anywhere

We started a new Redis container on the current machine using the Docker run-d-P 6379:6379 redis command, after which we looked at the current iptables NAT configuration and saw a new rule in the Docker chain:

Dnat       TCP  --  anywhere             anywhere             TCP dpt:6379 to:192.168.0.4:6379

The above rules forward TCP packets sent from any source to the current machine 6379 port to the address where 192.168.0.4:6379 resides.

This address is also the IP address that Docker assigns to the Redis service, and if we ping the IP address directly on the current machine, we will find that it is accessible:

$ ping 192.168.0.4
ping 192.168.0.4 (192.168.0.4) bytes of data.
Bytes from 192.168.0.4:icmp_seq=1 ttl=64 time=0.069 ms-bytes from 192.168.0.4:icmp_seq=2 ttl=64 time=0.043-ms<
C3/>^c
---192.168.0.4 ping statistics---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
RTT m In/avg/max/mdev = 0.043/0.056/0.069/0.013 ms

From the above series of phenomena, we can speculate on how Docker exposes the container's internal ports and forwards the packets, and when a Docker container needs to expose the service to the host machine, it assigns an IP address to the container and appends a new rule to the iptables.

When we use REDIS-CLI to access 127.0.0.1:6379 's address on the host machine's command line, the iptables NAT prerouting directs the IP address to 192.168.0.4, and the redirected packets are passed through the IPTABL The FILTER configuration in ES will eventually disguise the IP address as 127.0.0.1 in the NAT postrouting phase, although from the outside it looks like we're asking for 127.0.0.1:6379, but it's actually the port that the Docker container exposes.

$ redis-cli-h 127.0.0.1-p 6379 Ping
PONG

Docker network isolation through the Linux namespace, and packet forwarding via iptables, allowing Docker containers to gracefully serve host machines or other containers. libnetwork

The function of the entire network part is realized through the Docker split out Libnetwork, it provides a connection to the different container realization, simultaneously also can give the application to provide the consistent programming interface and the network layer abstraction the container network model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.