Docker core technology and implementation principles

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Referring to virtualization technology, the first thing we think of is Docker, after four years of rapid development Docker has become the standard of many companies, and is no longer a only in the development stage of the use of toys. As a product that is widely used in production environments, Docker has a very mature community and a large number of users, and the content in the code base becomes very large.

Again, because of the development of the project, the splitting of functions, and the strange renaming of PR, it becomes more difficult to understand the overall architecture of Docker again.

Although Docker has a lot of components, and the implementation is very complex, but this article does not want to be too much to introduce the details of Docker implementation, we would like to talk about the emergence of this virtualization technology Docker what the core technology support.

First of all, the advent of Docker must be because the current backend in the development and operations phase does require a virtualization technology to solve the development environment and production environment consistent problem, through Docker we can put the environment of the program is also included in version control, to exclude because the environment caused by different running results possible. But while these requirements drive the emergence of virtualization technology, we still don't have a perfect product without the right underlying technology to support it. The remainder of this article will cover some of the core technologies used by Docker, and if we understand how and how they are used, the implementation of Docker can be understood.

Namespaces

A namespace (namespaces) is a method that Linux provides for us to detach resources such as process trees, network interfaces, mount points, and inter-process communication. When we use Linux or MacOS on a daily basis, we do not need to run multiple fully detached servers, but if we start multiple services on the server, these services actually affect each other, each service can see the processes of other services, and can also access any file on the host machine. This is a lot of times we don't want to see, we prefer to run on the same machine different services can be completely isolated , like running on multiple different machines.

In this case, once a service on the server is compromised, the intruder is able to access all the services and files on the current machine, which we do not want to see, and Docker actually isolates the different containers through the Linux namespaces.

The Linux namespace mechanism provides the following seven different namespaces, including,,,,,, CLONE_NEWCGROUP CLONE_NEWIPC CLONE_NEWNET CLONE_NEWNS CLONE_NEWPID CLONE_NEWUSER and CLONE_NEWUTS , With these seven options we can set the resources on which the new process should be isolated from the host machine when a new process is created.

Process

The process is a very important concept in Linux and now the operating system, which represents an executing program and a task unit in a modern time-sharing system. On every *nix operating system, we are able to ps print out the processes that are executing in the current operating system, such as on Ubuntu, using the command to get the following results:

$ ps -efUID        PID  PPID  C STIME TTY          TIME CMDroot         1     0  0 Apr08 ?        00:00:09 /sbin/initroot         2     0  0 Apr08 ?        00:00:00 [kthreadd]root         3     2  0 Apr08 ?        00:00:05 [ksoftirqd/0]root         5     2  0 Apr08 ?        00:00:00 [kworker/0:0H]root         7     2  0 Apr08 ?        00:07:10 [rcu_sched]root        39     2  0 Apr08 ?        00:00:00 [migration/0]root        40     2  0 Apr08 ?        00:01:54 [watchdog/0]...

There are a lot of processes on the machine currently executing, in the above process there are two very special, one is pid 1 /sbin/init process, the other is pid 2 kthreadd process, both processes are created by the god process in Linux, the idle former is responsible for the A subset of the kernel's initialization work and system configuration will also create similar getty registration processes, which are responsible for managing and dispatching other kernel processes.

If we run a new Docker container under the current Linux operating system, and by exec entering its internal bash and printing all of its processes, we will get the following results:

root@iZ255w13cy6Z:~# docker run -it -d ubuntub809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79root@iZ255w13cy6Z:~# docker exec -it b809a2eb3630 /bin/bashroot@b809a2eb3630:/# ps -efUID        PID  PPID  C STIME TTY          TIME CMDroot         1     0  0 15:42 pts/0    00:00:00 /bin/bashroot         9     0  0 15:42 pts/1    00:00:00 /bin/bashroot        17     9  0 15:43 pts/1    00:00:00 ps -ef

Executing commands inside the new container ps prints out a very clean list of processes, with only ps -ef three processes currently in place, and dozens of processes on the host machine are gone.

The current Docker container successfully isolates the processes in the container from the processes in the host machine, and if we print all the current processes on the host machine, we get the following three Docker-related results:

UID        PID  PPID  C STIME TTY          TIME CMDroot     29407     1  0 Nov16 ?        00:08:38 /usr/bin/dockerd --raw-logsroot      1554 29407  0 Nov19 ?        00:03:28 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runcroot      5006  1554  0 08:38 ?        00:00:00 docker-containerd-shim b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 /var/run/docker/libcontainerd/b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 docker-runc

On the current host machine, there may be a process tree consisting of the different processes described above:

This is the implementation when using the creation of a clone(2) new process, that is CLONE_NEWPID , using the Linux namespace to implement the isolation of the process, any process inside the Docker container is unaware of the host machine's process.

containerRouter.postContainersStart└── daemon.ContainerStart    └── daemon.createSpec        └── setNamespaces            └── setNamespace

The container for Docker is to use the above technology to isolate the process from the host machine, and when we run docker run or docker start , we create a Spec in the following method to set up inter-process isolation:

func (daemon *Daemon) createSpec(c *container.Container) (*specs.Spec, error) {s := oci.DefaultSpec()// ...if err := setNamespaces(daemon, &s, c); err != nil {return nil, fmt.Errorf("linux spec namespaces: %v", err)}return &s, nil}

setNamespacesnot only will the process-related namespaces be set in the method, but also the namespaces associated with users, networks, IPC, and UTS:

func setNamespaces(daemon *Daemon, s *specs.Spec, c *container.Container) error {// user// network// ipc// uts// pidif c.HostConfig.PidMode.IsContainer() {ns := specs.LinuxNamespace{Type: "pid"}pc, err := daemon.getPidContainer(c)if err != nil {return err}ns.Path = fmt.Sprintf("/proc/%d/ns/pid", pc.State.GetPID())setNamespace(s, ns)} else if c.HostConfig.PidMode.IsHost() {oci.RemoveNamespace(s, specs.LinuxNamespaceType("pid"))} else {ns := specs.LinuxNamespace{Type: "pid"}setNamespace(s, ns)}return nil}

All namespace-related settings are Spec finally set as Create the arguments for the function when creating a new container:

daemon.containerd.Create(context.Background(), container.ID, spec, createOptions)

All of the settings related to namespaces are done in the two functions above, and Docker successfully completes the isolation of the host process and the network through a namespace.

Internet

If Docker's container completes the network isolation from the host process through the Linux namespace, but there is no way to connect to the entire Internet via the host's network, there are many restrictions, so Docker can create an isolated network environment through the namespace, but Docker Services still need to be connected to the outside world to play a role.

Each container that uses docker run launch has a separate network namespace, and Docker provides us with four different network modes, Host, Container, None, and Bridge mode.

In this section, we will introduce the default network setup mode for Docker: Bridge mode. In this mode, Docker also sets the IP address for all containers, in addition to allocating the isolated network namespace. When a Docker server on the host creates a new virtual bridge Docker0, then all the services that are started on that host are connected to the bridge in someone's case.

By default, each container is created with a pair of virtual network cards, and two virtual network cards make up the data channel, one of which is placed in the created container and is added to the named Docker0 Bridge. We can use the following command to view the interface of the current bridge:

$ brctl showbridge namebridge idSTP enabledinterfacesdocker08000.0242a6654980noveth3e84d4f            veth9953b75

DOCKER0 assigns a new IP address to each container and sets the DOCKER0 IP address as the default gateway. The bridge Docker0 is connected to the network card on the host machine through the configuration in the Iptables, and all eligible requests are forwarded to the DOCKER0 by Iptables and distributed to the corresponding machine by the bridge.

$ iptables -t nat -LChain PREROUTING (policy ACCEPT)target     prot opt source               destinationDOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCALChain DOCKER (2 references)target     prot opt source               destinationRETURN     all  --  anywhere             anywhere

We docker run -d -p 6379:6379 redis started a new Redis container using the command on the current machine, and after that we looked at the current iptables NAT configuration and saw DOCKER a new rule in the chain:

DNAT       tcp  --  anywhere             anywhere             tcp dpt:6379 to:192.168.0.4:6379

The above rule forwards the TCP packets sent from any source to the current machine 6379 port to the address where the 192.168.0.4:6379 resides.

This address is actually the IP address that Docker assigns to the Redis service, and if we ping the IP address directly on the current machine, we can see that it is accessible:

$ ping 192.168.0.4PING 192.168.0.4 (192.168.0.4) 56(84) bytes of data.64 bytes from 192.168.0.4: icmp_seq=1 ttl=64 time=0.069 ms64 bytes from 192.168.0.4: icmp_seq=2 ttl=64 time=0.043 ms^C--- 192.168.0.4 ping statistics ---2 packets transmitted, 2 received, 0% packet loss, time 999msrtt min/avg/max/mdev = 0.043/0.056/0.069/0.013 ms

From the above, we can infer how Docker exposes the ports inside the container and forwards the packets, and when a container with Docker needs to expose the service to the host machine, it assigns an IP address to the container and appends a new rule to the iptables.

When we access the address of the 127.0.0.1:6379 in the redis-cli command line of the host machine, the iptables NAT prerouting directs the IP address to 192.168.0.4, the redirected packets can be passed through the iptables FILTER configuration, which eventually disguises the IP address as 127.0 in the NAT postrouting phase. 0.1, although from the outside it looks like we're asking for 127.0.0.1:6379, but the actual request is already the port that the Docker container exposes.

$ redis-cli -h 127.0.0.1 -p 6379 pingPONG

Docker implements network isolation through the Linux namespace, and packet forwarding via iptables, allowing Docker containers to gracefully serve host machines or other containers.

Libnetwork

The function of the whole network part is implemented by the libnetwork of Docker, which provides an implementation to connect different containers, but also can give the application a container network model which can provide a consistent programming interface and network layer abstraction.

The goal of Libnetwork is to deliver a robust Container Network Model that provides a consistent programming interface and The required network abstractions for applications.

The most important concept in libnetwork, the container network model consists of several main components, namely Sandbox, Endpoint, and network:

In the container network model, each container contains a sandbox, which stores the current container's network stack configuration, including the container interface, routing table and DNS settings, Linux uses a network namespace to implement this sandbox, each sandbox may have one or more Endpoint, on Linux is a virtual network card Veth,sandbox through the Endpoint to join the corresponding network, where the network may be we mentioned above the Linux bridge or VLAN.

For more information about libnetwork or container network models, you can read the design Libnetwork learn more, and of course you can read the source code to understand the different implementations of the different OS to the container network model.

Mount point

Although we have solved the problem of process and network isolation through the Linux namespace, we have no way to access other processes on the host machine and restrict access to the network in the Docker process, but the processes in the Docker container can still access or modify other directories on the host machine. That's what we don't want to see.

Creating an isolated mount point namespace in a new process needs to be passed in the clone function CLONE_NEWNS , so that the process can get a copy of the parent process mount point, and if you do not pass this parameter, the child process will synchronize the file system to the parent process and the file system of the entire host .

If a container needs to be started, it must provide a root file system (ROOTFS) that the container needs to use to create a new process, and all binary execution must be in this root filesystem.

To start a container properly, you need to mount a few of the above specific directories in Rootfs, and in addition to the above several directories need to be mounted, we also need to establish some symbolic links to ensure that system IO does not occur.

In order to ensure that the current container process has no way to access other directories on the host machine, we also need to have access to the pivor_root chroot root node of the file directory through the Libcotainer provided or the function change process.

// pivor_rootput_old = mkdir(...);pivot_root(rootfs, put_old);chdir("/");unmount(put_old, MS_DETACH);rmdir(put_old);// chrootmount(rootfs, "/", NULL, MS_MOVE, NULL);chroot(".");chdir("/");

Here we attach the directory required by the container to the container, and also prohibit the current container process from accessing other directories on the host machine, to ensure the isolation of different file systems.

This part of the content is found in the author's spec.md file in Libcontainer, which contains a description of the filesystem used by Docker, and whether or not Docker really uses it chroot to ensure that the current process cannot access the directory of the host machine, and the author actually There is no definite answer , one is that the code of the Docker project is too large, do not know where to start, the author tried to find the relevant results through Google, but both found unanswered questions, but also with the description of the SPEC in conflict with the answer, If you have a clear answer to your readers can leave a message under the blog, thank you very much.

Chroot

Here we have to briefly introduce chroot (change root), in the Linux system, the system default directory is the / root directory is the beginning of the chroot use of the current system can be changed the root directory structure, by changing the current system root directory, We are able to restrict the user's rights, and in the new root directory does not have access to the old system root directory of the structure of files, but also established a completely isolated from the original system directory structure.

The relevant content of chroot is derived from the understanding of chroot, and readers can read this article for more detailed information.

CGroups

We isolate the file system, the network, and the process between the host machines through the Linux namespace for the newly created process, but the namespace does not provide us with the physical resource isolation, such as CPU or memory, if you run multiple on the same machine that knows nothing about each other and the host machine. containers, which collectively occupy the physical resources of the host machine.

If one of these containers is performing CPU-intensive tasks, it can affect the performance and execution efficiency of tasks in other containers, resulting in multiple containers interacting and seizing resources. How to limit the resource usage of multiple containers becomes the main problem after resolving the isolation of the process virtual resources, while Control Groups (CGroups) is able to isolate the physical resources on the host machine, such as CPU, memory, disk I/O, and network bandwidth.

Each CGroup is a set of processes that are constrained by the same criteria and parameters, with hierarchical relationships between different CGroup, meaning that they can inherit from the parent class some of the criteria and parameters used to restrict resource usage.

Linux CGroup can allocate resources for a set of processes, that is, the CPU, memory, network bandwidth and other resources mentioned above, through the allocation of resources, CGroup can provide several functions as follows:

In CGroup, all tasks are a process of a system, and CGroup is a set of processes according to a certain standard, in CGroup this mechanism, all the resource control is implemented in CGroup as a unit, each process can be added at any time a CGroup You can also exit a CGroup at any time.

–cgroup Introduction, application examples and principle description

Linux uses file systems to implement CGroup, and we can use the following command to see which subsystems are in the current CGroup:

$ lssubsys -mcpuset /sys/fs/cgroup/cpusetcpu /sys/fs/cgroup/cpucpuacct /sys/fs/cgroup/cpuacctmemory /sys/fs/cgroup/memorydevices /sys/fs/cgroup/devicesfreezer /sys/fs/cgroup/freezerblkio /sys/fs/cgroup/blkioperf_event /sys/fs/cgroup/perf_eventhugetlb /sys/fs/cgroup/hugetlb

Most Linux distributions have very similar subsystems, and the Cpuset, CPU, and so on are referred to as subsystems because they can assign resources to the corresponding control groups and limit the use of resources.

If we want to create a new cgroup just need to create a new folder under the subsystem that wants to allocate or restrict resources, and then this folder will automatically appear a lot of content, if you install Docker on Linux, you will find that all the subsystems of the directory has a name Folders for Docker:

$ ls cpucgroup.clone_children  ...cpu.stat  docker  notify_on_release release_agent tasks$ ls cpu/docker/9c3057f1291b53fd54a3d12023d2644efe6a7db6ddf330436ae73ac92d401cf1 cgroup.clone_children  ...cpu.stat  notify_on_release release_agent tasks

9c3057xxxIt is actually a docker container that we run, and when this container is launched, Docker creates a CGroup for the container with the same container identifier, and the CGroup on the current host will have the following hierarchical relationships:

Each CGroup below has a tasks file, which stores the PID of all processes belonging to the current control group, as the subsystem responsible for the CPU, the contents of the cpu.cfs_quota_us file can limit the use of the CPU, if the contents of the current file is 50000, then the current control group of all in The CPU usage of the process cannot exceed 50%.

If the system administrator wants to control the resource usage of a Docker container, it can docker find the corresponding sub-control group under the parent control group and change the contents of the corresponding file, of course, we can also use the parameters directly when the program is running, let the Docker process change the contents of the corresponding file.

$ docker run -it -d --cpu-quota=50000 busybox53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274$ cd 53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274/$ lscgroup.clone_children  cgroup.event_control  cgroup.procs  cpu.cfs_period_us  cpu.cfs_quota_us  cpu.shares  cpu.stat  notify_on_release  tasks$ cat cpu.cfs_quota_us50000

When we use Docker to shut down a running container, the folders corresponding to the sub-control group of Docker are also removed by the Docker process, and Docker uses CGroup only to do some file operations that create folders to change file contents, but CGroup Also does solve the problem of restricting the use of sub-container resources, the system administrator can allocate resources rationally for multiple containers and there will be no problem that multiple containers preempt each other's resources.

UnionFS

Linux namespaces and control groups solve the problem of isolation of different resources, the former solves the isolation of the process, network and file system, the latter realizes the isolation of CPU, memory and other resources, but there is another very important problem in Docker that needs to be solved-that is, mirroring.

What the mirror is, how it is composed and organized is a question that has been confusing to the author for some time since the author used Docker, and we can use it docker run very easily to download the Docker image from the remote and run it locally.

The Docker image is essentially a compressed package, and we can use the following command to export a file from a Docker image:

$ docker export $(docker create busybox) | tar -C rootfs -xvf -$ lsbin  dev  etc  home proc root sys  tmp  usr  var

You can see that the directory structure in the BusyBox image is not much different from the content in the root directory of the Linux operating system, and it can be said that the Docker image is a file .

Storage Drivers

Docker uses a series of different storage drivers to manage the file systems within the image and run the containers, which are somewhat different from the Docker volume (volume), which manages storage that can be shared across multiple containers.

To understand the storage drivers used by Docker, we first need to understand how Docker is building and storing images, and how the image of Docker is used by each container, and every image in Docker is made up of a series of read-only layers, Dockerfile Each of these commands creates a new layer on the existing read-only layer:

FROM ubuntu:15.04COPY . /appRUN make /appCMD python /app/app.py

Each layer in the container has only very small modifications to the current container, and the Dockerfile file above builds a mirror with a layer of four layers:

When the image is created by the docker run command to add a writable layer at the top of the image, which is the container layer, all changes to the runtime container are actually modifications to the container's read-write layer.

The difference between a container and an image is that all mirrors are read-only, and each container is actually equal to the mirror plus a read-write layer, that is, the same image can correspond to multiple containers.

AUFS

UnionFS is actually a file system service designed for the Linux operating system to "federate" multiple file systems to the same mount point. And AUFS is the advanced UnionFS is actually UnionFS upgrade version, it can provide better performance and efficiency.

AUFS, as a federated file system, is able to federate the layers in different folders into the same folder, which is called a branch in AUFS, and the entire "union" process is known as a Federated Mount (union Mount):

Each mirror layer or container layer is /var/lib/docker/ a subfolder under the directory, and in Docker, the contents of all mirroring and container tiers are stored in the /var/lib/docker/aufs/diff/ directory:

$ ls /var/lib/docker/aufs/diff/00adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c       93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d800adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c-init  93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d8-init019a8283e2ff6fca8d0a07884c78b41662979f848190f0658813bb6a9a464a90       93b06191602b7934fafc984fbacae02911b579769d0debd89cf2a032e7f35cfa...

and /var/lib/docker/aufs/layers/ in the storage of the mirror layer of metadata, each file is stored in the mirror layer of metadata, the final /var/lib/docker/aufs/mnt/ containing the mirror or the container layer of the mount point, will eventually be the Docker through a federated way to assemble.

The above picture shows the assembly process very well, each mirror layer is built on top of another mirror layer, and all the mirror layers are read-only, and only the container layer at the top of each container can be read and written by the user, all containers are built on some underlying services (Kernel), including namespaces , control groups, Rootfs, and so on, the way this container is assembled provides a lot of flexibility, and a read-only mirror layer can be shared to reduce the disk footprint.

Other storage Drivers

AUFS is just one of the storage drivers used by Docker, and in addition to AUFS, Docker supports a variety of storage drivers, including,,, aufs devicemapper overlay2 zfs and vfs so on, replacing the overlay2 > is the recommended storage driver, but it overlay2 will still be used aufs as the default driver for Docker on machines without drivers.

Different storage drivers also have a completely different implementation when storing images and container files, and interested readers can find the appropriate content in the Docker's official document Select a storage driver.

To see what kind of storage driver is used on the Docker of the current system, you can get the corresponding information using the following command:

$ docker info | grep StorageStorage Driver: aufs

Author of this Ubuntu because there is no overlay2 storage driver, so use aufs as Docker's default storage driver.

Summarize

Docker has become a very mainstream technology, has been used in many mature companies in the production environment, but Docker's core technology has been in fact for many years, the Linux namespace, control group and UnionFS three technologies to support the current implementation of Docker, is also The most important reason for Docker to occur.

The author has learned a lot about the principles of Docker implementation, and has learned a great deal about Linux operating systems, but because Docker's current codebase is too large to fully understand Docker from a source code perspective. The details of the implementation are already very difficult, but if you are really interested in their implementation details, you can start with Docker CE's source code to understand the principles of Docker.

Reference

  • Chapter 4. Docker Fundamentals Using Docker by Adrian Mount
  • Techniques BEHIND DOCKER
  • Docker Overview
  • Unifying filesystems with Union mounts
  • DOCKER Foundation Technology: AUFS
  • RESOURCE MANAGEMENT Guide
  • Kernel korner-unionfs:bringing filesystems Together
  • Union file Systems:implementations, part I
  • Improving DOCKER with unikernels:introducing Hyperkit, Vpnkit and Datakit
  • Separation anxiety:a Tutorial for isolating Your System with Linux namespaces
  • Understanding Chroot
  • Linux Init process/pc Boot Procedure
  • Docker network detailed and pipework source code interpretation and practice
  • Understand container communication
  • Docker Bridge Network Driver Architecture
  • Linux Firewall tutorial:iptables Tables, Chains, Rules Fundamentals
  • Traversing of tables and chains
  • Docker Network partial execution stream analysis (libnetwork source interpretation)
  • Libnetwork Design
  • Profiling Docker File system: Aufs and Devicemapper
  • Linux-understanding The Mount namespace & Clone clone_newns flag
  • Kernel knowledge behind Docker--namespace resource Isolation
  • Infrastructure for container projects
  • Spec Libcontainer
  • DOCKER Basic technology: LINUX NAMESPACE (top)
  • DOCKER Basic technology: LINUX CGROUP
  • "Write Your Own Docker" book excerpt three: Linux UnionFS
  • Introduction to Docker
  • Understand images, containers, and storage drivers
  • Use the AUFS storage driver

About pictures and reprint


This work is licensed using the Creative Commons Attribution 4.0 International license. When reproduced, please indicate the original link, the picture in use, please keep all the contents of the picture, can be appropriately scaled and in the reference where the article is attached to the link, the picture using Sketch to draw.

About comments and messages

If you have questions about the content of the Docker core technology and the implementation principles in this article, please leave a comment in the comments system below.

Original link: Docker core technology and implementation principle · Faith-Oriented programming

Follow:draveness GitHub

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.