It is believed that many developers default Docker such containers are a sandbox (sandbox) application, which means they can run any application in Docker with root privileges, and Docker have security mechanisms to protect the host system. For example, some people feel that the process inside the Docker container is as safe as the process inside the virtual machine, and some people just find a source to download the Docker mirror without verification, look at the content on the host machine to try, learn and research; There are also a number of companies offering PAAs services that allow users to submit their own customized Docker mirrors to a multi-tenant system. Please note that all of these actions are unsafe.
This article describes the isolation and security of Docker, and why it is less isolated and more secure than traditional virtual machines.
What is security?
For Docker alone, security can be summed up to two points:
Does not affect the host
No impact on other containers
So more than 90% of security issues can be attributed to isolation issues. The security problem of Docker is essentially the security of container technology, which includes the common kernel problem and the limitations of namespace is not perfect:
/proc,/sys, etc. not fully isolated
The information displayed by commands such as top, free, iostat is not isolated
Root User not quarantined
/dev Device not quarantined
Kernel module not quarantined
SELinux, Time, syslog, and all existing namespace information are not quarantined
Of course, the fact that the mirror itself is unsafe can also cause security problems.
Really not as safe as a virtual machine?
In fact, the traditional virtual machine system is also not 100% security, just break the hypervisor is enough to make the entire virtual machine destroyed, the question is who can casually break it? As mentioned above, Docker's isolation is mainly based on namespace technology. Traditionally, the PID in Linux is unique and independent, under normal circumstances, users will not see the repetitive PID. However, the namespace is adopted in Docker so that the same PID can exist independently in different namespace. For example, the pid=1 in a Container is a program, and the pid=1 in B Container can also be a program. Although Docker can separate seemingly self-contained spaces through namespace, the Linux kernel (Kernel) cannot be namespace, so even with multiple container, all system call is actually handled through the host's kernel, This leaves an undeniable security problem for Docker.
Traditional virtual machines also have many operations that need to be processed by the kernel, but this is only the kernel of the virtual machine, not the host host kernel. Therefore, in the event of a problem, the virtual system itself is affected at most. Of course, you can say that hackers can first hack the kernel of the virtual machine, and then look for hypervisor vulnerabilities can not be found at the same time, then break SELinux, and then attack the host kernel. Words are too complicated to express, not to mention the actual implementation? So Docker is very useful, but when migrating business systems to it, please be sure to pay attention to security!
How to solve?
After accepting the idea that "the container is not completely closed", the open source community, especially the Red Hat company, together with Docker to improve the security of Docker, mainly includes protecting the host from the internal operation of the container and preventing the container from destroying each other. The open source Community's efforts to address Docker security issues include:
Role: Quarantine audit function
Not fit reason: the significance is small, and will increase the complexity of audit, difficult to maintain.
Function: Isolate System log
Container reason: It's hard to tell exactly which log should belong to a particular one.
Function: Isolate devices (support devices are used in multiple containers at the same time)
The reason for the failure: almost all drivers are modified, the changes are too large.
Function: Make each container have its own system time
Non-fit reason: Some design details are not agreed on, and feel the application scene is not much.
Task Count Cgroup
Role: Limit the number of processes in Cgroup, can solve the problem of fork bomb
Non-fit reason: less necessary, added complexity, kmemlimit can achieve similar effect. (may be closed recently)
Information Display for quarantine/proc/meminfo
Role: In the container to see their own meminfo information
Cgroupfs Reason: All information has been exported, and the work performed by/proc can be implemented by user state, such as fuse.
However, from the 08 Cgroup/ns basic molding, so far no new namespace to join the core, Cgroup in the subsystem to do a simple supplement, most of the work is the original subsystem perfect. The core community of the container technology requirements of the isolation, the principle is enough, not to make the kernel too complex.
Some enterprises have done a lot of work, for example, some project teams adopt cascading security mechanisms, these optional security mechanisms are as follows:
1. File System Level protection
File system Read only: Some Linux system kernel file systems must be mount to the container environment, otherwise the process in the container will strike. This is very handy for malicious processes, but most apps running in containers don't actually need to write data to the file system. In this case, the developer can use read-only mode while the mount is being built. Like the following:/sys,/proc/sys,/proc/sysrq-trigger,/PROC/IRQ,/proc/bus
Write-time Replication (Copy-On-Write): Docker is using this file system. All running containers can share a basic file system image first, and once you need to write data to the file system, direct it to another specific file system associated with that container. Such a mechanism avoids one container from seeing another container's data, and the container cannot affect other containers by modifying the contents of the file system.
2. Capability mechanism
Linux is more clear about the capability mechanism, that is, for permission checking, traditional UNIX implements two different collations for the process, a high privilege process (user ID 0, superuser or root), and a low privilege process (the UID is not 0). The high privilege process avoids various permissions checks altogether, while the low privilege process accepts all permission checks and is checked for the validity of the UID, GID, and group manifests. Starting with the 2.2 kernel, Linux has divided the advanced permissions of the original and superuser into different units, called capability, so that they can be enabled or disabled independently of the particular capability. In general, unreasonable prohibitions on capability can cause applications to crash, so for containers like Docker, it is safe and available. Developers need to weigh capability settings from the functional, usability, and security aspects. The capability list, which is currently opened by default at Docker installations, has been the focus of development community controversy, and as an ordinary developer, the default settings can be changed by the command line.
3. Namespace mechanism
Some of the namespaces provided by Docker also provide security, such as the PID namespace, that hides all processes that are not running in the developer's current container. If a malicious program does not see these processes, the attack should also be troublesome. In addition, if the developer terminates the PID is the 1 process namespace, all processes in the container will be automatically terminated, which means that the administrator can easily turn off the container. There is also a network namespace that allows administrators to build a container's network environment through routing rules and iptable, so that processes within the container can only use specific networks that are licensed by the administrator. A container that can access only the public network and can only be accessed between local and two containers for filtering content.
4. Cgroups mechanism
Primarily for denial of service attacks. A malicious process can attack the system by occupying all the resources of the system. The cgroups mechanism can avoid this, such as CPU cgroups can log in and stop a malicious process when a Docker container attempts to destroy the CPU. Administrators need to design more cgroups to control processes that open too many files or resources such as too many child processes.
SELinux is a tagging system with tabs on the process, each file, directory, and System object. SELinux security by composing the access rules between the label process and the Label object. It implements a system called MAC (Mandatory access control), where the owner of the object does not have access to the object.
The easiest thing to do is not to treat the Docker container as something that can completely replace the virtual machine. Applications running in Docker containers will be selective for a long time, usually running only test systems or trusted businesses.
The threshold is a little higher, and we subtract from the system to achieve security through a variety of limitations. This is also the most mainstream, effective security reinforcement methods, such as the previous chapter of several security mechanisms described. At the same time must ensure the safety and stability of the kernel. External tools such as monitoring, fault tolerance and other systems are also essential.
In short, through the adaptation and reinforcement of the Docker container scheme, the safety can be fully achieved commercial standards. It is possible for the implementation of personnel technical requirements and the threshold is higher.