I believe you have seen in many places "Docker based on Mamespace, Cgroups, chroot and other technologies to build containers," but have you ever wondered why the construction of containers requires these technologies? Why not a simple system call? The reason is that the Linux kernel does not have the concept of "Linux container", the container is a user state concept.
Docker software engineer Michael Crosby will write a few blog posts, dive into Docker's back and explore what's going on behind Docker Run's code, the first of a series of blogs, to delve into Docker's application of namespace technology.
Namespaces
In the first installment, I'll discuss Docker how to create Linux namespace when using Linux namespace. In a later blog we will discuss how namespace can combine with other features such as cgroups and isolated file systems to achieve more useful functionality.
Fundamentally, namespace is the underlying concept of Linux systems, and there are several different types of namespaces that are deployed in the kernel. Tracking Docker run-it--privileged--net host crosbymichael/make-containers This code, we can drill down into each of the different namespace. There will be some preload files and configurations at the beginning. Although we will also create namespace in a container that we run with Docker, don't let him affect you, I choose to provide a way for a container to preload all dependencies. I use the--net host flag so that I can see the host's network interface in the container. You also need to provide--privilged tags to ensure that you have the right permissions to create new namespace through the container.
Here's what's in Dockerfile:
From Debian:jessie
RUN apt update && apt install-y \
GCC \
VIM \
Emacs
COPY containers//containers/
Workdir/containers
CMD [bash]
I will use C to explain this example, because it is easier to explain the bottom details than the go language.
NET Namespace
Receptacle namespaces provides a view of your system network protocol stack. This stack of protocols includes your local host (localhost). Make sure you are under the directory crosbymichael/make-containers and run IP A to see all the network interfaces running on your host.
> IP A
root@development:/containers# IP A
1:lo: <loopback,up,lower_up>mtu 65536 qdisc noqueue State UNKNOWN Group Default
Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00
inet 127.0.0.1/8 Scope host Lo
Valid_lft Forever Preferred_lft Forever
INET6:: 1/128 Scope Host
Valid_lft Forever Preferred_lft Forever
2:eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>MTU 1500 qdisc pfifo_fast State up group default Qlen 1000
Link/ether 08:00:27:19:CA:F2 BRD FF:FF:FF:FF:FF:FF
inet 10.0.2.15/24 BRD 10.0.2.255 Scope Global eth0
Valid_lft Forever Preferred_lft Forever
Inet6 FE80::A00:27FF:FE19:CAF2/64 Scope link
Valid_lft Forever Preferred_lft Forever
3:eth1: <BROADCAST,MULTICAST,UP,LOWER_UP>MTU 1500 qdisc pfifo_fast State up group default Qlen 1000
Link/ether 08:00:27:20:84:47 BRD FF:FF:FF:FF:FF:FF
inet 192.168.56.103/24 BRD 192.168.56.255 Scope Global eth1
Valid_lft Forever Preferred_lft Forever
Inet6 FE80::A00:27FF:FE20:8447/64 Scope link
Valid_lft Forever Preferred_lft Forever
4:docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP>MTU 1500 qdisc noqueue State down Group default
Link/ether 56:84:7a:fe:97:99 BRD FF:FF:FF:FF:FF:FF
inet 172.17.42.1/16 Scope Global DOCKER0
Valid_lft Forever Preferred_lft Forever
Inet6 FE80::5484:7AFF:FEFE:9799/64 Scope link
Valid_lft Forever Preferred_lft Forever
This is all the network interfaces that are currently in my host system. Now let's write a piece of code to create a new receptacle interface. To do this, we will write a C language library framework, the system called Clone. We will start with the call to clone, the file skeleton.c should be in the demo container's working directory, we will use this file as the basis for our example. Here is the example code:
#define _gnu_source
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include
#include <errno.h>
#define STACKSIZE (1024*1024)
static Char child_stack[stacksize];
struct Clone_args {
Char **argv;
};
Child_exec is the func that would be executed as the result of clone
static int child_exec (void *stuff)
{
struct Clone_args *args = (struct Clone_args *) stuff;
if (EXECVP (args->argv[0], args->argv)!= 0) {
fprintf (stderr, "failed to EXECVP argments%s\n",
Strerror (errno));
Exit (-1);
}
We should implies reach here!
Exit (Exit_failure);
}
int main (int argc, char **argv)
{
struct Clone_args args;
ARGS.ARGV = &argv[1];
int clone_flags = SIGCHLD;
The result of this call is that we child_exec'll be run in another
Process returning it ' s PID
pid_t PID =
Clone (child_exec, Child_stack + stacksize, clone_flags, &args);
if (PID < 0) {
fprintf (stderr, "Clone failed WTF!!!! %s\n ", Strerror (errno));
Exit (Exit_failure);
}
Lets wait on my child's process here unreported we, the parent, exits
if (Waitpid (PID, NULL, 0) = = 1) {
fprintf (stderr, "failed to wait PID%d\n", PID);
Exit (Exit_failure);
}
Exit (exit_success);
}
This is a small C program that allows you to perform./a.out IP A. It takes the parameters you pass through the command line as parameters to any process you want to use. Don't worry about implementing too much, because the things we're going to do are going to have interesting changes. It will execute the program you want with whatever parameters you want. This means that if you want to perform one of the following demo, it will generate a shell session so that you can "hang out" in your namespace. You can explore and examine these different namespace in your own way. So let's copy this file and start using receptacle namespace.
> CP skeleton.c NETWORK.C
In this file there is a very special variable called Clone_flags, where most of the changes occur. The namespace is mainly controlled by the clone flag. The clone tag for receptacle namespace is clone_newnet. We need to change int clone_flags = SIGCHLD; this line to int clone_flags = Clone_newnet | SIGCHLD; This call to clone creates a new receptacle namespace for us. Save this change in Network.c, and then compile and run.
> Gcc-o Net NETWORK.C
>/net IP A
1:LO:MTU 65536 qdisc noop State down Group default
Link/loopback 00:00:00:00:00:00 BRD 00:00:00:00:00:00
The results of this operation look very different from the first run of IP A. This time we only saw a loopback interface. This is because the process we created has only one of its own receptacle namespace views, not the entire host. This is how to create a new receptacle namespace method.
Docker uses the new receptacle namespace to start a Veth interface so that your container will have its own bridging IP address, usually DOCKER0. Next, we no longer talk about how to install the interface in namespace. The relevant content will be covered in another article.
MNT Namespace
Mount namespace lets you see a catalog view of all the mount points in the system under a certain range. People often confuse it with the process of imprisoning in chroot, or think they are similar, and others say that containers use mount Namespac to imprison processes in its root file system, which is wrong!
Let's make a copy of the SKELETON.C used to do mount related modifications. Can be built and run quickly to see what our current mount point looks like after executing the Mount command.
> CP skeleton.c mount.c > Gcc-o mount mount.c > Mount Mount
Proc On/proc type proc (rw,nosuid,nodev,noexec,relatime) Tmpfs On/dev
Type TMPFS (rw,nosuid,mode=755) SHM on/dev/shm type TMPFS (rw,nosuid,nodev,noexec,relatime,size=65536k)
Mqueue on/dev/mqueue type Mqueue (rw,nosuid,nodev,noexec,relatime) devpts
On/dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
Sysfs On/sys type Sysfs (rw,nosuid,nodev,noexec,relatime)/dev/disk/by-uuid/d3aa2880-c290-4586-9da6-2f526e381f41
on/etc/resolv.conf type EXT4 (rw,relatime,errors=remount-ro,data=ordered)
/dev/disk/by-uuid/d3aa2880-c290-4586-9da6-2f526e381f41 On/etc/hostname
Type EXT4 (rw,relatime,errors=remount-ro,data=ordered)/dev/disk/by-uuid/d3aa2880-c290-4586-9da6-2f526e381f41
On/etc/hosts type EXT4 (rw,relatime,errors=remount-ro,data=ordered) devpts
On/dev/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Above is the mount point that I see in my demo container. To create a new mount namespace, we use the CLONE_NEWNS flag bit. You can note that the name of this sign is a bit strange, why not clone_newmount or clone_newmnt? This is because Mount namespace is the first namespace in Linux, so here's the tag bit parameter name that's a little unorthodox , often when we implement a new feature or application of coding, we often fail to anticipate the full picture of the final result. Anyway, we add clone_newns to the Clone_flags variable, and the result is, int clone_flags = clone_newns | SIGCHLD, and then compile the mount.c and run the same command.
> CP skeleton.c MOUNT.C
> Gcc-o Mount mount.c
> Mount Mount
Nothing has changed this time, why? Because processes running in the new Mount namespace still have a/proc view under the underlying system. The result is that the new process inherits the view of the underlying mount point. We have some ways to prevent such a result, for example, using Pivot_root, which will be covered in detail in subsequent blogs.
However, there is a way we can try to mount something on the new Mount namespace. For example, under the/mytmp to mount a new TMPFS. We execute the Mount command in C code and write the mount points that require new mounts as parameters. To achieve this goal, we need to add code before calling EXECVP in the Child_exec function, as follows:
Child_exec is the func that would be executed as the result of clone
static int child_exec (void *stuff) {struct Clone_args *args = (struct
Clone_args *) stuff; if (Mount ("None", "/mytmp", "Tmpfs", 0, "")!= 0) {
fprintf (stderr, "failed to mount Tmpfs%s\n", Strerror (errno)); Exit (-1);
} if (EXECVP (args->argv[0], args->argv)!= 0) {fprintf (stderr, "failed
To EXECVP argments%s\n ", Strerror (errno)); Exit (-1); }//We should implies
Reach here! Exit (Exit_failure); }
Before compiling and executing, we need to create a directory/mytmp and run the changes above.
> mkdir/mytmp
> Gcc-o Mount mount.c
> Mount Mount
# cutting out the common output ...
None on/mytmp type TMPFS (rw,relatime)
Some common output is removed here.
As you can see from the results, a new TMPFS mount point is found. Go ahead and run the mount under the current shell.
Notice why the TMPFS mount point doesn't show up? This is because the mount point we create is under our own Mount namespace, not under the parent namespace.
As I said before, Mount namespace and filesystem jail are different and continue to execute our. Mount and LS commands, and we can give proof.
UTS Namespace
The UTS namespace (UNIX timesharing system contains information such as the name, version, underlying architecture type, and so on) used to identify the systems. contains hostname and domain domainname. It allows a container to have its own hostname identity, which identifies the host system and the other containers on it independently. Let's start, copy skeleton.c and run hostname command with him.
> CP skeleton.c UTS.C
> Gcc-o UTS UTS.C
>/uts hostname
Development
This will show you the hostname of your system (development in my case). As previously done, let's add the Clone_flags variable that the clone identifies to the UTS namespace. The value of this variable should be clone_newuts. When you compile and run it, you will see that the output is the same. These UTS namespace values are inherited from his parent system. Well, in this new namespace, we can modify its hostname without affecting the host system and other containers of its host system, which have isolated UTS namespace.
Let's modify the hostname in the Child_exec function. To do this, we need to add the #include <unistd.h> header file so that it can access the SetHostName function and add #include <string.h> header files so that the Setthostname function can call strlen function. The modified child_exec should read as follows:
Child_exec is the func that would be executed as the result of clone
static int child_exec (void *stuff)
{
struct Clone_args *args = (struct Clone_args *) stuff;
const char * new_hostname = "Myhostname";
if (SetHostName (New_hostname, strlen (new_hostname))!= 0) {
fprintf (stderr, "failed to EXECVP argments%s\n",
Strerror (errno));
Exit (-1);
}
if (EXECVP (args->argv[0], args->argv)!= 0) {
fprintf (stderr, "failed to EXECVP argments%s\n",
Strerror (errno));
Exit (-1);
}
We should implies reach here!
Exit (Exit_failure);
}
Make sure that the CLONE_FLAGS variable in your main function should be like this nt Clone_flags = clone_newuts | SIGCHLD, and then compiles and runs with the same command parameters. At this point you can see the return value used to execute the hostname command. and in order to verify that this change does not affect your current shell environment, we will execute hostname and confirm that we have returned our previous original values.
> Gcc-o UTS UTS.C
>/uts hostname
Myhostname
> hostname
Development
IPC Namespace
IPC namespace is used to isolate interprocess communication, like SYSV message queues, and let's create a skeleton.c copy of this namespace.
> CP skeleton.c IPC.C
The way we test the IPC namespace is to make sure we don't see it when we create a new process in the IPC namespace by creating a message queue on the host. Let's first create a message queue in the current shell and run the skeleton copy code to view the queue.
> Ipcmk-q
Message Queue id:65536
> Gcc-o IPC IPC.C
>/IPC ipcs-q
------Message Queues--------
Key msqid owner perms used-bytes Message
0xfe7f09d1 65536 Root 644 0 0
Without using the new IPC namespace, you can see that the same message queues are created. Now let's add the CLONE_NEWIPC tag to our clone_flags variable to create a new IPC namespace for our process. The clone_flags variable can be viewed as an int clone_flags = CLONE_NEWIPC | SIGCHLD, recompile and execute the same command again:
> Gcc-o IPC IPC.C
>/IPC ipcs-q
------Message Queues--------
Key msqid owner perms used-bytes Message
Finish it! The subprocess is now in a new IPC namespace with a completely separate view and access to Message Queuing.
PID Namespace
This part is very interesting. PID (Process Identification,os) namespace is the way to divide the PID that a process can view and interact with. When we create a new PID namespace, the PID of the first process is assigned a value of 1. When the process exits, the kernel kills the other processes within the namespace. Let's start our change by making a copy of the SKELETON.C.
> CP skeleton.c PID.C
To create a new PID namespace, we need to set Clone_flags to Clone_newpid. The variable should look like int clone_flags = Clone_newpid | SIGCHLD, we run PS aux in the shell and then compile and run our PID.C binaries with the same parameters.
> PS aux
USER PID%cpu%mem VSZ RSS TTY STAT START time COMMAND
Root 1 0.0 0.1 20332 3388? Ss 21:50 0:00 Bash
Root 147 0.0 0.1 17492 2088? intermolecular 22:49 0:00 PS aux
> Gcc-o pid PID.C
>/pid PS aux
USER PID%cpu%mem VSZ RSS TTY STAT START time COMMAND
Root 1 0.0 0.1 20332 3388? Ss 21:50 0:00 Bash
Root 153 0.0 0.0 5092 728? S+ 22:50 0:00/pid PS aux
Root 154 0.0 0.1 17492 2064? intermolecular 22:50 0:00 PS aux
In our expectation, the PS aux PID is 1, or at least does not see any PID coming in from other parent processes. Why is that? The process we hatch will still have A/proc view from the parent process, which means that/proc is mounted on the host system. So how do we solve this problem? How do we make sure that our new process can only look at the PID in its namespace? We can start by re-mount/proc.
Because we're going to use the mount, we can take this opportunity to use what we learned from Mnt namespace and combine the PID namespace to make sure we don't confuse it with the/proc of our host system.
We can start by including the clone flag that is set for the PID namespace and the clone flag set for MNT namespace. They look like int clone_flags = Clone_newpid | clone_newns | SIGCHLD; We need to edit the Child_exec function and mount the proc again. The system calls Unmount and mount. Because we are creating a new MNT namespace, this will not mess up our mainframe system. The results should be as follows:
Child_exec is the func that would be executed as the result of clone
static int child_exec (void *stuff) {struct Clone_args *args = (struct
Clone_args *) stuff; if (Umount ("proc", 0)!= 0) {fprintf (stderr, "failed
Unmount/proc%s\n ", Strerror (errno)); Exit (-1); if (Mount ("proc", "/proc"),
"Proc", 0, "")!= 0) {fprintf (stderr, failed Mount/proc%s\n), strerror (errno));
Exit (-1); } if (EXECVP (args->argv[0], args->argv)!= 0) {fprintf (stderr,
"Failed to EXECVP argments%s\n", Strerror (errno)); Exit (-1); }//We should
implies reach here! Exit (Exit_failure); }
Build and run again to see what happens?
> Gcc-o pid PID.C
>/pid PS aux
USER PID%cpu%mem VSZ RSS TTY STAT START time COMMAND
Root 1 0.0 0.0 9076 784? intermolecular 23:05 0:00 PS aux
Perfect! Our new PID namespace has been working with the help of MNT namespace!
USER Namespace
User namespace is the latest child user space that allows you to create a user that is independent of other namespace. This is done through the GID and UID mappings.
Here is an instance application that does not specify a mapping. If we add Clone_newuser to Clone_flags and then run the ID or Ls-la, we get nobody output because the current user has not been created yet.
> CP skeleton.c USER.C
# Add the Clone flag
> Gcc-o User user.c
> Supplied Ls-la
Total 84
Drwxr-xr-x 1 Nobody nogroup 4096 modified 16 23:10.
Drwxr-xr-x 1 Nobody nogroup 4096 modified 16 22:17.
-rwxr-xr-x 1 Nobody nogroup 8336 modified mount
-rw-r--r--1 Nobody nogroup 1577 modified 22:15
-rwxr-xr-x 1 Nobody nogroup 8064 modified 21:52 Net
-rw-r--r--1 Nobody Nogroup 1441 modified 21:52
-rwxr-xr-x 1 Nobody nogroup 8544 modified PID
-rw-r--r--1 Nobody nogroup 1772 modified 23:02
-rw-r--r--1 Nobody nogroup 1426 modified 21:59
-rwxr-xr-x 1 Nobody nogroup 8056 modified 23:10 user
-rw-r--r--1 Nobody nogroup 1442 modified 23:10
-rwxr-xr-x 1 Nobody nogroup 8408 modified 22:40
-rw-r--r--1 Nobody nogroup 1694 modified 22:36
This is a very simple example, but if you think about it you will find that you can run it in the container with root privileges (not root in the host system) through the user namespace. Don't forget that you can change Ls-la to bash at any time, and get a deeper understanding of namespace through the shell.
Summary
In this article, we reviewed the Mount,network,user,pid,uts and IPC Linux namespace, and did not modify too much code, but added some flag. Complex work focuses on managing the interactions of multiple kernel subsystems. As I mentioned at the outset, namespace is just a tool we use to create containers, and I hope that the PID example will enable us to understand how multiple namespace collaborate to create a container.