This article is from my personal blog
Www.chenbiaolong.com
Welcome to visit
Overview
Traditionally, many Linux resources are managed globally, for example, all processes in the system are identified by PID, which means that the kernel manages a global PID table and the process number must be unique. Similar to the kernel file system mount point data information, user ID number and so on. We know that in order to achieve virtualization must have a separate resource allocation, so that the container does not affect each other, how to make these global table localization it. The answer is namespace. namespace transforms a traditional global resource into a local domain resource for a namespace. The namespace currently implemented by the Linux kernel include:
1.Mount namespace (clone_newns): System mount point
2.UTS namespace (clone_newuts): hostname and other information
3.IPC namespace (CLONE_NEWIPC): interprocess communication
4.PID namespace (clone_newpid): Process number
5.Network namespace (clone_newnet): Network-related resources
6.User namespace (clone_newuser): User ID
As you can see, these system resources are managed by the kernel globally without introducing namespace. In order to support the container virtualization function, the Linux kernel joins the above 6 kinds of namespace to realize the localization of these global system resources, so that each namespace space has a separate set of system resources. As this article mainly describes the implementation of Docker virtualization principle, taking into account the length, will be mainly from the kernel perspective of Linux PID namespace. PID namespace enables processes belonging to different namespaces to have the same process number, which is critical to implementing Docker virtualization.
namespace kernel-related structures
There is a struct pointer nsproxy in the TASK_STRUCT structure. The Nsproxy structure defines the namespace supported by the kernel.
struct Task_struct {
...
/* namespaces */
struct nsproxy *nsproxy;
...
}
Nsproxy is defined as follows (Linux/include/linux/nsproxy.h):
struct Nsproxy {
atomic_t count;
struct Uts_namespace *uts_ns;
struct Ipc_namespace *ipc_ns;
struct Mnt_namespace *mnt_ns;
struct Pid_namespace *pid_ns_for_children;
struct net *net_ns;
};
Since we selected the kernel source version is 3.19, so in the above code has not implemented the user Namespace,user namespace in the kernel 3.8 is implemented.
There is a default nsproxy:init_nspoxy in the system, and the structure is initialized in task initialization.
#define INIT_TASK (tsk) \
{
...
. Nsproxy = &init_nsproxy,
...
}
Init_nsproxy is defined as follows:
struct Nsproxy init_nsproxy = {
. Count = atomic_init (1),
. Uts_ns = &init_uts_ns,
#if defined ( Config_posix_mqueue) | | Defined (CONFIG_SYSVIPC)
. Ipc_ns = &init_ipc_ns,
#endif
. Mnt_ns = NULL,
. pid_ns_for_ Children = &init_pid_ns,
#ifdef config_net
. Net_ns = &init_net,
#endif
};
PID Namespace Analysis
In these namespace, we choose Pid_namespace as the focus of this article analysis, the other namespace principle is similar.
Let's analyze the Pid_namespace structure:
struct Pid_namespace {
struct kref kref;
struct Pidmap pidmap[pidmap_entries];
int last_pid; When a new process is created, the process number of the new process that is local to that namespace is
struct task_struct *child_reaper;
Equivalent to the namespace of the Init process, responsible for all resources subordinate to the namespace process death of the resource recovery
struct Kmem_cache *pid_cachep;
unsigned int level; The namespace is located at the highest level of 0
struct pid_namespace *parent;//Upper namespace
#ifdef config_proc_fs
struct Vfsmount *proc_ MNT;
#endif
#ifdef config_bsd_process_acct
struct bsd_acct_struct *bacct;
#endif
};
Focus on the struct pidmap struct, which is used by struct PIDMAP to flag whether a process number is used. In the above analysis, we know that when the kernel starts all processes are assigned to an initial namespace, for the PID namespace, this initial namespace is Init_pid_ns, defined as follows.
The
Init_pid_ns has a namespace level of 0 and no parent namespace. Pidmap is initialized to 0, indicating that no ID number has been assigned yet. Last_pid is set to 0, indicating that the newly assigned PID number will start at Process 1. In addition, Child_reaper is set to Init_task in Init_pid_ns, which is responsible for all resource recycling that is subordinate to the namespace process death. Init_task This process is very special, is the Linux kernel boot process 1, is the ancestor of all processes, when the kernel started it carried out a series of environment initialization operations: such as the implementation of Linux startup scripts. In addition, it is responsible for recovering all the zombie processes in the initial namespace. Here the Init_pid_ns is a little different from the sub-namespace: Because Init_pid_ns is the initial namespace, it was created at the beginning of the kernel loading, so its child_reaper setting Init_task has the function of initializing the system environment. And the new namespace created by clone, because the kernel is already loaded, so the new namespace process 1 is no environment initialization function. Subsequent analysis of Process 1 for the sub-namespaces. The
creation of the new namespace is implemented through the clone function. The clone function eventually calls the Do_fork system call and eventually calls the Create_new_namespace function. The call hierarchy is shown in the following figure:
When executing copy_process, we find the namespace's processing code snippet for process 1 in that namespace:
/
* * Is_child_reaper returns True if the PID is the INIT process * of the current
namespace. As this one could be checked before
* Pid_ns->child_reaper are assigned in copy_process, we check
* with the PID Number.
*
/static struct task_struct *copy_process (...) {
...
if (Is_child_reaper (PID)) {//current process in the current namespace is process 1
//Process 1 is responsible for reclaiming the current namespace of the zombie process
ns_of_pid (PID)->child_reaper = P;
Set to not be killed
p->signal->flags |= signal_unkillable;
}
...
}
As you can see, for process 1 of the namespace, the kernel assigns the process to the namespace's child_reaper so that it is responsible for the zombie process resource recycling for that namespace. The process is also set to signal_unkillable to flag that the process cannot be killed. It is important to note that the non-kill attribute is valid only in this namespace, and that the process is only a normal process for its parent namespace and can still be killed. Unlike Init_pid_ns, the process 1 of the sub-namespace does not have an environment initialization operation similar to the init_task process. Therefore, when a new Docker container is created, process 1 in the container does not execute initialization scripts such as BASHRC. In fact, some of Docker's device initialization and environment variable initialization operations are done manually by the Docker program, rather than by process 1.
Because the PID namespace is added, the same process has different PID numbers in different namespaces. The kernel uses the Alloc_pid function to assign the PID number of the process in each namespace, with the following specific code:
struct PID *alloc_pid (struct pid_namespace *ns) {
...//omit some code
for (i = ns->level; i >= 0; i--) {
//tmp as Before traversing to the namespace, use its independent bitmap to assign the PID value
NR = Alloc_pidmap (tmp);
if (NR < 0)
Goto out_free;
Pid->numbers[i].nr = nr;
PID->NUMBERS[I].NS = tmp;
TMP = tmp->parent;
}
...//Omit some code
}
Summary
This paper mainly analyzes the PID namespace, and discusses the realization mechanism of namespace from the angle of kernel code implementation. In general, namespace has localized some of the global resources of the Linux kernel, which is the core foundation for Docker to Virtualize. Some traditional global resources of namespace,linux can be occupied by specific namespaces, and the resources of each namespace will not conflict with each other and play the role of environmental isolation. It is important to note that namespace-isolated system resources are similar to kernel resources such as PID numbers, rather than actual physical resources such as CPU memory. The physical resource limits and controls for Docker are implemented through Cgroup.