Linux identity process

Last Update:2018-02-13 Source: Internet

Author: User

Tags goto session id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, preface

In fact, two years ago, this site has a document about the process identity, but very simple, and the code is from the 2.6 kernel. With the introduction of the concepts of Linux container, PID namespace and so on, process identification has changed dramatically, so we need to reorganize this part of the content.

This paper is divided into four parts to describe the topic of process identification: In the third chapter, we describe the basic concepts of PID, PID number, PID namespace and so on, after introducing some basic knowledge of IDs. The fourth chapter focuses on how the kernel abstracts these basic concepts into concrete data structures, and finally we briefly analyze the kernel's source code for process identities (from the linux4.4.6 version).

Ii. Overview of various IDs

The so-called process is actually the execution of the program, compared with the static program, the process is a running state of the entity, with a variety of resources: address space (not necessarily the full address space, but the address space is arranged in a segment of the memory mappings), open files, Pending signal, one or more thread of execution, the data entity in the kernel (such as one or more task_struct entities), the kernel stack (also one or more), etc. For the process, we use the process ID, which is the PID (process ID). The PID of the current process and the PID of the parent process can be obtained through getpid and Getppid.

Thread of execution in a process is called thread, which is an entity that is active in the process. On the one hand, all threads in the process share some resources, on the other hand, the thread has its own resources, such as having its own PC value, user stack, kernel stack, having its own HW context, scheduling policy and so on. We generally say process scheduling, but actually the thread is the basic unit of the scheduler. For the Linux kernel, the implementation of threads is a special existence, unlike classic Unix. In Linux, processes and threads are not differentiated, and are abstracted by task_struct, except that a process that supports multithreading is abstracted by a set of task_struct that task_struct share data structures such as memory descriptors. We use the thread ID to uniquely identify the thread in the process, and POSIX specifies that the thread ID is unique within the owning process, but in the implementation of the Linux kernel, the thread ID is unique across the system, and of course, with portability in mind, application Software should not assume this. In user space, you can get the thread ID of the current threads through the Gettid function. For single-threaded processes, the process ID and thread ID are the same, for processes that support multithreading, each thread has its own thread ID, but all threads share a single PID.

To facilitate the shell's job controll, we need to organize a set of processes to form a process group. The concepts in this regard are described in detail in the process and terminal documentation and are not discussed here. To identify the process group, we need to introduce the concept of the process group ID. We typically use the ID of the first process in the process group as the ID of the process group, and all processes in the process group share a process group ID. In user space, the process group ID can be accessed through interface functions such as Setpgid, Getpgid, Setpgrp, and Getpgrp.

Through the layer progression of thread ID, process ID, process group ID, we finally come to the topmost ID, the session ID, which is actually used to identify a user interaction process in a computer system: the user logs into the system, Constantly submit tasks (that is, job or process group) to the computer system and observe the results, and finally log out and destroy the session. The concept of the session, in the process and terminal documents described in detail, you can refer to the document, here will not repeat. In user space, we can manipulate session ID by GetSID, Setsid.

Third, the basic concept

1. How user space sees process ID

We use the following block diagram to describe how user space and kernel space look at process IDs:

From the perspective of user space, each process can call Getpid to get the ID that identifies the process, which we call the PID, which is the type pid_t. Therefore, we know that a process can be uniquely identified in user space by a positive integer (we call this positive integer PID number). After the introduction of the container, things are a little more complicated, and the PID positive integer can only be a process that uniquely identifies the container. That is, if the container 1 and container 2 exist in the system, then there can be two PID equals a process, respectively, in the container 1 and container 2. Of course, processes can also not be in containers, such as process x and process y, which are similar to processes in traditional Linux systems. Of course, you can also assume that process x and process y are at a system-level top container 0, which includes process x and process y and two containers. The same concept, container 2 can also nest a container, thus forming a container hierarchy.

A container (Linux container) is an OS-level virtualization approach, essentially a pure software approach to virtualization, with little overhead, light weight, and, of course, its own limitations. Linux container mainly apply the kernel of the Cgroup and namespace isolation technology, of course, these are not our documentation concerns, we are mainly concerned about the PID namespace.

When a process runs on top of the Linux OS, it has a lot of system resources, such as PID, user ID, network device, protocol stack, IP, port number, filesystem hierarchy. For traditional Linux, these resources are global, a process umount a file system mount point, change its own filesystem hierarchy view, The file system directory structure seen by all processes changes (Umount operations are perceived by all processes). Is it possible to separate these resources away? This is the concept of namespace, and PID namespace is used to isolate the PID address space.

The process is not aware of the PID namespace, it just know to be able to obtain their own ID through getpid, do not know that they are actually locked in a PID namespace cage. From this point of view, user space is simple and happy, kernel space is not so fortunate, we need to use complex data structures to abstract these hierarchical structure of the PID.

Finally, incidentally, the above description is for the PID, in fact, Tid, Pgid and Sid are the same concept, originally directly using these IDs can uniquely identify an entity, now we need (PID Namespace,id) to uniquely identify an entity.

2. How kernel space sees process ID

Although from the user space, a PID with a positive integer representation is sufficient, but in the kernel space, a positive integer is definitely not, we use a 2-level PID namespace to describe (that is, the case of the above picture). PID namespace 0 is the PID namespace 1 and 2 of the parent namespace, in the PID namespace 1 pid equals a process, corresponding to the PID namespace 0 pid equals m of that process, that is, The kernel state actually requires a positive integer in two different namespace to record the ID information of a process. Spread out, as we can describe, in an n-level PID namespace Hieraray, the process at the x level requires an X positive integer ID to represent the process.

In addition, the kernel also records the relationship between PID namespace: Who is the root, who is the leaf, the father-son relationship ...

Iv. data abstraction of the kernel state

1, how to abstract PID number?

struct Upid {
int nr;
struct Pid_namespace *ns;
struct Hlist_node pid_chain;
};

Although user space uses a positive integer to represent various IDs, for the kernel, we need to use a two-tuple (PID namespace,id number) to represent, because the pure PID number is meaningless, must limit its PID namespace, Only then, that ID number is the only one. In this way, the NR and NS members in the upid are better understood, corresponding to the ID number and PID namespace respectively. In addition, when the userspace pass ID number parameter enters the kernel to request the service (for example, to send a signal to an ID), we must need to quickly find its corresponding Upid data object by ID number, in order to deal with such demand, The kernel stores all the upid in the system in a hash table, and the Pid_chain member is the next node in the hash table.

2, how to abstract tid, PID, sid, Pgid?

struct PID
{
atomic_t count;
unsigned int level;
struct Hlist_head Tasks[pidtype_max];
struct Rcu_head RCU;
struct Upid numbers[1];
};

Although the name is PID, the data structure actually abstracts not just a thread ID or process ID, it actually includes the process group ID and session ID. Because multiple task structs share the PID (for example, all task structs in a session point to the same struct PID data object that represents the session ID), it is not surprising that a member such as Count is present. Represents the reference count for the data object.

After understanding the PID namespace hierarchy, level members are not difficult to understand, any one of the system assigned to the PID is subordinate to a namespace, and this namespace is located in the entire PID namespace At a certain level of the hierarchy, Pid->level indicates the levels of the namespace to which the PID belongs. Because the PID is also visible to its parent PID namespace, this level value actually indicates how many PID namespace the PID object is visible.

How many PID namespace are visible, how many (PID namespace,pid number) pairs, numbers is an array that represents the PID number at each level. The tasks members are associated with the task that uses the struct PID, which we describe in the next section.

3. How does the process descriptor embody TID, PID, sid, Pgid?

Because of the multiple task sharing IDs (which refer to the four IDs mentioned above), there are two things to consider when designing data structures:

(1) quickly find the corresponding struct PID from task struct

(2) from a struct PID to traverse all tasks that use the PID

At this request, we have designed an auxiliary data structure:

struct Pid_link
{
struct Hlist_node node;
struct PID *pid;
};

Where node is the node in the task struct list that strings the task to the struct PID, and the PID points to the concrete struct PID. At this point, we can embed an array of pid_link in the task struct:

struct Task_struct {
......
struct Pid_link Pids[pidtype_max];
......
}

The PIDs member in a task struct is an array that represents the TID (PID), Pgid, and SID of the task, respectively. We define the following types of PID:

Enum Pid_type
{
Pidtype_pid,
Pidtype_pgid,
Pidtype_sid,
Pidtype_max
};

All along we say four kinds of type,tid, PID, sid, Pgid, why do we define one less? In fact, the starting version of the kernel does define four types of PID, but later in order to save memory, the TID and PID merged. OK, now that we've introduced too many data structures, let's use a picture to describe the relationship between data structures:

For multiple threads in a process, each thread can find the same struct PID data object that represents the thread ID by task->pids[pidtype_pid].pid. Of course, any thread has its own process, that is, the struct PID data object that represents its process ID. How do I find it? This requires a bridge, the thread group member (TASK->GROUP_LEADER) defined in the task struct, through which a thread can always easily find its corresponding thread set leader, The corresponding PID of the thread group leader is the process ID of the thread. Therefore, for a thread, its task->group_leader->pids[pidtype_pid].pid points to the struct PID data object that represents its process ID. Of course, for thread group leader, the struct PID data object whose thread ID and process ID is an entity, and for those ordinary threads of non-thread group leader, the struct of its thread ID and process ID The PID data object points to different entities.

A struct PID has three linked headers, and if the PID only identifies a thread ID, then its PID link header has only one element in the linked list, which is the task struct using that PID. If the PID represents a process ID, then the PID link header points to multiple task structs in the linked list, each representing the task struct of the thread that belongs to the process, and the first task struct in the list is the thread group Leader If the PID does not represent a process group ID or session ID, then both the Pgid-Link header and the session-link header in the struct PID are pointing to null. If the PID represents a process group ID, its structure is as follows:

For those multi-thread processes, the kernel has several task structs and process counterparts, but for simplicity, in the picture above, the task struct corresponding to process x is actually the corresponding task struct for the thread group leader. The Pgid pointer (task->pids[pidtype_pgid].pid) of these task structs points to the struct PID data object corresponding to the process group. The Pgid-Link header in the PID concatenates all the task structs that use the PID (only those task structs that correspond to the thread group leader in tandem), and the first node in the list is the process group leader.

The concept of Session PID is similar, we can learn by ourselves.

4, how to abstract PID namespace?

Well, this is a little bit complicated, for the time being todo.

Five, Code Analysis

1, how to get its corresponding thread ID according to a task struct?

Static inline struct PID *task_pid (struct task_struct *task)
{
Return task->pids[pidtype_pid].pid;
}

By the same token, we can easily get a task corresponding to the Pgid and SID. The process ID is a little bit around, we first find the task's thread group leader, the thread ID of the thread group leader corresponding to that task's process ID.

2, how to get the current PID namespace according to a task struct?

struct Pid_namespace *task_active_pid_ns (struct task_struct *tsk)
{
Return Ns_of_pid (Task_pid (tsk));
}

This operation can be divided into two steps, the first step is to find its corresponding thread ID, and then according to the thread ID to find the current PID namespace, the code is as follows:

Static inline struct pid_namespace *ns_of_pid (struct PID *pid)
{
struct Pid_namespace *ns = NULL;
if (PID)
NS = pid->numbers[pid->level].ns;
return NS;
}

A struct PID entity is hierarchical, corresponding to several levels (PID namespace,pid number) Two tuples, the topmost is the root PID namespace, the bottom (leaf node) is the current PID namespace,pid-> Level represents the current hierarchy, so PID->NUMBERS[PID->LEVEL].NS describes the current PID namespace.

3. How is getpid implemented?

When trapped in the kernel, it is easy to get the current task struct (based on the value of Sp_svc), which is the starting point, followed by the following code:

Static inline pid_t task_tgid_vnr (struct task_struct *tsk)
{
Return Pid_vnr (Task_tgid (tsk));
}

The thread ID of the thread group leader that the task corresponds to can be obtained through task_tgid, which is actually the process ID. In addition, the current PID namespace can also be obtained through Task_active_pid_ns, and with these two parameters, you can call Pid_nr_ns to get the PID number for that task:

pid_t Pid_nr_ns (struct pid *pid, struct pid_namespace *ns)
{
struct Upid *upid;
pid_t nr = 0;

if (PID && ns->level <= pid->level) {
Upid = &pid->numbers[ns->level];
if (Upid->ns = = ns)
NR = upid->nr;
}
return nr;
}

A PID can run through a number of PID namespace, but not all PID namespace can view the PID, get the corresponding PID number. Therefore, at the beginning of the code will be verified, if the PID namespace hierarchy (Ns->level) is lower than the PID current PID namespace level, then directly return 0. If the PID namespace level is OK, then to check that the namespace PID is not the current PID namespace, if it is, directly return the corresponding PID number, otherwise, return 0.

For Gettid and Getppid These two interfaces, the whole concept is similar to getpid, no longer repeat.

4, given the thread ID number, how to find the corresponding task struct?

The conditions given here include ID number, current PID namespace, how to find the corresponding task under such conditions? We are divided into two steps, the first step is to find the corresponding struct PID, the code is as follows:

struct PID *find_pid_ns (int nr, struct pid_namespace *ns)
{
    struct upid *pnr;

    Hlist_for_each_entry_rcu (PNR,
             &pid_hash[pid_hashfn (NR, NS)], Pid_chain)
        if ( Pnr->nr = = Nr && Pnr->ns = = ns)
            Return container_of (PNR, struct PID,
                     Numbers[ns->level]);

    return NULL;
}

The whole system has so many struct PID data objects, each PID has multiple level of (PID namespace,pid number) pair, PID numbers and namespace to find the corresponding PID is a very time-consuming operation. In addition, such an operation is a more frequent operation, a simple example is to send a signal through kill to the specified process (PID number). It is because of the frequent and time-consuming operation that the system establishes a global hash list to solve this problem, and pid_hash points to a number of hash-linked headers (the number of heads and the memory configuration). This hash table is used to find the corresponding struct upid through a specified PID namespace and ID number. Once the upid is found, the corresponding struct PID data object is found by container_of.

The second step is to find the task struct from the struct PID, the code is as follows:

struct task_struct *pid_task (struct PID *pid, enum Pid_type type)
{
struct Task_struct *result = NULL;
if (PID) {
struct Hlist_node *first;
First = Rcu_dereference_check (Hlist_first_rcu (&pid->tasks[type]),
Lockdep_tasklist_lock_is_held ());
if (first)
result = Hlist_entry (first, struct task_struct, pids[(type)].node);
}
return result;
}

5. How is Getpgid implemented?

Syscall_define1 (Getpgid, pid_t, PID)
{
struct Task_struct *p;
struct PID *grp;
int retval;

Rcu_read_lock ();
if (!pid)
GRP = TASK_PGRP (current);
else {
retval =-esrch;
p = find_task_by_vpid (PID);
if (!p)
Goto out;
GRP = TASK_PGRP (p);
if (!GRP)
Goto out;

retval = Security_task_getpgid (p);
if (retval)
Goto out;
}
retval = PID_VNR (GRP);
Out
Rcu_read_unlock ();
return retval;
}

When the incoming PID number equals 0, Getpgid is actually getting the process Groud ID number for the current session, and through TASK_PGRP can get the PID object that corresponds to the progress group ID that is used by the process. If a process ID number other than 0 is given when calling Getpgid, then Getpgid actually wants to get the gpid of the specified PID number. At this point, we need to call Find_task_by_vpid to find the corresponding task struct for the PID number. Once the task struct structure is found, it is easy to get the Pgid it uses (the entity is a struct PID type). At this point, regardless of the parameter condition (the passed parameter PID number equals 0 or not 0), we have found the PID number corresponding to the struct PID data Object (Pgid). Of course, the end user space needs Pgid number, so we need to call PID_VNR to find the PID pgid number in the current namespace.

GetSID code logic and Getpid are similar, no longer repeat.

Linux identity process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More