Introduction to Linux Kernel Engineering--User space process using kernel resources

Source: Internet
Author: User

This article is mostly reproduced and assembled, just think that these knowledge should be put together better. How process system resources are used

Most processes request memory using GLIBC, but GLIBC is also an application library that ultimately calls the operating system's memory management interface to use memory. In most cases, the glibc is transparent to the user and the operating system, so it is very helpful to observe the process logged by the operating system directly to the memory usage. But GLIBC's own implementation is also problematic, so it is too special to take account of the process of memory use also to consider the glibc factor. Other operating system resource usage can be viewed directly from the proc file system.

The type of system resources required by the process memory

The process requires memory, but it does not necessarily require physical memory. A process can request 1 g of memory, the kernel will certainly be approved to him, but the kernel really gives him the actual corresponding physical memory when the process needs to actually use this memory, this time will cause a fault, the kernel in this exception processing code to the actual memory of the process. Just as you save money in a bank, most of the time your assets are a number, only when you want to withdraw cash, the bank will need to raise cash to pay you (because he promised before), but usually the bank's cash is always small and the total number of depositors ' assets (the asset bubble is the same), When everyone is on a run, the bank has to go bankrupt. Operating system memory is the same, when the process is required to cash, the kernel will not be able to fully cash, the kernel has to crash.

The type of memory of the process: the process is used to hold the data (heap), to execute the process (stack), with the physical memory of other processes (such as shared libraries, shared memory), the size of the virtual address space of the process (depending on whether you are 32-bit or 64-bit, 2 of that number of parties), The application process is actually using the size of the physical address (RSS) that the process uses to place the part of the code to execute (TRS). In general, the physical address that the application actually uses consists of three parts: data, stack, executable code. Most processes require most of the data memory generally.

The entry point of the analysis process

Each process has a parent process, a process group, a conversation group (the general process group belongs to the conversation group, and the process belongs to the process group).

Each process has its own thread group (or even a co-worker)

The number of pages that occur can be used to diagnose memory-sensitive or excessive memory usage.

The running time of the kernel and the user state, the cumulative wait time of the task can be seen how much the process depends on the system call (perhaps you need to put the process into the kernel implementation or use non-blocking)

The scheduling policy, which CPU to run on, and the priority of the process can be used to allocate resources unfairly when system resources are low.

The number of pages swapped out and the various memory tables used illustrate the memory usage power of the process.

The process's cap capability set can see if the process has more permissions than

First,/PROC/PID/STATM

PID/STATM contains information that is active for all CPUs in this process, and all values in the file are accumulated from the start of the system to the current moment.

/PROC/1 # CATSTATM

550 70 62 451 0 97 0

Output interpretation

CPU and CPU0 ... The meaning of each parameter of each row (in the first example) is:

Parameter Interpretation/proc/1/status

Size (pages) = 550 task virtual address space VMSIZE/4

Resident (pages) = 70 the size of the physical memory that the application is using VMRSS/4

Shared (pages) = 62 shared pages

Trs (pages) = 451 the size of executable virtual memory owned by the program VMEXE/4

Lrs (pages) = 0 The size of the library that is imaged to the virtual memory space of the task VMLIB/4

Drs (pages) = 97 size of program data segment and user-state stack (vmdata+ VMSTK) 4

DT (pages) 0

Second,/proc/pid/stat

Pid/stat contains information about all CPU activity in the process, and all values in the file are accumulated from the start of the system to the current moment.

/PROC/1 # Cat Stat

1 (LINUXRC) S 0 0 0 0-1 8388864 50 633 204 2 357 72 342 16 0 1 0 22 2252800 70 4294967295 32768 1879936 31992707043199269 552 1113432 0 0 0 674311 3221479524 0 0 0 0 0 0

Each parameter means:

Parameter interpretation

Pid=1 process (including lightweight process, i.e. thread) number

comm= LINUXRC The name of the application or command

Task_state=s the status of the task, r:runnign,s:sleeping (task_interruptible), D:disk Sleep (task_uninterruptible), T:stopped, T: Tracing Stop,z:zombie, X:dead

Ppid=0 Parent Process ID

Pgid=0 Thread Group number

Sid=0 c the session group ID where the task resides

Tty_nr=0 (PTS/3) The device number of the TTY terminal of the task, INT (0/256) = main device number, (0-Main device number) = Secondary device number

The process group number of the Tty_pgrp=-1 terminal, the PID of the foreground task (including the shell application) currently running on the terminal where the task is located.

task->flags=8388864 the process flag bit to see the properties of the task

Min_flt=50 the number of page faults that occur when the task does not need to copy data from the hard disk

cmin_flt=633 cumulative number of times that all waited-for processes of the task have occurred

Maj_flt=20 the number of missing pages (main pages) that the task requires to copy data from the hard disk

CMAJ_FLT=4 cumulative number of main pages that have occurred in all waited-for processes of the task

When a process occurs with a missing fault, the process falls into the kernel state and performs the following actions:

1. Check if the virtual address you want to access is legitimate

2. Find/Assign a physical page

3, fill the physical page content (read the disk, or directly set 0, or do nothing)

4. Establish a mapping relationship (virtual address to physical address)

Re-execute the command that occurred with a missing pages interrupt

If the 3rd step, need to read the disk, then this time the fault is Majflt, otherwise it is minflt.

Utime=2 the time that the task was run in the user state, in units of jiffies

stime=357 the mission at the time of the nuclear mentality run, the unit is jiffies

cutime=72 cumulative of all the waited-for processes of the task once in the user state run time, in units of jiffies

cstime=342 cumulative of all the waited-for processes of this task ever run in the nuclear mentality at the time, in units of jiffies

Priority=16 the dynamic priority of a task

Nice=0 the static priority of a task

Num_threads=1 the number of threads in the thread group where the task resides

It_real_value=0 the delay of the next SIGALRM send process due to the timing interval, in Jiffy.

Start_time=22 the time the task was started, in Jiffies

vsize=2252800 (bytes) The virtual address space size of the task

RSS=70 (page) the size of the physical address space that the task currently resides in

These pages may be used for code, data, and stacks.

Rlim=4294967295=0xffffffff (bytes) The maximum value that the task can reside in the physical address space

start_code=32768=0x8000 the start address of the code snippet for the task in the virtual address space (determined by the connector)

end_code=1879936 the end address of the code snippet for the task in the virtual address space

START_STACK=3199270704=0XBEB0FF30 the start address of the stack for the task in the virtual address space

The current value of the kstkesp=3199269552 SP (32-bit stack pointer) is consistent with the kernel stack page of the process.

kstkeip=1113432 =0x10fd58 Pointer to the instruction to be executed, the current value of the PC (32-bit instruction pointer).

Pendingsig=0 the bitmap of the pending signal, recording the normal signal sent to the process

Block_sig=0 a bitmap for blocking signals

Sigign=0 the bitmap of the ignored signal

sigcatch=674311 the bitmap of the captured signal

wchan=3221479524 If the process is a sleep state, this value gives the scheduled call point

Number of pages nswap=0 by swapped

Cnswap=0 the number of pages that all child processes are swapped

Exit_signal=0 the signal sent to the parent process at the end of the process

Task_cpu (Task) =0 on which CPU to run

Task_rt_priority=0 relative priority levels for real-time processes

task_policy=0 process scheduling strategy, 0 = non-real-time process, 1=FIFO real-time process; 2=RR real-time process

Third,/proc/pid/status

Contains all CPU-active information, and all values in the file are accumulated from the start of the system to the current moment.

/proc/286 # Cat Status

Name:mmtest

State:r (running)

sleepavg:0%

tgid:286

pid:286

ppid:243

tracerpid:0

uid:0 0 0 0

gid:0 0 0 0

Fdsize:32

Groups:

vmpeak:1464 KB

vmsize:1464 KB

vmlck:0 KB

vmhwm:344 KB

vmrss:344 KB

Vmdata:20 KB

vmstk:84 KB

Vmexe:4 KB

vmlib:1300 KB

Vmpte:6 KB

Threads:1

sigq:0/256

sigpnd:0000000000000000

shdpnd:0000000000000000

sigblk:0000000000000000

sigign:0000000000000000

sigcgt:0000000000000000

capinh:0000000000000000

Capprm:00000000fffffeff

Capeff:00000000fffffeff

Output interpretation

Parameter interpretation

Name of the application or command

State task status, run/sleep/zombie/

The average wait time for sleepavg tasks (in nanosecond), interactive tasks because of the number of dormant, long time, their sleep_avg will be correspondingly larger, so the calculated priority will be correspondingly higher.

tgid=286 Thread Group number

pid=286 task ID

PPID=243 Parent Process ID

Tracerpid=0 the ID number of the process that received the process information tracking

UID UID euid suid fsuid

GID GID Egid sgid fsgid

The maximum number of fdsize=32 file descriptors, the maximum number of file handles that can be opened File->fds

Groups:

vmpeak:60184 KB/* The size of the process address space */

vmhwm:18020 KB/* File memory mapping and the size of the anonymous memory map */

Vmsize (KB) =1499136 the size of the task virtual address space (TOTAL_VM-RESERVED_VM), where TOTAL_VM is the size of the process's address space, RESERVED_VM: The physical page of the process between reserved or special memory

Vmlck (KB) the size of the physical memory that the =0 task has locked. Locked physical memory cannot be swapped to hard disk (LOCKED_VM)

Vmrss (KB) = 344 KB the size of the physical memory that the application is using, which is the RSS value with the parameters of the PS Command (RSS)

Vmdata (KB) =20kb the size of the program data segment (the size of the virtual memory), storing the initialized data; (TOTAL_VM-SHARED_VM-STACK_VM)

VMSTK (KB) =84kb task on the user-configured stack size (STACK_VM)

Vmexe (KB) =4kb the size of executable virtual memory owned by a program, code snippets, excluding libraries used by tasks (End_code-start_code)

Vmlib (KB) =1300kb the size of the library that is being imaged to the virtual memory space of the task (Exec_lib)

VMPTE=6KB the size of all page tables for the process, in kilobytes: KB

Threads=1 shares the number of tasks that use this signal descriptor, and in a POSIX multi-line program application, all threads in the thread group use the same signal descriptor.

SIGQ number of signals to be processed

SIGPND Shield bit that stores the pending signal for the thread

SHDPND shielding bit, storing the pending signal for this thread group

SIGBLK storage of blocked signals

Sigign storage of ignored signals

SIGCGT storage of captured signals

Capinh inheritable, the ability to inherit from a program that can be executed by the current process

CAPPRM permitted, the ability of the process to be able to use, can contain Capeff not in the ability, these capabilities are temporarily abandoned by the process itself, Capeff is a subset of CAPPRM, the process of abandoning the unnecessary ability to improve security

Capeff effective, the effective ability of the process

Iv./proc/loadavg

All values in the file are accumulated from the start of the system to the current moment. This file only gives the aggregate information for all CPUs, and it is not possible to get information about each CPU.

/proc # Cat Loadavg

1.0 1.00 0.93) 2/19 301

The meaning of each value is:

Parameter interpretation

Lavg_1 (1.0) 1-minute average load

Lavg_5 (1.00) 5-minute average load

Lavg_15 (0.93) 15-minute average load

Nr_running (2) The number of tasks that run the queue at the time of sampling, in the same way as the/proc/stat procs_running

Nr_threads (19) The number of active tasks in the system at the time of sampling (excluding tasks that have been completed)

Last_pid (301) The maximum PID value, including the lightweight process, which is the thread.

Assuming there are currently two CPUs, the current number of tasks per CPU is 4.61/2=2.31

Wu,/proc/286/smaps

The file reflects the size of the corresponding linear region of the process

/proc/286 # Cat Smaps

00008000-00009000 R-xp 00000000 00:0c1695459/memtest/mmtest

Size:4 KB

Rss:4 KB

shared_clean:0 KB

shared_dirty:0 KB

Private_clean:4 KB

private_dirty:0 KB

00010000-00011000 Rw-p 00000000 00:0c1695459/memtest/mmtest

Size:4 KB

Rss:4 KB

shared_clean:0 KB

shared_dirty:0 KB

private_clean:0 KB

Private_dirty:4 KB

00011000-00012000 rwxp 00011000 00:000 [Heap]

Size:4 KB

rss:0 KB

shared_clean:0 KB

shared_dirty:0 KB

private_clean:0 KB

private_dirty:0 KB

40000000-40019000 R-xp 00000000 00:0c2413396/lib/ld-2.3.2.so

size:100 KB

rss:96 KB

interface method for user-state processes using kernel memory management

From the operating system perspective, the process allocates memory in two ways, with two system invocations: BRK and mmap (regardless of shared memory).

1. BRK is the highest address pointer of the data segment (. data) _edata to the high address (see glibc section)

2, Mmap is in the virtual address space of the process (heap and the middle of the stack, called the file map area) to find a piece of free virtual memory.

Both of these methods allocate virtual memory and no physical memory is allocated. In the first access to the allocated virtual address space, a page break occurs, the operating system is responsible for allocating physical memory, and then establish a mapping between virtual memory and physical memory.

In the standard C library, Malloc/free function allocations are provided to release memory, which is implemented by brk,mmap,munmap these system calls.

GLIBC memory Management method pre-defined process memory area

Http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=man&raw=1&fname=/usr/share/ Catman/p_man/cat3/standard/_rt_symbol_table_size.z

GLIBC memory allocation method

The following is an example to illustrate the principle of memory allocation, the default is the use of Ptmalloc, followed by Jemalloc to pymalloc improvement implementation.

Principle

When malloc is less than 128k of memory, using BRK to allocate memory, the _edata to high address push (only the virtual space, not the corresponding physical memory (so no initialization), the first read/write data, causing the kernel page fault, the kernel allocates the corresponding physical memory, The virtual address space is then mapped to a mapping relationship), such as:


1. When the process starts, the initial layout of its (virtual) memory space is shown in 1.

Where the mmap memory-mapped file is in the middle of the heap and stack (for example, libc-2.2.93.so, other data files, etc.), the memory-mapped file is omitted for the sake of simplicity. The _edata pointer (defined inside the glibc) points to the highest address of the data segment.

2. After the process calls A=malloc (30K), Memory space 2:

The malloc function invokes the BRK system call and pushes the _edata pointer toward the high address by 30K, completing the virtual memory allocation.

You may ask: just put the _edata+30k to complete the memory allocation?

The fact is, _edata+30k just complete the allocation of virtual address, a This memory is still no physical page corresponding to it, until the first time the process read and write a memory, a page break occurs, this time, the kernel allocates a memory corresponding to the physical pages. That is, if malloc assigns a block of content and then never accesses it, a corresponding physical page is not assigned.

3, the process calls B=malloc (40K) after the memory space 3.

In case two, malloc is larger than 128k of memory, use MMAP to allocate memory, find a free memory allocation between heap and stack (corresponding to independent memory, and initialize to 0), such as:


4. After the process calls C=malloc (200K), Memory space 4:

By default, the malloc function allocates memory, and if the request memory is greater than 128K (which can be adjusted by the M_mmap_threshold option), it is not to push the _edata pointer, but instead use the MMAP system call to allocate a piece of virtual memory from the middle of the heap and stack.

This is done mainly because:

BRK allocated memory needs to wait until the high address memory is freed (for example, before B is released, A is not possible to release, which is why memory fragmentation occurs, when to tighten to see below), and mmap allocated memory can be released separately. Of course, there are other benefits, there are disadvantages, and then specific, interested students can go to see the glibc inside the malloc code.

5, the process calls D=malloc (100K) after the memory space 5;

6, after the process calls free (c), C corresponds to the virtual memory and physical memory released together.


7, Process call Free (b), 7: B corresponds to the virtual memory and physical memory is not released, because there is only one _edata pointer, if push back, then d this memory how to do? Of course, b this memory, can be reused, if this time to a 40K request, then malloc will probably put b this memory back.

8. After the process calls free (d), 8 shows: B and D are joined together and become a piece of 140K of idle memory.

9. By default: The memory Crunch operation (TRIM) is performed when the free memory of the highest address space exceeds 128K (can be adjusted by the M_trim_threshold option). In the previous step free, the highest address was found to have more than 128K of memory, and the memory crunch became shown in Figure 9.

Experiment

Look at a phenomenon after understanding the principle of memory allocation:

1 during the pressure test, it is found that the measured object performance is not ideal, the concrete performance is:

The system state of the process CPU consumes 20, the user state CPU consumes 10, the system is idle about 70

2 The Ps-o majflt,minflt-c Program command is used to see that majflt increments per second is 0, while Minflt increments is greater than 10000 per second.

Preliminary analysis

Majflt representative Major fault, Chinese name Big mistake, Minflt representative minor fault, Chinese name is small error. These two values represent the number of fault pages that have occurred since the start of a process. When a process occurs with a missing fault, the process falls into the kernel state and performs the following actions:

L Check if the virtual address you want to access is legitimate

L Find/Assign a physical page

Fill physical page contents (read disk, or direct 0, or do nothing)

l Establish a mapping relationship (virtual address to physical address)

L re-execute the command that has a missing pages interrupt

L if the 3rd step, need to read the disk, then this time the fault is Majflt, otherwise it is minflt.

L This process Minflt so high, more than 10,000 times a second, I have to suspect that it is very much related to the CPU consumption of the kernel state of the process.

Analyze code

Look at the code and find this: a request to allocate 2 m of memory with malloc and the free memory after the request ends. Looking at the log, it is found that allocating memory statements takes 10us of time, averaging one request processing time of 1000US. The reason has been found!

Although allocating memory statements takes a small amount of time to process a request, this statement severely affects performance. To explain why, you need to understand the principles of memory allocation first.

Truth

Say the memory allocation principle, then the test module in the kernel CPU consumption is very high reason is clear: each request to malloc a piece of 2M of memory, by default, malloc call Mmap allocate memory, the request ends, call Munmap free memory. Assuming that each request requires 6 physical pages, then each request will produce 6 page faults, at 2000 of the pressure, 10,000 page faults per second, which do not need to read disk resolution, so called Minflt; page break in the kernel state execution, Therefore, the kernel CPU consumption of the process is very large. The fault of the pages is scattered throughout the processing of the request, so the allocation statement time-consuming (10US) relative to the processing time (1000US) of the entire request is very small.

Solutions

Change dynamic memory to static allocation, or start with malloc for each thread and then save it in Threaddata. However, due to the particularity of this module, static allocation, or startup time allocation is not OK. In addition, Linux under the default stack size limit is 10M, if you allocate a few m of memory on the stack, there is a risk.

Disable malloc calls Mmap allocate memory, and suppress memory crunch.

At the start of the process, add the following two lines of code:

Mallopt (m_mmap_max,0); Disable malloc calls Mmap allocate memory

Mallopt (m_trim_threshold,-1); Suppress Memory crunch

Effect: After adding these two lines of code, with the PS command to observe, the pressure is stable, majlt and Minflt are 0. The system-State CPU of the process dropped from 20 to 10.

Summary

You can use the command Ps-o majfltminflt-c program to view the Majflt of a process, the value of Minflt, both of which are cumulative values that accumulate from the start of the process. We can pay more attention to these two values when we stress test the high-performance requirements of the program.

If a process uses mmap to map a large data file to the virtual address space of a process, we need to focus on the value of Majflt because the damage to performance compared to Minflt,majflt is fatal, and the time-consuming order of random reads of a disk is several milliseconds, And Minflt only have a lot of time to affect performance.

Other Memory request Management algorithm implementation

malloc used in glibc is not the only memory management method available. Bionic's Dlmalloc, Google's Tcmalloc also has been widely regarded as the strongest jemalloc. The core idea of jemalloc is to divide the memory pool into 3 levels, each with its own pool of memory, and some large memory pools, most of which are huge memory pools. Instead, Tcmalloc manages a series of memory pools, each of which develops affinity to a memory pool. So the jemalloc is suitable for a fixed number of threads, while the tcmalloc is suitable for a large number of thread changes.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Introduction to Linux Kernel Engineering--User space process using kernel resources

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.