Linux in the system tuning, the first consideration of the entire operating system structure, and then for the various parts of the optimization, the following shows the various components of a Linux system:
As can be seen, we can adjust the application, library files, kernels, drivers, as well as the hardware itself, so the next talk about these detailed introduction, which is the performance of the system has improved. The kernel subsystem mainly includes several aspects: 1. Network 2. IO (Input/Output subsystem) 3. Process (Progress) 4. Memory (RAM) 5. File System (FileSystem) one, Linux process management 1, process definition: The process is a computer resource allocation unit, mainly including the scheduling of system resources, such as CPU time, memory space, is a program execution of a copy. They also include a range of resources such as file opening, release signals, kernel data, process State, memory address mapping, execution of multiple threads, and data segment global variables. 2, Process status: Running (running), interruptible (sleep state can be interrupted), unitertible (non-interruptible sleep state), stopped (stop state), Zombie (zombie)
Task_running: A process either executes on the CPU or waits for CPU time to execute; task_interruptible: A process that is in sleeping state until the environmental conditions change, suspends a hardware interrupt, uses the resources released by the system, or release a signal, this signal can wake up the process to the task_running state, generally a iobound type; Task_uninterruptible: This is the same as task_interruptible, but the CPU Type of bound. Task_stopped: Program execution is completed, this process enters this state when it receives Sigstop, SIGTSTP, sigttin, or sigttou signal these signals; Task_zombie: Child Process Execution Complete, However, the parent process of the child process is nearly dead, unable to reclaim the resources occupied by the child process, 3, process cycle: A child process is a copy of the parent process, so in addition to the PID and Ppid different, the other is the same, and the child process in the execution of a program, it needs to be in a separate address space , when execution completes, the child process lock consumes the resources to have the parent process to recycle, this is a process life cycle; 4, Thread concept: Thread is a smaller resource scheduling unit than process resources, so also called lightweight process (LWP) and thread scheduling is more resource-saving system, if using a process, Even if the same process needs to allocate two resources, and the thread is different, it can be used in the same process to achieve the reuse of resources, which is nginx performance than the use of Apache performance is better, because Nginx is thread-based, While the Apache prefork mode is process-based, each process connection to allocate 4M of space, and each thread does not match 4k thus can find a significant difference, about the process and thread resource allocation comparison, I have prepared a diagram, you can easily see the difference between them:
5, process scheduling: Process scheduling is based on a certain scheduling algorithm to achieve the switch between different processes, the common scheduling algorithm has a few: O (1) Scheduler <2.6 after > This time in the scheduling of one second, to achieve fair scheduling, fair allocation of Time, But sometimes it's a waste of time, like Edit and movie, and edit will waste time. O (log n) o (n) o (n^2) O (2^n) in order to solve the deficiencies of O (1), and refer to the Deadline,deadline has three queues, activity queue, Death queue, Deadline queue, for different processes to make a counter, to the time regardless of process priority level , but "See Death and save". (deadline is now used on disk) the process scheduler now uses the CFS (fully fair scheduling algorithm) scheduling algorithm: Instead of allocating a time slice for each process, a time scale is allocated according to the process. 6, the type of process: interactive process, batch process, real-time process, 7, process priority: static priority:1-99,100-139 Dynamic priority: Dynamic priority adjustment range 100-139, adjust the nice value, Adjusted to show nice values. Nice value range in -20-19 real-time Priority: 0-99 (the higher the number, the higher the priority) 8, the process scheduling strategy: sched_fifo: FIFO "1-99", can only dispatch real-time priority process; &NBSP;SCHED_RR: rotation The time slice is introduced, it is the FIFO improved algorithm, only the real-time priority process can be dispatched; sched_other: Scheduling the traditional timeshare process "100-139"; sched_batch: Only used to dispatch nice=0 or priority 120; using #ps Axo Comm.,rtprio to display real-time priority; real-time processes define the scheduling categories and priority levels that are used for startup: Chrt -f [1-99]/path/to/program Arguments //added to the FIFO queue Chrt -r [1-99]/path/to/program arguments //added to the RR queue nice&& Renice adjust the priority to see if the bottleneck of the system is CPU1. Average load yum Install sysstat load Three average value cannot be greater than 3 w uptime top saR-q 1 3 vmstat 1 52.CPU utilization mpstat-p 60% in user space is normal sar-p all 1 iostat-c 1 c AT/PROC/STAT9, Cache cache: Caches are divided into Cache-hit and Cache-miss#yum install x86info#x86info–c View CPU cache type #valgrind--tool=cachegrind LS View cache hit condition # DMESG | Grep-i Cache View buffer size 10, multiple CPUs and multi-core CPUs are balanced #watch-n. 5 ' PS Axo COMM,PID,PSR | grep httpd ' see on which CPU the process is Taskset [opts] #taskset-C-P cpulist PID binds a process to a specific CPU #taskset-c-P 0 5533 binds 5533 processes to the No. 0 CPU this Can increase the cache hit rate, but there is a flaw in the Equalization 11, the dispatch domain: is to bind some processes in the domain of multiple CPUs. The role of the dispatch domain is to define the CPU group as the dispatch domain, a CPU set is a dispatch domain, configure the local dispatch domain: #mkdir/cpusets#vim/etc/fstab Add content cpuset /cpusets cpuset default 0 0#CD/ Cpusets#ls
#mkdir ro
#cd Ro#echo 0 > CPUs This creates an RO dispatch domain that is bound to the first CPU; 12. The address space of the process a process is working with its own memory space, so this leads to each process having its own address space, Each process has its own characteristics and data size, the process must hold a large number of data size, for the Linux kernel, each process using dynamic memory allocation is structured, as shown in:
So the process can be divided into code snippets, data segments, BBS, heap, stack. Second, memory management 1, memory management to understand the memory scheduling mechanism, the following I will combine to make a simple elaboration:
I'm going from left to right. The user makes a request, by invoking the library file, the user space into the kernel space, the data from the disk in the kernel space to read into memory to operate, after the operation is completed, on the synchronization to disk, and in order to improve the efficiency of the CPU, using the MMU (memory Manger Unit, Memory management unit) to replace the CPU to perform the entire process of scheduling, in this process, the first to swap partition slab allocator directory to find the corresponding resource location, and then to the corresponding memory space or disk space to find, Of course, in order to avoid the external fragments of memory recovery, introduced the zoned buddy allocator, this is a reasonable allocation of memory resources, waiting for the process to occupy the space used to complete, the need to reclaim resources, here is the use of Pdflush, that is, in memory dirty Pages of data are synced to disk. Pte:page Tables Page Table entry PAE: Physical Address Extension tlb:transition lookaside buffers converted buffers only x86info–c and DMESG can view 2, Memory structure: A comparison of 32-bit systems and 64-bit system memory space
Because of the limitations of the hardware device, so the kernel can not treat the page as the same, so there is a zone, such as the following to begin the detailed introduction of the content: 1. On a 32-bit system, the maximum address space for a single process is 4G (2^32), which is 4G divided into 1G of kernel space and 3G user space, where 1G's kernel space is used to map the virtual address and physical address of the page table, and the other 3g is the user space; in the 1G kernel space, 16MB is used to make DMA (direct memory area), this area includes pages, the 880MB from 16MB-896MB is used by the kernel, and 128MB is used to map, from the virtual address space to the physical address space And the 3G above is the full address space. For 64-bit operating system space, there are 1 g is used to do the DMA, the area is used to do the actual address space, so on the server, without considering the direct installation of 64-bit system; 2. Usually the cause of page faults: The data is still on the disk, the data on the Exchange partition, if the data continuously from the Exchange partition on the data scheduling, then this can directly determine the memory bottleneck, memory is not enough, the rest will not explain, you understand, expand the memory, hehe! 3. How to view resource usage for each process: #cat/proc/pid/status or CAT/PROC/PID/STATM and then you can look at the contents of the system, each start a process, will occupy a certain amount of address space. You can also use the graphical interface to view #gnome-system-monitor#pmap +pid view the library file for the process, or to locate the memory bottleneck #yum install Glibc-utils#memusage +command View the memory consumed by a single process while showing the #memusage–help as a bar to get more help on the command
#ps Axo Minflt,majflt To see the occurrence of faults in all processes
Where Minflt represents the use of disk, and Majflt represents the use of the swap partition, according to this information can be judged, if the long-term use of the swap partition, then is the memory bottleneck. 4. Memory quotas are divided into several: (1) process forks or execs child process (2) New process requests memory (3) New process uses memory (4) Pro Cess frees Memory5. Memory type: SRAM: static (Static RAM) DRAM: Dynamics (Dynamic RAM) SDRAM DDR rdram (used on server, narrowband, memory with parity) 6, improve TLB performance tlb: (transition lookais buffer) conversion backup buffer, which is the table that holds the virtual address to the physical address translation, also called the page table buffer The TLB-stored page supports 4k and 4M two on 32-bit systems, and more on other systems, and if you want to know it, you can use the Help document #yum install Kernel-doc after the installation is complete/usr/share/doc/ Kernel-doc-2.6.18/documentation; If we want the value of the TLB to be larger, that is to make the Hugetlb page (large page table) can set the kernel parameters, which can increase the TLB cache hit ratio, reduce the number of PTE bearing, To speed up the address translation; Adjust the size of the TLB value:/etc/sysctl.conf vm.nr_hugepages=n Temporary effect: #echo 4 >/proc/sys/vm/nr_hugepages #sysctl-W vm.nr_hugepages=n permanently valid: #vim/etc/sysctl.conf #sysctl-P save, effective in order to use the large page table later, we can make a large page table file system # Mkdir/hugepages#mount -t Hugetlbfs none/hugepages in the future if it is used, it can be used directly, of course, in order to detail the detailed monitoring process of the scheduling situation, you can also track system calls # Strace-p PID View process detailed scheduling process #strace LS can see the emitted referenceA detailed system call to #strace Ls-o/tmp/tmp.txt is saved in the specified file. #strace-C-P PID statistics system call time and the number of calls, combined with these results for analysis 7, memory usage should follow the policy: a. Reduce the consumption of small memory objects, the tools used are slab cache b. Reduce slow system service times: i. File system: Buffer cache (slab cache) ii. DISK io:page CACHE&NBSP;&NBSP;&NBSP;&NBSP;III. internetprocess:shared memory iv.network io:buffer cache,arp cache,connection tracking (buffer is to solve the two-way rate mismatch, and the cache is to reuse, improve the cache hit ratio) combined with the above theory, next is to do the detailed memory kernel parameters of the process of adjustment: 1, Vm.min_free_kbytes The minimum amount of K-byte space to keep in memory Description: When an application repeatedly use and release large memory, the bandwidth of the disk, the CPU is low, the memory is small, need to change the above parameters, if the change is small, this will lead to reduced service time, other applications can not have enough memory, Zone_nomal to a great amount of pressure; The default is 28902, vm.overcommit_memory memory overuse, this parameter has three values can be set 0,1,20| deny overuse (default) 1| always over-use (not recommended) 2| can only use a percentage of all RAM and swap recommendations set to 30,50% is already large < This pair starts 2 effective >) as far as possible according to Committed_as as the reference to set 3, Vm.overcommit_ratio Over-use ratio, is the above swap ratio (the recommended setting is 30%,50% is already larger) default: 504, adjust the slab cache small memory objects, cache the file Innode (index), each kind of slab can only cache one small memory object; The file is in the/proc/ In the Slabinfo file, we can look at:
This price can be the value of the Limit,batchcount,sharedfactor, where limited=n*batchcount such as: #echo "Ext3_inode_cache 108 8" >/proc/ Slabinfo#slabtop//view current slab assigned information you can also use vmstat–m to view current slab assignment information
5. Adjust ARP cache: #cat/proc/net/arp
It can be seen that the ARP cache can now be viewed using commands
#arp –a#ip neigh showarp-d hostname delete ARP cache IP neighbor flush dev eth0 marked as invalid and will be cleared in a few more in order to reduce the ARP cache footprint system resources, you can set its stored entries Net.ip V4.NEIGH.DEFAULT.GC_THRESH1 definition Cache on-line, when the cache is greater than how much time to do Cleanup (default) Net.ipv4.neigh.default.gc_thresh2 soft online 512net.ipv4.neigh.default.gc_thresh3 Hard Online 1024net.ipv4.neigh.default.gc_interval cache collect garbage Time default 30s these are settings for the size of the ARP cache; 6. Page cache:page caches data, places to use: direct read, read/write, read and write fast devices, access memory mapped files, Access swap mapped filesvm.lowmem_reserve_ratio The percentage of space reserved in a low address space, typically referred to as Zone_normal placing oom (memory address exhaustion); Vm.vfs_cache_pressure defining kernel reclaim slab memory in cache, Pagecache, Swapcache , the default is 100, if less than 100, the tendency is smaller, if set to 0, will not be recycled Pagecache, Swapcache memory (not recommended), if you open the file frequently, to increase this value, can quickly reclaim memory, if there is enough memory, can be maintained at 100, It is generally not necessary to reduce this value; Vm.page_cluster page cluster: Control the physical memory data exchange to the swap partition, the number of pages exchanged 2^n this value by default is 3, that is, 8 pages, the application frequently use the swap partition, you can adjust the larger point; Vm.zone_ Reclaim_mode Memory Area recycling model, when a region of memory consumption, how to reclaim the space of this area, its value is: 1 (turn on the recycle mechanism), 2 (the chapter page of this memory to the disk in the recycling), 4 (Reclaim the space occupied by the swap page) 7, Anonymous Pages: Anonymous pages anonymous page mainly includes program data, arrarys,heap,allowcations, etc.
8. SysV IPC interprocess communication interprocess communication: *1.semaphores signal Adjustment: kernel.semcat/proc/sys/kernel/sem250 Each array semaphore 3200032128 can be used by the array *2.messageing queues Message Queuing (Rubbit MQ) adjusted by exchanging information: kernel.msgmnb16384 How many messages a single message queue can hold KERNEL.MSGMNI16 Message Queuing maximum number of message queues kernel.msgmax8192 single message maximum on-line 8k*3.shared memory shared ram adjustment: kernel.shmmni4096 The maximum number of shared memory segments kernel.shmall2097152 the value of a shared page at a time kernel.shmmaxshmall*4k the size of the shared memory segment can be created using the IPCS command to view the mechanism of the current machine interprocess communication IPCS- L Display limits for interprocess communication
If the file is not modified too much, you can put the data in the/DEV/SHM, this directly to the memory, but once the power loss, the data will be lost 9, memory detailed use of the command: Free-m display memory usage page tables:cat/proc/ Vmstat system memory:cat/proc/meminfo (Total physical memory,cache,active use,inactive) use Vmstat-s#yum Install Sysstatsar-r 1 10 (Take once per second, take 10 times) 10, page Category: 1.free 2.inactive clean can be directly recycled using 3.inactive dirty The data on the page is not synchronized to the disk, as long as the synchronization to disk, you can reclaim the page 4.active is being used bubby system: Partner systems, is to avoid the external fragments of memory recovery, saving memory space is a continuous space There is a thread in the kernel called KSWAPD to reclaim the free partition, put the memory in less (inactive clean state) into the swap partition, make room for other use use vmstat-a |-s display page box use Status page main role: Pages Cache and process address space, all recycle pages in page cache 11, how to recycle dirty Pages:1. Synchronization to disk, by the Pdflush kernel thread (every time will be synchronized, you can run multiple simultaneous, you can define the number of pdflush, such as more than Cpu,disk, can be on one CPU, improve concurrency) 2. Put to swap partition vm.nr _pdflush_threads is to define the number of Pdflush threads, a disk is recommended a Pdflush thread vm.dirty_background_ratio when the total number of dirty pages accounted for the entire memory ratio, start pdflushvm.dirty_ Ratio the proportion of dirty pages in a single process, and then starts pdflush to brush write vm.dirty_expir_centisecs once every time (default 30s,0 means disabled) vm.dirty_writeback_ Centisecs Default 5s, define a data modification more than this time can be brush write 12, how to recycle clean page: syncfsynceCho s >/proc/sysrq-trigger manual sync to disk echo x >/proc/sys/vm/drop_cachesx= 1 free pagecache 2 free dentries and inodes 3 buffer and cacheout-of-memory killer process when the memory is full, start killing some processes/proc/pid/oom_ Score the score of the process, the higher the score, the first to kill the 13 when the memory is oom, to optimize the OOM strategy: echo n >/proc/pid/oom_adjecho F >/proc/sysrq-trigger Manual kill Process Vm.panic_on_oom=1 prohibit Oom-kill, but memory exhaustion will also be a system panic 14, probing memory leaks: When a process exits, memory leaks are divided into two types: 1. Virtual Memory leak: Apply only, but do not use 2.real: Memory release failed #sar-r view memory request, release, positive number is applied, negative number is released, where Frmpgz is the opposite #yum install Valgrind#valgrand--tool= Memcheck ls View memory leak * * As an administrator, always pay attention to whether the memory leaks 15, swap partition: Swap-out=page-out is to put the in-memory data into the swap partition swap-in: Put the in-memory data into the swap partition; Those pages to Exchange: Inactive pagesanonymous pages 16, improve swap performance: 1. Reduce decision time: Quickly change into a small page, Anonymous pages2. Reduce the number of accesses: Do swap clusters, multiple partitions do the same priority 3. Reduce service time: Use partitions as much as possible, not utility files, and place swap partitions on disk heretics as much as possible, and place swap partitions on RAID0 vm.swappiness Define how much inclination to call out of the exchange partition of the anonymous page, inside is a% ratio, reached on the use of swap memory, if you do not want to use swap partition, you can set the value of less than 100, the default is 60 17, the size of the Swap partition: Batch service:> 4*ram database service: <=1G Application Services: >=0.5*RAM&NBSP; vm.page_cluster=n (2^n pages) vm.swap_token_timeout once a swap partition is found frequently does not participate in this time 18, monitoring memory tool: Vmstat-n display memory and swap partition data sar-r Data showing memory and swap partitions sar-r display memory data sar-w like SWQP data sar-b display page data
Linux system tuning 1