http://mogu.io/156-156
Summary
The first one is to discuss what tools we can use to assist the analysis when there are problems with the Linux system, and to discuss the possible points and ideas when the problem arises, and hopefully to introduce some Linux kernel mechanisms to the application Layer development team to choose a more appropriate usage strategy.
Objective
The search team's servers in the previous period of time the CPU load is very high ( such as the load average reached more than 80 ), is the so-called surgery industry has specialized, search brothers on the Linux bottom technology understanding is not very deep, so this problem troubled them for a while.
Believe that we have a similar experience in the face of problems, if the problem is related to areas we are unfamiliar with, we tend to be unprepared.
As the virtualization team had some background knowledge of Linux, they began to assist them in locating the root cause of the problem and helping them solve the problems after knowing the difficulties encountered by the search team.
I would like to take this opportunity to introduce to you what tools we can use to assist with the analysis of problems with Linux systems, and to introduce some of the mechanisms of Linux in memory management and our usage strategies.
Problems with the Linux system, how do we analyze
工欲善其事, its prerequisite. To solve the problem, we first have to locate the cause of the problem.
There are many problem-locating tools in the Linux system that can help us to analyze the problem. So we are looking at the current search server phenomenon, and think about what tools can be used to find the cause of the problem.
Linux system response is slow, from the kernel point of view, there are probably the following situations:
The
-
Thread takes too long to execute in the kernel state, which exceeds the execution time allocated by the scheduling algorithm, which consumes the CPU for a long time in the kernel state and does not return the user state. There is a term for this phenomenon, called softlockup. The
uses a phenomenon to simply say the kernel state and the user state. We may encounter this phenomenon, the execution of a command, CTRL + C can not kill it, and hit the keyboard also response,
This is probably because this process is running in the kernel state, Crtl+c is to send the process signal way to notify the process, The process only checks for a signal if it returns to the user state from the kernel state, so the
is not likely to be killed if it is in the kernel state. (This can also be because the process has blocked the sigquit signal in the code) and this phenomenon gives us a very busy feeling about the system.
for Softlockup, there is a set of detection mechanism in the kernel, which provides the user with a Sysctl Debug interface: Kernel.softlockup_panc, we can make this value to 1, so that when this problem occurs,
Let the kernel take the initiative to panic, and dump out some field information.
PS: It is recommended that our servers enable this option
-
CPU load value is high, indicating that there are too many threads in the running state and D state. The
thread waits for the resource to go to sleep, goes into the D state (that is, disk sleep, deep sleep), the thread into the D state cannot be interrupted, and they sleep until the waiting resource is released and wakes them up.
(the general principle is that these threads wait for what resources, such as a semaphore, it is added to the wait queue for this semaphore, and then the other threads release the semaphore to check the waiting queue for that semaphore,
and then wake up the thread in the queue. The thread of the
D state can be viewed by the PS aux command, and "D" means the D State:
Root 732 0.0 0.0 0 0? S Oct28 0:00[scsi_eh_7]
Root 804 0.0 0.0 0 0? D Oct28 5:52[jbd2/sda1-8]
Root 805 0.0 0.0 0 0? S Oct28 0:00[ext4-dio-unwrit]
Root 806 0.0 0.0 0 0? D Oct28 12:16[flush-8:0]
The normal thread on the CPU dispatch queue is the sleep state:
$ ps aux | grep "Flush \| Jbd
Root 796 0.0 0.0 0 0? S 52:39 [Jbd2/sda1-8]
Root 1225 0.0 0.0 0 0? S 108:38 [flush-8:0]
Yafang 15030 0.0 0.0 103228 824 pts/0 s+ 0:00 grep flush|jbd
For this phenomenon, the kernel also has a term called hung_task, and there is a monitoring mechanism. By default, if thread 120S is in the D state, the kernel prints an alarm message, which can be adjusted. And we can also allow the kernel to panic when this happens.
ps: The purpose of the kernel panic is to dump the farm information at panic. In the kernel state, out of the process context, there is the interrupt context, (PS: On the 2.6 kernel we use, the interrupt is still used by the kernel stack, it does not have its own independent stack space). Interrupts can also be abnormal, such as prolonged outages. The interruption is closed for a long time, and this phenomenon is called hardlockup. For Hardlockup, the kernel also has a monitoring mechanism, which is the NMI watchdog. You can use the /proc/interrupts See if the system is capable of NMI watchdog. $ Cat /proc/interrupts | grep Nminmi: 320993 264474 196631 maskable interrupts value is not 0, indicating that the system enabled NMI watchdog. We then set the Kernel.nmi_watchdog to 1 via Sysctl, which is to proactively let the kernel go panic when the NMI watchdog is triggered. The hardlockup is thus monitored for this fault. PS: The NMI watchdog on our server should be enabled
[[email protected] BDI] / proc / | 00 Non-maskable interrupts is 0 cannot use NMI watchdog
deploy tools to collect site information
We deployed the tools above to collect the fault site information to help us find root cause. The Linux kernel collects the fault site information of the big kill device is kdump+kexec.
The basic principle of kdump is that the kernel first reserves a portion of physical memory to crash kernel when it starts, and when there is panic, it calls Kexec to start crash kernel directly into the physical memory area, bypassing the boot by using kexec Loader a series of initializations,
Thus the rest of the entire memory is not changed (i.e. the scene of the accident is retained), and then the newly launched kernel will dump all the memory content (which can be cropped) and store it on disk.
See if Kdump is enabled on our system.
$ cat/proc/cmdline
Ro root=uuid=1ad1b828-e9ac-4134-99ce-82268bc28887 rd_no_luks rd_no_lvm lang=en_us. UTF-8 rd_no_md sysfont=latarcyrheb-sun16 [email protected] keyboardtype=pc keytable=us rd_no_dm rhgb quiet
Kernel boot parameters "[email protected]" tells us that the system has kdump enabled. Of course crash kernel address space is not in the 0M place, but:
$ Cat/proc/iomem | grep Crash
03000000-0b4fffff:crash kernel
Then go to see if the Kdump kernel service is turned on:
$ sudo service kdump status
Kdump is operational
The instructions are already open.
Kdump is configured by/etc/kdump.conf, which, by default, will fetch the kernel field information (that is, Vmcore) to the/var/crash directory, crash This command to analyze the Vmcore.
ps: 建议我们的所有服务器都配置好kdump
Crawl Site information: What is the CPU doing?
One day in the morning, just came to the office (the bug appeared in a very humane time point), a server appeared load high alarm information. Then try to log in, very slow, or log in, and then try to hit the command, very very slow, but still respond. Using perf top to observe that many threads are in the Spinlock or SPINLOCK_IRQ (off interrupt spin) state, (PS: Forgive me for not saving the picture at that time). It looks like there's a deadlock or something inside the kernel. Ps:perf is a powerful tool to assist in the analysis of problems, it is recommended that our servers are loaded with perf this time what is the CPU doing? For this we have to sacrifice the ultimate big kill: Use SYSRQ to let the kernel panic (this server traffic has been guided away, so can be restarted), and then panic trigger Kdump to save the field information. (There is a more ultimate kill device: The keyboard's SYSRQ key + letter combination, which can be applied to our inability to log on to the server, can be aided by the management card, and then use the keyboard interrupt to trigger the collection of information.) PS: We recommend that we configure the SYSRQ on our servers.
The first sysrq is to be enabled:
$ cat/proc/sys/kernel/sysrq
1
Then let the kernel panic:
$ echo C >/proc/sysrq-triggger
So it was in the/var/crash directory to obtain the Vmcore.
#现场信息的分析过程
You can analyze Vmcore by crash This command, because this vmcore is not the elf format, so it is not possible to use tools such as GDB to analyze.
The following is a partial analysis of the key information for the Vmcore:
$Crash /usr/lib/debug/lib/modules/2.6.32-431.el6.x86_64/vmlinux Vmcore crash> bt-a pid:8400 task:ffff880ac686b500 cpu:0 COMMAND: "Crond" ... #6 [ffff88106ed1d668] _spin_lock at FFF fffff8152a311 #7 [ffff88106ed1d670] shrink_inactive_list at ffffffff81139f80 <br> #8 [ffff88106ed1d820] Shrink_ Mem_cgroup_zone at Ffffffff8113a7ae #9 [ffff88106ed1d8f0] shrink_zone at ffffffff8113aa73 #10 [ffff88106ed1d960] Zone_re Claim at ffffffff8113b661 ... pid:8355 task:ffff880e67cf2aa0 cpu:12 COMMAND: "java" ... #6 [ffff88106ed49598] _spin_lock_irq at ffffffff8152a235 # 7 [ffff88106ed495a0] shrink_inactive_list at FFFFFFFF8113A0C5 #8 [ffff88106ed49750] Shrink_mem_cgroup_zone at Ffffffff8113a7ae #9 [ffff88106ed49820] shrink_zone at ffffffff8113aa73<br> #10 [ffff88106ed49890] Zone_reclaim at Ffffffff8113b661<br>, ..... pid:4106 task:ffff880103f39540 cpu:15 COMMAND: "sshd" #6 [Ffff880103e713b8] _spin_lock at ffffffff8152a311 #7 [FFFF 880103E713C0] shrink_inactive_list at ffffffff81139f80 #8 [ffff880103e71570] Shrink_mem_cgroup_zone at FFFFFfff8113a7ae #9 [ffff880103e71640] shrink_zone at ffffffff8113aa73 #10 [ffff880103e716b0] Zone_reclaim at ffffffff8113b661 ... pid:19615 task:ffff880ed279e080 cpu:16 COMMAND: "DNSMASQ" ... #6 [ffff880ac68195a8] shrink_inactive_list at FFFFFFFF8 1139daf #7 [ffff880ac6819750] shrink_mem_cgroup_zone at Ffffffff8113a7ae #8 [ffff880ac6819820] Shrink_zone at Ffffffff8113aa73 #9 [ffff880ac6819890] zone_reclaim at ffffffff8113b661 ... pid:8356 task:ffff880ed267c040 cpu:17 COMMAND: "java" #6 [Ffff88106ed4b5d8] _spin_lock at ffffffff8152a30e #7 [FFFF 88106ED4B5E0] shrink_inactive_list at ffffffff81139f80 #8 [ffff88106ed4b790] Shrink_mem_cgroup_zone at Ffffffff8113a7ae #9 [ffff88106ed4b860] shrink_zone at ffffffff8113aa73 #10 [ffff88106ed4b8d0] Zone_reclaim at ffffffff8113b661 ...
The general meaning is that all the threads that need allocate memory now have to call Zone_reclaim to Inactive_list to reclaim Pagecache, which is called direct reclaim.
Brief summary information is as follows:
A total of 24 CPUs, we look at the status of each CPU at this time
Cpu |
Run the program |
Functions that are running |
0 |
Crond |
_spin_lock (&zone->lru_lock) |
1 |
Bash |
_spin_lock (&zone->lru_lock) |
2 |
Crond |
_spin_lock (&zone->lru_lock) |
3 |
Bash |
_spin_lock (&zone->lru_lock) |
4 |
Swapper |
Idle |
5 |
Java |
_spin_lock (&zone->lru_lock) |
6 |
Bash |
SysRq |
7 |
Crond |
_spin_lock (&zone->lru_lock) |
8 |
Swapper |
Idle |
9 |
Swapper |
Idle |
10 |
Swapper |
Idle |
11 |
Swapper |
Idle |
12 |
Java |
_spin_lock (&zone->lru_lock) |
13 |
Sh |
_spin_lock (&zone->lru_lock) |
14 |
Bash |
_spin_lock (&zone->lru_lock) |
15 |
Sshd |
_spin_lock (&zone->lru_lock) |
16 |
Dnsmasq |
Shrink_inactive_list |
17 |
Java |
_spin_lock (&zone->lru_lock) |
18 |
Lldpd |
_SPIN_LOCK_IRQ (&zone->lru_lock) |
19 |
Swapper |
Idle |
20 |
SendMail |
_spin_lock (&zone->lru_lock) |
21st |
Swapper |
Idle |
22 |
Swapper |
Idle |
23 |
Swapper |
Idle |
From this table we can see that all the threads that are requesting memory are waiting for the Zone->lru_lock spin lock, and this spin lock is now held by the DNSMASQ thread on the CPU16, and it is now working hard to reclaim Pagecache to Freelist. So the thread that applies the memory from this zone has to wait here, so the load value is higher. External performance is that the system reflects the slow ah, SSH is not in (because SSH will also apply for memory), even if logged in, the knock command does not respond (because these commands are also required to apply for memory).
Behind the Knowledge page cache
The reason for this is that when the thread is requesting memory, it finds that there is not enough memory available on the freelist of the zone, so it has to recycle the inactive page from the zone's LRU list, which is the direct Reclaim (direct recycling). The reason that direct reclaim consumes time is that it does not differentiate between dirty page and clean page when it is recycled, and if the dirty page is recycled, it triggers the operation of disk IO, which first writes the contents of the dirty page to the disk , then go and put the page in the freelist.
Let's look at the relationship of the memory,page cache,disk I/O with a picture first.
For example, for example, when we open a file, if we do not use the flag O_direct, that is file I/O, all access to the disk file will go through memory, memory will be the portion of the data to cache, but if the use of O_direct flag, That is direct I/O, which bypasses memory and accesses the disk directly, and this part of the data that is accessed is not cached and has a much lower natural performance.
Page Reclaim
Intuitively, we have a perception that we are now reading a file that will be cached in memory, and if we do not have access to it for the next one months, and we do not shut down or restart the machine for one months, then the file should not be in memory after one months. This is the kernel's management policy for page cache: LRU ( least Recently used ). That is, the least recently used page cache is recycled to free pages.
There are two types of page recycling mechanisms in the kernel: background recycling and direct recycling.
Background recycling is done by a kernel thread kswapd, and when the free pages in memory are below a water level (page_low), the kernel thread is awakened, and then it recycles the page cache into the free_list of memory from the LRU list. It will be recycled until the free pages reach another water level page_high. As shown,
The direct recycle is that when the page fault occurs, there is not enough memory available, so the thread simply reclaims the memory itself, and it recycles 32 pages at a time. As shown in the logical process,
So, we should avoid direct reclaim.
Memory Zone
For multi-core NUMA systems, memory is a node, and different CPUs do not have the same access speed to different memory nodes, so the CPU takes precedence over the memory nodes that are close to itself ( that is, the relatively fast memory area ).
The internal CPU relies on the MMU for memory management, depending on the memory properties, the MMU divides a memory node into a different zone. For 64-bit Systems ( that is, the system we are using now ), a memory node contains three zone:normal,dma,dma32. For 32-bit systems, a memory node is included: NORMAL,HIGHMEM,DMA. The purpose of HIGHMEM is to solve the problem that the linear address space is not enough, because there is enough linear address space on the 64-bit, so there is no zone.
The purpose of the existence of different zones is based on the principle of local data, we also know when writing code, the relevant data to put together can improve performance, memory zone is also the truth. So the MMU allocates memory as much as possible by allocating the same zone memory to the same process. There are pros and cons, so there is a good chance that it can bring some harm.
Root cause know: fix the problem
To avoid direct reclaim, we have to ensure that there is enough free pages in the process to request memory, from the previous background we can see that raising watermark low can wake KSWAPD early, and then KSWAPD to do background Reclaim To do this, the kernel specifically provides a SYSCTL interface for users to use: Vm.extra_free_kbytes.
So we increase this value (for example, increase to 5g,hohoho), and it does solve the problem. Increase this value to increase the low water level, so that when the memory is requested, if free memory is lower than the water level, it will wake up KSWAPD to do page recycling, but also because there is enough free memory available so the process can normally apply without triggering the direct recovery.
PS:extra_free_kbytes是CentOS-6(kernel-2.6.32导出的一个接口),该接口并未合入内核主线,它是redhat自己合入的一个patch。 在CentOS-7(kernel-3.10)上删除了该接口,转而使用dirty ratio来触发,因为直接回收耗时长的直接原因就是因为回收的时候会去回收dirty page,所以CentOS-7的这种做法更加合理一些,extra_free_kbytes在某种程度上也浪费了一些内存的使用。
However, this is not enough.
We can also see from the kernel alltrace from the front dump that thread recycling is associated with memory zone. That is, the free pages inside the normal zone are not enough, so the direct reclaim is triggered. But what if there are enough free pages in the DMA zone at this point? Will the thread request memory from the DMA zone?
Keep looking at another question.
Just a few days after the problem was solved, the search for SOLR server encountered a similar problem, but the problem was a bit different from the previous one. Let's take a look at its free memory:
$ free-m
Total used free shared buffers Cached
mem:64391 62805 1586 0 230 27665
-/+ buffers/cache:34909 29482
swap:15999 0 15999
As we can see, it has a lot of free pages at this time, there are more than 1G.
and see how many dirty pages it has:
$ cat/proc/vmstat
Nr_free_pages 422123
Nr_inactive_anon 1039139
Nr_active_anon 7414340
Nr_inactive_file 3827150
Nr_active_file 3295801
...
Nr_dirty 4846
Nr_writeback 0
...
At the same time its dirty pages are also very high, with 4,846.
This is not the same as the previous question, the question is that free pages is very rare at the same time dirty a lot of pages, here is a lot of pages and dirty pages.
That's the question we had in front of the memory zone. Free pages are in other zones, so threads go back to their zone page cache without using the other zone's pages. For this kernel also provides an interface for users to use: Vm.zone_reclaim_mode. This value on the machine would have been 1.
(That is, if you prefer to recycle your zone's page cache and not apply for the other zone's free pages), I'll change it to 0 (that is, as long as other zones have free pages to apply to other zones), it solves the problem (the system resumes normal after a setup).
When you change this value from 1 to 0, the effect is immediate:
$ free-m
Total used free shared buffers Cached
mem:64391 64062 329 0 233 28921
-/+ buffers/cache:34907 29484
swap:15999 0 15994
You can see that the free pages are reduced immediately, and dirty pages is reduced (no one with flush thread Rob Zone->lru_lock This lock, naturally it dirty page brush also quickly).
Summary & thinking: mechanism and strategy
As can be seen from the question we discussed earlier, the Linux kernel provides a variety of mechanisms, and then we select the strategy to use based on the specific usage scenario.
Our aim is certainly to improve the performance of the system as much as possible without compromising stability. However, if the stability and performance of the two is chosen, there is no doubt that we have to choose stability.
The variety of Linux mechanisms has also brought some distress to the top developers: it is difficult to choose a good strategy to use these kernel mechanisms because of the lack of depth of understanding of the underlying.
However, the use of these mechanisms will not have a universal formula, or to see the specific use of the scene. Because the search server has a lot of bulk file operations, so the use of page cache is very frequent, so we chose to be able to trigger background reclaim this strategy, and if you
File operations are infrequent, it is clear that there is no need to wake up the background recovery thread as early as possible.
Another, as a file server, its demand for page cache is very large, the more memory as page cache, the overall performance of the system will be better, so we do not need to reserve the DMA memory for the local data,
The two-phase comparison is certainly the page cache performance improvement is greater than the local performance of the data, and if you do not have a lot of file operation, it is still open zone_reclaim.
As an underlying developer, I would like to recommend a reasonable kernel usage strategy to the application Layer development team. So it is very welcome that the application development team will be able to discuss with us after encountering the underlying confusion so that I can understand the implementation of the application layer so that I can give more reasonable suggestions.
Linux kernel Analysis: The analysis and thinking on the problem of CPU load instant high caused by page recycling--------------Mushroom Street Technology blog