Remotely trigger SYSRQ to obtain the latest dmesg information-a nearly useless solution, sysrqdmesg

Last Update:2015-02-02 Source: Internet

Author: User

Tags dmesg

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Remotely trigger SYSRQ to obtain the latest dmesg information-a nearly useless solution, sysrqdmesg
In the first half of this article, I will describe a story that sounds very attractive and reasonable, and then I will tell you that this beautiful story will not actually happen. Finally I will give a summary. In the next article, I propose a self-comparison solution.
Part 1: Wonderful story in xtables-addons, there is a particularly interesting small module, that is, xt_SYSRQ, which is loaded into the kernel as an iptables target, the sysrq command can be remotely sent to the local machine. This function is powerful. In last year's project, I have deployed it to the actual product. However, I can check it again today and find that there are still some shortcomings in the US, which need to be improved:
1. The original xt_SYSRQ has no feedback mechanism. Although you can remotely trigger a sysrq command, you do not receive any feedback, including whether the command execution is successful or not, and what about the kernel ring buffer. Sometimes, I only need to check the dmesg information. Sometimes, I want to check the dump information of the kernel stack. I do not just restart the machine.
2. the authentication mechanism of the original xt_SYSRQ is a bit messy in addons. The SYSRQ module implements a simple authentication mechanism, that is, password authentication. To prevent password attacks, the serial number window mechanism is used. However, you need to know that such protection is very easy to crack. First, data packets are transmitted over UDP. Second, the kernel is not easy to perform strong authentication, especially when panic is used. Therefore, it is better to put such authentication outside.
The above are the two problems I found, but remember to click here when solving them. Why?
Do you really need this method to obtain remote dmesg information? If the system is still alive, isn't it better to use SSH? Therefore, as described in the xtables-addons document, this function is only available when the system is dead. There is only one knowledge point, that is, when the system is suspended and panic, the interruption may still be responded, and the execution of Netfilter HOOK as the Soft Interrupt after the interruption will also be responded. Some may ask why the system is panic and the response is interrupted. The answer is that you think panic is too serious. panic is in a state of panic and is not dead. The cause of system confusion may be that the system-level operation bug causes memory confusion or other confusion. Continuing to run the system will cause unknown consequences. At this time, the best way is to stay unchanged, this is panic. The interrupt response logic is closed, so the interrupt can still respond. However, if the memory related to the interrupt processing is damaged, there is no chance at all.
Next, let's take a look at the remote SYSRQ security mechanism. In fact, the author's native method is safe. Once again, there is no need to extend it too much to complicate the route. The system has been suspended, and the processing should be as simple as possible. The security prerequisite must be ensured. First, the plain text of the password is not transmitted on the network, but only the digest value is transmitted. This policy is already broken in the Network Authentication field. Second, in order to prevent replay attacks and collision passwords, using the window mechanism is also a small skill, so you only need to add a feedback mechanism. Problem 2 above is basically not a problem.
When everything is ready and you know what to do and what to do, when you use Sysrq-c to Crash the system, you will find that everything stops instantly, nic interruption does not respond at all... what happened?
Part 2: the reality is that in the article panic and BUG_ON, I once said, "Even if the panic is connected, you can still ping this machine from the outside." But I have to say that, it is just a "best case". In 99% of the cases, the interrupt will stop responding. It seems not difficult to understand this. It should be an operating system, A user-state process is only one of its execution streams, including kernel threads, interrupts, the lower half of Interrupt context, and the lower half of thread context interruptions, almost all data packets processed through the native forward and arp are processed during the interruption or soft interruption. In addition, many soft interrupts can be processed in the interrupt context (in irq_exit after the hard interrupt is completed). If these execution streams are allowed to continue in the afternoon, is it also called panic? So the best way is to stop all of this! It seems that the complexity is more than that. If you trigger a panic in an interrupt context, it is clear that the Soft Interrupt will not be executed in irq_exit. Therefore, even if you do not stop all the interrupt responses after the panic, the soft interrupt is not executed in the interrupt context every time, depending on whether the panic is triggered in the interrupt context.
If you want to know what happened after panic, it is not very complicated. In fact, the sequence after panic is as follows:
1. preemption is prohibited (if the kernel is not enabled during compilation, nothing is done );
2. Print Information and stack;
3. Call the kexec logic;
4. notify other CPUs to stop all work;
5. Call the panic notification chain to handle the problem;
6. If the panic timeout is set, restart the machine after waiting;
7. If the panic timeout is not set, the system enters the "Flashlight" State until it is permanently set.
The most important part is step 1, which is also the key to whether the system will handle interruptions and soft interruptions in the future. Before explaining in detail, let's first explain why we should notify other CPUs to stop working. Because the panic is triggered on the current CPU, and the kernel data structure shared among multiple CPUs may be lost at this time, it is necessary to notify other CPUs to stop at this moment. So what do I need to do to stop? The main focus is on the external bus related to the corresponding CPU and interrupt controller operations. A typical saying is to close them, which means to isolate these CPUs from external events as much as possible, this is also a safe practice. If the power is suddenly lost, the levels may be inconsistent after the hot restart. Therefore, it is always better to adopt a safe close sequence. Is there a direct approach? Of course!
If we set the machine Restart Method to Cold Restart (almost all the boards support), there will be no level inconsistency, and the CPU can be safely shut down at once, therefore, there is no need to follow the secure shutdown sequence to shut down the CPU. To be more efficient, it is important to know that the system will be restarted immediately ), there is no need to disable the interrupt controller. At this time, the CPU will continue to respond to the interrupt. Whether or not the Soft Interrupt will be processed depends on whether the panic is triggered in the non-interrupt context. Because preemption is disabled and no execution stream returns the user State, task switching is not possible at this time. Therefore, even if the Soft Interrupt is processed, it will not be processed in the softirqd context, the interrupt context in irq_exit is processed.
Everything is so miserable!
Specifically, to respond to interruptions after panic, you need to set a kernel startup parameter: reboot = f, c. Force (cold) Restart does not stop the local cpu apic interrupt controller. To continue processing soft interruptions, do not interrupt the context panic, this is like telling someone not to die. This is uncontrollable. Even if the interrupt context is not panic, as softirqd cannot be scheduled, soft interruptions that cannot be processed in the interrupt context (only MAX_SOFTIRQ_RESTART times) will be completely drowned. In addition, it is extremely wrong to let the system continue to execute interrupt response and Soft Interrupt Processing in panic, which will destroy more kernel data structures, in case the disk cache/buffer is damaged, or if you accidentally write an address/register, the consequences will be unpredictable... everything is so miserable, just because the kernel panic!
Part 3: the correct method is not to set an iptables policy to wait for external remote triggering of SYSRQ, which means that the interrupt cannot be closed and the Soft Interrupt must be processed... the correct method is to restart the system immediately after the panic, but you need to do some aftercare before the restart. Of course, if you configure kexec, it is better, but if you do not want to debug it, that is completely unnecessary. The following are the correct practices:
1. Set the reboot = f and c parameters at startup. The operation of the controller is not prohibited after panic;
2. After the system is started, set sysctl-w kernel. panic = 5 to restart the system immediately;
3. Register a panic notification chain, encapsulate the Ethernet header with a broadcast address, and broadcast the content of the kernel ring buffer (including stack and other information ).
Why do I need to talk about broadcasting? Because I don't want it to send ARP requests and then wait for the ARP response, because it will add an interaction (because the ARP response is being processed in a soft interrupt, since the panic cannot be triggered without being interrupted, it cannot be guaranteed that the Soft Interrupt will be executed.) In addition, who should I send it? It is true that a receiving IP address can be configured, but one more configuration is not mentioned. This IP address is stored in the memory after all. As long as the memory is used a little more, the possibility of getting error data from the panic is even greater, after panic, you have to find a way to use the least information. Although the scene of "sending data packets" is huge enough, there is no way.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More