MCE Processing--MCE Learning notes in Linux

Source: Internet
Author: User
Tags intel pentium

What is a 1.machine check?

Machine check is a type of hardware that is used to report internal errors. It includes machine check exceptions and silent machine check.

Where machine Check exceptions (MCEs) occurs when the hardware does not correct an internal error, in which case the program that the CPU is currently running is interrupted, and a special exception handler is called. This situation usually requires software to process, that is, machine check exception handler.

This is often referred to as silent Machine check when the hardware is able to correct internal errors. When this error occurs, the hardware registers the corresponding error message in a special register. After that, the operating system or the firmware (BIOS) can read the information from this write register, registering and analyzing these error messages to help predict the failure of the machine hardware in advance.


2.machine Check is important

As the number of transistors in each generation increases and the chip size decreases, so does the probability of hardware errors, which makes it increasingly important to handle this error.

In addition, it is becoming increasingly popular to integrate many computers together for high-performance scientific computing. In these clusters of computers, the probability of a hardware error will be higher than the probability of a normal computer error, so it is important to handle these hardware errors in order to ensure reliability.

There are many reasons for machine checks, such as CPU, cache, internal bus, memory and so on, and of course it may be a software error in the driver.


3.x86 Machine Check Architecture overview

Intel and AMD chips belong to the x86 architecture. Prior to the introduction of memory (parity memory) in IBM's machine, an NMI will be set up when an error occurs. The subsequent machine discarded the memory, but still reported some hardware errors. After that, the basic machine check was added to the CPU in the Intel Pentium and the MCA (Machine Check Architecture) was introduced. The MCA includes a standard exception (18th interrupt), as well as some standard register MSR (in some places the full name is model specific register, others are called Machine specific register). These registers allow the software to check whether a machine check has occurred, allow or disallow them, detect whether these errors have been restored, or pollute the state of the CPU.

In addition, the bank includes more registers, the bank is the specific subsystem produces the wrong grouping, these subsystems include the CPU, the bus unit, the cache and the North Bridge and so on. The number and significance of the bank is dependent on the specific CPU. Each bank has a certain number of sub-errors, which can be banned or allowed. Typically, a generic machine check handler allows all errors and bank. In addition, the Bank also holds the address associated with the error. The advantage of this universal architecture is that a separate machine check handler function can work on many different CPUs. When a machine check is detected, the kernel reads all machine check registers, and the bank that reports the error.

The decoding and interpretation of different errors is dependent on the specific CPU and the user. Some general-purpose processing can be done, for example, when a bank register contains a valid error address, we assume that an error occurred at the memory location of the address. Of course, the processing function makes the corresponding action based on whether the error is corrected and whether the error has contaminated the CPU context.

4. Why it is difficult to write a machine check handler function

Because the current kernel service cannot be used. We know that kernel code can run in the process context and interrupt context, and that the interrupt context can do less than what the process context can do. The function called in the interrupt context must properly protect its data structure against concurrent access from multiple interrupts. These functions are referred to as "break-safe".

However, we know that machine check exception can happen at any time, even in critical sections where all interrupts are forbidden. Therefore, in this case, if these interrupt-safe functions are called in the Machine check exception handler, they may deadlock on the spin lock.

Because the silent machine check handler function and the Machine check exception handler function share the same code path in order to make the code more simple, the problem discussed above is also applicable for the silent machine Check handler function 。

Similarly, it is important to be able to deal with machine check as quickly as possible, because after a hardware error occurs, the state of the machines may become less stable. When the processing function waits for the machine to enter a state that is more manageable, the event may become impossible to handle. For example, in the waiting time, on the same bank, another error occurs, and the error overwrites the previous error and becomes non-processing.

For some complex RAM errors, the processing function has no alternative but to wait because it requires synchronization with the kernel lock. Unlike other exceptions, Machine check is asynchronous. This means that the error reported by the CPU is not at the point where the error occurred, which may have been over hundreds of clock cycles, which leads to the unreliability of the processing.

5. Register Machine Check

Traditionally, registered machine check is performed by firmware (i.e. BIOS), and those MC registers will not be zeroed when the operating system does not have a machine check handler function. After the next hot boot, the BIOS will find information from the last machine check and enlist it in the log file. There are obviously many drawbacks to this approach, such as the ability to register log files every time the machine restarts, to not record multiple errors that occur on the same bank, to gather information on the network, and to write logs to disk is difficult.

Therefore, the best way to do this is to hand over the task of registering logs to the operating system. However, most Linux users currently use the X interface, so the control port is not visible. When the operating system registers a fatal machine check, the X-interface looks as if it were frozen and does not respond to the user. To solve this problem, this deadly machine check is re-registered at the time of reboot. It is also possible to write a day file on disk so that later support staff analysis is called possibility.

It is necessary to separate the Machine check log file from the software error log file because the user may not be able to distinguish between the two types of errors. Experience has shown that it is best to separate the two log files completely.

6. Overridden X86-64 processing function

Because the x86-64 machine Check handler function in the original linux2.4 kernel is inherited from the version of I386. But later found that there are some bugs, and some design errors. Therefore, the Machine check handler function on x86-64 is rewritten on the Linux 2.5 kernel. This rewrite follows the standards of Intel and AMD for machine check processing functions. There is no specific CPU-related code in this rewritten code, which is written entirely in accordance with the common x86 machine Check architecture. In addition, the rewritten code distinguishes between errors that differentiate between uncorrectable errors and pollution of the CPU state, and in the first case it kills the process when it is safe, rather than the system panic. In the previous processing function, the system would be panic in both cases. However, when the process is in the kernel state and holds the lock, killing the process will cause the system to deadlock. Deadlocks are more difficult to handle than panic, so the kernel chooses panic when a machine check occurs in a kernel-state process.

In the newly written handler, a lock-free binary log file system is created, which is completely separate from the PRINTK log file. It logs the machine check into a buffer, and when the buffer is full, the subsequent information is discarded and can be accessed through the character device/dev/mcelog in the user space. Use the application in user space Mcelog to read and decode the character device regularly.

When a fatal machine check is encountered, the error is read by the BIOS or kernel after the system has been re-heated. Other slient Machine Check can be accessed by mcelog according to certain rules, and they will be written in a special log file.

The MCE structure is as follows:

/* A Machine Check record */struct MCE {     __u64 status;    /* Bank status Register */  &N Bsp  __u64 misc;     /* Misc Register (always 0 right now) */     __u64 addr;      /* address or 0 */     __u64 mcgstatus; /* Global MC Status register */     __u64 rip;      /* Program counter or 0 for silent error */     __U64 TSC;      /* CPU time Stamp counter */     __u64 res1;      /* for the future extension */     __u64 res2;      /* Dito. */     __U8 CS;        /* code segment */     __U8 Bank;      /* Machine Check bank */     __U8 CPU;        /* CPU that raised the error */     __U8 finished;  /* Entry is valid */     __u32 pad;};

7. Configure the new x86-64 processing function

The new processing function can be configured when the system is running, by reading or writing/sys/devices/system/machinecheck/machinecheck0/The following configuration file, the legal domain includes:

(1) Tolerant tolerance level, the higher this level, the Machine check processing function will risk the larger the machines continue to run, the legal level is as follows:

0 when an uncorrectable error (uncorrected errors) occurs, the machine is always panic 1 if a deadlock can occur, PANIC2 risk a small deadlock without panic3 never panic or exits (used to test) specify OOP on the command line of the kernel S=panic means 0 tolerance, and for a clustered computer, setting tolerant to 0 may be the best, and setting panic=10 at the same time will force the machine to restart

(2) Check_interval in seconds to detect the time interval for silent machine check. The default is 5 minutes, if 0 is the infertility sequence background check.

(3) Bank0ctl...banknctl the binary error mask in the BANKN. The default is to allow all errors in the bank. A forbidden error will be ignored.

8. Future work: New Ram/cache error handling

RAM errors are one of the most common sources of machine check events, but because the memory controller and CPU are asynchronous, error reporting can be inaccurate. The handler function assumes that the error occurred in the process is activated at the time of the exception and that the process is in kernel or user state to determine whether to kill a process or panic. When an error occurs, the information may be stale before the kernel calls or context switches. A more reliable approach is to use the physical error address provided in the MCN_ADDR register and use the VM structure to find the process that this address belongs to. This could be memory shared by multiple processes.

The handler function first needs to synchronize with the state of the process because the VM lock is not broken. It should first enter an interrupt context by a self-interrupt on the same CPU (which may delay the execution of the next local interrupt permit and the standard interrupt context). The interrupt handler function can then set up a work queue item to run a callback function for the event process on the local CPU.

This callback function can use the MEM_MAP and RMAP data structures provided by Linux2.6 to view the owner of this error page. There are a number of scenarios in the kernel page cache:


This method can also handle cache errors that cannot be corrected. In the future, by sending a signal with an incorrect address to the application as a payload instead of simply killing it unconditionally, let the application respond to the machine check. The program can then decide what to do with the contaminated memory. A database server with a data cache, which can discard a contaminated cache page and reread it with a disk backup.

MCE Processing--MCE Learning notes in Linux

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.