1. symptom description
On the surface, the man-machine program crashes and all functions cannot work properly.
This problem has appeared many times. It took a week to understand it. Later, we summarized the tracking process and analysis and provided three solutions.
2 problem tracking 2.1 Tracking Step 1 -- top command viewing thread status
Status such as CPU usage hmi-848 thread has reached 95%.
2.2 Tracking Step 2 -- The strace tool loads 848 threads
The output information is shown in figure. The system continuously calls futex and the call fails (the parameter is invalid ).
Process 848 attached-interrupt to quit Sys_333 (0x2, 0x80, 0x2, 0, 0x80) = 1620625728 <0.000000> Futex (0x8016dc66, futex_wait_private, 2, null) =-1 einval (invalid argument) <0.000000> Sys_335 (0x2, 0x2, 0x2, 0, 0x80) = 2 <0.001000> Sys_333 (0x2, 0x80, 0x2, 0, 0x80) = 1620625728 <0.001000> Futex (0x8016dc66, futex_wait_private, 2, null) =-1 einval (invalid argument) <0.000000> Sys_335 (0x2, 0x2, 0x2, 0, 0x80) = 2 <0.000000> Sys_333 (0x2, 0x80, 0x2, 0, 0x80) = 1620625728 <0.000000> Futex (0x8016dc66, futex_wait_private, 2, null) =-1 einval (invalid argument) <0.000000> |
2.3 Tracking Step 3 -- open the map file to find information about the address 0x8016dc66
The address 0x8016dc66 is found, for example. This address corresponds to a global variable gs_util_mutex in glbsem. C, which is defined as "pthread_mutex_t" in glbsem. C.
Gs_util_mutex "is a thread lock.
According to the above analysis, draw conclusion 1:
An error occurs when the lock (or unlock) is triggered, and the API does not return the same result when futex is called.
According to this conclusion, there is no problem with the thread lock operation complying with the specifications, and then we can conclude that 1:
The cause of the error may be that the space of gs_util_mutex is damaged, or the library function implementation that is locked (or unlocked) is incorrect.
2.4 Tracking Step 4 -- check whether the gs_util_mutex space is damaged
Check whether your gs_util_mutex space is damaged. Open the map file and check out what the address adjacent to 0x8016dc66 is. You can check whether these variable operations will damage the gs_util_mutex data. When I see gs_currsemfreefile-related operations, they will be highlighted. (haha! The problem is finally solved), code. If the fname of row 8th does not meet the requirements, the content in the fname of row 27th cannot be copied to gs_currsemfreefile after initialization. At this time, the length of the copied content is unknown, this may damage gs_util_mutex data. Further careful analysis, no other problems found.
St_void gs_free_semx (sd_const st_char * srcfile, st_int srclinenum) { St_char fname [slog_max_fname + 1]; St_int idx; If (gs_track) { If (gs_debug_sel & gs_log_flow) { If (srcfile! = NULL) Slogtrimfilename (fname, srcfile ); Else Strcpy (fname, "unknown "); } Idx = gs_currsemownerindex; -- Gs_currsemownerindex; If (gs_currsemownerindex =-1) { Glbsem_log_cflow0 ("the semaphore shocould now be free "); } Else if (gs_currsemownerindex> = 0 & gs_currsemownerindex <max_sem_nest_track) { Glbsem_log_cflow2 ("file % s, line % d now has the semaphore ", Gs_currsemownerfile [gs_currsemownerindex], Gs_currsemownerline [gs_currsemownerindex]); Strcpy (gs_currsemfreefile [idx], fname ); Gs_currsemfreeline [idx] = srclinenum; If (strcmpi (gs_currsemfreefile [idx], gs_currsemownerfile [idx]) { Glbsem_log_err2 ("possible problem: % s (% d)", gs_currsemownerfile [idx], Gs_currsemownerline [idx]); } } Else if (gs_currsemownerindex> = 0 & gs_currsemownerindex> = max_sem_nest_track) { Glbsem_log_cflow0 ("nested too deep to track "); } Else if (gs_currsemownerindex <-1) { Glbsem_log_err0 ("glbsem gs_free_semx error: SEM track index negative "); Gs_currsemownerindex =-1; } } Gs_mutex_free (& gs_glb_mutex ); } |
2.5 Tracking Step 5-Modify and verify
In the code above, modify row 3rd to "st_char fname [slog_max_fname + 1] = {0};" and compile and verify the code. It is frustrating that the problem persists.
2.6 Tracking Step 6-Write simple code to verify the thread lock
According to the use of the thread lock in the original program to write verification code, the results of the program ran for 24 hours is no problem, but the second monk did not touch the head. Isn't it really a thread lock problem?
2.7 Tracking Step 7-logic Logic
Start with "futex (0x8016dc66, futex_wait_private, 2, null) =-1 einval (invalid argument. After reading the futex help documentation, related posts, and corresponding source code, when futex is analyzed based on the parameter, it is found that get_futex_key is called according to the 2nd parameter futex, while get_futex_key contains the following code:
If (unlikely (address % sizeof (u32 ))! = 0 ))
Return-einval;
The two lines of code mean that if the address is not 0 for the remainder of 4, The einval (invalid
Argument), address is exactly the 1st parameter 0x8016dc66 (gs_util_mutex address) of futex, and 0x8016dc66 cannot be divisible by 4. Well, that's the problem!
Conclusion 2: The first address of the thread lock variable must be a multiple of 4.
2.8 Tracking Step 8 -- Re-use the code in Step 6 for verification conclusion 2
Open the map file in Step 6 and find that the first address of the thread lock variable is a multiple of 4. Modify the source file and compile it to ensure that the first address of the thread lock variable is not a multiple of 4. Run the program. It will be stuck soon.
3. Principle Analysis 3.1 application background
Set the thread attribute to pthread_mutex_recursive.
3.2 Failure Mechanism 3.2.1 thread Lock Based on futex
The futex User Manual is described as follows:
(NOTE: Refer to http://www.kernel.org/doc/man-pages/online/pages/man2/futex.2.html)
Long futex (u32 _ User * uaddr, int op, u32 Val, ktime_t * timeout,
U32 _ User * uaddr2, u32 val2, u32 val3)
The uaddr argument needs to point to an aligned integer which stores
Counter. The operation to execute is passed via the op argument, along with
Value Val
3.2.2 alignment mechanism of thread locks in libc
The attributes of a thread lock are defined as follows, and they are a consortium. Member _ align is used to ensure that the first address of a variable of this type is an integer multiple of sizeof (long INT. (NOTE: Refer to libc2.10.1)
Typedef Union
{
Struct _ pthread_mutex_s
{
Int _ lock;
Unsigned int _ count;
Int _ owner;
/* Kind must stay at this position in the structure to maintain
Binary Compatibility .*/
Int _ kind;
Unsigned int _ nusers;
_ Extension _ Union
{
Int _ spins;
_ Pthread_slist_t _ list;
};
} _ Data;
Char _ SIZE [_ sizeof_pthread_mutex_t];
Long int _ align;
} Pthread_mutex_t;
By default, m68k uses two-byte alignment. By default, the common GCC compiler uses 4-byte align, which ensures that the first address of the thread lock attribute variable is a multiple of 4, when m68k is used for compilation, the _ align member cannot ensure that the first address of the thread lock attribute variable is a multiple of 4 (actually a multiple of 2 ).
3.2.3 parameter error asking what causes high CPU usage
Assume that two threads pthr1 and pthr2 share one lock. When pthr1 has locked and has not been unlocked, it switches to pthr2 to run. pthr2 will call futex when the lock is applied, and futex will continue to call futex_wait to block pthr2, utex_wait returns an invalid parameter when get_futex_key is called, while the implementation of pthread_mutex_lock does not properly process the futex return value. It is found that the lock value does not change and enters the dead loop of the futex call, the CPU usage of the thread is high.
4 solutions
1,
Obtain the source code of all libraries, such as libc, and compile the 4-byte alignment library by yourself. In this way, the application can be compiled in 4-byte alignment. This can solve this problem, and also solve the problems of semaphores and other alignment. (We strongly recommend that you solve this problem in this way, because we do not know how many problems related to alignment are hidden in our applications)
2,
We encapsulate the interface for creating and destroying thread lock variables, use malloc and free to allocate and release space for thread lock variables, and ensure that the first address is aligned with 4.
3,
Declare the thread lock variable using _ attribute (aligned (4) to ensure that its first address is 4 bytes aligned.