Recently testing our own improved redis, found that when doing an RDB, the child process will always hang, gdb attach, the stack is as follows:
(GDB) bt#0 0x0000003f6d4f805e in __lll_lock_wait_private () from/lib64/libc.so.6#1 0x0000003f6d49dcad in _L_lock_2164 ( ) from/lib64/libc.so.6#2 0x0000003f6d49da67 in __tz_convert () from/lib64/libc.so.6#3 0x0000000000421004 in RedisLogRa W (level=2, Msg=0x7fff9f412b50 "[Infq_info]: [infq.c:1483] infq persistent dump, suffix:405665, start_index:35637626, El E_count:4 ") at Redis.c:332#4 0x0000000000421256 in Redislog (level=2, FMT=0X4EEDCF" [Infq_info]:%s ") at redis.c:363#5 0x000000000043926b in Infq_info_log (msg=0x7fff9f413090 "[infq.c:1483] infq persistent dump, suffix:405665, Start_index : 35637626, Ele_count:4 ") at Object.c:816#6 0x00000000004b5677 in Infq_log (level=1, file=0x501465" infq.c ", lineno=1483 , fmt=0x5024e0 "infq persistent dump, suffix:%d, start_index:%lld, Ele_count:%d") at Logging.c:81#7 0x00000000004b37f8 In Dump_push_queue (infq=0x17d07f0) at Infq.c:1480#8 0x00000000004b1566 in Infq_dump (infq=0x17d07f0, buf=0x7fff9f41365 0 "", buf_size=1024, data_sIZE=0X7FFF9F413A5C) at Infq.c:720#9 0x00000000004440e6 in Rdbsaveobject (RDB=0X7FFF9F413C50, o=0x7f8c0b4d0470) at RDB.C : 600#10 0x000000000044429c in Rdbsavekeyvaluepair (RDB=0X7FFF9F413C50, Key=0x7fff9f413b90, val=0x7f8c0b4d0470, Expiretime=-1, now=1434687031023) at Rdb.c:642#11 0x0000000000444471 in Rdbsaverio (RDB=0X7FFF9F413C50, error= 0X7FFF9F413C4C) at rdb.c:686#12 0x0000000000444704 in Rdbsave (filename=0x7f8c0b410040 "Dump.rdb") at rdb.c:750#13 0X00000000004449CD in Rdbsavebackground (filename=0x7f8c0b410040 "Dump.rdb") at Rdb.c:831#14 0x0000000000422b0e in Servercron (eventloop=0x7f8c0b45a150, Id=0, clientdata=0x0) at redis.c:1240#15 0x000000000041d47e in processTimeEvents (eventloop=0x7f8c0b45a150) at ae.c:311#16 0x000000000041d7c0 in Aeprocessevents (eventloop=0x7f8c0b45a150, flags=3) at Ae.c:423#17 0x000000000041d8de in Aemain (eventloop=0x7f8c0b45a150) at Ae.c:455#18 0x0000000000429ae3 in main (argc=2, AR gv=0x7fff9f414168) at redis.c:3843
are blocked on the redislog for printing logs. When you print the log, you need to call localtime build time. View GLIBC Code glibc-2.9/time/localtime.c:
/* Return the ' struct TM ' representation of *t in local time, using *TP to store the result . */struct TM *__localtime_r (t, TP) const time_t *t; struct TM *TP; { return __tz_convert (T, 1, TP);} Weak_alias (__localtime_r, Localtime_r)/* Return the ' struct TM ' representation of *t in local time. */struct TM *localtime (t) const time_t *t;{ Return __tz_convert (t, 1, &_TMBUF);} Libc_hidden_def (localtime)
Whether localtime or localtime_r are called __tz_convert functions to complete the actual function, and then look at this function, in the GLIBC-2.9/TIME/TZSET.C:
/* This locks all the state variables in tzfile.c and this file. */__libc_lock_define_initialized (Static, Tzset_lock)/* Return the ' struct TM ' representation of *timer in the local Timez One. Use the local time if Use_localtime is nonzero, UTC otherwise. */STRUCT TM *__tz_convert (const time_t *timer, int use_localtime, struct TM *tp) { long int leap_correction; int leap_extra_secs; if (timer = = NULL) { __set_errno (EINVAL); return NULL; } Locking __libc_lock_lock (tzset_lock); Some come out logic //Unlock __libc_lock_unlock (tzset_lock); return TP;}
This function is used by the Tzset_lock global Lock, which is a static variable. Because of lock access, this localtime_r is thread-safe, but localtime uses global variables so it is not thread-safe. However, both functions are not signal-safe, and if used in signal processing functions, it is necessary to consider the deadlock situation. For example, the program calls Localtime_r, after the lock signal occurs, the signal processing function also calls Localtime_r, it will be blocked because the lock is not acquired.
Why doesn't the above localtime deadlock occur in native Redis?
Because the LocalTime function is not called in native Redis, the call to LocalTime is complete when you fork the child process, that is, the lock and the release.
Because of our improved redis, which uses multi-threading and calls the Redislog print log, a thread may be in the LocalTime function call (locked, but not unlocked) when the child process is fork, in which case The child process shares the memory space of the main process in a copy-on-write way, so the lock corresponding to LocalTime is also occupied, so the child process is blocked.
Well, what about the solution?
If, for the lock we have control, then before the call fork to create the sub-process, you can use the library function pthead_atfork to unlock, to achieve a consistent state.
#include <pthread.h> int pthread_atfork (void (*prepare) (void), void (*parent) (void), void (*child) ( void));
The Prepare function pointer is called before fork, and the parent and child are called after the fork is returned in the parent-son process. In this way, all locks can be released in the prepare, and the parent is locked as needed.
Because there is no way to operate the lock used by localtime, the above method is not feasible. Here, we use a compromise approach: rely on the Redis Servercron timer to update the localtime and save to the global variable, the component's multi-threaded print log, just get the cache of global variables, avoid multithreading calls localtime function. Because the Servercron is executed at up to 10ms intervals, there is not much error and is fully available for the log.
Finally, we conclude that this function with global locking is not a signal-safe, such as Localtime,free,malloc. At the same time, this type of function, called in multithreaded mode, may be deadlocked during the fork process. The way to avoid this is to ensure that no locking occurs when you fork (either by avoiding multi-threaded calls or by customizing the lock area Control).
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
LocalTime Deadlock--multi-threaded fork sub-process