Symptom
We implemented a file named libilvrfplugin. so Lib, which links libiubsntconflib. so, while libiubsntconflib. so again linked libipconflib. so, libipconflib. so implements a method check_vrf_r () to check the validity of VRF.
To put it simply, a lib links B Lib, and B lib links C Lib. c lib implements the check_vrf_r () method ().
In some scenarios, the system dynamically loads a Lib, but a lib does not use the check_vrf_r () method (). Note that the Lib library is dynamically loaded, that is, the Lib user uses dlopen () to load the Lib.
In the new version, we found that the Lib was not used, as if the Lib did not exist at all.
We can find the following log in Syslog:
Apr 17 15:46:27. 718075 info CFPU-0 validator: cpluginmanager: Unable to load plugin libilvrfplugin. So error:/opt/nokiasiemens/lib64/libiubsntconflib. So: Undefined Symbol: check_vrf_r
From syslogs, we can see that libilvrfplugin. So (a Lib) fails to be dynamically loaded because the check_vrf_r symbol cannot be found.
Question 1: these libraries work normally in the previous version, and there are no changes in the current version. Why are there errors?
Question 2: The check_vrf_r symbol is not used in libilvrfplugin. so. Why cannot I find the symbol?
Search for Problems
Locate the error code based on the clues in Syslog:
...lib_handle=dlopen(ep[count]->d_name,RTLD_LAZY);if(!lib_handle){ error = dlerror(); TRACER(TRC_INFO)<<"CPluginManager : Unable to load plugin " <<ep[count]->d_name << " Error:"<<error<<std::endl; free(ep[count]); continue;} // if...
Here, dlopen () is used to load the dynamic link library, and the flag is set to rtld_lazy. This flag controls the parsing method when dlopen () loads Lib, and undefined symbols are not parsed during loading. In this example, the check_vrf_r symbol cannot be found (because library a only contains the symbol through the corresponding include ).
By the way, dlopen () also has a parsing method rtld_now, which requires all symbols to be resolved to the address, regardless of whether the symbol is used or not.
However, in this example, the dlopen () parsing method is correct. We do not expect to parse the check_vrf_r symbol, but why do we still parse it?
The reason may be that glibc's implementation has changed, and there is a bug that cannot be said, or where it may change the implementation of dlopen. By the way, since the C language does not have the namespace concept, you can define a function with the same name as a system function to overwrite the system function. This should be avoided in most cases.
After checking that glibc has not changed in the two versions we released, it is likely that the dlopen () is changed or affected (). As a last resort, we can only look at all the code changes in our version.
We are surprised to find that such a line of code is added to a script:
export LD_PRELOAD=/opt/nokiasiemens/SS_FConfigure/lib/libdlopeninterceptor.so
Literally, it is related to dlopen. "dlopen hijacking" is a domineering name! Next, let's see what the code will do.
Here, the environment variable ld_preload is export. This environment variable declares the dynamic link library that is preferentially loaded before the application is loaded. In other words, if the dynamic link library implements a function with the same name as the system function, this will overwrite the system functions.
With excitement, check the implementation of this dynamic lib:
#include <dlfcn.h>#include <syslog.h>#include <stdlib.h>#ifdef __cplusplus__extern "C" {#endiftypedef void* (*dlopen_func_t)(const char* filename, int flag);static dlopen_func_t _glibc_dlopen = NULL;void* dlopen(const char* filename, int flag){ int realflag = flag; if (NULL == _glibc_dlopen) { _glibc_dlopen = (dlopen_func_t)dlsym(RTLD_NEXT, "dlopen"); if (NULL == _glibc_dlopen) { syslog(LOG_CRIT, "dlopeninterceptor:Failed to resolve dlopen, got error:%s", dlerror()); return NULL; } } if (realflag & RTLD_LAZY) { realflag = realflag & ~RTLD_LAZY; realflag = realflag | RTLD_NOW; syslog(LOG_DEBUG, "dlopeninterceptor:Changing dlopen flag from to %d to %d when opening %s", flag, realflag, filename); } return _glibc_dlopen(filename, realflag);}#ifdef __cplusplus__}#endif
We are surprised to find that the Lib indeed overwrites dlopen (). If the flag specified by dlopen (), rtld_lazy will be forcibly converted to rtld_now. Find the root cause. Just relax.
Last
After finding root cause, it is easy. Report an issue to the appropriate organization or department and put your analysis results on it. The problem will soon be solved. Later I heard that a colleague mistakenly added a line of code in the script to make the dlopen () Hijacking take effect.
For such problems, root cause is very simple, and the efficiency to fix is not large. What is a little difficult is how to gradually locate the problem in a complex system, and you need to check the code implementation of the entire system. Because you are not familiar with other modules of the system, you also need to provide spear support when necessary. I would like to thank my colleagues who have provided support in this process. I am also glad that the company has a good mechanism or atmosphere, so that you can get powerful support when necessary.