Linux system () function error, errno is echild__oracle

Source: Internet
Author: User
Tags signal handler

Reproduced from: http://my.oschina.net/renhc/blog/54582 Today, a program that has been running for nearly a year has suddenly hung up, the problem is located to the system () function of the problem, the simple use of this function in my previous article did introduce: http:/ /my.oschina.net/renhc/blog/53580

Let's take a look at the problem.

Briefly encapsulates the system () function:

int Pox_system (const char *cmd_line)
{return
    system (Cmd_line);
}

Function call:
int ret = 0;
ret = Pox_system ("gzip-c/var/opt/i00005.xml >/var/opt/i00005.z");
if (0!= ret)
{
    Log ("Zip file failed\n");
}

Problem: Every time you execute here, you will zip failed. It is always right to take the command out of the shell and execute it alone, in fact the code has been running for a long time and never had a problem.

bad log.

When parsing log, we can only see the "Zip file failed" This custom information, as for why fail, there is no clue. Well, let's try to find more clues:

int ret = 0;
ret = Pox_system ("gzip-c/var/opt/i00005.xml >/var/opt/i00005.z");
if (0!= ret)
{
    Log ("Zip file failed:%s\n", Strerror (errno));//try to print out system error message
}
We added log to the errno set by the system () function, and we got a very useful clue: the system () function failed because of "No child Processes". Keep looking for root cause.

who moved the errno ?

We know from the above clues that the system () function sets errno to Echild, but we cannot find any information about ehild from the man Manual of the system () function. We know that the system () function executes as follows: Fork ()->exec ()->waitpid (). Obviously Waitpid () is a significant suspect, let's check the man manual to see if it is possible to set Echild:echild (for Waitpid () or Waitid ()) The process specified by PID (waitpid ()) or Idtype and ID (Waitid ()) does not exist or are not, child of the calling process. (This can happen for one's own child if the action for SIGCHLD are set to Sig_ign.) Also the Linux Notes section about threads.) Sure enough, if the SIGCHLD signal behavior is set to Sig_ign, the waitpid () function may report a echild error because the child process cannot be found. It seems that we have found the solution to the problem: Reset the SIGCHLD signal to the default value, that is, signal (SIGCHLD, SIG_DFL) before calling the system () function. We were excited to take a look at the Linux notes section for a while and just add code tests. Boy, the problem is solved.

is it your style to deal with the problem?

As we rush into the check in code, a question arises: "Why didn't this error happen before?" Yes, well, a good program suddenly hangs up. First of all, our code has not changed, so it must be an external factor. At the thought of external factors, we began to complain: "It must be the other group's procedures that affect us." "But complaining it's useless, if you think so, then please come up with evidence." But calm down and analyze it is not difficult to find that this can not be the impact of other programs, other processes can not affect the way we process the signal processing.

The system () function does not fail before, because the Systeme () function relies on one of the characteristics of the systems, that is, the way the SIGCHLD signal is handled by the kernel initialization process is SIG_DFL, what does that mean? That is, the kernel discovers the process after the child process terminates sends a SIGCHLD signal to the process, after the process receives this signal to adopt the SIG_DFL way processing, then SIG_DFL is what way. SIG_DFL is a macro that defines a signal processing function pointer, in fact the signal handler does nothing. This feature is exactly what the system () function requires, and the system () function first fork () a subprocess to execute the command command, and the system () function uses the Waitpid () function to collect the corpse from the child process after execution.

Through the above analysis, we can be soberly aware that the system () before the implementation of the SIGCHLD signal processing must be changed, no longer is SIG_DFL, as to what become temporarily do not know, in fact, we do not need to know, we just need to remember to use System () function before the SIGCHLD signal processing mode explicitly modified to SIG_DFL mode, while recording the original processing mode, use the system () and then set the original processing mode. This allows us to mask the impact of system upgrades or signal processing changes.

Verification conjecture

Our company adopts the continuous integration + Agile development model, every day by the dedicated team responsible for automated case testing, each time called a build, we analyzed the current builds and last building used the system version, found that the version did upgrade. So we found the relevant team for verification, we have a detailed description of the problem, and soon the other side gave feedback, the following is the original mail reply:

The Libgen has added a new SIGCHLD treatment. to ignore it. In order to avoid the emergence of zombie processes.

It seems our conjecture is right. Problem analysis Here, the solution is also clear, so we modified our Pox_system () function:

typedef void (*sighandler_t) (int);
int Pox_system (const char *cmd_line)
{
   int ret = 0;
   sighandler_t Old_handler;

   Old_handler = Signal (SIGCHLD, SIG_DFL);
   RET = System (Cmd_line);
   Signal (SIGCHLD, old_handler);

   return ret;
}


I think this is the perfect solution for calling system (), and using the Pox_system () function encapsulation brings great maintainability, we just need to modify one of the functions here, and no other calls need to be changed.

Later, looking at the other side of the modified code, sure enough to find the answer from the code:

/* Ignore SIGCHLD to avoid zombie process * *
(signal (SIGCHLD, sig_ign) = = Sig_err) {
    return-1;
} else { C12/>return 0;
}


Other Thinking

Our company's code uses the SVN process management, so far there are many branch, gradually, almost every branch have appeared above the problem, so I fix the problem on each BRANCHC, almost a day, because some branch have been locked, Then think of the merge code must find the relevant responsible person to explain the seriousness of the problem, but also in different environments to test, I do these side think, the system this upgrade appropriate.

First of all, because the system upgrade caused our code to detect problems in the test, then hurried to fix, resulting in our passivity, I think this is their mistake. Your upgrade must take into account the impact on the other team. What's more, you're doing a system upgrade. Before the upgrade need to do a risk assessment, to the possible impact of informing everyone, so just professional.

Moreover, according to their argument, changing the signal processing is to avoid zombie processes, of course, the original intention is good, but such upgrades affect the use of some functions, such as the system () function, wait () function, Waipid (), fork () functions, these functions are related to the child process, If you want to use a wait () or waitpid () to bury the process, you must use the method described above to place the SIGCHLD signal as SIG_DFL processing before (in fact, fork () before the call (in fact, wait ()/waitpid () And then sets the signal processing to the previous value. Your system upgrades, forcing everyone to improve the code, and indeed improve the quality of the code, but for this upgrade I do not agree, imagine how many fork () you have seen before and after->waitpid () set the SIGCHLD signal code.

recommendations for using the system () function

is given a more secure use of invoking the system () function, but using the system () function is error-prone and wrong. That is the return value of the system () function, and for an introduction to its return value, see the previous article. The system () function is sometimes convenient but not abusive.

1. It is recommended that the system () function be used only to execute shell commands, because generally speaking, the system () return value is not 0 to indicate an error;

2, the proposed Monitoring system () function after the completion of the errno value, to fight for more useful information when the error;

3. It is recommended to consider the substitution function of the system () function Popen (); its usage is introduced in another article of mine.

Qdurenhongcai@163.com

Reprint please indicate the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.