Exit Codes]

Source: Internet
Author: User

Address: http://www.slac.stanford.edu/BFROOT/www/Computing/Environment/Tools/Batch/exitcode.html

----------------------------------------------------

Job crashes and exit Codes

This webpage is a collection of information about job crashes and exit codes, gleaned from hypernews and wherever else I cocould find it.

If you have anything useful to add, or if any of the information is incorrect,Please feel free to edit the page!

Quick diagnosis

The overall impression I get from searching hypernews is:

  • If you get core dumped, there was a problem in your code, and you shoshould use the debugger.
  • If you do not get core dumped, then your exit code probably means you ran out of CPU time.
Exit codes and kill-job Signals

The exit code from a batch job is a standard UNIX termination status, the same sort of number you get in a shell script from checking"$?"Variable after executing a command.

Typically, exit code 0 (zero) means successful completion. codes 1-127 are typically generated by your job itself callingexit()With
A non-zero value to terminate itself and indicate an error. in Babar we don't make very much use of this. the most common such value you might see is 64, which is the value used by framework to say that its event loop is being stopped before all the requested
Data have been read, typically because time ran out. in recent Babar releases you might also see 125, which we use as a code for a generic "severe error"; the job log shocould contain in a message stating what the error was.

Exit codes in the range 129-255 represent jobs terminated by UNIX "signals". Each type of signal has
A number, and what's reported as the job exit code is
The signal number plus 128. signals can arise from within the process itself (as for segv, see below) or be sent to the process by some external agent (such as the batch control system, or your using"bkill"Command ).

By way of example, then, Exit Code 64 means that the job deliberately terminated its execution by calling"exit(64)", Exit code 137 means that the job was Ed a signal 9, and exit code 140 represents
Signal 12.

The specific meaning of the signal numbers is platform-dependent. if you are trying to figure out a problem that was seen on Linux, you have to run the following commands on Linux. we don't have Solaris or Mac OS
Batch resources in Babar at the moment, but if we did, you wowould have to match platforms similarly when debugging.

Terminationdecoder

Babar provides a little program that will take your exit code and spit out an explanation. The program is calledterminationDecoder. Examples:

[yakut] terminationDecoder 137terminated by signal 9 (Killed)[yakut] terminationDecoder 64exited with code 64 (in Framework: stop requested, e.g., by CpuCheck)

More details

You can also look this up yourself; if you know the signal number, then you can find out why the job was killed using the command "Kill-L ":

[yakut] kill -lHUP INT QUIT ILL TRAP ABRT BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM STKFLTCHLD CONT STOP TSTP TTIN TTOU URG XCPU XFSZ VTALRM PROF WINCH POLL PWR SYSRTMIN RTMIN+1 RTMIN+2 RTMIN+3 RTMAX-3 RTMAX-2 RTMAX-1 RTMAX

So for example, if your job was killed by signal 6, then it got an "abrt", which is short for abort.

To find out what all"kill -l"Words mean, you can use the command:

man 7 signal    

(Or, on Solaris ,"man -s 3HEAD signal"). This will give you the man page for signal (7 ). scroll down a bit and you will get a list of the kill-signal words with a short explanation. here is a sample:

SIGHUP        1       Term    Hangup detected on controlling terminal                              or death of controlling processSIGINT        2       Term    Interrupt from keyboardSIGQUIT       3       Core    Quit from keyboardSIGILL        4       Core    Illegal InstructionSIGABRT       6       Core    Abort signal from abort(3)SIGFPE        8       Core    Floating point exceptionSIGKILL       9       Term    Kill signalSIGSEGV      11       Core    Invalid memory referenceSIGPIPE      13       Term    Broken pipe: write to pipe with no readersSIGALRM      14       Term    Timer signal from alarm(2)SIGTERM      15       Term    Termination signal

(Obviusly, these are just"kill -l"Words, but with a" sig "in front of them .)

You may also find it useful to look at the file signal. H. On a Linux machine, the location is:

/usr/include/asm/signal.h

Hypernews examples

Here are some specific exit codes that came up in hypernews. here I have recorded the HN responses. however, they might not be correct in all cases. (maybe the exit codes can mean other things, too .)

Exit Code 9:Ran out of CPU time.

Exit Code 64:The framework ended the job nicely for you, most likely because the job was running out of CPU time. But it means you did not go through all the data requested. The solution is
Submit the job to a queue with more resources (bigger CPU time limit ).

Exit codes 125:AnErrMsg(severe)Was reached in your job.

Exit codes 127:Something wrong with the machine?

Exit codes 130:The job ran out of CPU or swap time. If swap time is the culprit, check for memory
Leaks.

Exit codes 131:The job ran out of CPU or swap time. If swap time is the culprit, check for memory
Leaks.

Exit codes 134:The job is killed with an abort signal, and you probably got core dumped. Often this is caused either byassert()OrErrMsg(fatal)Being hit in your
Job. There may be a run-time bug in your code. Use a debugger like GDB or DBX to find out what's wrong.

Exit codes 137:The job was killed because it exceeded the time limit.

Exit codes 139:Segmentatation violation.

Exit codes 140:The job exceeded the "Wall Clock" time limit (as opposed to the CPU time limit ).

Howto's Guide to job-kill Signals

The following is copied from howto-Basic-debugging,
Which you shoshould definitely consult to learn how to interpret, report, and deal with errors and crashes:

Segv
A Segmentation violation or segmentation fault typically means that something is trying to access memory that it shouldn't be accessing. one common example of this is trying to access memory through a null pointer, for example:

sunprompt> cat main.c#include main(){  int* bunk(0);  cout << *bunk << endl;}sunprompt> CC main.csunprompt> ./a.outSegmentation fault (core dumped)
Abrt
Asserts are one common source of the "Abort" signal, for example:

sunprompt> cat main.c#include main(){  int i=0;  assert(i!=0);}sunprompt> CC main.csunprompt> ./a.outAssertion failed: i!=0, file main.c, line 5Abort (core dumped)

Note that the actual assertion which was failed and the location is also printed. An abrt can also be generated from the BabarErrMsg(fatal)Construct, in which case your job log shoshould contain a message explaining the error.

FPE
A "floating point error" usually indicates a numerical problem such as a division by zero or an overflow. One example wocould be:

osfprompt> cat main.cmain(){  float a = 1.;  float b = 0.;  float c = a/b;}osfprompt> g++ main.cosfprompt> ./a.outFloating exception (core dumped)
Ill
If you receive a signal like this ("illegal instruction"), means that, while running, your program has tried to execute a machine "Instruction" which does not exist. this can happen for a variety of reasons, including:

  • A memory overwrite that happens to overwrite part of the program stored in memory. This may result in the program trying, for example, to execute data as if it is a machine instruction.
  • An attempt to take an executable compiled on one platform for use on another, for example on an earlier version of the same chip.
  • A truncated or datagupted executable is loaded for execution
  • Incomplete recompilation of source code, I. e. You changed one C ++ class and didn't recompile all other code affected by that change.
Bus
A "Bus Error" may come, for example, from accessing unaligned data (I. e. like trying to access a 4 byte integer with a pointer to the middle of it ). what this means will vary from platform to platform. (I haven't come up with a good example of this one
Yet .)

A "Bus Error" can also often indicate a memory overwrite, e.g. Somebody wrote a number where a pointer is kept.OftenCaused by going past the end of an array and into the system pointers at the start of the next memory block.

How do you know if you 've exceeded your CPU time?

To find out whether your job has exceeded the CPU time limit, you have to do 3 things:

  1. Look at your log file to get the job's CPU time.
  2. Use the machine-dependent cpuf to convert the CPU time to slac time. The formula is: slac time = CPU time * cpuf.
  3. Compare this to the time allowed by the queue in which the job was run.

Here is an example.

First, look at the end of your log file:

 Job <VubRecoilUserApp VubXlnu.tcl SP-1237-BSemiExcl-Run5-R18b-1 MC> was submitted from host <yakut02> by user <penguin>.Job was executed on host(s) <cob0313>, in queue <xlong>, as user <penguin>.</u/br/penguin> was used as the home directory.</u/br/penguin/vubrecoil/vub30/workdir> was used as the working directory.Started at Wed Feb  8 17:25:33 2006Results reported at Wed Feb  8 19:27:28 2006Your job looked like:------------------------------------------------------------# LSBATCH: User inputVubRecoilUserApp VubXlnu.tcl SP-1237-BSemiExcl-Run5-R18b-1 MC------------------------------------------------------------Exited with exit code 134.Resource usage summary:    CPU time   :   7058.71 sec.    Max Memory :      2863 MB    Max Swap   :      2968 MB    Max Processes  :         3    Max Threads    :         3
 

The job was run on the machine cob0313.

> bhosts -l cob0313

This tells you (among other things) that the cpuf for cob0313 is 7.65.

The slac time for your job is thus:

SLAC time = (CPU time) * CPUF = (7058.71 sec) * 7.65 = 53999.1 sec = 900 min

The next step to find out if this exceeds the CPU limit of the queue in which the job was run. In this example, the job was the xlong queue:

> bqueues -l xlong

Among other things, this tells you the CPU limit for the queue:

 CPULIMIT 2900.0 min of slac

The job used only 900 minutes of slac time, less than the 2900 allowed by the xlong queue. So the job did not exceed its CPU time limit. It must have crashed for some other reason.

Memory leaks

Jobs can also crash because of memory leaks --- things like dangling pointers or array overruns. The following links may be helpful for tracking down memory leaks:

  • Memory leaks webpage
  • Valgrind at Babar




Author: Sheila mclachlin
Created: Feb 09,200 6.
Last Updated: Feb 13,200 6 by Gregory Dubois-Felsmann.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.