Verify and Report Exit Codes and States for All Steps
Currently, only the allocation exit-code is reported. Use the information of the exit-code of each step to report how many steps, of each job, are failed.
IMPORTANT: Normally, the allocation exit code reported by sacct (shown when the -X flag is used) corresponds to the exit code of the last command in the submitted batch script, which may not necessarily be a job step.
As a result, even if srun (or mpirun) fails, if another command follows and completes successfully (e.g., time), the exit code will be reported as 0:0, and the job state will be marked as COMPLETED rather than FAILED.
Here’s an example of a job with two steps: the first step completed successfully, but the second one failed. However, after the last srun, another serial command was executed, which returned an exit code of 0, leading to the overall exit code being reported as 0:
$ sacct -Xj 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
JobID State ExitCode AllocNodes AllocCPUS
-------------------- ---------- -------- ---------- ----------
10382893 COMPLETED 0:0 128 4096
$ sacct -j 10382893 -o jobid%20,state,exitcode,allocnodes,alloccpus
JobID State ExitCode AllocNodes AllocCPUS
-------------------- ---------- -------- ---------- ----------
10382893 COMPLETED 0:0 128 4096
10382893.batch COMPLETED 0:0 1 32
10382893.extern COMPLETED 0:0 128 4096
10382893.0 COMPLETED 0:0 1 32
10382893.1 FAILED 1:0 128 4096