Skip to content

Frequently Encountered Problems

A list of common errors encountered in running jobs.

NOTE most of these items use Grid Engine rather than Slurm. This is being updated. However, the underlying ideas are the same; just the commands differ.

Job quits immediately with no output, or immediately goes into Error state

Diagnosing

Grid Engine: If you look at the full queue status, your job appears with an "E" (error) flag:

[juser@proteusa01 ~]$ qstat -f -u \* | less ... ############################################################################### - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################### 123456 1.01025 myjob.sh juserEqw   05/23/2016 16:29:01     8

If you do

[juser@proteusa01 ~]$ qstat -j 123456

in the output you should see the "error reason":

error reason    1:          02/12/2015 14:31:56 [1062:71145]: execvp(/cm/local/apps/sge/var/spool/ic21n03/job_scripts/123456, "/cm/local/apps/sge/var/spool/ic21n03/job_scripts/123456") failed: No such file or directory

Problem

There are two probable causes:

  • The most likely issue is that your job script was created on a Windows machine, and uses DOS line endings rather than Unix line endings.[1]
  • The other may be that the "shebang" line in your job script has a typo.("Shebang" comes from "hash bang", the two characters which begin the line.) The "shebang" line should be:

#!/bin/bash

Solutions

  • For the case where you created your script in Windows, convert the job script to Unix line endings using dos2unix[2]

[juser@proteusi01 ~]$ dos2unix myjob.sh

  • For the case where the shebang line was wrong, correct it with the full path to the script interpreter (usually "/bin/bash").

Unable to Submit Job - "Invalid account or account/partition combination specified"

Diagnosing

You are a member of more than one group/project. When you submit (sbatch) a job under project different from your usual one, you get the error 'Invalid account or account/partition combination specified'.

Problem

Your default Slurm billing account is set to an unbillable account. This happens when you are a member of multiple accounts.

Solutions

Specify the billing account to be used (charged) for the job. Either:

  • in the job script

#SBATCH -A somethingPrj

  • in the command line

[myname@picotte001 ~]$ sbatch -A somethingPrj myjobscript.sh

Job Stays in Pending List and Does Not Run

Diagnosing

Output from squeue shows that the job is in the pending list even though there are free resources, and the per-group slot quota is not in effect.

Problem

Usually caused by resource requests which are not possible to fulfill, including:

  • time > 48 hours
  • total amount of memory requested per node exceeds amount of RAM installed on node
  • number of slots requested is too high

Solution

Use scontrol to modify the resource requests, or scancel the job and resubmit after adjusting the job script.

You may use sbatch --test-only to test your job script. See the man page for sbatch.

Libraries Not Found

Diagnosing

You get error messages about libraries not being found.

Problem

The LD_LIBRARY_PATH environment variable is not set properly. If you also get error messages about "no module", which means that the "module" command is not found, this is because the environment was not set up properly to use the module command.

Solution

One of these will work:

  1. Add the command ". /etc/profile.d/modules.sh" to your job script before you do any "module load".
  2. Change the first line of your job script to: #!/bin/bash -l

Do not add the "#$ -V" option to your job script.

Unable to Login With SSH

Diagnosing

SSH login attempts fail with a disconnect.

Problem

There is a security measure in place which blocks IP addresses after 3 failed login attempts. The block lasts 90 minutes. This applies only to off-campus IP addresses.

Solutions

Use the Drexel VPN. This assigns your computer an on-campus IP address.

Disk Quota Exceeded

Diagnosing

An operation or job fails with an error message:

Disk quota exceeded

Problem

You have exceeded the quota for your home directory.

Solution

Move files out of your home directory into a subdirectory of your group directory.

  • To move files out of your home directory, and into the group directory, you must use the "cp" command to copy files, and then delete the originals after the copy is complete. The "mv" command may not accomplish this properly due to the way it handles directories.

cp -R ~/data /mnt/HA/groups/myrsrchGrp/users/myname

  • You can also use "rsync" -- NB there is a trailing "/" on "~/data/" but not on the destination directory

mkdir /mnt/HA/groups/myrsrchGrp/users/myname rsync -av ~/data/ /mnt/HA/groups/myrsrchGrp/users/myname/data

Unable to Create Any New Files

Diagnosing

Any attempt to create a new file fails with an error. The error message will vary by application.

Problem

You have exceeded the quota for your home directory.

Solution

See above: #Disk Quota Exceeded

Unable to Upload Data

Diagnosing

SFTP program gives an error, saying it is unable to upload data.

Problem

Probably trying to upload to home directory, which has a quota: on Picotte, it is 64 GB. (Quotas subject to change.)

Solution

All data should be stored in the group directory:

  • Picotte: /ifs/groups/groupnameGrp

See above: #Disk Quota Exceeded

TMPDIR local scratch directory does not exist for interactive srun session

This can happen sometimes when requesting multiple tasks in srun rather than multiple cpus (cores).

Diagnosing

Request an interactive session on a GPU node requesting multiple tasks:

[juser@picotte001]$ srun --ntasks=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234567
[juser@node012]$ ls -ld $TMPDIR
/bin/ls: cannot access '/local/scratch/1234567': No such file or directory

Problem

The root cause is not known.

Solution

Request multiple CPUs (i.e. CPU cores) per task, instead; the default number of tasks is one per node.

[juser@picotte001]$ srun --cpus-per-task=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234568
[juser@node012]$ ls -ld $TMPDIR
drwxrwxrwt 529 root root 36864 Oct  7 17:58 /local/scratch/1234568

See Also

References

[1] Wikipedia:Newline

[2] Tips for Windows Users#Scripts Created on Windows