Skip to content

Frequently Encountered Problems

A list of common errors encountered in running jobs.

Unable to Submit Job - "Invalid account or account/partition combination specified"

Diagnosing

You are a member of more than one group/project. When you submit (sbatch) a job under project different from your usual one, you get the error 'Invalid account or account/partition combination specified'.

Problem

Your default Slurm billing account is set to an unbillable account. This happens when you are a member of multiple accounts.

Solutions

Specify the billing account to be used (charged) for the job. Either:

  • in the job script

#SBATCH -A somethingPrj

  • in the command line

[myname@picotte001 ~]$ sbatch -A somethingPrj myjobscript.sh

Job Stays in Pending List and Does Not Run

Diagnosing

Output from squeue shows that the job is in the pending list even though there are free resources, and the per-group slot quota is not in effect.

Problem

Usually caused by resource requests which are not possible to fulfill, including:

  • time > 48 hours
  • total amount of memory requested per node exceeds amount of RAM installed on node
  • number of slots requested is too high

Solution

Use scontrol to modify the resource requests, or scancel the job and resubmit after adjusting the job script.

You may use sbatch --test-only to test your job script. See the man page for sbatch.

Libraries Not Found

Diagnosing

You get error messages about libraries not being found.

Problem

The LD_LIBRARY_PATH environment variable is not set properly. If you also get error messages about "no module", which means that the "module" command is not found, this is because the environment was not set up properly to use the module command.

Solution

One of these will work:

  1. Add the command ". /etc/profile.d/modules.sh" to your job script before you do any "module load".
  2. Change the first line of your job script to: #!/bin/bash -l

Do not add the "#$ -V" option to your job script.

Disk Quota Exceeded

Diagnosing

An operation or job fails with an error message:

Disk quota exceeded

Problem

You have exceeded the quota for your home directory.

Solution

Move files out of your home directory into a subdirectory of your group directory.

  • To move files out of your home directory, and into the group directory, you must use the "cp" command to copy files, and then delete the originals after the copy is complete. The "mv" command may not accomplish this properly due to the way it handles directories.

cp -R ~/data /mnt/HA/groups/myrsrchGrp/users/myname

  • You can also use "rsync" -- NB there is a trailing "/" on "~/data/" but not on the destination directory

mkdir /mnt/HA/groups/myrsrchGrp/users/myname rsync -av ~/data/ /mnt/HA/groups/myrsrchGrp/users/myname/data

Unable to Create Any New Files

Diagnosing

Any attempt to create a new file fails with an error. The error message will vary by application.

Problem

You have exceeded the quota for your home directory.

Solution

See above: Disk Quota Exceeded

Unable to Upload Data

Diagnosing

SFTP program gives an error, saying it is unable to upload data.

Problem

Probably trying to upload to home directory, which has a quota: on Picotte, it is 64 GB. (Quotas subject to change.)

Solution

All data should be stored in the group directory:

  • Picotte: /ifs/groups/groupnameGrp

See above: Disk Quota Exceeded

TMPDIR local scratch directory does not exist for interactive srun session

This can happen sometimes when requesting multiple tasks in srun rather than multiple cpus (cores).

Diagnosing

Request an interactive session on a GPU node requesting multiple tasks:

[juser@picotte001]$ srun --ntasks=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234567
[juser@node012]$ ls -ld $TMPDIR
/bin/ls: cannot access '/local/scratch/1234567': No such file or directory

Problem

The root cause is not known.

Solution

Request multiple CPUs (i.e. CPU cores) per task, instead; the default number of tasks is one per node.

[juser@picotte001]$ srun --cpus-per-task=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234568
[juser@node012]$ ls -ld $TMPDIR
drwxrwxrwt 529 root root 36864 Oct  7 17:58 /local/scratch/1234568

See Also

References

[1] Wikipedia:Newline

[2] Tips for Windows Users - Scripts Created on Windows