Frequently Encountered Problems
A list of common errors encountered in running jobs.
Unable to Submit Job - "Invalid account or account/partition combination specified"♯
Diagnosing♯
You are a member of more than one group/project. When you submit
(sbatch
) a job under project different from your usual one, you get
the error 'Invalid account or account/partition combination specified'.
Problem♯
Your default Slurm billing account is set to an unbillable account. This happens when you are a member of multiple accounts.
Solutions♯
Specify the billing account to be used (charged) for the job. Either:
- in the job script
#SBATCH -A somethingPrj
- in the command line
[myname@picotte001 ~]$ sbatch -A somethingPrj myjobscript.sh
Job Stays in Pending List and Does Not Run♯
Diagnosing♯
Output from squeue shows that the job is in the pending list even though there are free resources, and the per-group slot quota is not in effect.
Problem♯
Usually caused by resource requests which are not possible to fulfill, including:
- time > 48 hours
- total amount of memory requested per node exceeds amount of RAM installed on node
- number of slots requested is too high
Solution♯
Use scontrol to modify the resource requests, or scancel the job and resubmit after adjusting the job script.
You may use sbatch --test-only to test your job script. See the man page for sbatch.
Libraries Not Found♯
Diagnosing♯
You get error messages about libraries not being found.
Problem♯
The LD_LIBRARY_PATH
environment variable is not set properly. If you
also get error messages about "no module", which means that the "module"
command is not found, this is because the environment was not set up
properly to use the module command.
Solution♯
One of these will work:
- Add the command "
. /etc/profile.d/modules.sh
" to your job script before you do any "module load". - Change the first line of your job script to:
#!/bin/bash -l
Do not add the "#$ -V" option to your job script.
Disk Quota Exceeded♯
Diagnosing♯
An operation or job fails with an error message:
Disk quota exceeded
Problem♯
You have exceeded the quota for your home directory.
Solution♯
Move files out of your home directory into a subdirectory of your group directory.
- To move files out of your home directory, and into the group
directory, you must use the "
cp
" command to copy files, and then delete the originals after the copy is complete. The "mv
" command may not accomplish this properly due to the way it handles directories.
cp -R ~/data /mnt/HA/groups/myrsrchGrp/users/myname
- You can also use "
rsync
" -- NB there is a trailing "/" on "~/data/" but not on the destination directory
mkdir /mnt/HA/groups/myrsrchGrp/users/myname
rsync -av ~/data/ /mnt/HA/groups/myrsrchGrp/users/myname/data
Unable to Create Any New Files♯
Diagnosing♯
Any attempt to create a new file fails with an error. The error message will vary by application.
Problem♯
You have exceeded the quota for your home directory.
Solution♯
See above: Disk Quota Exceeded
Unable to Upload Data♯
Diagnosing♯
SFTP program gives an error, saying it is unable to upload data.
Problem♯
Probably trying to upload to home directory, which has a quota: on Picotte, it is 64 GB. (Quotas subject to change.)
Solution♯
All data should be stored in the group directory:
- Picotte: /ifs/groups/groupnameGrp
See above: Disk Quota Exceeded
TMPDIR local scratch directory does not exist for interactive srun session♯
This can happen sometimes when requesting multiple tasks in srun rather than multiple cpus (cores).
Diagnosing♯
Request an interactive session on a GPU node requesting multiple tasks:
[juser@picotte001]$ srun --ntasks=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234567
[juser@node012]$ ls -ld $TMPDIR
/bin/ls: cannot access '/local/scratch/1234567': No such file or directory
Problem♯
The root cause is not known.
Solution♯
Request multiple CPUs (i.e. CPU cores) per task, instead; the default number of tasks is one per node.
[juser@picotte001]$ srun --cpus-per-task=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234568
[juser@node012]$ ls -ld $TMPDIR
drwxrwxrwt 529 root root 36864 Oct 7 17:58 /local/scratch/1234568