Frequently Encountered Problems
A list of common errors encountered in running jobs.
NOTE most of these items use Grid Engine rather than Slurm. This is being updated. However, the underlying ideas are the same; just the commands differ.
Job quits immediately with no output, or immediately goes into Error state
Diagnosing
Grid Engine: If you look at the full queue status, your job appears with an "E" (error) flag:
[juser@proteusa01 ~]$ qstat -f -u \* | less
...
###############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
###############################################################################
123456 1.01025 myjob.sh juser
E
qw 05/23/2016 16:29:01 8
If you do
[juser@proteusa01 ~]$ qstat -j 123456
in the output you should see the "error reason":
error reason 1: 02/12/2015 14:31:56 [1062:71145]: execvp(/cm/local/apps/sge/var/spool/ic21n03/job_scripts/123456,
"/cm/local/apps/sge/var/spool/ic21n03/job_scripts/123456") failed: No such file or directory
Problem
There are two probable causes:
- The most likely issue is that your job script was created on a Windows machine, and uses DOS line endings rather than Unix line endings.[1]
- The other may be that the "shebang" line in your job script has a typo.("Shebang" comes from "hash bang", the two characters which begin the line.) The "shebang" line should be:
#!/bin/bash
Solutions
- For the case where you created your script in Windows, convert the
job script to Unix line endings using
dos2unix
[2]
[juser@proteusi01 ~]$ dos2unix myjob.sh
- For the case where the shebang line was wrong, correct it with the
full path to the script interpreter (usually "
/bin/bash
").
Unable to Submit Job - "Invalid account or account/partition combination specified"
Diagnosing
You are a member of more than one group/project. When you submit
(sbatch
) a job under project different from your usual one, you get
the error 'Invalid account or account/partition combination specified'.
Problem
Your default Slurm billing account is set to an unbillable account. This happens when you are a member of multiple accounts.
Solutions
Specify the billing account to be used (charged) for the job. Either:
- in the job script
#SBATCH -A somethingPrj
- in the command line
[myname@picotte001 ~]$ sbatch -A somethingPrj myjobscript.sh
Job Stays in Pending List and Does Not Run
Diagnosing
Output from squeue shows that the job is in the pending list even though there are free resources, and the per-group slot quota is not in effect.
Problem
Usually caused by resource requests which are not possible to fulfill, including:
- time > 48 hours
- total amount of memory requested per node exceeds amount of RAM installed on node
- number of slots requested is too high
Solution
Use scontrol to modify the resource requests, or scancel the job and resubmit after adjusting the job script.
You may use sbatch --test-only to test your job script. See the man page for sbatch.
Libraries Not Found
Diagnosing
You get error messages about libraries not being found.
Problem
The LD_LIBRARY_PATH
environment variable is not set properly. If you
also get error messages about "no module", which means that the "module"
command is not found, this is because the environment was not set up
properly to use the module command.
Solution
One of these will work:
- Add the command "
. /etc/profile.d/modules.sh
" to your job script before you do any "module load". - Change the first line of your job script to:
#!/bin/bash -l
Do not add the "#$ -V" option to your job script.
Unable to Login With SSH
Diagnosing
SSH login attempts fail with a disconnect.
Problem
There is a security measure in place which blocks IP addresses after 3 failed login attempts. The block lasts 90 minutes. This applies only to off-campus IP addresses.
Solutions
Use the Drexel VPN. This assigns your computer an on-campus IP address.
Disk Quota Exceeded
Diagnosing
An operation or job fails with an error message:
Disk quota exceeded
Problem
You have exceeded the quota for your home directory.
Solution
Move files out of your home directory into a subdirectory of your group directory.
- To move files out of your home directory, and into the group
directory, you must use the "
cp
" command to copy files, and then delete the originals after the copy is complete. The "mv
" command may not accomplish this properly due to the way it handles directories.
cp -R ~/data /mnt/HA/groups/myrsrchGrp/users/myname
- You can also use "
rsync
" -- NB there is a trailing "/
" on "~/data/
" but not on the destination directory
mkdir /mnt/HA/groups/myrsrchGrp/users/myname
rsync -av ~/data/ /mnt/HA/groups/myrsrchGrp/users/myname/data
Unable to Create Any New Files
Diagnosing
Any attempt to create a new file fails with an error. The error message will vary by application.
Problem
You have exceeded the quota for your home directory.
Solution
See above: #Disk Quota Exceeded
Unable to Upload Data
Diagnosing
SFTP program gives an error, saying it is unable to upload data.
Problem
Probably trying to upload to home directory, which has a quota: on Picotte, it is 64 GB. (Quotas subject to change.)
Solution
All data should be stored in the group directory:
- Picotte: /ifs/groups/groupnameGrp
See above: #Disk Quota Exceeded
TMPDIR local scratch directory does not exist for interactive srun session
This can happen sometimes when requesting multiple tasks in srun rather than multiple cpus (cores).
Diagnosing
Request an interactive session on a GPU node requesting multiple tasks:
[juser@picotte001]$ srun --ntasks=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234567
[juser@node012]$ ls -ld $TMPDIR
/bin/ls: cannot access '/local/scratch/1234567': No such file or directory
Problem
The root cause is not known.
Solution
Request multiple CPUs (i.e. CPU cores) per task, instead; the default number of tasks is one per node.
[juser@picotte001]$ srun --cpus-per-task=12 --time=15:00 --pty /bin/bash -l
[juser@node012]$ echo $TMPDIR
/local/scratch/1234568
[juser@node012]$ ls -ld $TMPDIR
drwxrwxrwt 529 root root 36864 Oct 7 17:58 /local/scratch/1234568