Job Submission Related
Common reasons for job failures include:
- Errors in job script parameters.
- Job scripts containing Windows system encodings.
- The number of CPU cores requested exceeds the total number available in the cluster.
- Missing environment variables required for the job, such as lib*.so files, missing CuDNN libraries, or incorrect CuDNN library versions.
1. Job submission failed with no job ID generated?
Possible Cause: There are errors in the job submission command or the parameters within the job script. Solution: Check for errors, correct the parameters, and resubmit the job.
2. Job remains in the queue indefinitely without starting?
Possible Cause: The requested resources (CPU cores or GPUs) exceed the total cluster capacity or the maximum limits of the partition (queue). Solution: Adjust the resource requests (e.g., reduce CPU cores or GPUs) and resubmit the job.
3. Job fails with missing system library files?
Possible Cause: Missing lib* files or incompatible CuDNN versions.
Solutions:
- Locate the required lib* files in the software installation directory and configure the library path.
- Contact the administrator to install the missing libraries.
- Verify the CUDA version and load the correct CuDNN version.
4. Job script encoding issues?
Possible Cause: Scripts edited on Windows and uploaded to Linux may have incompatible line endings.
Solution: Convert the script encoding using dos2unix
5. How to troubleshoot job failures?
Solution: Check the log files generated by the -e (error) and -o (output) parameters to identify the root cause.
6. Communication issues between compute nodes and the management node?
Possible Causes:
- Configuration discrepancies: Incomplete hosts files, inconsistent system time, or mismatched Slurm configurations.
- Network problems: Faulty network devices or misconfigurations.
Solutions:
- Verify and correct configurations across all nodes. Disable the problematic node's slurmd service if necessary.
- Use system logs and diagnostic tools to identify and resolve network issues.
7. How to requeue a completed or failed job?
Solution: Use scontrol requeue to reset the job status to PENDING: scontrol requeue <job_id>