Job Submission Related

Common reasons for job failures include:

Errors in job script parameters.
Job scripts containing Windows system encodings.
The number of CPU cores requested exceeds the total number available in the cluster.
Missing environment variables required for the job, such as lib*.so files, missing CuDNN libraries, or incorrect CuDNN library versions.

1. Job submission failed with no job ID generated?

Possible Cause: There are errors in the job submission command or the parameters within the job script. Solution: Check for errors, correct the parameters, and resubmit the job.

2. Job remains in the queue indefinitely without starting?

Possible Cause: The requested resources (CPU cores or GPUs) exceed the total cluster capacity or the maximum limits of the partition (queue). Solution: Adjust the resource requests (e.g., reduce CPU cores or GPUs) and resubmit the job.

3. Job fails with missing system library files?

Possible Cause: Missing lib* files or incompatible CuDNN versions.

Solutions:

Locate the required lib* files in the software installation directory and configure the library path.
Contact the administrator to install the missing libraries.
Verify the CUDA version and load the correct CuDNN version.

4. Job script encoding issues?

Possible Cause: Scripts edited on Windows and uploaded to Linux may have incompatible line endings. Solution: Convert the script encoding using dos2unix

5. How to troubleshoot job failures?

Solution: Check the log files generated by the -e (error) and -o (output) parameters to identify the root cause.

6. Communication issues between compute nodes and the management node?

Possible Causes:

Configuration discrepancies: Incomplete hosts files, inconsistent system time, or mismatched Slurm configurations.
Network problems: Faulty network devices or misconfigurations.

Solutions:

Verify and correct configurations across all nodes. Disable the problematic node's slurmd service if necessary.
Use system logs and diagnostic tools to identify and resolve network issues.

7. How to requeue a completed or failed job?

Solution: Use scontrol requeue to reset the job status to PENDING: scontrol requeue <job_id>

1. Job submission failed with no job ID generated?​

2. Job remains in the queue indefinitely without starting?​

3. Job fails with missing system library files?​

4. Job script encoding issues?​

5. How to troubleshoot job failures?​

6. Communication issues between compute nodes and the management node?​

7. How to requeue a completed or failed job?​