Skip to main content

Job Submission Related

Common reasons for job failures include:

  • Errors in job script parameters.
  • Job scripts containing Windows system encodings.
  • The number of CPU cores requested exceeds the total number available in the cluster.
  • Missing environment variables required for the job, such as lib*.so files, missing CuDNN libraries, or incorrect CuDNN library versions.

1. Job submission failed with no job ID generated?

Possible Cause: There are errors in the job submission command or the parameters within the job script. Solution: Check for errors, correct the parameters, and resubmit the job.

2. Job remains in the queue indefinitely without starting?

Possible Cause: The requested resources (CPU cores or GPUs) exceed the total cluster capacity or the maximum limits of the partition (queue). Solution: Adjust the resource requests (e.g., reduce CPU cores or GPUs) and resubmit the job.

3. Job fails with missing system library files?

Possible Cause: Missing lib* files or incompatible CuDNN versions.

Solutions:

  • Locate the required lib* files in the software installation directory and configure the library path.
  • Contact the administrator to install the missing libraries.
  • Verify the CUDA version and load the correct CuDNN version.

4. Job script encoding issues?

Possible Cause: Scripts edited on Windows and uploaded to Linux may have incompatible line endings. Solution: Convert the script encoding using dos2unix

5. How to troubleshoot job failures?

Solution: Check the log files generated by the -e (error) and -o (output) parameters to identify the root cause.

6. Communication issues between compute nodes and the management node?

Possible Causes:

  • Configuration discrepancies: Incomplete hosts files, inconsistent system time, or mismatched Slurm configurations.
  • Network problems: Faulty network devices or misconfigurations.

Solutions:

  • Verify and correct configurations across all nodes. Disable the problematic node's slurmd service if necessary.
  • Use system logs and diagnostic tools to identify and resolve network issues.

7. How to requeue a completed or failed job?

Solution: Use scontrol requeue to reset the job status to PENDING: scontrol requeue <job_id>