Skip to main content

Slurm Job Management

Displaying queue and node information: sinfo

sinfo allows you to see what queues and nodes exist on the system and their status.
For example: sinfo -l

step1

HPC AI Convergence Intelligent Computing Centre Platform Queue (Phase II)
TypeNameResourcesDescription
CPU Nodes Compute Poolsi64m512u(Sharing)、i64m512ue(Exclusive)110 units   CPU: Intel 2*8358P,32C, 2.6GHz   Memory:512GB   System Disk: SSD 2*960GB   OS:UbuntuLimited to 7 days, user accounts use a total of 1024 cores
i64m512r(Sharing)、i64m512re(Exclusive)30 units   CPU: Intel 2*8358P,32C, 2.6GHz   Memory:512GB   System Disk: SSD 2*960GB   Data Disk:SSD 6*1.92 TB   OS:RedhatLimited to 7 days, user accounts use a total of 128 cores
a128m512u(Sharing)、a128m512ue(Exclusive)20 units   CPU: 2*AMD 7763,64C, 2.45GHz   Memory:512GB   System Disk: SSD 2*960GB   OS:Ubuntu(i64m1tga800u)Limited to 7 days, users use a total of 256 cores
(i64m1tga800ue)Limited to 7 days, users use a total of 128 cores
long_cpuResource sharing with i64m512u queueLimited to 14 days, with users using a total of 1024 cores
Large memory nodes Compute Poolsi96m3tu(Sharing)、i96m3tue(Exclusive)6 units   CPU: Intel 2*6348H,24C, 2.3GHz   Memory:3072GB   System Disk: SSD 2*960GB   OS:Ubuntulimited to 7 days, users use a total of 192 cores
CPU Emergency Poolsemergency_cpuResource sharing with i64m512u queueLimited to 14 days, users use a total of 512 cores
GPU Nodes Computing Poolsi64m1tga800u(Sharing)、i64m1tga800ue(Exclusive)50 units   Host:gpu1-[1-65]   CPU: Intel 2*8358P,32C, 2.6GHz   Memory:1024GB   GPU: 8*A800   System Disk: SSD 2*960GB   OS:Ubuntu   15台   Host:gpu2-[1-15]   CPU: Intel 2*8358P,32C, 2.6GHz   Memory:1024GB   GPU: 8*A800   System Disk: SSD 2*960GB   Data Disk:SSD 6*1.92 TB   OS:Ubuntu(i64m1tga800u)Limited to 7 days, users use a total of 128 cores, users use a total of GPU 16 cards
(i64m1tga800ue)Limited to 7 days, users use a total of 64 cores, users use a total of GPU 8 cards
i64m1tga40u(Sharing)、i64m1tga40ue(Exclusive)14 units   CPU: Intel 2*8358P,32C, 2.6GHz   Memory:1024GB   System Disk: SSD 2*960GB   GPU: 8*A40   OS:UbuntuLimited to 7 days, users use a total of 128 cores, users use a total of GPU 16 cards
long_cpuResource sharing with i64m1tga800u queueLimited to 14 days, users use a total of 128 cores, users use a total of GPU 16 cards
GPU A800 Emergency Poolsemergency_gpuResource sharing with i64m1tga800u queueLimited to 14 days, users use a total of 64 cores, users use a total of GPU 8 cards
GPU A40 Emergency Poolsemergency_gpua40Resource sharing with i64m1tga40u queueLimited to 14 days, users use a total of 64 cores, users use a total of GPU 8 cards
Debug TestdebugThere are a total of 6 units, including 1 A40 unit and 5 CPU1 units:    
①CPU: Intel 2*8358P,32C, 2.6GHz   Memory:1024GB   System Disk: SSD 2*960GB   GPU: 8*A40   
OS:Ubuntu
②CPU: Intel 2*8358P,32C, 2.6GHz   Memory:512GB   System Disk: SSD 2*960GB   
OS:Ubuntu
Limited to 30 minutes, users use a total of 8 cores, users use a total of GPU 1 cards

Debug CPU and GPU resources: applicable to users of CPU, CUDA, software adaptation, code debugging and tuning, for special container environment image tuning and teaching training.

Pricing plan:

step2

View information about jobs in the queue: squeue

step2

To view detailed partition (queue) information: scontrol show partition

step3

View detailed node information: scontrol show node

step4

View detailed assignment information: scontrol show job $JOBID

step5

View job dynamic output: speek

step6

Note: This command does not come with slurm, it is encapsulated.

Terminate a job: scancel job_id

step7

Hang a queued job: scontrol hold job_id

step8 step8_2

Continue a queued job: scontrol release _job_id

step9

Suspend a running job: scontrol suspend job_id

step10

Resume a suspended job: scontrol resume job_id

step11