Slurm Job Management
Displaying queue and node information: sinfo
sinfo allows you to see what queues and nodes exist on the system and their status.
For example: sinfo -l
HPC AI Convergence Intelligent Computing Centre Platform Queue (Phase II) | |||
---|---|---|---|
Type | Name | Resources | Description |
CPU Nodes Compute Pools | i64m512u(Sharing)、i64m512ue(Exclusive) | 110 units CPU: Intel 2*8358P,32C, 2.6GHz Memory:512GB System Disk: SSD 2*960GB OS:Ubuntu | Limited to 7 days, user accounts use a total of 1024 cores |
i64m512r(Sharing)、i64m512re(Exclusive) | 30 units CPU: Intel 2*8358P,32C, 2.6GHz Memory:512GB System Disk: SSD 2*960GB Data Disk:SSD 6*1.92 TB OS:Redhat | Limited to 7 days, user accounts use a total of 128 cores | |
a128m512u(Sharing)、a128m512ue(Exclusive) | 20 units CPU: 2*AMD 7763,64C, 2.45GHz Memory:512GB System Disk: SSD 2*960GB OS:Ubuntu | (i64m1tga800u)Limited to 7 days, users use a total of 256 cores (i64m1tga800ue)Limited to 7 days, users use a total of 128 cores | |
long_cpu | Resource sharing with i64m512u queue | Limited to 14 days, with users using a total of 1024 cores | |
Large memory nodes Compute Pools | i96m3tu(Sharing)、i96m3tue(Exclusive) | 6 units CPU: Intel 2*6348H,24C, 2.3GHz Memory:3072GB System Disk: SSD 2*960GB OS:Ubuntu | limited to 7 days, users use a total of 192 cores |
CPU Emergency Pools | emergency_cpu | Resource sharing with i64m512u queue | Limited to 14 days, users use a total of 512 cores |
GPU Nodes Computing Pools | i64m1tga800u(Sharing)、i64m1tga800ue(Exclusive) | 50 units Host:gpu1-[1-65] CPU: Intel 2*8358P,32C, 2.6GHz Memory:1024GB GPU: 8*A800 System Disk: SSD 2*960GB OS:Ubuntu 15台 Host:gpu2-[1-15] CPU: Intel 2*8358P,32C, 2.6GHz Memory:1024GB GPU: 8*A800 System Disk: SSD 2*960GB Data Disk:SSD 6*1.92 TB OS:Ubuntu | (i64m1tga800u)Limited to 7 days, users use a total of 128 cores, users use a total of GPU 16 cards (i64m1tga800ue)Limited to 7 days, users use a total of 64 cores, users use a total of GPU 8 cards |
i64m1tga40u(Sharing)、i64m1tga40ue(Exclusive) | 14 units CPU: Intel 2*8358P,32C, 2.6GHz Memory:1024GB System Disk: SSD 2*960GB GPU: 8*A40 OS:Ubuntu | Limited to 7 days, users use a total of 128 cores, users use a total of GPU 16 cards | |
long_cpu | Resource sharing with i64m1tga800u queue | Limited to 14 days, users use a total of 128 cores, users use a total of GPU 16 cards | |
GPU A800 Emergency Pools | emergency_gpu | Resource sharing with i64m1tga800u queue | Limited to 14 days, users use a total of 64 cores, users use a total of GPU 8 cards |
GPU A40 Emergency Pools | emergency_gpua40 | Resource sharing with i64m1tga40u queue | Limited to 14 days, users use a total of 64 cores, users use a total of GPU 8 cards |
Debug Test | debug | There are a total of 6 units, including 1 A40 unit and 5 CPU1 units: ①CPU: Intel 2*8358P,32C, 2.6GHz Memory:1024GB System Disk: SSD 2*960GB GPU: 8*A40 OS:Ubuntu ②CPU: Intel 2*8358P,32C, 2.6GHz Memory:512GB System Disk: SSD 2*960GB OS:Ubuntu | Limited to 30 minutes, users use a total of 8 cores, users use a total of GPU 1 cards Debug CPU and GPU resources: applicable to users of CPU, CUDA, software adaptation, code debugging and tuning, for special container environment image tuning and teaching training. |
Pricing plan:
View information about jobs in the queue: squeue
To view detailed partition (queue) information: scontrol show partition
View detailed node information: scontrol show node
View detailed assignment information: scontrol show job $JOBID
View job dynamic output: speek
Note: This command does not come with slurm, it is encapsulated.
Terminate a job: scancel job_id
Hang a queued job: scontrol hold job_id