Cannot allocate GPU in Slurm

I've got a problem to allocate gpu resourese at Slurm cluster.

specify 1 GPU and run as shown below, it says that gres resources cannot be allocated. The same result If more than one.

$ srun --gres=gpu:1 --pty bash
srun: error: Unable to create step for job 73: Invalid generic resource (gres) specification

compute node's gres information seems to come out correctly as below

$ sinfo -o "%20N  %10c  %10m  %25f  %10G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES       
gpu_svr[1-4  72          515484      (null)                     gpu:8   

The Node configuration in the slurm.conf as below

/etc/slurm/slurm.conf

GresTypes=gpu
NodeName=gpu_svr1 NodeAddr=x.x.x.1 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18 
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr2 NodeAddr=x.x.x.2 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18 
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr3 NodeAddr=x.x.x.3 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18 
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr4 NodeAddr=x.x.x.4 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18 
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=ALL Default=YES MaxTime=INFINITE State=UP

here is gres.conf on Compute nodes

gres.conf 

NodeName=gpu_svr[1-4] Name=gpu File=/dev/nvidia[0-7]


Read more here: https://stackoverflow.com/questions/65701099/cannot-allocate-gpu-in-slurm

Content Attribution

This content was originally published by SEUNG SIK KIM at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: