We are observing that the GPU Operator incorrectly sets the node label nvidia.com/gpu.present=false on nodes that have compatible NVIDIA GPUs (A10G on g5.12xlarge) and where nvidia-smi works correctly.
✅ Expected Behavior:
The GPU Operator should install the driver as a DaemonSet (nvidia-driver-daemonset)
GPU health check should pass
nvidia.com/gpu.present=true should be set if the GPU is detected and healthy
❌ Actual Behavior:
No nvidia-driver-daemonset is created
GPU Operator logs show:
"Setting node label","Label":"nvidia.com/gpu.present","Value":"false"
Log output includes:
"No GPU node in the cluster, do not create DaemonSets"
"Failed to detect GPU"
Node has g5.12xlarge A10G, and nvidia-smi returns correct output when tested manually
🔧 Configuration:
GPU Operator version: v25.3.1
ClusterPolicy:
driver:
enabled: true
useNvidiaDriverCRD: false
usePrecompiled: false
kernelModuleType: dkms
version: "570.148.08"
OS: Amazon Linux 2 (EKS GPU AMI: amazon-eks-gpu-node-1.32-v20250519)
K8s: v1.32.3-eks
nvidia-smi works on node
Manually labeling node to nvidia.com/gpu.present=true works temporarily
GPU Operator pod logs confirm it's skipping DaemonSet deployment due to misdirected GPU presence
❓Question:
What conditions could cause the operator to skip driver deployment and mark the GPU as unhealthy even when it is working? Can this be suppressed or debugged further?
Let me know if you'd like to attach logs or specific file samples (e.g. your ClusterPolicy YAML) to include in the issue too.