Skip to content

kubernetes-1.35: add kubelet-env-nvidia template#860

Open
arnaldo2792 wants to merge 1 commit intobottlerocket-os:developfrom
arnaldo2792:label-nodes/core-kit
Open

kubernetes-1.35: add kubelet-env-nvidia template#860
arnaldo2792 wants to merge 1 commit intobottlerocket-os:developfrom
arnaldo2792:label-nodes/core-kit

Conversation

@arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Mar 10, 2026

Add a kubelet-env-nvidia template that hardcodes the nvidia.com/gpu.present=true node label. NVIDIA components such as the DRA driver use this label in their deployment nodeAffinity. Without it, they refuse to schedule on NVIDIA GPU hosts.

Issue number:

Related to: bottlerocket-os/bottlerocket#4756

Description of changes:

Add a kubelet-env-nvidia template that hardcodes the nvidia.com/gpu.present=true node label. NVIDIA components such as the
DRA driver use this label in their deployment nodeAffinity. Without it, they refuse to schedule on NVIDIA GPU hosts.

Testing done:

In combination with: bottlerocket-os/bottlerocket#4784

Launched a Kubernetes 1.35 NVIDIA variant and confirmed the label is registered:

if k get nodes --show-labels | rg -q nvidia.com;
   echo "Label present"
end

Label present

Confirmed that the only new label is what nvidia.com/gpu.present:

Details
=== Label Diff ===
  Node A: ip-192-168-58-243.us-west-2.compute.internal
  Node B: ip-192-168-67-54.us-west-2.compute.internal
  Total labels: 16 (identical: 11, different: 4, only-A: 0, only-B: 1)

≠  Different values:
     LABEL                                     NODE A                                        NODE B
  ---------------------------------------------------------------------------------------------------------------------------------------
   ≠ failure-domain.beta.kubernetes.io/zone    us-west-2c                                    us-west-2a
   ≠ kubernetes.io/hostname                    ip-192-168-58-243.us-west-2.compute.internal  ip-192-168-67-54.us-west-2.compute.internal
   ≠ topology.k8s.aws/zone-id                  usw2-az3                                      usw2-az2
   ≠ topology.kubernetes.io/zone               us-west-2c                                    us-west-2a

→  Only on Node B:
     LABEL                                     NODE A                                        NODE B
  ---------------------------------------------------------------------------------------------------------------------------------------
   → nvidia.com/gpu.present                                                                  true

=  Identical:
     LABEL                                     NODE A                                        NODE B
  ---------------------------------------------------------------------------------------------------------------------------------------
     alpha.eksctl.io/cluster-name              bottlerocket-test-k8s-1-34                    bottlerocket-test-k8s-1-34
     alpha.eksctl.io/nodegroup-name            aws-k8s-1-34-x86-64                           aws-k8s-1-34-x86-64
     beta.kubernetes.io/arch                   amd64                                         amd64
     beta.kubernetes.io/instance-type          g4dn.2xlarge                                  g4dn.2xlarge
     beta.kubernetes.io/os                     linux                                         linux
     failure-domain.beta.kubernetes.io/region  us-west-2                                     us-west-2
     k8s.io/cloud-provider-aws                 f03e107b1cf397a788e2ef10f07cdab3              f03e107b1cf397a788e2ef10f07cdab3
     kubernetes.io/arch                        amd64                                         amd64
     kubernetes.io/os                          linux                                         linux
     node.kubernetes.io/instance-type          g4dn.2xlarge                                  g4dn.2xlarge
     topology.kubernetes.io/region             us-west-2                                     us-west-2

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Add a kubelet-env-nvidia template that hardcodes the
nvidia.com/gpu.present=true node label. NVIDIA components such as the
DRA driver use this label in their deployment nodeAffinity. Without it,
they refuse to schedule on NVIDIA GPU hosts.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
std = { version = "v1", helpers = ["join_map"] }
+++
NODE_IP={{settings.kubernetes.node-ip}}
NODE_LABELS=nvidia.com/gpu.present=true,{{join_map "=" "," "no-fail-if-missing" settings.kubernetes.node-labels}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't cause any errors or warnings if no other node labels are present?

NODE_LABELS=nvidia.com/gpu.present=true,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a node that doesn't set any labels through user data, I still saw the new label applied:

❯ kubectl get nodes -l nvidia.com/gpu.present=true
NAME                                           STATUS   ROLES    AGE     VERSION
ip-192-168-68-129.us-west-2.compute.internal   Ready    <none>   8m10s   v1.35.0-eks-ac2d5a0

I inspected the kubelet logs and I didn't see any warnings or info logs for the labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants