kubernetes-1.35: add kubelet-env-nvidia template#860
Open
arnaldo2792 wants to merge 1 commit intobottlerocket-os:developfrom
Open
kubernetes-1.35: add kubelet-env-nvidia template#860arnaldo2792 wants to merge 1 commit intobottlerocket-os:developfrom
arnaldo2792 wants to merge 1 commit intobottlerocket-os:developfrom
Conversation
Add a kubelet-env-nvidia template that hardcodes the nvidia.com/gpu.present=true node label. NVIDIA components such as the DRA driver use this label in their deployment nodeAffinity. Without it, they refuse to schedule on NVIDIA GPU hosts. Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
yeazelm
reviewed
Mar 10, 2026
ytsssun
approved these changes
Mar 12, 2026
yeazelm
approved these changes
Mar 12, 2026
bcressey
reviewed
Mar 12, 2026
| std = { version = "v1", helpers = ["join_map"] } | ||
| +++ | ||
| NODE_IP={{settings.kubernetes.node-ip}} | ||
| NODE_LABELS=nvidia.com/gpu.present=true,{{join_map "=" "," "no-fail-if-missing" settings.kubernetes.node-labels}} |
Contributor
There was a problem hiding this comment.
This doesn't cause any errors or warnings if no other node labels are present?
NODE_LABELS=nvidia.com/gpu.present=true,
Contributor
Author
There was a problem hiding this comment.
I have a node that doesn't set any labels through user data, I still saw the new label applied:
❯ kubectl get nodes -l nvidia.com/gpu.present=true
NAME STATUS ROLES AGE VERSION
ip-192-168-68-129.us-west-2.compute.internal Ready <none> 8m10s v1.35.0-eks-ac2d5a0I inspected the kubelet logs and I didn't see any warnings or info logs for the labels.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a kubelet-env-nvidia template that hardcodes the nvidia.com/gpu.present=true node label. NVIDIA components such as the DRA driver use this label in their deployment nodeAffinity. Without it, they refuse to schedule on NVIDIA GPU hosts.
Issue number:
Related to: bottlerocket-os/bottlerocket#4756
Description of changes:
Add a kubelet-env-nvidia template that hardcodes the
nvidia.com/gpu.present=truenode label. NVIDIA components such as theDRA driver use this label in their deployment
nodeAffinity. Without it, they refuse to schedule on NVIDIA GPU hosts.Testing done:
In combination with: bottlerocket-os/bottlerocket#4784
Launched a Kubernetes 1.35 NVIDIA variant and confirmed the label is registered:
Confirmed that the only new label is what
nvidia.com/gpu.present:Details
=== Label Diff === Node A: ip-192-168-58-243.us-west-2.compute.internal Node B: ip-192-168-67-54.us-west-2.compute.internal Total labels: 16 (identical: 11, different: 4, only-A: 0, only-B: 1) ≠ Different values: LABEL NODE A NODE B --------------------------------------------------------------------------------------------------------------------------------------- ≠ failure-domain.beta.kubernetes.io/zone us-west-2c us-west-2a ≠ kubernetes.io/hostname ip-192-168-58-243.us-west-2.compute.internal ip-192-168-67-54.us-west-2.compute.internal ≠ topology.k8s.aws/zone-id usw2-az3 usw2-az2 ≠ topology.kubernetes.io/zone us-west-2c us-west-2a → Only on Node B: LABEL NODE A NODE B --------------------------------------------------------------------------------------------------------------------------------------- → nvidia.com/gpu.present true = Identical: LABEL NODE A NODE B --------------------------------------------------------------------------------------------------------------------------------------- alpha.eksctl.io/cluster-name bottlerocket-test-k8s-1-34 bottlerocket-test-k8s-1-34 alpha.eksctl.io/nodegroup-name aws-k8s-1-34-x86-64 aws-k8s-1-34-x86-64 beta.kubernetes.io/arch amd64 amd64 beta.kubernetes.io/instance-type g4dn.2xlarge g4dn.2xlarge beta.kubernetes.io/os linux linux failure-domain.beta.kubernetes.io/region us-west-2 us-west-2 k8s.io/cloud-provider-aws f03e107b1cf397a788e2ef10f07cdab3 f03e107b1cf397a788e2ef10f07cdab3 kubernetes.io/arch amd64 amd64 kubernetes.io/os linux linux node.kubernetes.io/instance-type g4dn.2xlarge g4dn.2xlarge topology.kubernetes.io/region us-west-2 us-west-2Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.