-
Notifications
You must be signed in to change notification settings - Fork 0
Add dra #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add dra #81
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| --- | ||
| weight: 20 | ||
| --- | ||
|
|
||
| # Enable CDI in Containerd | ||
|
|
||
| CDI (Container Device Interface) provides a standard mechanism for device vendors to describe what is required to provide access to a specific resource such as a GPU beyond a simple device name. | ||
|
|
||
| CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation. | ||
|
|
||
| ## Steps to Enable CDI in containerd v1.7.x | ||
|
|
||
| 1. Update containerd configuration. | ||
| Edit the configuration file: | ||
| ```bash | ||
| vi /etc/containerd/config.toml | ||
| ``` | ||
| Add or modify the following section: | ||
| ```toml | ||
| [plugins."io.containerd.grpc.v1.cri"] | ||
| enable_cdi = true | ||
| ``` | ||
| 2. Restart containerd. | ||
| ```bash | ||
| systemctl restart containerd | ||
| systemctl status containerd | ||
| ``` | ||
| Ensure the service is running correctly. | ||
|
|
||
| 3. Verify CDI is Enabled. | ||
| ```bash | ||
| journalctl -u containerd | grep "EnableCDI:true" | ||
| ``` | ||
| Wait a moment, if there are logs, it means the setup was successful. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| --- | ||
| weight: 30 | ||
| i18n: | ||
| title: | ||
| en: How To | ||
| zh: How To | ||
| --- | ||
|
|
||
| # How To | ||
|
|
||
| <Overview /> |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,58 @@ | ||||||
| --- | ||||||
| weight: 30 | ||||||
| --- | ||||||
|
|
||||||
| # Enable DRA(Dynamic Resource Allocation) and corresponding API groups in Kubernetes | ||||||
|
|
||||||
| DRA support is enabled by default in Kubernetes 1.34 and later. Earlier versions, starting from 1.32, support for this feature requires manual activation. | ||||||
|
|
||||||
|
|
||||||
| ## Steps to Enable DRA in Kubernetes 1.32–1.33 | ||||||
|
|
||||||
| On the all master nodes: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix grammatical error. "On the all master nodes" should be "On all master nodes". 📝 Suggested fix-On the all master nodes:
+On all master nodes:📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
| 1. Edit `kube-apiserver` component manifests in `/etc/kubernetes/manifests/kube-apiserver.yaml`: | ||||||
| ```yaml | ||||||
| spec: | ||||||
| containers: | ||||||
| - command: | ||||||
| - kube-apiserver | ||||||
| - --feature-gates=DynamicResourceAllocation=true # required | ||||||
| - --runtime-config=resource.k8s.io/v1beta1 # required | ||||||
| - --runtime-config=resource.k8s.io/v1beta2 # required | ||||||
|
Comment on lines
+19
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Use comma-separated key=value pairs; later entries override earlier ones. Keys are either core v1 ("v1") or group/version ("group/version"), or special keys like api/all, api/ga, api/beta, api/alpha. Examples:
See kube-apiserver flag reference and "Enable or Disable a Kubernetes API" docs for details. [1][2] Sources 🌐 Web query:
💡 Result: Yes — --runtime-config accepts multiple comma-separated key[=value] entries. Use API paths (e.g. group/version) or special keys (api/all, api/legacy); later entries override earlier ones. Example: Sources: Consolidate Per Kubernetes documentation, Correct syntax - --feature-gates=DynamicResourceAllocation=true # required
- - --runtime-config=resource.k8s.io/v1beta1 # required
- - --runtime-config=resource.k8s.io/v1beta2 # required
+ - --runtime-config=resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true # required🤖 Prompt for AI Agents |
||||||
| # ... other flags | ||||||
| ``` | ||||||
|
|
||||||
| 2. Edit `kube-controller-manager` component manifests in `/etc/kubernetes/manifests/kube-controller-manager.yaml`: | ||||||
| ```yaml | ||||||
| spec: | ||||||
| containers: | ||||||
| - command: | ||||||
| - kube-controller-manager | ||||||
| - --feature-gates=DynamicResourceAllocation=true # required | ||||||
| # ... other flags | ||||||
| ``` | ||||||
|
|
||||||
| 3. Edit `kube-scheduler` component manifests in `/etc/kubernetes/manifests/kube-scheduler.yaml`: | ||||||
| ```yaml | ||||||
| spec: | ||||||
| containers: | ||||||
| - command: | ||||||
| - kube-scheduler | ||||||
| - --feature-gates=DynamicResourceAllocation=true | ||||||
| # ... other flags | ||||||
| ``` | ||||||
|
|
||||||
| 4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix grammatical error. "on the all nodes" should be "on all nodes". 📝 Suggested fix-4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
+4. For kubelet, edit `/var/lib/kubelet/config.yaml` on all nodes:📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: kubelet.config.k8s.io/v1beta1 | ||||||
| kind: KubeletConfiguration | ||||||
| featureGates: | ||||||
| DynamicResourceAllocation: true | ||||||
| ``` | ||||||
|
|
||||||
| Restart kubelet: | ||||||
|
|
||||||
| ```bash | ||||||
| sudo systemctl restart kubelet | ||||||
| ``` | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| --- | ||
| weight: 83 | ||
| --- | ||
| # Alauda Build of NVIDIA DRA Driver for GPUs | ||
|
|
||
| <Overview /> |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,177 @@ | ||||||||
| --- | ||||||||
| weight: 20 | ||||||||
| --- | ||||||||
|
|
||||||||
| # Installation | ||||||||
|
|
||||||||
| ## Prerequisites | ||||||||
|
|
||||||||
| - **NvidiaDriver v565+** | ||||||||
| - **Kubernetes v1.32+** | ||||||||
| - **ACP v4.1+** | ||||||||
| - **Cluster administrator access to your ACP cluster** | ||||||||
| - **CDI must be enabled in the underlying container runtime (such as containerd, see [Enable CDI](how_to/cdi_enable_containerd.mdx))** | ||||||||
| - **DRA and corresponding API groups must be enabled(see [Enable DRA](how_to/k8s_dra_enable.mdx)).** | ||||||||
|
|
||||||||
| ## Procedure | ||||||||
|
|
||||||||
| ### Installing Nvidia driver in your gpu node | ||||||||
| Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) | ||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix phrasing: "Prefer to" → "Refer to". "Prefer to" is grammatically incorrect in this context. 📝 Suggested fix-Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
+Refer to the [NVIDIA Official Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||
|
|
||||||||
| ### Installing Nvidia Container Runtime | ||||||||
| Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | ||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix phrasing: "Prefer to" → "Refer to". Same issue as above. 📝 Suggested fix-Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+Refer to the [NVIDIA Container Toolkit Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||
|
|
||||||||
| ### Downloading Cluster plugin | ||||||||
|
|
||||||||
| :::info | ||||||||
|
|
||||||||
| `Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal. | ||||||||
|
|
||||||||
| Please contact Consumer Support for more information. | ||||||||
|
|
||||||||
| ::: | ||||||||
|
|
||||||||
| ### Uploading the Cluster plugin | ||||||||
|
|
||||||||
| For more information on uploading the cluster plugin, please refer to <ExternalSiteLink name="acp" href="ui/cli_tools/index.html#uploading-cluster-plugins" children="Uploading Cluster Plugins" /> | ||||||||
|
|
||||||||
| ### Installing Alauda Build of NVIDIA DRA Driver for GPUs | ||||||||
|
|
||||||||
| 1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule. | ||||||||
| ```bash | ||||||||
| kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra | ||||||||
| ``` | ||||||||
| :::info | ||||||||
| **Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.** | ||||||||
| ::: | ||||||||
|
|
||||||||
| 2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin. | ||||||||
|
|
||||||||
|
|
||||||||
| ### Verify DRA setup | ||||||||
|
|
||||||||
| 1. Check DRA driver and DRA controller pods: | ||||||||
|
|
||||||||
| ```bash | ||||||||
| kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu" | ||||||||
| ``` | ||||||||
| You should get results similar to: | ||||||||
| ``` | ||||||||
| nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h | ||||||||
| nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h | ||||||||
| ``` | ||||||||
|
|
||||||||
| 2. Verify ResourceSlice objects: | ||||||||
| ```bash | ||||||||
| kubectl get resourceslices -o yaml | ||||||||
| ``` | ||||||||
|
|
||||||||
| For GPU nodes, you should see output similar to: | ||||||||
|
|
||||||||
| ```yaml | ||||||||
| apiVersion: resource.k8s.io/v1beta1 | ||||||||
| kind: ResourceSlice | ||||||||
| metadata: | ||||||||
| generateName: 192.168.140.59-gpu.nvidia.com- | ||||||||
| name: 192.168.140.59-gpu.nvidia.com-gbl46 | ||||||||
| ownerReferences: | ||||||||
| - apiVersion: v1 | ||||||||
| controller: true | ||||||||
| kind: Node | ||||||||
| name: 192.168.140.59 | ||||||||
| uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c | ||||||||
| spec: | ||||||||
| devices: | ||||||||
| - basic: | ||||||||
| attributes: | ||||||||
| architecture: | ||||||||
| string: Pascal | ||||||||
| brand: | ||||||||
| string: Tesla | ||||||||
| cudaComputeCapability: | ||||||||
| version: 6.0.0 | ||||||||
| cudaDriverVersion: | ||||||||
| version: 12.8.0 | ||||||||
| driverVersion: | ||||||||
| version: 570.124.6 | ||||||||
| pcieBusID: | ||||||||
| string: 0000:00:0b.0 | ||||||||
| productName: | ||||||||
| string: Tesla P100-PCIE-16GB | ||||||||
| resource.kubernetes.io/pcieRoot: | ||||||||
| string: pci0000:00 | ||||||||
| type: | ||||||||
| string: gpu | ||||||||
| uuid: | ||||||||
| string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66 | ||||||||
| capacity: | ||||||||
| memory: | ||||||||
| value: 16Gi | ||||||||
| name: gpu-0 | ||||||||
| driver: gpu.nvidia.com | ||||||||
| nodeName: 192.168.140.59 | ||||||||
| pool: | ||||||||
| generation: 1 | ||||||||
| name: 192.168.140.59 | ||||||||
| resourceSliceCount: 1 | ||||||||
| ``` | ||||||||
| 3. Deploy workloads with DRA. | ||||||||
| :::info | ||||||||
| **Note:Fill in the `selector` field of the following `ResourceClaimTemplate` resource according to your specific GPU model.You can use [common expression language (CEL)](https://cel.dev) to select devices based on specific attributes.** | ||||||||
| ::: | ||||||||
| Create spec file: | ||||||||
| ```bash | ||||||||
| cat <<EOF > dra-gpu-test.yaml | ||||||||
| --- | ||||||||
| apiVersion: resource.k8s.io/v1beta1 | ||||||||
| kind: ResourceClaimTemplate | ||||||||
| metadata: | ||||||||
| name: gpu-template | ||||||||
| spec: | ||||||||
| spec: | ||||||||
| devices: | ||||||||
| requests: | ||||||||
| - name: gpu | ||||||||
| deviceClassName: gpu.nvidia.com | ||||||||
| selectors: | ||||||||
| - cel: | ||||||||
| expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'" # [!code callout] | ||||||||
| --- | ||||||||
| apiVersion: v1 | ||||||||
| kind: Pod | ||||||||
| metadata: | ||||||||
| name: dra-gpu-workload | ||||||||
| spec: | ||||||||
| tolerations: | ||||||||
| - key: "nvidia.com/gpu" | ||||||||
| operator: "Exists" | ||||||||
| effect: "NoSchedule" | ||||||||
| runtimeClassName: nvidia | ||||||||
| restartPolicy: OnFailure | ||||||||
| resourceClaims: | ||||||||
| - name: gpu-claim | ||||||||
| resourceClaimTemplateName: gpu-template | ||||||||
| containers: | ||||||||
| - name: cuda-container | ||||||||
| image: "ubuntu:22.04" | ||||||||
| command: ["bash", "-c"] | ||||||||
| args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] | ||||||||
| resources: | ||||||||
| claims: | ||||||||
| - name: gpu-claim | ||||||||
| ``` | ||||||||
| Apply spec: | ||||||||
|
|
||||||||
| ```bash | ||||||||
| kubectl apply -f dra-gpu-test.yaml | ||||||||
| ``` | ||||||||
|
|
||||||||
| Obtain output of container in the pod: | ||||||||
| ```bash | ||||||||
| kubectl logs pod -n dra-gpu-workload -f | ||||||||
| ``` | ||||||||
|
Comment on lines
+170
to
+172
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incorrect The command 🐛 Proposed fix- kubectl logs pod -n dra-gpu-workload -f
+ kubectl logs dra-gpu-workload -f📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||
| The output is expected to show the GPU UUID from the container. Example: | ||||||||
|
|
||||||||
| ```text | ||||||||
| GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66) | ||||||||
| ``` | ||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| --- | ||
| weight: 10 | ||
| --- | ||
| # Introduction | ||
|
|
||
| Dynamic Resource Allocation (DRA) is a Kubernetes feature that provides a more flexible and extensible way to request and allocate hardware resources like GPUs. Unlike traditional device plugins that only support simple counting of identical resources, DRA enables fine-grained resource selection based on device attributes and capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
containerd CDI EnableCDI log output format journalctl💡 Result:
If you enable CDI in containerd (plugins."io.containerd.grpc.v1.cri" enable_cdi = true), containerd logs remain in the system journal (systemd/journald). To view containerd/CDI-related messages:
Notes:
Sources:
[1] containerd / CDI configuration (enable_cdi, cdi_spec_dirs).
[2] journald / journalctl usage and -o json output.
🌐 Web query:
containerd version 1.7.0 CDI enable_cdi log messages💡 Result:
I can help — do you want (pick one):
Background: containerd exposes enable_cdi and cdi_spec_dirs in the CRI plugin config (plugins."io.containerd.grpc.v1.cri"). Set enable_cdi = true and list CDI spec dirs (e.g. /etc/cdi, /var/run/cdi) and restart containerd. [1][2].
LGTM! Clear and well-structured CDI enablement guide.
The documentation is well-organized with clear steps for enabling CDI in containerd v1.7.x. However, improve the verification command on line 32: instead of
journalctl -u containerd | grep "EnableCDI:true", use the more robust approachjournalctl -u containerd.service -o json | jq 'select(.MESSAGE|test("EnableCDI";"i"))'to reliably filter CDI-related logs from structured journal output.🤖 Prompt for AI Agents