From c05c928054f42f3f2655cc0687ce51d720f5b6fd Mon Sep 17 00:00:00 2001 From: luohua13 Date: Thu, 15 Jan 2026 18:30:26 +0800 Subject: [PATCH 1/2] dra --- .../pgpu_dra/how_to/cdi_enable_containerd.mdx | 34 ++++++++++ docs/en/pgpu_dra/how_to/index.mdx | 11 ++++ docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx | 7 +++ docs/en/pgpu_dra/index.mdx | 6 ++ docs/en/pgpu_dra/install.mdx | 63 +++++++++++++++++++ docs/en/pgpu_dra/intro.mdx | 6 ++ 6 files changed, 127 insertions(+) create mode 100644 docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx create mode 100644 docs/en/pgpu_dra/how_to/index.mdx create mode 100644 docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx create mode 100644 docs/en/pgpu_dra/index.mdx create mode 100644 docs/en/pgpu_dra/install.mdx create mode 100644 docs/en/pgpu_dra/intro.mdx diff --git a/docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx b/docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx new file mode 100644 index 0000000..d2dfc57 --- /dev/null +++ b/docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx @@ -0,0 +1,34 @@ +--- +weight: 20 +--- + +# Enable CDI in Containerd + +CDI (Container Device Interface) provides a standard mechanism for device vendors to describe what is required to provide access to a specific resource such as a GPU beyond a simple device name. + +CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation. + +## Steps to Enable CDI in Containerd (1.7.0 <= version < 2.0.0) + +1. Update containerd configuration. + Edit the configuration file: + ```bash + vi /etc/containerd/config.toml + ``` + Add or modify the following section: + ```toml + [plugins."io.containerd.grpc.v1.cri"] + enable_cdi = true + ``` +2. Restart containerd. + ```bash + systemctl restart containerd + systemctl status containerd + ``` + Ensure the service is running correctly. + +3. Verify CDI is Enabled. + ```bash + journalctl -u containerd | grep "EnableCDI:true" + ``` + Wait a moment, if there are logs, it means the setup was successful. diff --git a/docs/en/pgpu_dra/how_to/index.mdx b/docs/en/pgpu_dra/how_to/index.mdx new file mode 100644 index 0000000..1fcbef6 --- /dev/null +++ b/docs/en/pgpu_dra/how_to/index.mdx @@ -0,0 +1,11 @@ +--- +weight: 30 +i18n: +title: +en: How To +zh: How To +--- + +# How To + + diff --git a/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx b/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx new file mode 100644 index 0000000..caa5c1a --- /dev/null +++ b/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx @@ -0,0 +1,7 @@ +--- +weight: 30 +--- + +# Enable DRA(Dynamic Resource Allocation) in Kubernetes + + diff --git a/docs/en/pgpu_dra/index.mdx b/docs/en/pgpu_dra/index.mdx new file mode 100644 index 0000000..d1577c3 --- /dev/null +++ b/docs/en/pgpu_dra/index.mdx @@ -0,0 +1,6 @@ +--- +weight: 83 +--- +# Alauda Build of NVIDIA DRA Driver for GPUs + + diff --git a/docs/en/pgpu_dra/install.mdx b/docs/en/pgpu_dra/install.mdx new file mode 100644 index 0000000..d745382 --- /dev/null +++ b/docs/en/pgpu_dra/install.mdx @@ -0,0 +1,63 @@ +--- +weight: 20 +--- + +# Installation + +## Prerequisites + +- **NvidiaDriver v565+** +- **Kubernetes v1.32+** +- **ACP v4.1+** +- **Cluster administrator access to your ACP cluster** +- **CDI must be enabled in the underlying container runtime (such as containerd)** +- **DRA and corresponding API groups must be enabled** + +## Procedure + +### Installing Nvidia driver in your gpu node +Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) + +### Installing Nvidia Container Runtime +Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) + +### Downloading Cluster plugin + +:::info + +`Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal. + +Please contact Consumer Support for more information. + +::: + +### Uploading the Cluster plugin + +For more information on uploading the cluster plugin, please refer to + +### Installing Alauda Build of NVIDIA DRA Driver for GPUs + +1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule. + ```bash + kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra + ``` + :::info + **Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.** + ::: + +2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin. + +3. Verify result. You can see the status of "Installed" in the UI or you can check the pod status: + ```bash + kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu" + ``` + You should get results similar to: + ``` + nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h + nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h + ``` + +### Upgrading Alauda Build of NVIDIA DRA Driver for GPUs + +1. Upload the new version for package of **Alauda Build of NVIDIA DRA Driver for GPUs** plugin to ACP. +2. Go to the `Administrator` -> `Clusters` -> `Target Cluster` -> `Functional Components` page, then click the `Upgrade` button, and you will see the `Alauda Build of NVIDIA DRA Driver for GPUs` can be upgraded. diff --git a/docs/en/pgpu_dra/intro.mdx b/docs/en/pgpu_dra/intro.mdx new file mode 100644 index 0000000..e0f4e65 --- /dev/null +++ b/docs/en/pgpu_dra/intro.mdx @@ -0,0 +1,6 @@ +--- +weight: 10 +--- +# Introduction + +Dynamic Resource Allocation (DRA) is a Kubernetes feature that provides a more flexible and extensible way to request and allocate hardware resources like GPUs. Unlike traditional device plugins that only support simple counting of identical resources, DRA enables fine-grained resource selection based on device attributes and capabilities. From cb088061a3f067ec7b8b811b03e73a2d3f0b735c Mon Sep 17 00:00:00 2001 From: luohua13 Date: Mon, 19 Jan 2026 15:16:51 +0800 Subject: [PATCH 2/2] add dra support --- .../pgpu_dra/how_to/cdi_enable_containerd.mdx | 2 +- .../pgpu_dra/how_to/index.mdx | 0 .../pgpu_dra/how_to/k8s_dra_enable.mdx | 58 ++++++ .../device_management}/pgpu_dra/index.mdx | 0 .../device_management/pgpu_dra/install.mdx | 177 ++++++++++++++++++ .../device_management}/pgpu_dra/intro.mdx | 0 docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx | 7 - docs/en/pgpu_dra/install.mdx | 63 ------- 8 files changed, 236 insertions(+), 71 deletions(-) rename docs/en/{ => infrastructure_management/device_management}/pgpu_dra/how_to/cdi_enable_containerd.mdx (93%) rename docs/en/{ => infrastructure_management/device_management}/pgpu_dra/how_to/index.mdx (100%) create mode 100644 docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx rename docs/en/{ => infrastructure_management/device_management}/pgpu_dra/index.mdx (100%) create mode 100644 docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx rename docs/en/{ => infrastructure_management/device_management}/pgpu_dra/intro.mdx (100%) delete mode 100644 docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx delete mode 100644 docs/en/pgpu_dra/install.mdx diff --git a/docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx similarity index 93% rename from docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx rename to docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx index d2dfc57..9fa18e0 100644 --- a/docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx +++ b/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx @@ -8,7 +8,7 @@ CDI (Container Device Interface) provides a standard mechanism for device vendor CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation. -## Steps to Enable CDI in Containerd (1.7.0 <= version < 2.0.0) +## Steps to Enable CDI in containerd v1.7.x 1. Update containerd configuration. Edit the configuration file: diff --git a/docs/en/pgpu_dra/how_to/index.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx similarity index 100% rename from docs/en/pgpu_dra/how_to/index.mdx rename to docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx diff --git a/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx new file mode 100644 index 0000000..2a67108 --- /dev/null +++ b/docs/en/infrastructure_management/device_management/pgpu_dra/how_to/k8s_dra_enable.mdx @@ -0,0 +1,58 @@ +--- +weight: 30 +--- + +# Enable DRA(Dynamic Resource Allocation) and corresponding API groups in Kubernetes + +DRA support is enabled by default in Kubernetes 1.34 and later. Earlier versions, starting from 1.32, support for this feature requires manual activation. + + +## Steps to Enable DRA in Kubernetes 1.32–1.33 + +On the all master nodes: +1. Edit `kube-apiserver` component manifests in `/etc/kubernetes/manifests/kube-apiserver.yaml`: + ```yaml + spec: + containers: + - command: + - kube-apiserver + - --feature-gates=DynamicResourceAllocation=true # required + - --runtime-config=resource.k8s.io/v1beta1 # required + - --runtime-config=resource.k8s.io/v1beta2 # required + # ... other flags + ``` + +2. Edit `kube-controller-manager` component manifests in `/etc/kubernetes/manifests/kube-controller-manager.yaml`: + ```yaml + spec: + containers: + - command: + - kube-controller-manager + - --feature-gates=DynamicResourceAllocation=true # required + # ... other flags + ``` + +3. Edit `kube-scheduler` component manifests in `/etc/kubernetes/manifests/kube-scheduler.yaml`: + ```yaml + spec: + containers: + - command: + - kube-scheduler + - --feature-gates=DynamicResourceAllocation=true + # ... other flags + ``` + +4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes: + + ```yaml + apiVersion: kubelet.config.k8s.io/v1beta1 + kind: KubeletConfiguration + featureGates: + DynamicResourceAllocation: true + ``` + + Restart kubelet: + + ```bash + sudo systemctl restart kubelet + ``` diff --git a/docs/en/pgpu_dra/index.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/index.mdx similarity index 100% rename from docs/en/pgpu_dra/index.mdx rename to docs/en/infrastructure_management/device_management/pgpu_dra/index.mdx diff --git a/docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx new file mode 100644 index 0000000..5414629 --- /dev/null +++ b/docs/en/infrastructure_management/device_management/pgpu_dra/install.mdx @@ -0,0 +1,177 @@ +--- +weight: 20 +--- + +# Installation + +## Prerequisites + +- **NvidiaDriver v565+** +- **Kubernetes v1.32+** +- **ACP v4.1+** +- **Cluster administrator access to your ACP cluster** +- **CDI must be enabled in the underlying container runtime (such as containerd, see [Enable CDI](how_to/cdi_enable_containerd.mdx))** +- **DRA and corresponding API groups must be enabled(see [Enable DRA](how_to/k8s_dra_enable.mdx)).** + +## Procedure + +### Installing Nvidia driver in your gpu node +Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) + +### Installing Nvidia Container Runtime +Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) + +### Downloading Cluster plugin + +:::info + +`Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal. + +Please contact Consumer Support for more information. + +::: + +### Uploading the Cluster plugin + +For more information on uploading the cluster plugin, please refer to + +### Installing Alauda Build of NVIDIA DRA Driver for GPUs + +1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule. + ```bash + kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra + ``` + :::info + **Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.** + ::: + +2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin. + + +### Verify DRA setup + +1. Check DRA driver and DRA controller pods: + + ```bash + kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu" + ``` + You should get results similar to: + ``` + nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h + nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h + ``` + +2. Verify ResourceSlice objects: + ```bash + kubectl get resourceslices -o yaml + ``` + + For GPU nodes, you should see output similar to: + + ```yaml + apiVersion: resource.k8s.io/v1beta1 + kind: ResourceSlice + metadata: + generateName: 192.168.140.59-gpu.nvidia.com- + name: 192.168.140.59-gpu.nvidia.com-gbl46 + ownerReferences: + - apiVersion: v1 + controller: true + kind: Node + name: 192.168.140.59 + uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c + spec: + devices: + - basic: + attributes: + architecture: + string: Pascal + brand: + string: Tesla + cudaComputeCapability: + version: 6.0.0 + cudaDriverVersion: + version: 12.8.0 + driverVersion: + version: 570.124.6 + pcieBusID: + string: 0000:00:0b.0 + productName: + string: Tesla P100-PCIE-16GB + resource.kubernetes.io/pcieRoot: + string: pci0000:00 + type: + string: gpu + uuid: + string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66 + capacity: + memory: + value: 16Gi + name: gpu-0 + driver: gpu.nvidia.com + nodeName: 192.168.140.59 + pool: + generation: 1 + name: 192.168.140.59 + resourceSliceCount: 1 + ``` +3. Deploy workloads with DRA. + :::info + **Note:Fill in the `selector` field of the following `ResourceClaimTemplate` resource according to your specific GPU model.You can use [common expression language (CEL)](https://cel.dev) to select devices based on specific attributes.** + ::: + Create spec file: + ```bash + cat < dra-gpu-test.yaml + --- + apiVersion: resource.k8s.io/v1beta1 + kind: ResourceClaimTemplate + metadata: + name: gpu-template + spec: + spec: + devices: + requests: + - name: gpu + deviceClassName: gpu.nvidia.com + selectors: + - cel: + expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'" # [!code callout] + --- + apiVersion: v1 + kind: Pod + metadata: + name: dra-gpu-workload + spec: + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + runtimeClassName: nvidia + restartPolicy: OnFailure + resourceClaims: + - name: gpu-claim + resourceClaimTemplateName: gpu-template + containers: + - name: cuda-container + image: "ubuntu:22.04" + command: ["bash", "-c"] + args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] + resources: + claims: + - name: gpu-claim + ``` + Apply spec: + + ```bash + kubectl apply -f dra-gpu-test.yaml + ``` + + Obtain output of container in the pod: + ```bash + kubectl logs pod -n dra-gpu-workload -f + ``` + The output is expected to show the GPU UUID from the container. Example: + + ```text + GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66) + ``` diff --git a/docs/en/pgpu_dra/intro.mdx b/docs/en/infrastructure_management/device_management/pgpu_dra/intro.mdx similarity index 100% rename from docs/en/pgpu_dra/intro.mdx rename to docs/en/infrastructure_management/device_management/pgpu_dra/intro.mdx diff --git a/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx b/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx deleted file mode 100644 index caa5c1a..0000000 --- a/docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx +++ /dev/null @@ -1,7 +0,0 @@ ---- -weight: 30 ---- - -# Enable DRA(Dynamic Resource Allocation) in Kubernetes - - diff --git a/docs/en/pgpu_dra/install.mdx b/docs/en/pgpu_dra/install.mdx deleted file mode 100644 index d745382..0000000 --- a/docs/en/pgpu_dra/install.mdx +++ /dev/null @@ -1,63 +0,0 @@ ---- -weight: 20 ---- - -# Installation - -## Prerequisites - -- **NvidiaDriver v565+** -- **Kubernetes v1.32+** -- **ACP v4.1+** -- **Cluster administrator access to your ACP cluster** -- **CDI must be enabled in the underlying container runtime (such as containerd)** -- **DRA and corresponding API groups must be enabled** - -## Procedure - -### Installing Nvidia driver in your gpu node -Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) - -### Installing Nvidia Container Runtime -Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) - -### Downloading Cluster plugin - -:::info - -`Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal. - -Please contact Consumer Support for more information. - -::: - -### Uploading the Cluster plugin - -For more information on uploading the cluster plugin, please refer to - -### Installing Alauda Build of NVIDIA DRA Driver for GPUs - -1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule. - ```bash - kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra - ``` - :::info - **Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.** - ::: - -2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin. - -3. Verify result. You can see the status of "Installed" in the UI or you can check the pod status: - ```bash - kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu" - ``` - You should get results similar to: - ``` - nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h - nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h - ``` - -### Upgrading Alauda Build of NVIDIA DRA Driver for GPUs - -1. Upload the new version for package of **Alauda Build of NVIDIA DRA Driver for GPUs** plugin to ACP. -2. Go to the `Administrator` -> `Clusters` -> `Target Cluster` -> `Functional Components` page, then click the `Upgrade` button, and you will see the `Alauda Build of NVIDIA DRA Driver for GPUs` can be upgraded.