Kubernetes Device Resource Assignment (DRA) driver for CPU resources.
This repository implements a DRA driver that enables Kubernetes clusters to manage and assign CPU resources to workloads using the DRA framework.
The driver can be configured with the following command-line flags:
--cpu-device-mode: Sets the mode for exposing CPU devices."individual": Exposes each allocatable CPU as a separate device in theResourceSlice. This mode provides fine-grained control as it exposes granular information specific to each CPU as device attributes in theResourceSlice."grouped" (default): Exposes a single device representing a group of CPUs. This mode treats CPUs as a consumable capacity within the group, improving scalability by reducing the number of API objects.
--cpu-device-group-by: When--cpu-device-modeis set to"grouped", this flag determines the grouping strategy."numanode"(default): Groups CPUs by NUMA node."socket": Groups CPUs by socket.
--reserved-cpus: Specifies a set of CPUs to be reserved for system and kubelet processes. These CPUs will not be allocatable by the DRA driver and would be excluded from theResourceSlice. The value is a cpuset, e.g.,0-1. This semantic is the same as the one the kubelet applies with itsstaticCPU Manager policy and enablingstrict-cpu-reservationflag and specifying the CPUs with thereservedSystemCPUsto be reserved for system daemons. For correct CPU accounting, the number of CPUs reserved with this flag should match the sum of the kubelet'skubeReservedandsystemReservedsettings. This ensures the kubelet subtracts the correct number of CPUs fromNode.Status.Allocatable.
The driver is deployed as a DaemonSet which contains two core components:
-
DRA driver: This component is the main control loop and handles the interaction with the Kubernetes API server for Dynamic Resource Allocation.
- Topology Discovery: It discovers the node's CPU topology, including details like sockets, NUMA nodes, cores, SMT siblings, Last-Level Cache (LLC), and core types (e.g., Performance-cores, Efficiency-cores). This is done by parsing
/proc/cpuinfoand reading sysfs files. - ResourceSlice Publication: Based on the
--cpu-device-modeflag, it publishesResourceSliceobjects to the API server:- In
individualmode, each allocatable CPU becomes a device in theResourceSlice, with attributes detailing its topology. - In
groupedmode, devices represent larger CPU aggregates (like NUMA nodes or sockets). These devices support consumable capacity, indicating the number of available CPUs within that group.
- In
- Claim Allocation: When a
ResourceClaimis assigned to the node, the DRA driver handles the allocation:- In
individualmode, the scheduler has already selected specific CPU devices. The driver enforces this selection through CDI and NRI. - In
groupedmode, the claim requests a quantity of CPUs from the group device. The driver then uses topology-aware allocation logic (imported from Kubelet's CPU Manager) to select the physical CPUs within the group. Strict compatibility with kubelet's cpumanager or CPU allocation is not a goal of this driver. This decision will be reviewed in the future releases.
- In
- CDI Spec Generation: Upon successful allocation, the driver generates a CDI (Container Device Interface) specification.
- Topology Discovery: It discovers the node's CPU topology, including details like sockets, NUMA nodes, cores, SMT siblings, Last-Level Cache (LLC), and core types (e.g., Performance-cores, Efficiency-cores). This is done by parsing
-
CDI (Container Device Interface): The driver uses CDI to communicate the allocated CPU set to the container runtime.
- A CDI JSON spec file is created or updated for the allocated claim.
- This spec instructs the runtime to inject an environment variable (e.g.,
DRA_CPUSET_<claimUID>=<cpuset>) into the container. - The driver includes mechanisms for thread-safe and atomic updates to the CDI spec files.
-
NRI Plugin: This component integrates with the container runtime via the Node Resource Interface (NRI).
- For containers with guaranteed CPUs (those with a DRA ResourceClaim), the plugin reads the environment variable injected via CDI and pins the container to its exclusive CPU set using the cgroup cpuset controller.
- For all other containers, it confines them to a shared pool of CPUs, which consists of all allocatable CPUs not exclusively assigned to any guaranteed container.
- It dynamically updates the shared pool cpuset for all shared containers whenever guaranteed allocations change (containers are created or removed).
- On restart, the NRI plugin can synchronize its state by inspecting existing containers and their environment variables to rebuild the current CPU allocations.
- Exclusive CPU Allocation: Pods that request CPUs via a ResourceClaim are allocated exclusive CPUs based on the chosen mode and topology.
- Shared CPU Pool Management: All other containers without a ResourceClaim are confined to a shared pool of CPUs that are not reserved.
- Topology Awareness: The driver discovers detailed CPU topology including sockets, NUMA nodes, cores, SMT siblings, L3 cache (UncoreCache), and core types (Performance/Efficiency).
- Advanced CPU Allocation Strategies: When in
"grouped"mode, the driver utilizes allocation logic adapted from the Kubelet's CPU Manager, including:- NUMA aware best-fit allocation.
- Packing or spreading CPUs across cores.
- Preference for aligning allocations to UncoreCache boundaries.
- CDI Integration: Manages CDI spec files to inject environment variables containing the allocated cpuset into the container.
- State Synchronization: On restart, the driver synchronizes with all existing pods on the node to rebuild its state of CPU allocations from environment variables injected by CDI.
- Multiple Device Exposure Modes:
- Individual Mode: Each CPU is a device, allowing for selection based on attributes like CPU ID, core type, NUMA node, etc. This mode is ideal for workloads requiring fine-grained control over CPU placement, common in HPC or performance-critical applications.
- Grouped Mode: CPUs are grouped (e.g., by NUMA node or socket) and treated as a consumable capacity within that group. This helps in reducing the number of devices exposed to the API server, especially on systems with a large number of CPUs, thus improving scalability. This mode is suitable for workloads needing alignment with other DRA resources within the same group (e.g., NUMA node) or where the exact CPU IDs are less critical than the quantity.
- This driver currently only manages CPU resources. Memory allocation and management are not supported.
- While the driver is topology-aware, the grouped mode currently abstracts some of the fine-grained details within the group. Future enhancements may explore combining consumable capacity with partitionable devices for more hierarchical control.
- If needed, create a kind cluster. We have one in the repo, if needed, that
can be deplayed as follows:
make kind-cluster
- Deploy the driver and all necessary RBAC configurations using the provided
manifest
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/dra-driver-cpu/refs/heads/main/install.yaml
- Create a ResourceClaim: This requests a specific number of exclusive CPUs from
the driver.
kubectl apply -f hack/examples/sample_cpu_resource_claims.yaml
- Create a Pod: Reference the ResourceClaim in your pod spec to receive the
allocated CPUs.
kubectl apply -f hack/examples/sample_pod_with_cpu_resource_claim.yaml
Here's how the ResourceSlice objects might look for the different modes:
Each CPU is listed as a separate device with detailed attributes.
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: dra-driver-cpu-worker-dra.cpu-qskwf
# ... other metadata
spec:
driver: dra.cpu
nodeName: dra-driver-cpu-worker
pool:
generation: 1
name: dra-driver-cpu-worker
resourceSliceCount: 1
devices:
- attributes:
dra.cpu/cacheL3ID:
int: 0
dra.cpu/coreID:
int: 1
dra.cpu/coreType:
string: standard
dra.cpu/cpuID:
int: 1
dra.cpu/numaNodeID:
int: 0
dra.cpu/socketID:
int: 0
dra.net/numaNode:
int: 0
name: cpudev0
- attributes:
dra.cpu/cacheL3ID:
int: 0
dra.cpu/coreID:
int: 1
dra.cpu/coreType:
string: standard
dra.cpu/cpuID:
int: 33
dra.cpu/numaNodeID:
int: 0
dra.cpu/socketID:
int: 0
dra.net/numaNode:
int: 0
name: cpudev1
# ... other CPU devicesCPUs are grouped, and the device entry shows consumable capacity.
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: dra-driver-cpu-worker-dra.cpu-tp869
# ... other metadata
spec:
driver: dra.cpu
nodeName: dra-driver-cpu-worker
pool:
generation: 1
name: dra-driver-cpu-worker
resourceSliceCount: 1
devices:
- allowMultipleAllocations: true
attributes:
dra.cpu/smtEnabled:
bool: true
dra.cpu/numCPUs:
int: 64
dra.cpu/numaNodeID:
int: 0
dra.cpu/socketID:
int: 0
dra.net/numaNode:
int: 0
capacity:
dra.cpu/cpu:
value: "64"
name: cpudevnuma0
- allowMultipleAllocations: true
attributes:
dra.cpu/smtEnabled:
bool: true
dra.cpu/numCPUs:
int: 64
dra.cpu/numaNodeID:
int: 1
dra.cpu/socketID:
int: 0
dra.net/numaNode:
int: 1
capacity:
dra.cpu/cpu:
value: "64"
name: cpudevnuma1Learn how to engage with the Kubernetes community on the community page.
You can reach the maintainers of this project at:
Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.
This project is managed by its OWNERS and is licensed under Creative Commons 4.0.