cuda.core.system: add conveniences to convert between device types #1508

mdboom · 2026-01-16T17:43:17Z

This adds two convenience methods to convert between cuda.core.Device and cuda.core.system.Device.

I have a few design questions for this one:

Is the naming of Device.to_system_device and Device.to_cuda_device ok?
This could be a new type accepted by the constructor, e.g. cuda.core.Device(nvml_device), but that requires importing the other module from the constructor which could be a performance / cyclical import nightmare.

copy-pr-bot · 2026-01-16T17:43:22Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR adds convenience methods to convert between cuda.core.Device (CUDA API) and cuda.core.system.Device (NVML API), enabling easier interoperability between the two device types.

Changes:

Added to_system_device() method to cuda.core.Device for converting to NVML device representation
Added to_cuda_device() method to cuda.core.system.Device for converting to CUDA device representation
Added comprehensive tests for both conversion methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
cuda_core/cuda/core/_device.pyx	Implements `to_system_device()` method for converting CUDA Device to system Device using UUID mapping
cuda_core/cuda/core/system/_device.pyx	Implements `to_cuda_device()` method for converting system Device to CUDA Device by searching all devices for matching UUID
cuda_core/tests/test_device.py	Adds test for `to_system_device()` conversion, verifying UUID and PCI bus ID mapping
cuda_core/tests/system/test_system_device.py	Adds test for `to_cuda_device()` conversion across all devices, verifying UUID and PCI bus ID mapping

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cuda_core/cuda/core/_device.pyx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mdboom · 2026-01-16T17:53:26Z

/ok to test

github-actions · 2026-01-16T18:03:47Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1508/
https://nvidia.github.io/cuda-python/pr-preview/pr-1508/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1508/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1508/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

mdboom · 2026-01-16T18:25:21Z

/ok to test

…nvert-devices

cpcloud

Would really like to see if we can address the UUID inconsistency, but that isn't blocking and can be done in a follow-up.

cpcloud · 2026-01-20T16:58:52Z

cuda_core/cuda/core/system/_device.pyx

+            if cuda_device.uuid == uuid:
+                return cuda_device
+
+        raise RuntimeError("No corresponding CUDA device found for this NVML device.")


More of a rant I guess than anything: I know this is pre-existing, but I really dislike that we raise RuntimeError everywhere.

What would be more appropriate here? A custom exception along the lines of DeviceNotFoundError?

cpcloud · 2026-01-20T17:05:37Z

cuda_core/cuda/core/system/_device.pyx

+        # this matching when it has a `GPU-` prefix, but for now we just strip
+        # it.  If a matching CUDA device can't be found, we will get a helpful
+        # exception, anyway, below.
+        uuid = self.uuid[4:]


Even though these are two different Device APIs, having their UUIDs not be identical feels like a design flaw. Can we fix that or is there a specific reason they must be this way?

Is the prefix actually providing unique information?

For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?

Keeping the prefix also makes this not actually a UUID and therefore incompatible with any tool that consumes UUIDs directly. Every use has to be sliced out before being used.

UUID is a complex mess, see nvbugs 4868877. Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?

For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?

I don't /know/ but I really doubt it. I'd be fine with truncating the GPU and MIG prefix in NVML (and adding another method to obtain it for users who might care). That's probably better than requiring truncation by our users.

Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?

Using indices is clearly not the right choice -- there are envvars that control their assignment and a bunch of other things that make them unsuitable.

The NVML docs say:

The order in which NVML enumerates devices has no guarantees of consistency between reboots. For that reason it is recommended that devices be looked up by their PCI ids or UUID. See nvmlDeviceGetHandleByUUID() and nvmlDeviceGetHandleByPciBusId_v2().

I chose UUID because /technically speaking/ some really old devices and maybe some future devices may not actually be on a PCI bus. And PCI bus ids are inconsistent between CUDA and NVML (in more problematic ways, such as CUDA only exposing the lower 16-bits of the domain).

cuda_core/cuda/core/_device.pyx

cuda_core/cuda/core/system/_device.pyx

mdboom · 2026-01-20T18:11:20Z

/ok to test

leofang · 2026-01-20T18:28:05Z

cuda_core/cuda/core/system/_device.pyx

+        # TODO: If the user cares about the prefix, we will expose that in the
+        # future using the MIG-related APIs in NVML.
+
+        return nvml.device_get_uuid(self._handle)[4:]


I am thinking the system device should return the full UUID and we document the different expectations between cuda.core and cuda.core.system (or CUDA vs NVML). @mdboom thoughts?

I totally can see it both ways. My original implementation did was you suggested (following upstream NVML behavior). But @cpcloud convinced me this is weird -- UUID has a well-defined meaning in our field that NVML deviates from. I don't feel super strongly either way -- we just need to break the tie ;)

IIUC NVML is the only way for us to tell, from inside a running process, if we are using MIG instances or otherwise (bare-metal GPU, MPS, etc). CUDA purposely hides MIG from end users. So my thinking is if we don't follow NVML there is no other way for Python users to query. Could you check if my impression is correct?

There is another API, nvmlDeviceIsMigDeviceHandle, that could be used to query whether it's MIG, and IMHO, that's better than the user needing to parse a string to get that info.

/** * Test if the given handle refers to a MIG device. * * A MIG device handle is an NVML abstraction which maps to a MIG compute instance. * These overloaded references can be used (with some restrictions) interchangeably * with a GPU device handle to execute queries at a per-compute instance granularity. * * For Ampere &tm; or newer fully supported devices. * Supported on Linux only. * * @param device NVML handle to test * @param isMigDevice True when handle refers to a MIG device */ nvmlReturn_t DECLDIR nvmlDeviceIsMigDeviceHandle(nvmlDevice_t device, unsigned int *isMigDevice);

cuda.core.system: Add conveniences to convert device types

23aefa3

mdboom force-pushed the cuda.core.system-convert-devices branch from 922270e to 23aefa3 Compare January 16, 2026 17:44

mdboom requested a review from Copilot January 16, 2026 17:44

Copilot started reviewing on behalf of mdboom January 16, 2026 17:44 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

cuda_core/cuda/core/_device.pyx Show resolved Hide resolved

cuda_core/cuda/core/_device.pyx Show resolved Hide resolved

Update cuda_core/cuda/core/_device.pyx

c995aec

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Support systems with old cuda.bindings

784ba3c

Merge remote-tracking branch 'upstream/main' into cuda.core.system-co…

b4df1cd

…nvert-devices

mdboom requested a review from leofang January 20, 2026 15:00

cpcloud approved these changes Jan 20, 2026

View reviewed changes

leofang reviewed Jan 20, 2026

View reviewed changes

cuda_core/cuda/core/_device.pyx Show resolved Hide resolved

leofang reviewed Jan 20, 2026

View reviewed changes

cuda_core/cuda/core/system/_device.pyx Show resolved Hide resolved

mdboom added 2 commits January 20, 2026 13:09

Make uuid match between NVML and CUDA

20875b3

Add documentation

a10f3bc

mdboom requested a review from cpcloud January 20, 2026 18:11

leofang reviewed Jan 20, 2026

View reviewed changes

cuda.core.system: add conveniences to convert between device types #1508

Are you sure you want to change the base?

cuda.core.system: add conveniences to convert between device types #1508

Conversation

mdboom commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

mdboom commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

mdboom commented Jan 16, 2026

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mdboom commented Jan 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdboom commented Jan 16, 2026 •

edited

Loading