Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Jan 16, 2026

This adds two convenience methods to convert between cuda.core.Device and cuda.core.system.Device.

I have a few design questions for this one:

  • Is the naming of Device.to_system_device and Device.to_cuda_device ok?
  • This could be a new type accepted by the constructor, e.g. cuda.core.Device(nvml_device), but that requires importing the other module from the constructor which could be a performance / cyclical import nightmare.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 16, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom mdboom force-pushed the cuda.core.system-convert-devices branch from 922270e to 23aefa3 Compare January 16, 2026 17:44
@mdboom mdboom requested a review from Copilot January 16, 2026 17:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds convenience methods to convert between cuda.core.Device (CUDA API) and cuda.core.system.Device (NVML API), enabling easier interoperability between the two device types.

Changes:

  • Added to_system_device() method to cuda.core.Device for converting to NVML device representation
  • Added to_cuda_device() method to cuda.core.system.Device for converting to CUDA device representation
  • Added comprehensive tests for both conversion methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
cuda_core/cuda/core/_device.pyx Implements to_system_device() method for converting CUDA Device to system Device using UUID mapping
cuda_core/cuda/core/system/_device.pyx Implements to_cuda_device() method for converting system Device to CUDA Device by searching all devices for matching UUID
cuda_core/tests/test_device.py Adds test for to_system_device() conversion, verifying UUID and PCI bus ID mapping
cuda_core/tests/system/test_system_device.py Adds test for to_cuda_device() conversion across all devices, verifying UUID and PCI bus ID mapping

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@mdboom
Copy link
Contributor Author

mdboom commented Jan 16, 2026

/ok to test

@github-actions
Copy link

@mdboom
Copy link
Contributor Author

mdboom commented Jan 16, 2026

/ok to test

@mdboom mdboom requested a review from leofang January 20, 2026 15:00
Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would really like to see if we can address the UUID inconsistency, but that isn't blocking and can be done in a follow-up.

if cuda_device.uuid == uuid:
return cuda_device

raise RuntimeError("No corresponding CUDA device found for this NVML device.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a rant I guess than anything: I know this is pre-existing, but I really dislike that we raise RuntimeError everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be more appropriate here? A custom exception along the lines of DeviceNotFoundError?

# this matching when it has a `GPU-` prefix, but for now we just strip
# it. If a matching CUDA device can't be found, we will get a helpful
# exception, anyway, below.
uuid = self.uuid[4:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though these are two different Device APIs, having their UUIDs not be identical feels like a design flaw. Can we fix that or is there a specific reason they must be this way?

Is the prefix actually providing unique information?

For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?

Keeping the prefix also makes this not actually a UUID and therefore incompatible with any tool that consumes UUIDs directly. Every use has to be sliced out before being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID is a complex mess, see nvbugs 4868877. Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?

I don't /know/ but I really doubt it. I'd be fine with truncating the GPU and MIG prefix in NVML (and adding another method to obtain it for users who might care). That's probably better than requiring truncation by our users.

Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?

Using indices is clearly not the right choice -- there are envvars that control their assignment and a bunch of other things that make them unsuitable.

The NVML docs say:

The order in which NVML enumerates devices has no guarantees of consistency between reboots. For that reason it
is recommended that devices be looked up by their PCI ids or UUID. See
nvmlDeviceGetHandleByUUID() and nvmlDeviceGetHandleByPciBusId_v2().

I chose UUID because /technically speaking/ some really old devices and maybe some future devices may not actually be on a PCI bus. And PCI bus ids are inconsistent between CUDA and NVML (in more problematic ways, such as CUDA only exposing the lower 16-bits of the domain).

@mdboom mdboom requested a review from cpcloud January 20, 2026 18:11
@mdboom
Copy link
Contributor Author

mdboom commented Jan 20, 2026

/ok to test

Comment on lines +1069 to +1072
# TODO: If the user cares about the prefix, we will expose that in the
# future using the MIG-related APIs in NVML.

return nvml.device_get_uuid(self._handle)[4:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking the system device should return the full UUID and we document the different expectations between cuda.core and cuda.core.system (or CUDA vs NVML). @mdboom thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally can see it both ways. My original implementation did was you suggested (following upstream NVML behavior). But @cpcloud convinced me this is weird -- UUID has a well-defined meaning in our field that NVML deviates from. I don't feel super strongly either way -- we just need to break the tie ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC NVML is the only way for us to tell, from inside a running process, if we are using MIG instances or otherwise (bare-metal GPU, MPS, etc). CUDA purposely hides MIG from end users. So my thinking is if we don't follow NVML there is no other way for Python users to query. Could you check if my impression is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another API, nvmlDeviceIsMigDeviceHandle, that could be used to query whether it's MIG, and IMHO, that's better than the user needing to parse a string to get that info.

/**
 * Test if the given handle refers to a MIG device.
 *
 * A MIG device handle is an NVML abstraction which maps to a MIG compute instance.
 * These overloaded references can be used (with some restrictions) interchangeably
 * with a GPU device handle to execute queries at a per-compute instance granularity.
 *
 * For Ampere &tm; or newer fully supported devices.
 * Supported on Linux only.
 *
 * @param device                               NVML handle to test
 * @param isMigDevice                          True when handle refers to a MIG device
 */
nvmlReturn_t DECLDIR nvmlDeviceIsMigDeviceHandle(nvmlDevice_t device, unsigned int *isMigDevice);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants