-
Notifications
You must be signed in to change notification settings - Fork 241
cuda.core.system: add conveniences to convert between device types #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
922270e to
23aefa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds convenience methods to convert between cuda.core.Device (CUDA API) and cuda.core.system.Device (NVML API), enabling easier interoperability between the two device types.
Changes:
- Added
to_system_device()method tocuda.core.Devicefor converting to NVML device representation - Added
to_cuda_device()method tocuda.core.system.Devicefor converting to CUDA device representation - Added comprehensive tests for both conversion methods
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cuda_core/cuda/core/_device.pyx | Implements to_system_device() method for converting CUDA Device to system Device using UUID mapping |
| cuda_core/cuda/core/system/_device.pyx | Implements to_cuda_device() method for converting system Device to CUDA Device by searching all devices for matching UUID |
| cuda_core/tests/test_device.py | Adds test for to_system_device() conversion, verifying UUID and PCI bus ID mapping |
| cuda_core/tests/system/test_system_device.py | Adds test for to_cuda_device() conversion across all devices, verifying UUID and PCI bus ID mapping |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
/ok to test |
|
|
/ok to test |
cpcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would really like to see if we can address the UUID inconsistency, but that isn't blocking and can be done in a follow-up.
| if cuda_device.uuid == uuid: | ||
| return cuda_device | ||
|
|
||
| raise RuntimeError("No corresponding CUDA device found for this NVML device.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a rant I guess than anything: I know this is pre-existing, but I really dislike that we raise RuntimeError everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be more appropriate here? A custom exception along the lines of DeviceNotFoundError?
| # this matching when it has a `GPU-` prefix, but for now we just strip | ||
| # it. If a matching CUDA device can't be found, we will get a helpful | ||
| # exception, anyway, below. | ||
| uuid = self.uuid[4:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though these are two different Device APIs, having their UUIDs not be identical feels like a design flaw. Can we fix that or is there a specific reason they must be this way?
Is the prefix actually providing unique information?
For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?
Keeping the prefix also makes this not actually a UUID and therefore incompatible with any tool that consumes UUIDs directly. Every use has to be sliced out before being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID is a complex mess, see nvbugs 4868877. Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, is it ever the case where you have GPU-$UUID and MIG-$UUID and $UUID are equal?
I don't /know/ but I really doubt it. I'd be fine with truncating the GPU and MIG prefix in NVML (and adding another method to obtain it for users who might care). That's probably better than requiring truncation by our users.
Is there a way we can avoid using UUID while still allowing CUDA and NVML to handshake?
Using indices is clearly not the right choice -- there are envvars that control their assignment and a bunch of other things that make them unsuitable.
The NVML docs say:
The order in which NVML enumerates devices has no guarantees of consistency between reboots. For that reason it
is recommended that devices be looked up by their PCI ids or UUID. See
nvmlDeviceGetHandleByUUID() and nvmlDeviceGetHandleByPciBusId_v2().
I chose UUID because /technically speaking/ some really old devices and maybe some future devices may not actually be on a PCI bus. And PCI bus ids are inconsistent between CUDA and NVML (in more problematic ways, such as CUDA only exposing the lower 16-bits of the domain).
|
/ok to test |
| # TODO: If the user cares about the prefix, we will expose that in the | ||
| # future using the MIG-related APIs in NVML. | ||
|
|
||
| return nvml.device_get_uuid(self._handle)[4:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking the system device should return the full UUID and we document the different expectations between cuda.core and cuda.core.system (or CUDA vs NVML). @mdboom thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally can see it both ways. My original implementation did was you suggested (following upstream NVML behavior). But @cpcloud convinced me this is weird -- UUID has a well-defined meaning in our field that NVML deviates from. I don't feel super strongly either way -- we just need to break the tie ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC NVML is the only way for us to tell, from inside a running process, if we are using MIG instances or otherwise (bare-metal GPU, MPS, etc). CUDA purposely hides MIG from end users. So my thinking is if we don't follow NVML there is no other way for Python users to query. Could you check if my impression is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another API, nvmlDeviceIsMigDeviceHandle, that could be used to query whether it's MIG, and IMHO, that's better than the user needing to parse a string to get that info.
/**
* Test if the given handle refers to a MIG device.
*
* A MIG device handle is an NVML abstraction which maps to a MIG compute instance.
* These overloaded references can be used (with some restrictions) interchangeably
* with a GPU device handle to execute queries at a per-compute instance granularity.
*
* For Ampere &tm; or newer fully supported devices.
* Supported on Linux only.
*
* @param device NVML handle to test
* @param isMigDevice True when handle refers to a MIG device
*/
nvmlReturn_t DECLDIR nvmlDeviceIsMigDeviceHandle(nvmlDevice_t device, unsigned int *isMigDevice);
This adds two convenience methods to convert between
cuda.core.Deviceandcuda.core.system.Device.I have a few design questions for this one:
Device.to_system_deviceandDevice.to_cuda_deviceok?cuda.core.Device(nvml_device), but that requires importing the other module from the constructor which could be a performance / cyclical import nightmare.