ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain#20536
Conversation
…eSetCoarseGrain On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain returns hipErrorInvalidValue because the hint is not applicable to UMA systems. The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on APU systems such as AMD Strix Halo (gfx1151). Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it without error checking and clear any resulting error with hipGetLastError(). Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory issues on APU systems, and store totalGlobalMem in device info. Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits hipMallocManaged to ~64GB regardless of available system RAM. A fix has been submitted upstream: ROCm/rocm-systems#4077 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Related to #20472 which fixes the same class of AMD APU issues on Linux.
Together with #20472, these fixes make AMD APUs fully functional for LLM |
|
This PR fixes the immediate crash issue ( However, the underlying >64GB allocation limit reported in #19764 is a ROCm Once both fixes are merged, Windows APU users (Strix Halo, etc.) will be able Related: #20472 fixes the Linux memory reporting side. |
|
Make one PR per fix please. |
@JohannesGaessler Thanks for reviewing! These two PRs fix different issues: #20472 (by @hogeheer499-commits):
This PR (#20536):
They're complementary fixes for AMD APU compatibility:
Additional context: There's also an upstream ROCm bug limiting Windows APU allocations to ~64GB (reported in #19764). I've submitted a Without that ROCm fix, Windows APU users hit the 64GB limit regardless. But this PR still has value:
Would you like me to simplify this PR (remove debug logging / total_vram tracking) to focus only on the |
|
You are making unrelated changes to Also according to the llama.cpp AI usage policy:
|
I resubmitted the clean code. Test Environment Test Results Notes
This test runs on Windows, and hipMallocManaged returns hipSuccess (not hipErrorNotSupported), indicating that the code will reach the line hipMemAdviseSetCoarseGrain. In the version without the fix, it would crash here. Also, sorry, I just used AI for text translation, not to submit a PR with AI, which might have made you feel there was an AI touch. |
Thank you for clarifying. We are unfortunately in a position where we had to ban it because that is the only feasible way for us to avoid having to sift through a lot of incorrect or hallucinated issues/PRs. |

Description:
Problem
On AMD APU/iGPU devices (unified memory architecture, e.g. AMD Strix Halo gfx1151),
hipMemAdviseSetCoarseGrainreturnshipErrorInvalidValuebecause this hint is not applicable to UMA systems. The current code wraps this call inCUDA_CHECK(), which treatsit as a fatal error and crashes.
Fix
Treat
hipMemAdviseSetCoarseGrainas an optional performance hint:CUDA_CHECK()wrapperhipGetLastError()to prevent propagationThis matches the intent of the existing comment ("fall back to cudaMalloc if not supported") and is consistent with how optional hints are
handled elsewhere.
Additional Changes
GGML_LOG_DEBUGpre-allocation memory logging to help diagnose memory issues on APU systemstotalGlobalMemin device info struct for future useTesting
Tested on AMD Strix Halo (gfx1151), 128GB unified memory, Windows 11:
hipMemAdviseSetCoarseGrainwithhipErrorInvalidValueContext: ROCm APU Large BAR Bug
AMD APUs on Windows are currently limited to ~64GB
hipMallocManagedallocations due to a ROCm runtime bug wherelargeBar_isunconditionally disabled for all APU devices in HIP mode. This causes the Windows GART allocator's 50%-of-RAM cap to trigger prematurely.
A fix has been submitted to ROCm upstream:
ROCm/rocm-systems#4077
Without that ROCm fix, APUs are still limited to ~64GB regardless of this change. However, this PR is independently valuable:
Impact
GGML_CUDA_ENABLE_UNIFIED_MEMORY