ggml-cuda : fix UMA memory detection for HIP/ROCm on AMD APUs#20472
ggml-cuda : fix UMA memory detection for HIP/ROCm on AMD APUs#20472hogeheer499-commits wants to merge 2 commits intoggml-org:masterfrom
Conversation
Bug verification on AMD Ryzen AI MAX+ 395 (gfx1151, 128GB unified memory)Wrote a test program that simulates the exact code path in On AMD APUs, The |
Note on end-to-end testingI was unable to reproduce the However, the mechanism is clearly demonstrated above: On my 128GB system the 91 GiB reported by The fix itself is minimal and clearly correct: |
AMD APUs report prop.integrated=1 which triggers the UMA memory path from ggml-org#17368. This overrides hipMemGetInfo() (accurate) with /proc/meminfo MemAvailable (too low), losing ~30 GiB on a 128GB Strix Halo system. For HIP builds, only enter the UMA path when GGML_CUDA_ENABLE_UNIFIED_MEMORY is explicitly set. This preserves correct behavior for both cases: - Default: hipMemGetInfo() reports accurate TTM-backed memory - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: /proc/meminfo is used (system RAM mode) Tested on AMD Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 128GB, ROCm 7.1. Fixes: ggml-org#18159
8674daa to
73357da
Compare
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
Great fix! For Windows users with the same APU, there's a complementary |
|
On my Strix Halo system I get the following on master: ggml_cuda_init: found 1 ROCm devices (Total VRAM: 62206 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 62206 MiB (62202 MiB free)
build: 8323 (57819b8d4) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected to use 5438 MiB of device memory vs. 121595 MiB of free device memory
llama_params_fit_impl: will leave 116157 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.18 seconds
main: printing fitted CLI arguments to stdout...
-c 0 -ngl -1With this PR I get: ggml_cuda_init: found 1 ROCm devices (Total VRAM: 62206 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 62206 MiB (62202 MiB free)
build: 8325 (e0dace50d) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected to use 5438 MiB of device memory vs. 62060 MiB of free device memory
llama_params_fit_impl: will leave 56621 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.18 seconds
main: printing fitted CLI arguments to stdout...
-c 0 -ngl -1So at the very least there are some edge cases that this PR does not handle correctly and it cannot be merged like this. |
|
Thanks for testing! I see the issue — on your system I think the better fix is actually simpler: instead of adding HIP-specific logic, just change the existing UMA path to take the maximum instead of unconditionally overwriting: size_t proc_free = (size_t)available_memory_kb * 1024;
if (proc_free > *free) {
*free = proc_free;
}This way it works for both CUDA and HIP without any
One question: this results in I'll push once you confirm. |
|
According to the llama.cpp AI usage policy:
|
|
Yeah fair point, I used AI to help structure that comment since English isn't my first language, but I should've just written it myself. Won't happen again. The fix itself (max instead of always overwriting) I do understand and stand behind. Want me to push it? |
AMD APUs report
prop.integrated == 1, which triggers the UMA memory detection from #17368. This replaces the accuratehipMemGetInfo()value withMemAvailablefrom/proc/meminfo, which reports significantly less memory on systems with large TTM allocations (e.g. 122 GiB vs 91 GiB on a 128GB Strix Halo system).For HIP builds, skip the
prop.integratedcheck and only enter the UMA path whenGGML_CUDA_ENABLE_UNIFIED_MEMORYis explicitly set. This wayhipMemGetInfo()is used by default (which correctly reports TTM-backed memory), while the explicit env var override still works for users who need it.Verified on AMD Ryzen AI MAX+ 395 (gfx1151, 128GB unified memory, ROCm 7.1) that
prop.integratedreturns 1 andhipMemGetInfo()returns 122880 MiB whileMemAvailablereports ~91 GiB.Fixes #18159
Related: #19818, #19764, #18650