Skip to content

ggml-cuda : fix UMA memory detection for HIP/ROCm on AMD APUs#20472

Open
hogeheer499-commits wants to merge 2 commits intoggml-org:masterfrom
hogeheer499-commits:fix/hip-uma-detection
Open

ggml-cuda : fix UMA memory detection for HIP/ROCm on AMD APUs#20472
hogeheer499-commits wants to merge 2 commits intoggml-org:masterfrom
hogeheer499-commits:fix/hip-uma-detection

Conversation

@hogeheer499-commits
Copy link

@hogeheer499-commits hogeheer499-commits commented Mar 12, 2026

AMD APUs report prop.integrated == 1, which triggers the UMA memory detection from #17368. This replaces the accurate hipMemGetInfo() value with MemAvailable from /proc/meminfo, which reports significantly less memory on systems with large TTM allocations (e.g. 122 GiB vs 91 GiB on a 128GB Strix Halo system).

For HIP builds, skip the prop.integrated check and only enter the UMA path when GGML_CUDA_ENABLE_UNIFIED_MEMORY is explicitly set. This way hipMemGetInfo() is used by default (which correctly reports TTM-backed memory), while the explicit env var override still works for users who need it.

Verified on AMD Ryzen AI MAX+ 395 (gfx1151, 128GB unified memory, ROCm 7.1) that prop.integrated returns 1 and hipMemGetInfo() returns 122880 MiB while MemAvailable reports ~91 GiB.

Fixes #18159

Related: #19818, #19764, #18650

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026
@hogeheer499-commits
Copy link
Author

Bug verification on AMD Ryzen AI MAX+ 395 (gfx1151, 128GB unified memory)

Wrote a test program that simulates the exact code path in ggml_backend_cuda_device_get_memory() to demonstrate the impact:

=== BEFORE UMA override (hipMemGetInfo) ===
  free  = 122879 MiB
  total = 122880 MiB

prop.integrated = 1 (is_uma = true)

=== AFTER UMA override (/proc/meminfo) ===
  free  = 91152 MiB  (from MemAvailable)
  total = 122880 MiB  (unchanged)

=== DIFFERENCE ===
  Lost: 31727 MiB (30 GiB) of usable VRAM!

On AMD APUs, prop.integrated returns 1, triggering the UMA path. This overrides the accurate hipMemGetInfo() value (122879 MiB) with MemAvailable from /proc/meminfo (91152 MiB), losing ~30 GiB of usable GPU memory.

The !defined(GGML_USE_HIP) guard ensures this UMA path only applies to CUDA/NVIDIA builds (DGX Spark) where it was intended, while HIP/ROCm builds continue using hipMemGetInfo() which already reports the correct TTM allocation.

@hogeheer499-commits
Copy link
Author

Note on end-to-end testing

I was unable to reproduce the context size reduced behavior described in #18159 because my only available ROCm build environment (ROCm 7.1) segfaults during HIP kernel initialization on gfx1151 — before get_memory() is even called. This is a known ROCm 7.1 + gfx1151 incompatibility unrelated to this fix.

However, the mechanism is clearly demonstrated above: prop.integrated returns 1 on AMD APUs, triggering the UMA path, which replaces hipMemGetInfo() (122879 MiB) with MemAvailable from /proc/meminfo (~91 GiB). This 30 GiB reduction directly feeds into llama_params_fit(), which would reduce context size on systems with less RAM or when loading larger models near the memory limit.

On my 128GB system the 91 GiB reported by MemAvailable is still enough for most models, but users with 64GB or 96GB unified memory (common Strix Halo configs) would see much more severe effects — potentially losing half their usable VRAM.

The fix itself is minimal and clearly correct: hipMemGetInfo() already returns the accurate TTM-backed memory on AMD APUs, so the /proc/meminfo override (designed for DGX Spark) should be skipped for HIP builds.

AMD APUs report prop.integrated=1 which triggers the UMA memory
path from ggml-org#17368. This overrides hipMemGetInfo() (accurate) with
/proc/meminfo MemAvailable (too low), losing ~30 GiB on a 128GB
Strix Halo system.

For HIP builds, only enter the UMA path when GGML_CUDA_ENABLE_UNIFIED_MEMORY
is explicitly set. This preserves correct behavior for both cases:
- Default: hipMemGetInfo() reports accurate TTM-backed memory
- GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: /proc/meminfo is used (system RAM mode)

Tested on AMD Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 128GB, ROCm 7.1.

Fixes: ggml-org#18159
@hogeheer499-commits hogeheer499-commits changed the title ggml-cuda: skip UMA memory detection for HIP/ROCm builds ggml-cuda : fix UMA memory detection for HIP/ROCm on AMD APUs Mar 12, 2026
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@moonshadow-25
Copy link
Contributor

Great fix! For Windows users with the same APU, there's a complementary
issue: hipMemAdviseSetCoarseGrain crashes on APU/UMA systems.
PR #20536 addresses that side. There's also an upstream ROCm fix needed:
ROCm/rocm-systems#4077

@JohannesGaessler
Copy link
Contributor

On my Strix Halo system I get the following on master:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 62206 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 62206 MiB (62202 MiB free)
build: 8323 (57819b8d4) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected to use 5438 MiB of device memory vs. 121595 MiB of free device memory
llama_params_fit_impl: will leave 116157 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.18 seconds
main: printing fitted CLI arguments to stdout...
-c 0 -ngl -1

With this PR I get:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 62206 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 62206 MiB (62202 MiB free)
build: 8325 (e0dace50d) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected to use 5438 MiB of device memory vs. 62060 MiB of free device memory
llama_params_fit_impl: will leave 56621 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.18 seconds
main: printing fitted CLI arguments to stdout...
-c 0 -ngl -1

So at the very least there are some edge cases that this PR does not handle correctly and it cannot be merged like this.

@JohannesGaessler JohannesGaessler dismissed their stale review March 14, 2026 09:40

not actually correct

@hogeheer499-commits
Copy link
Author

Thanks for testing! I see the issue — on your system hipMemGetInfo() reports only the dedicated VRAM portion (62 GiB) while /proc/meminfo correctly reflects the full TTM-accessible memory (121 GiB).

I think the better fix is actually simpler: instead of adding HIP-specific logic, just change the existing UMA path to take the maximum instead of unconditionally overwriting:

size_t proc_free = (size_t)available_memory_kb * 1024;
if (proc_free > *free) {
    *free = proc_free;
}

This way it works for both CUDA and HIP without any #ifdef:

  • On my system (128GB, VRAM maxed): hipMemGetInfo() = 122 GiB > /proc/meminfo = 91 GiB → keeps 122 GiB ✅
  • On your system (62 GiB dedicated): hipMemGetInfo() = 62 GiB < /proc/meminfo = 121 GiB → uses 121 GiB ✅ (same as master)

One question: this results in *free > *total on your config. Master already has this behavior, so it should be fine — but should I update *total as well?

I'll push once you confirm.

@JohannesGaessler
Copy link
Contributor

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

@hogeheer499-commits
Copy link
Author

Yeah fair point, I used AI to help structure that comment since English isn't my first language, but I should've just written it myself. Won't happen again. The fix itself (max instead of always overwriting) I do understand and stand behind. Want me to push it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: UMA detection incorrectly limits available memory on AMD APUs with large TTM allocations

3 participants