Skip to content

llama-bench: introduce -hf and -hff flags & use --mmap 1 by default#20211

Merged
taronaeo merged 3 commits intoggml-org:masterfrom
taronaeo:feat/bench-mmap-on
Mar 9, 2026
Merged

llama-bench: introduce -hf and -hff flags & use --mmap 1 by default#20211
taronaeo merged 3 commits intoggml-org:masterfrom
taronaeo:feat/bench-mmap-on

Conversation

@taronaeo
Copy link
Contributor

@taronaeo taronaeo commented Mar 7, 2026

This PR introduces the -hf and -hff flags for downloading models remotely from HuggingFace, exactly the same as how llama-cli does it. Also, turns on mmap by default.

I've been using llama-cli and llama-completions with the -hf flags for quite a while now, and when I have to run benchmarks, the -hf and -hff flags were not available with llama-bench, forcing me to manually find the model cache path which isn't ideal.

Test

$ build/bin/llama-bench -hf ibm-granite/granite-3.3-2b-instruct-GGUF:Q4_K_M,ggml-org/gpt-oss-20b-GGUF

common_download_file_single_online: using cached file (same etag): /Users/taronaeo/Library/Caches/llama.cpp/ibm-granite_granite-3.3-2b-instruct-GGUF_granite-3.3-2b-instruct-Q4_K_M.gguf
common_download_file_single_online: using cached file (same etag): /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 26800.60 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| granite 3B Q4_K - Medium       |   1.44 GiB |     2.53 B | MTL,BLAS   |       8 |           pp512 |        663.39 ± 0.25 |
| granite 3B Q4_K - Medium       |   1.44 GiB |     2.53 B | MTL,BLAS   |       8 |           tg128 |         59.44 ± 0.04 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       8 |           pp512 |        500.93 ± 2.78 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       8 |           tg128 |         42.92 ± 0.04 |

build: 938907293 (8227)

@taronaeo taronaeo requested a review from ggerganov as a code owner March 7, 2026 19:33
taronaeo added 3 commits March 8, 2026 14:36
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo taronaeo force-pushed the feat/bench-mmap-on branch from fc82d0c to 2920f54 Compare March 8, 2026 06:36
@taronaeo
Copy link
Contributor Author

taronaeo commented Mar 9, 2026

2 CI failures seem unrelated to this PR. Merging.

@taronaeo taronaeo merged commit ae87863 into ggml-org:master Mar 9, 2026
144 of 149 checks passed
/* tensor_split */ { std::vector<float>(llama_max_devices(), 0.0f) },
/* tensor_buft_overrides*/ { std::vector<llama_model_tensor_buft_override>{ { nullptr, nullptr } } },
/* use_mmap */ { false },
/* use_mmap */ { true },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default was intentionally set to avoid the situation where people run --direct-io 0,1 and see no difference because the parameter got overridden by mmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this. In that case, would it be better if we automatically disable the other and warn the user about it?

For example, --mmap 1 (by default) and the user specifies --direct-io 1. We disable mmap automatically and show the change in the benchmark markdown table.

Or, we could also check if mmap and direct-io are turned on together. If they are, we fail early and make the user choose by setting --mmap 0 or removing the direct-io flag.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that would be better, yes. Overall I think this is confusing because these are two flags that can't coexist. That means we have 4 configurations they can be set in, but only 3 are valid. It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.

The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?

Copy link
Contributor Author

@taronaeo taronaeo Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.

I agree. Although, I'd assume that introducing this flag would also mean overhauling other binaries (f.ex., llama-cli) that have --mmap and -dio as well.

The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?

I was thinking maybe if only the --direct-io flag was specified, we completely disable mmap.
If both --mmap and --direct-io are specified, some thought in pseudocode,

If mmap.empty() && !directio.empty() -> disable mmap
Elseif mmap.size() != directio.size() -> exit early // both flags need to have the same number of arguments
Else
  Loop through both mmap and directio
    If mmap[index] == 1 && directio[index] == 1 -> exit // both mmap and directio cannot be turned on together
    // we can't if mmap[index] == directio[index] because both mmap and directio can be turned off without an issue
    Else continue with benchmark

Hope my thought process is clear. I may be missing something. Let me know if I should create a PR to get started on this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR for this: #20461. PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants