llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default by taronaeo · Pull Request #20211 · ggml-org/llama.cpp

taronaeo · 2026-03-07T19:33:28Z

This PR introduces the -hf and -hff flags for downloading models remotely from HuggingFace, exactly the same as how llama-cli does it. Also, turns on mmap by default.

I've been using llama-cli and llama-completions with the -hf flags for quite a while now, and when I have to run benchmarks, the -hf and -hff flags were not available with llama-bench, forcing me to manually find the model cache path which isn't ideal.

Test

$ build/bin/llama-bench -hf ibm-granite/granite-3.3-2b-instruct-GGUF:Q4_K_M,ggml-org/gpt-oss-20b-GGUF

common_download_file_single_online: using cached file (same etag): /Users/taronaeo/Library/Caches/llama.cpp/ibm-granite_granite-3.3-2b-instruct-GGUF_granite-3.3-2b-instruct-Q4_K_M.gguf
common_download_file_single_online: using cached file (same etag): /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 26800.60 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| granite 3B Q4_K - Medium       |   1.44 GiB |     2.53 B | MTL,BLAS   |       8 |           pp512 |        663.39 ± 0.25 |
| granite 3B Q4_K - Medium       |   1.44 GiB |     2.53 B | MTL,BLAS   |       8 |           tg128 |         59.44 ± 0.04 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       8 |           pp512 |        500.93 ± 2.78 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       8 |           tg128 |         42.92 ± 0.04 |

build: 938907293 (8227)

common/arg.cpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo · 2026-03-09T01:05:37Z

2 CI failures seem unrelated to this PR. Merging.

0cc4m · 2026-03-09T10:54:40Z

tools/llama-bench/llama-bench.cpp

    /* tensor_split         */ { std::vector<float>(llama_max_devices(), 0.0f) },
    /* tensor_buft_overrides*/ { std::vector<llama_model_tensor_buft_override>{ { nullptr, nullptr } } },
-    /* use_mmap             */ { false },
+    /* use_mmap             */ { true },


This default was intentionally set to avoid the situation where people run --direct-io 0,1 and see no difference because the parameter got overridden by mmap.

Sorry I missed this. In that case, would it be better if we automatically disable the other and warn the user about it?

For example, --mmap 1 (by default) and the user specifies --direct-io 1. We disable mmap automatically and show the change in the benchmark markdown table.

Or, we could also check if mmap and direct-io are turned on together. If they are, we fail early and make the user choose by setting --mmap 0 or removing the direct-io flag.

WDYT?

Something like that would be better, yes. Overall I think this is confusing because these are two flags that can't coexist. That means we have 4 configurations they can be set in, but only 3 are valid. It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.

The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?

It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.

I agree. Although, I'd assume that introducing this flag would also mean overhauling other binaries (f.ex., llama-cli) that have --mmap and -dio as well.

The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?

I was thinking maybe if only the --direct-io flag was specified, we completely disable mmap.
If both --mmap and --direct-io are specified, some thought in pseudocode,

If mmap.empty() && !directio.empty() -> disable mmap Elseif mmap.size() != directio.size() -> exit early // both flags need to have the same number of arguments Else Loop through both mmap and directio If mmap[index] == 1 && directio[index] == 1 -> exit // both mmap and directio cannot be turned on together // we can't if mmap[index] == directio[index] because both mmap and directio can be turned off without an issue Else continue with benchmark

Hope my thought process is clear. I may be missing something. Let me know if I should create a PR to get started on this :)

Created a PR for this: #20461. PTAL

…ault (ggml-org#20211)

taronaeo requested a review from ggerganov as a code owner March 7, 2026 19:33

CISC reviewed Mar 7, 2026

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

github-actions bot added the examples label Mar 7, 2026

taronaeo force-pushed the feat/bench-mmap-on branch from 4c4e008 to fc82d0c Compare March 8, 2026 06:35

taronaeo added 3 commits March 8, 2026 14:36

bench: use mmap by default

6251177

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

bench: add -hf, -hff flags

0b6ba85

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

revert: rm LLAMA_EXAMPLE_BENCH from arg.cpp

2920f54

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo force-pushed the feat/bench-mmap-on branch from fc82d0c to 2920f54 Compare March 8, 2026 06:36

ggerganov approved these changes Mar 8, 2026

View reviewed changes

taronaeo merged commit ae87863 into ggml-org:master Mar 9, 2026
144 of 149 checks passed

savvadesogle mentioned this pull request Mar 9, 2026

Eval bug: [Vulkan] [Intel] unsloth/GLM-4.7-Flash-Q4_K_M.gguf and A770 main: error: failed to load model (MMAP off) #19143

Open

0cc4m reviewed Mar 9, 2026

View reviewed changes

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026

llama-bench: introduce -hf and -hff flags & use --mmap 1 by def…

d23cda6

…ault (ggml-org#20211)

taronaeo mentioned this pull request Mar 12, 2026

llama-bench: fix case where mmap and direct-io are turned on together #20461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default#20211

llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default#20211
taronaeo merged 3 commits intoggml-org:masterfrom
taronaeo:feat/bench-mmap-on

taronaeo commented Mar 7, 2026

Uh oh!

Uh oh!

taronaeo commented Mar 9, 2026

Uh oh!

Uh oh!

0cc4m Mar 9, 2026

Uh oh!

taronaeo Mar 11, 2026

Uh oh!

0cc4m Mar 11, 2026

Uh oh!

taronaeo Mar 12, 2026 •

edited

Loading

Uh oh!

taronaeo Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

taronaeo commented Mar 7, 2026

Test

Uh oh!

Uh oh!

taronaeo commented Mar 9, 2026

Uh oh!

Uh oh!

0cc4m Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

taronaeo Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

0cc4m Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

taronaeo Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taronaeo Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

taronaeo Mar 12, 2026 •

edited

Loading