llama-bench: introduce -hf and -hff flags & use --mmap 1 by default#20211
llama-bench: introduce -hf and -hff flags & use --mmap 1 by default#20211taronaeo merged 3 commits intoggml-org:masterfrom
-hf and -hff flags & use --mmap 1 by default#20211Conversation
4c4e008 to
fc82d0c
Compare
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
fc82d0c to
2920f54
Compare
|
2 CI failures seem unrelated to this PR. Merging. |
| /* tensor_split */ { std::vector<float>(llama_max_devices(), 0.0f) }, | ||
| /* tensor_buft_overrides*/ { std::vector<llama_model_tensor_buft_override>{ { nullptr, nullptr } } }, | ||
| /* use_mmap */ { false }, | ||
| /* use_mmap */ { true }, |
There was a problem hiding this comment.
This default was intentionally set to avoid the situation where people run --direct-io 0,1 and see no difference because the parameter got overridden by mmap.
There was a problem hiding this comment.
Sorry I missed this. In that case, would it be better if we automatically disable the other and warn the user about it?
For example, --mmap 1 (by default) and the user specifies --direct-io 1. We disable mmap automatically and show the change in the benchmark markdown table.
Or, we could also check if mmap and direct-io are turned on together. If they are, we fail early and make the user choose by setting --mmap 0 or removing the direct-io flag.
WDYT?
There was a problem hiding this comment.
Something like that would be better, yes. Overall I think this is confusing because these are two flags that can't coexist. That means we have 4 configurations they can be set in, but only 3 are valid. It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.
The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?
There was a problem hiding this comment.
It might be worth overhauling this into something like --loading-mode none|mmap|direct-io.
I agree. Although, I'd assume that introducing this flag would also mean overhauling other binaries (f.ex., llama-cli) that have --mmap and -dio as well.
The difficulty with llama-bench is just that it behaves differently with these values, what do you do if the user runs --direct-io 0,1? Handle both 0 and 1 the same and disable mmap for both? Keep mmap for dio 0, but disable for 1?
I was thinking maybe if only the --direct-io flag was specified, we completely disable mmap.
If both --mmap and --direct-io are specified, some thought in pseudocode,
If mmap.empty() && !directio.empty() -> disable mmap
Elseif mmap.size() != directio.size() -> exit early // both flags need to have the same number of arguments
Else
Loop through both mmap and directio
If mmap[index] == 1 && directio[index] == 1 -> exit // both mmap and directio cannot be turned on together
// we can't if mmap[index] == directio[index] because both mmap and directio can be turned off without an issue
Else continue with benchmark
Hope my thought process is clear. I may be missing something. Let me know if I should create a PR to get started on this :)
This PR introduces the
-hfand-hffflags for downloading models remotely from HuggingFace, exactly the same as howllama-clidoes it. Also, turns onmmapby default.I've been using
llama-cliandllama-completionswith the-hfflags for quite a while now, and when I have to run benchmarks, the-hfand-hffflags were not available withllama-bench, forcing me to manually find the model cache path which isn't ideal.Test