-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Prerequisites
I am running the latest code (tested with v0.3.23 and v0.3.24).
I carefully followed the README.md in JamePeng's repository.
I searched existing issues and found no duplicate.
Expected Behavior
After each inference run using Qwen3-VL GGUF models via llama-cpp-python (JamePeng's fork) in ComfyUI, all GPU memory (Windows) or unified memory (macOS) allocated by the library should be fully released, returning to baseline levels without accumulation over multiple runs.
Current Behavior
On both Windows and macOS, memory usage increases by a fixed amount (~1GB) after each inference run. This memory is never released and accumulates until an out-of-memory (OOM) crash occurs. The issue is reproducible in both versions 0.3.23 and 0.3.24.
Important version note:
v0.3.23: Normal inference speed (~20-30 tok/s), but leaks memory.
v0.3.24: Severe performance regression (>7x slower, ~4 tok/s), and also leaks memory. The slowdown and the leak appear to be independent issues.
Environment
Hardware:
Windows: NVIDIA GeForce RTX 5070 Ti (16GB VRAM)
macOS: Apple M3 (16GB unified memory)
OS:
Windows 11
macOS 26.2
Software:
Python: 3.13.11
llama-cpp-python: 0.3.23 and 0.3.24
ComfyUI: v0.11.1 (issue is backend-related, front-end independent)
Models Tested
Qwen3-VL-30B-A3B-Instruct-IQ4_XS.gguf (~16.4GB)
Qwen3-VL-4B-Instruct-Q6_K.gguf (~3.3GB)
Steps to Reproduce
Install llama-cpp-python (JamePeng's fork) with CUDA support on Windows or via pip on macOS.
Download any Qwen3-VL GGUF model (e.g., the 4B model for macOS testing).
Use any ComfyUI front-end node that supports llama-cpp-python (e.g., ComfyUI-QwenVL or ComfyUI-llama-cpp).
Run a single inference (e.g., image/video captioning) and record baseline memory.
Run 5 consecutive inferences without restarting ComfyUI.
Monitor memory:
Windows: nvidia-smi -l 1
macOS: Activity Monitor (watch Python process memory)
Evidence (macOS with 4B model)
Stage Python Process Memory (GB)
After ComfyUI startup 0.65
After 1st run 1.4
After 2nd run 2.3
After 3rd run 3.1
After 4th run 4.3
After 5th run 4.7
On Windows, nvidia-smi shows identical stepwise growth (screenshots available upon request).
Additional Notes
The issue occurs with different front-end nodes, confirming the leak is in the backend (llama-cpp-python) itself.
The problem is specific to GGUF models; other model formats (e.g., diffusers) do not exhibit this leak.
Cross-platform reproducibility (Windows + macOS) strongly suggests a fundamental issue in the framework, not a driver quirk.