native QLoRA training with reward-weighted SFT and GRPO#20453
native QLoRA training with reward-weighted SFT and GRPO#20453srossitto79 wants to merge 10 commits intoggml-org:masterfrom
Conversation
|
According to the llama.cpp AI usage policy:
Sorry, but given that you are a new contributor and that the PR description is clearly machine generated I do in a vacuum not have high confidence that the implementation in this PR is correct. Unfortunately I do as of right now also not have the technical expertise to judge the correctness of the implementation, I would have to read up on this first. As such I'm giving the review of this PR a low priority until I am less busy with other matters in a month or so. |
No problem, take your time, I deliberately added this at the end of the PR to be clear: I did the changes on a fork and I seriously detached from upstream changes, so to make the pr, I cloned upstream again and used Claude code with opus to transfer my changes on the fresh repo code. I had to do some fix manually, at first, it forgot few things here and there, some declaration, some assert, but In the end I aligned everything I needed with my working dev version and from the models I tested I got from this branch the same results I had in my dev repo. The only think I don't like much, is the way I interfaced python with the grpo finetuner, I used stdin/out/err to deliver the prompts, the responses and the computed rewards> I mean is a bit rudimental way to do IPC. But its working, its fast to implement and I provided an example of its usage in python. Anyway I am open to better solutions if you have others ideas. Hope to see some comment about the training results that can help me improve the code or validate my observations. |
Suboptimally worded PR descriptions are OK, autogenerated ones are not. My opinion is that carefully checked machine generated code can be OK, but I want proof in the form of a writeup that the person clicking the submit button actually understands it. Otherwise I will end up wasting too much time reviewing broken code that will never get merged. |
Summary
Native QLoRA + Reward-Weighted SFT + GRPO training pipeline for quantized GGUF models.
llama_adapter_lora_initandllama-export-lora.reward/scorefield, cross-entropy loss is scaled by the normalized reward before backprop — no extra flags needed.--grpo-modeimplements a full GRPO training loop via a line-based IPC protocol between a Python driver (prompt sampling + reward scoring) and the C++ process (model state + generation + gradient updates).Key features
--freeze-layers N— skip LoRA on first N layers; backward auto-pruned--grad-checkpoint N— mark every Nth forward node persistent to reduce peak activation VRAM--train-on-prompt/--shuffle-dataset/--save-every N/--val-splitOUT_PRODCUDA kernel — dequantize on GPU + cuBLAS for backward matmulOUT_PROD_IDfor MoE backward — enables LoRA on dense FFN layers in Mixtral/Nemotron-MoEFiles changed
examples/qlora_training/(C++ trainer, Python GRPO driver, README, sample data)ggml/src/ggml.c,ggml/src/ggml-opt.cpp,ggml/include/ggml-opt.h,ggml/include/ggml.hggml/src/ggml-cuda/out-prod.cu(new),ggml/src/ggml-cuda/ggml-cuda.cu,ggml/src/ggml-cuda/opt-step-adamw.cuggml/src/ggml-cpu/ops.cppinclude/llama.h,src/llama-context.cpp,src/llama-context.hcommon/common.h,common/arg.cppexamples/training/finetune.cpp(addedshuffle+grad_checkpoint_intervalparams)Tested on
Test plan
-DGGML_CUDA=ONon Linux/Windowsllama-finetune-qlorawith sample data:examples/qlora_training/sample_data.jsonlllama-cli --loraand merges withllama-export-loraexamples/training/finetune(existing example) to verify no regressionpython3 examples/qlora_training/grpo_example.py --model <model>🤖 PR Generated with Claude Code
Disclosure: AI tools were used in an assistive capacity for mechanical porting
of already-written and tested code from a development fork to a clean upstream base.
All code was authored and extensively tested by the contributor.