fix: generated caption no longer gets truncated#19
fix: generated caption no longer gets truncated#19ServeurpersoCom merged 1 commit intoServeurpersoCom:masterfrom
Conversation
📝 WalkthroughWalkthroughThis pull request refines the generation logic in the fill step by conditionally disabling Classifier-Free Guidance (CFG) during text expansion and introducing selective FSM activation based on lyrics needs and CoT reasoning flags, preventing CFG distortion in textual scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/ace-qwen3.cpp`:
- Around line 794-797: The current logic only sets fill_cfg = 1.0f when
need_lyrics or req.use_cot_caption are true, but every path that calls
generate_phase1_batch() should have CFG disabled; change the assignment so
fill_cfg is always 1.0f for phase-1 fills (regardless of need_lyrics or
req.use_cot_caption) prior to calling generate_phase1_batch(), leaving
fill_top_p as-is and still using cfg_scale elsewhere; update references to
fill_cfg in the phase-1 generation code (e.g., where generate_phase1_batch() is
invoked) to ensure metadata-only fills also use fill_cfg = 1.0f.
| // Disable CFG for ANY textual expansion (lyrics OR CoT reasoning), | ||
| // as CFG distorts text logits and forces premature newlines. | ||
| float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale; | ||
| float fill_top_p = top_p; |
There was a problem hiding this comment.
Keep CFG off for every phase-1 fill path.
Lines 794-797 only force fill_cfg = 1.0f for lyrics or use_cot_caption, but this branch always calls generate_phase1_batch() to produce text that is later parsed back into AcePrompt. Metadata-only fill therefore still runs textual sampling with CFG, so one fill mode keeps the same truncation/garbling risk this change is trying to remove.
Suggested fix
- // Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
- // as CFG distorts text logits and forces premature newlines.
- float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
+ // Phase 1 always generates text in the fill path, so keep CFG off here.
+ float fill_cfg = 1.0f;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Disable CFG for ANY textual expansion (lyrics OR CoT reasoning), | |
| // as CFG distorts text logits and forces premature newlines. | |
| float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale; | |
| float fill_top_p = top_p; | |
| // Phase 1 always generates text in the fill path, so keep CFG off here. | |
| float fill_cfg = 1.0f; | |
| float fill_top_p = top_p; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/ace-qwen3.cpp` around lines 794 - 797, The current logic only sets
fill_cfg = 1.0f when need_lyrics or req.use_cot_caption are true, but every path
that calls generate_phase1_batch() should have CFG disabled; change the
assignment so fill_cfg is always 1.0f for phase-1 fills (regardless of
need_lyrics or req.use_cot_caption) prior to calling generate_phase1_batch(),
leaving fill_top_p as-is and still using cfg_scale elsewhere; update references
to fill_cfg in the phase-1 generation code (e.g., where generate_phase1_batch()
is invoked) to ensure metadata-only fills also use fill_cfg = 1.0f.
There was a problem hiding this comment.
A 1 sample test with instrumental works, but a non-instrumental test when generating lyrics truncated the caption.
There was a problem hiding this comment.
@jdluzen, thanks for testing! That's a very useful data point.
Since the PR already sets fill_cfg = 1.0f when need_lyrics = true, CFG is not the active issue for the lyrics path. The likely culprit is that the old code also reset fill_top_p and fill_top_k for lyrics specifically, and those lines were removed in this PR:
// Old code (before this PR) — these two lines are now gone:
fill_top_p = 1.0f;
fill_top_k = 0;Without those overrides, lyrics generation now uses whatever top_p and top_k values come from the request. A restrictive top_p < 1.0 (or a non-zero top_k) shrinks the candidate pool during free-text sampling, which can cause the sampler to pick TOKEN_IM_END early and truncate mid-sentence.
A secondary suspect is the FSM: with vocal_language = "en" (the example payload in your PR), fsm.force_language("en") is active during lyrics fill. If the language-forcing masks too aggressively, it can also block valid continuation tokens.
Suggested fix — restore the sampling overrides for the lyrics path alongside the CFG change:
// Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
// as CFG distorts text logits and forces premature newlines.
float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
float fill_top_p = top_p;
int fill_top_k = top_k;
if (need_lyrics) {
// Free-text caption expansion: unrestricted nucleus sampling.
// CFG is already off (fill_cfg = 1.0f above); also disable top_k
// and open top_p so the sampler doesn't prematurely hit EOS.
fill_top_p = 1.0f;
fill_top_k = 0;
}This restores the old behavior for the lyrics path while keeping the CFG fix for use_cot_caption. If you're still seeing truncation after this, the FSM language-forcing during caption expansion is worth investigating next (it may make sense to skip force_language when the fill step is generating a free-form caption rather than structured metadata).
|
Great work ! |
* Add LEGO mode: generate instrumental stems over references (#19) * Add LEGO mode: --lego <track> flag for dit-vae, example files, README docs * Remove base model check from lego.sh Removed the echo statement for ensuring the base model. * Refactor lego.sh by removing echo statements Removed echo statements for steps in the script. * Implement error check for --lego with DiT model Add error handling for --lego option requiring base DiT model * Move lego mode from `--lego <track>` CLI flag to `"lego"` JSON request field (#21) * apply requested changes --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
I've been attempting to track down an issue where the generated caption is essentially cut off might sentence.
Occasionally, the lyrics also do not generate or are garbled, which I am wondering if it is the same issue.
I fully expect that there is a better solution, though this is what works for me right now for both regular and instrumental.
Summary by CodeRabbit