Skip to content

fix: generated caption no longer gets truncated#19

Merged
ServeurpersoCom merged 1 commit intoServeurpersoCom:masterfrom
jdluzen:fix/captiontruncate
Mar 10, 2026
Merged

fix: generated caption no longer gets truncated#19
ServeurpersoCom merged 1 commit intoServeurpersoCom:masterfrom
jdluzen:fix/captiontruncate

Conversation

@jdluzen
Copy link
Contributor

@jdluzen jdluzen commented Mar 9, 2026

I've been attempting to track down an issue where the generated caption is essentially cut off might sentence.
Occasionally, the lyrics also do not generate or are garbled, which I am wondering if it is the same issue.

{
    "caption": "epic rock and roll about ramen noodles",
    "duration": 0,
    "lyrics": "",
    "inference_steps": 8,
    "vocal_language": "en"
}

I fully expect that there is a better solution, though this is what works for me right now for both regular and instrumental.

Summary by CodeRabbit

  • Bug Fixes
    • Improved text generation quality for lyrics and reasoning modes by optimizing sampling behavior.
    • Enhanced language constraint application to activate selectively, preventing unnecessary restrictions during appropriate generation scenarios.

@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This pull request refines the generation logic in the fill step by conditionally disabling Classifier-Free Guidance (CFG) during text expansion and introducing selective FSM activation based on lyrics needs and CoT reasoning flags, preventing CFG distortion in textual scenarios.

Changes

Cohort / File(s) Summary
Selective FSM and CFG Control
tools/ace-qwen3.cpp
Introduces active_fsm pointer to conditionally enable FSM for generation: FSM is enabled for lyrics only when a valid vocal_language exists, and for non-lyrics when CoT reasoning is disabled. CFG is disabled (set to 1.0) during text expansion in lyrics generation or CoT reasoning to avoid sampling distortion. Updates phase1_batch invocation to use the new active_fsm pointer instead of direct FSM boolean indicator.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • fix: pass generated caption to output json file #14: Complements this PR by propagating generated captions and vocal_language information, while this PR controls when FSM is enabled and manages CFG behavior during text expansion in the same caption/CoT flow.

Suggested reviewers

  • ServeurpersoCom

Poem

🐰 With careful paws, I toggle FSM's gate,
CFG dims when lyrics must generate,
CoT thoughts flow free from sampling's bend,
Each flag aligned, the right conditions blend!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: generated caption no longer gets truncated' is specific and directly addresses the main issue resolved by the changeset, which involves modifications to CFG and FSM logic to prevent caption truncation during text expansion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/ace-qwen3.cpp`:
- Around line 794-797: The current logic only sets fill_cfg = 1.0f when
need_lyrics or req.use_cot_caption are true, but every path that calls
generate_phase1_batch() should have CFG disabled; change the assignment so
fill_cfg is always 1.0f for phase-1 fills (regardless of need_lyrics or
req.use_cot_caption) prior to calling generate_phase1_batch(), leaving
fill_top_p as-is and still using cfg_scale elsewhere; update references to
fill_cfg in the phase-1 generation code (e.g., where generate_phase1_batch() is
invoked) to ensure metadata-only fills also use fill_cfg = 1.0f.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1fda06bd-7129-42d4-8d56-66b53b0dc435

📥 Commits

Reviewing files that changed from the base of the PR and between 1d57065 and 90e0325.

📒 Files selected for processing (1)
  • tools/ace-qwen3.cpp

Comment on lines +794 to +797
// Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
// as CFG distorts text logits and forces premature newlines.
float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
float fill_top_p = top_p;
Copy link

@coderabbitai coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep CFG off for every phase-1 fill path.

Lines 794-797 only force fill_cfg = 1.0f for lyrics or use_cot_caption, but this branch always calls generate_phase1_batch() to produce text that is later parsed back into AcePrompt. Metadata-only fill therefore still runs textual sampling with CFG, so one fill mode keeps the same truncation/garbling risk this change is trying to remove.

Suggested fix
-        // Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
-        // as CFG distorts text logits and forces premature newlines.
-        float fill_cfg   = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
+        // Phase 1 always generates text in the fill path, so keep CFG off here.
+        float fill_cfg   = 1.0f;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
// as CFG distorts text logits and forces premature newlines.
float fill_cfg = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
float fill_top_p = top_p;
// Phase 1 always generates text in the fill path, so keep CFG off here.
float fill_cfg = 1.0f;
float fill_top_p = top_p;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/ace-qwen3.cpp` around lines 794 - 797, The current logic only sets
fill_cfg = 1.0f when need_lyrics or req.use_cot_caption are true, but every path
that calls generate_phase1_batch() should have CFG disabled; change the
assignment so fill_cfg is always 1.0f for phase-1 fills (regardless of
need_lyrics or req.use_cot_caption) prior to calling generate_phase1_batch(),
leaving fill_top_p as-is and still using cfg_scale elsewhere; update references
to fill_cfg in the phase-1 generation code (e.g., where generate_phase1_batch()
is invoked) to ensure metadata-only fills also use fill_cfg = 1.0f.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 1 sample test with instrumental works, but a non-instrumental test when generating lyrics truncated the caption.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdluzen, thanks for testing! That's a very useful data point.

Since the PR already sets fill_cfg = 1.0f when need_lyrics = true, CFG is not the active issue for the lyrics path. The likely culprit is that the old code also reset fill_top_p and fill_top_k for lyrics specifically, and those lines were removed in this PR:

// Old code (before this PR) — these two lines are now gone:
fill_top_p = 1.0f;
fill_top_k = 0;

Without those overrides, lyrics generation now uses whatever top_p and top_k values come from the request. A restrictive top_p < 1.0 (or a non-zero top_k) shrinks the candidate pool during free-text sampling, which can cause the sampler to pick TOKEN_IM_END early and truncate mid-sentence.

A secondary suspect is the FSM: with vocal_language = "en" (the example payload in your PR), fsm.force_language("en") is active during lyrics fill. If the language-forcing masks too aggressively, it can also block valid continuation tokens.

Suggested fix — restore the sampling overrides for the lyrics path alongside the CFG change:

// Disable CFG for ANY textual expansion (lyrics OR CoT reasoning),
// as CFG distorts text logits and forces premature newlines.
float fill_cfg   = (need_lyrics || req.use_cot_caption) ? 1.0f : cfg_scale;
float fill_top_p = top_p;
int   fill_top_k = top_k;

if (need_lyrics) {
    // Free-text caption expansion: unrestricted nucleus sampling.
    // CFG is already off (fill_cfg = 1.0f above); also disable top_k
    // and open top_p so the sampler doesn't prematurely hit EOS.
    fill_top_p = 1.0f;
    fill_top_k = 0;
}

This restores the old behavior for the lyrics path while keeping the CFG fix for use_cot_caption. If you're still seeing truncation after this, the FSM language-forcing during caption expansion is worth investigating next (it may make sense to skip force_language when the fill step is generating a free-form caption rather than structured metadata).

@ServeurpersoCom ServeurpersoCom merged commit 0a91260 into ServeurpersoCom:master Mar 10, 2026
3 of 4 checks passed
@ServeurpersoCom
Copy link
Owner

ServeurpersoCom commented Mar 10, 2026

Great work !

                       Python ref          acestep.cpp (before)   acestep.cpp (PR #19)
CFG during CoT            yes (bug)           yes (bug)              no (fixed)
FSM during CoT caption    yes (via processor) yes                    no (disabled)
Default cfg_scale         1.0 (hides bug)     2.0 (exposes bug)      2.0 (bug fixed)
CoT caption tokens        ~180 (cfg=1.0)      47 (cfg=2.0, trunc)    179 (cfg=1.0, full)
Caption quality           full (by luck)      truncated              full (by design)
Audio codes impact        clean               degraded prompt        clean
DiT conditioning          full embedding       partial embedding     full embedding

ServeurpersoCom pushed a commit that referenced this pull request Mar 10, 2026
* Add LEGO mode: generate instrumental stems over references (#19)

* Add LEGO mode: --lego <track> flag for dit-vae, example files, README docs

* Remove base model check from lego.sh

Removed the echo statement for ensuring the base model.

* Refactor lego.sh by removing echo statements

Removed echo statements for steps in the script.

* Implement error check for --lego with DiT model

Add error handling for --lego option requiring base DiT model

* Move lego mode from `--lego <track>` CLI flag to `"lego"` JSON request field (#21)

* apply requested changes

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants