Updated script to add sub-headings to markdown code snippets#3182
Updated script to add sub-headings to markdown code snippets#3182
Conversation
WalkthroughAdds four exported utilities to the MDX→Markdown transformer to generate language display names and inject language subheadings for multi-language code blocks; integrates subheading insertion into the conversion pipeline and extends tests to cover the new behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4215876 to
4e92bf7
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@data/onPostBuild/transpileMdxToMarkdown.ts`:
- Around line 13-15: In function getLanguageDisplayName, change the early return
to use braces to satisfy the eslint "curly" rule: replace the line "if (!lang)
return '';" with a block-style conditional "if (!lang) { return ''; }" so the
function (and its parameter lang) uses a braced return before continuing to the
split/capitalize logic.
- Around line 27-31: The current codeTagRegex only matches bare
"<Code>...</Code>" so tags with attributes like "<Code fixed="true">...</Code>"
are skipped; update the codeTagRegex definition (the variable named codeTagRegex
used in the replace call) to allow an opening <Code> tag followed by optional
attributes by requiring a word boundary after "Code" and permitting any
characters except ">" until the closing ">", keep the case-insensitive and
global flags and keep the inner capture non-greedy so the replace callback (the
function handling innerContent) still receives the correct content.
- Around line 33-46: The regex in codeBlockRegex only uses \w+ and thus misses
language identifiers with hyphens, plus signs, dots, or other non-word chars;
update codeBlockRegex to capture any non-newline/backtick language token (e.g.,
/```([^\n`]+)\n[\s\S]*?```/g) and change the innerContent.replace callback
signature to accept the match and the lang (remove the redundant fullMatch
parameter) so getLanguageDisplayName(lang) receives the full language token;
modify occurrences of codeBlockRegex, innerContent.replace, and the callback
used to build transformedContent accordingly.
There was a problem hiding this comment.
Pull request overview
This PR adds language-specific subheadings to code snippets within <Code> tags in markdown documentation to make it easier for LLMs to identify which language each code snippet belongs to. The transformation converts code blocks from having implicit language identifiers in the fence syntax to explicit markdown headings.
Changes:
- Added
getLanguageDisplayName()function to convert language identifiers to display names (e.g.,realtime_javascript→Realtime Javascript) - Added
addLanguageSubheadingsToCodeBlocks()function to transform code blocks within<Code>tags by prepending language headings - Integrated the new transformation as Stage 12 in the MDX-to-Markdown transpilation pipeline
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| data/onPostBuild/transpileMdxToMarkdown.ts | Implements two new functions for language display name formatting and code block transformation; integrates the transformation into the pipeline after template variable replacement |
| data/onPostBuild/transpileMdxToMarkdown.test.ts | Adds comprehensive test coverage for the new functions including various language formats, multiple code blocks, and content preservation |
| data/onPostBuild/snapshots/transpileMdxToMarkdown.test.ts.snap | Updates snapshot to reflect the new heading structure in transformed code blocks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
kennethkalmer
left a comment
There was a problem hiding this comment.
I like this @sacOO7, good stuff! I have some questions though:
- Should we remove the language from the fenced code block for token efficiency?
- [nitpick] Do you think it is possible that we dynamically determine what header level the markdown should use? In
messages.mdwe jump from##to####in the first block, would be nice to go to###instead. Likewise, if we have some other page that already has####with code snippets inside we should then be going to#####.
|
@kennethkalmer good question. I had thought about both points.
I’ll definitely double-check and update accordingly 👍 |
|
Claude said
Also, For Pure LLM Consumption: Skip the Language Identifier |
|
Accordingly updated the script 👍 |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const precedingLevel = findPrecedingHeadingLevel(content, offset); | ||
| const headingPrefix = '#'.repeat(precedingLevel + 1); |
There was a problem hiding this comment.
findPrecedingHeadingLevel(content, offset) scans from the start of the document up to offset for every <Code> match, which can become O(n²) on long docs with many <Code> blocks (e.g., large API reference pages). Consider precomputing heading positions/levels once (or scanning incrementally while iterating matches) to avoid repeated full-prefix scans and substring allocations.
| const output = addLanguageSubheadingsToCodeBlocks(input); | ||
| expect(output).toContain('#### Javascript'); | ||
| // Language should be removed from fenced code blocks | ||
| expect(output).not.toContain('```javascript'); | ||
| expect(output).toContain('```\nconst x = 1;'); |
There was a problem hiding this comment.
The <Code fixed="true"> test asserts that the language subheading is added, but it doesn't assert that the fixed="true" attribute is preserved in the output. Adding an expectation for the attribute would catch regressions where the transform accidentally drops <Code> tag attributes.
| } | ||
|
|
||
| // Ensure proper newline after <Code> tag for markdown formatting | ||
| return `<Code>\n\n${transformedContent.trimStart()}</Code>`; |
There was a problem hiding this comment.
When transforming a <Code ...> block that contains language-tagged fences, the replacement hard-codes the opening tag to <Code>, which drops any existing attributes like fixed="true". This will change semantics for many docs that rely on these attributes. Preserve the original opening <Code...> tag (and ideally its exact casing/attributes) when returning the transformed block.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@data/onPostBuild/transpileMdxToMarkdown.test.ts`:
- Around line 764-775: The test for addLanguageSubheadingsToCodeBlocks doesn't
assert that attributes on the <Code> tag (e.g., fixed="true") are preserved;
update the test to assert the transformed output contains '<Code fixed="true">'
and ensure the transformer (addLanguageSubheadingsToCodeBlocks / any helper used
by transpileMdxToMarkdown) is changed to retain attributes from the original
<Code> wrapper when emitting the transformed markup rather than stripping them
out; locate the logic that replaces or serializes the <Code> node in
addLanguageSubheadingsToCodeBlocks and propagate the original node.attributes
into the output so the new assertion passes.
In `@data/onPostBuild/transpileMdxToMarkdown.ts`:
- Around line 70-89: The replacement currently drops any attributes on the
original <Code> tag; update addLanguageSubheadingsToCodeBlocks to capture the
opening tag's attributes in codeTagRegex and reuse them in the returned string
so attributes like fixed="true" are preserved. Specifically, change codeTagRegex
to capture the full opening tag or its attributes (e.g.,
/<Code\b([^>]*)>([\s\S]*?)<\/Code>/gi), adjust the replacement callback
parameters to receive that attributes capture, and return
`<Code${attributes}>\n\n${transformedContent.trimStart()}</Code>` (or
equivalent) so transformCodeBlocksWithSubheadings and findPrecedingHeadingLevel
logic remain unchanged. Ensure you handle cases where there are no attributes
(attributes string may be empty) to avoid introducing extra spaces.
- Around line 28-39: findPrecedingHeadingLevel currently scans the raw content
and its headingRegex (^(#+)\s+ with /gm) can match # lines inside fenced code
blocks; use the existing splitByCodeBlocks utility to strip out fenced code
blocks before running the regex so comments inside code fences aren't treated as
headings. Concretely, keep the initial contentBefore = content.substring(0,
position) but call splitByCodeBlocks(contentBefore) (assign to
contentWithoutCodeBlocks) and run headingRegex.exec against
contentWithoutCodeBlocks instead of contentBefore; ensure you still default
lastHeadingLevel = 3 and update lastHeadingLevel from match[1].length as before.
🧹 Nitpick comments (1)
data/onPostBuild/transpileMdxToMarkdown.ts (1)
46-62: Consider trimming the captured language identifier.The regex
([^\n]+)can capture trailing whitespace if a fence line has it (e.g., `` ```javascript ``). This would propagate into the heading as#### Javascript. Adding a.trim()` is a low-cost defensive measure.Suggested tweak
return innerContent.replace(codeBlockRegex, (_codeBlock, lang, codeContent) => { - const displayName = getLanguageDisplayName(lang); + const displayName = getLanguageDisplayName(lang.trim()); return `${headingPrefix} ${displayName}\n\n\`\`\`\n${codeContent}\`\`\``; });
| it('should handle <Code> tags with attributes like fixed="true"', () => { | ||
| const input = `<Code fixed="true"> | ||
| \`\`\`javascript | ||
| const x = 1; | ||
| \`\`\` | ||
| </Code>`; | ||
| const output = addLanguageSubheadingsToCodeBlocks(input); | ||
| expect(output).toContain('#### Javascript'); | ||
| // Language should be removed from fenced code blocks | ||
| expect(output).not.toContain('```javascript'); | ||
| expect(output).toContain('```\nconst x = 1;'); | ||
| }); |
There was a problem hiding this comment.
Test doesn't verify that fixed="true" is preserved in the output.
The test is titled "should handle <Code> tags with attributes" but only asserts on the heading and language removal — it never checks whether the attribute survives the transformation. This aligns with the attribute-stripping bug flagged in the source file. Once that's fixed, add an assertion:
const output = addLanguageSubheadingsToCodeBlocks(input);
expect(output).toContain('#### Javascript');
// Language should be removed from fenced code blocks
expect(output).not.toContain('```javascript');
expect(output).toContain('```\nconst x = 1;');
+ // Attributes on <Code> tag should be preserved
+ expect(output).toContain('<Code fixed="true">');🤖 Prompt for AI Agents
In `@data/onPostBuild/transpileMdxToMarkdown.test.ts` around lines 764 - 775, The
test for addLanguageSubheadingsToCodeBlocks doesn't assert that attributes on
the <Code> tag (e.g., fixed="true") are preserved; update the test to assert the
transformed output contains '<Code fixed="true">' and ensure the transformer
(addLanguageSubheadingsToCodeBlocks / any helper used by transpileMdxToMarkdown)
is changed to retain attributes from the original <Code> wrapper when emitting
the transformed markup rather than stripping them out; locate the logic that
replaces or serializes the <Code> node in addLanguageSubheadingsToCodeBlocks and
propagate the original node.attributes into the output so the new assertion
passes.
| function findPrecedingHeadingLevel(content: string, position: number): number { | ||
| const contentBefore = content.substring(0, position); | ||
| const headingRegex = /^(#+)\s+/gm; | ||
| let lastHeadingLevel = 3; // Defaults to 3 when no heading is found | ||
| let match; | ||
|
|
||
| while ((match = headingRegex.exec(contentBefore)) !== null) { | ||
| lastHeadingLevel = match[1].length; | ||
| } | ||
|
|
||
| return lastHeadingLevel; | ||
| } |
There was a problem hiding this comment.
findPrecedingHeadingLevel can match # lines inside fenced code blocks.
The regex ^(#+)\s+ with the gm flag will match comment lines inside fenced code blocks that start with # (e.g., Python # some comment, shell #!/bin/bash). Since this function is called on the full document content, a # comment line in a preceding code block could be misidentified as an h1 heading, yielding an h2 subheading instead of the expected h4.
The existing splitByCodeBlocks utility could be leveraged to strip fenced code block content before scanning for headings.
Suggested approach
function findPrecedingHeadingLevel(content: string, position: number): number {
const contentBefore = content.substring(0, position);
- const headingRegex = /^(#+)\s+/gm;
+ // Strip fenced code block contents so # comment lines inside them aren't misidentified as headings
+ const contentWithoutCodeBlocks = contentBefore.replace(/```[\s\S]*?```/g, '');
+ const headingRegex = /^(#+)\s+/gm;
let lastHeadingLevel = 3; // Defaults to 3 when no heading is found
let match;
- while ((match = headingRegex.exec(contentBefore)) !== null) {
+ while ((match = headingRegex.exec(contentWithoutCodeBlocks)) !== null) {
lastHeadingLevel = match[1].length;
}
return lastHeadingLevel;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function findPrecedingHeadingLevel(content: string, position: number): number { | |
| const contentBefore = content.substring(0, position); | |
| const headingRegex = /^(#+)\s+/gm; | |
| let lastHeadingLevel = 3; // Defaults to 3 when no heading is found | |
| let match; | |
| while ((match = headingRegex.exec(contentBefore)) !== null) { | |
| lastHeadingLevel = match[1].length; | |
| } | |
| return lastHeadingLevel; | |
| } | |
| function findPrecedingHeadingLevel(content: string, position: number): number { | |
| const contentBefore = content.substring(0, position); | |
| // Strip fenced code block contents so # comment lines inside them aren't misidentified as headings | |
| const contentWithoutCodeBlocks = contentBefore.replace(/ |
🤖 Prompt for AI Agents
In `@data/onPostBuild/transpileMdxToMarkdown.ts` around lines 28 - 39,
findPrecedingHeadingLevel currently scans the raw content and its headingRegex
(^(#+)\s+ with /gm) can match # lines inside fenced code blocks; use the
existing splitByCodeBlocks utility to strip out fenced code blocks before
running the regex so comments inside code fences aren't treated as headings.
Concretely, keep the initial contentBefore = content.substring(0, position) but
call splitByCodeBlocks(contentBefore) (assign to contentWithoutCodeBlocks) and
run headingRegex.exec against contentWithoutCodeBlocks instead of contentBefore;
ensure you still default lastHeadingLevel = 3 and update lastHeadingLevel from
match[1].length as before.
| function addLanguageSubheadingsToCodeBlocks(content: string): string { | ||
| // Match <Code> blocks with optional attributes (case-insensitive for the tag) | ||
| // Handles both <Code> and <Code fixed="true"> etc. | ||
| const codeTagRegex = /<Code\b[^>]*>([\s\S]*?)<\/Code>/gi; | ||
|
|
||
| return content.replace(codeTagRegex, (fullMatch, innerContent: string, offset: number) => { | ||
| // Calculate the appropriate heading level based on preceding headings | ||
| const precedingLevel = findPrecedingHeadingLevel(content, offset); | ||
| const headingPrefix = '#'.repeat(precedingLevel + 1); | ||
|
|
||
| // Transform code blocks with subheadings | ||
| const transformedContent = transformCodeBlocksWithSubheadings(innerContent, headingPrefix); | ||
| if (transformedContent === null) { | ||
| return fullMatch; // No code blocks with language - return unchanged | ||
| } | ||
|
|
||
| // Ensure proper newline after <Code> tag for markdown formatting | ||
| return `<Code>\n\n${transformedContent.trimStart()}</Code>`; | ||
| }); | ||
| } |
There was a problem hiding this comment.
<Code> tag attributes (e.g., fixed="true") are silently stripped.
Line 87 always emits a bare <Code> tag, discarding any attributes from the original opening tag. For example, <Code fixed="true"> becomes <Code> after transformation. This alters the semantics of the output.
Capture the attributes in the regex and reuse them in the replacement:
Proposed fix
function addLanguageSubheadingsToCodeBlocks(content: string): string {
- const codeTagRegex = /<Code\b[^>]*>([\s\S]*?)<\/Code>/gi;
+ const codeTagRegex = /<Code\b([^>]*)>([\s\S]*?)<\/Code>/gi;
- return content.replace(codeTagRegex, (fullMatch, innerContent: string, offset: number) => {
+ return content.replace(codeTagRegex, (fullMatch, attributes: string, innerContent: string, offset: number) => {
const precedingLevel = findPrecedingHeadingLevel(content, offset);
const headingPrefix = '#'.repeat(precedingLevel + 1);
const transformedContent = transformCodeBlocksWithSubheadings(innerContent, headingPrefix);
if (transformedContent === null) {
return fullMatch;
}
- return `<Code>\n\n${transformedContent.trimStart()}</Code>`;
+ return `<Code${attributes}>\n\n${transformedContent.trimStart()}</Code>`;
});
}🤖 Prompt for AI Agents
In `@data/onPostBuild/transpileMdxToMarkdown.ts` around lines 70 - 89, The
replacement currently drops any attributes on the original <Code> tag; update
addLanguageSubheadingsToCodeBlocks to capture the opening tag's attributes in
codeTagRegex and reuse them in the returned string so attributes like
fixed="true" are preserved. Specifically, change codeTagRegex to capture the
full opening tag or its attributes (e.g.,
/<Code\b([^>]*)>([\s\S]*?)<\/Code>/gi), adjust the replacement callback
parameters to receive that attributes capture, and return
`<Code${attributes}>\n\n${transformedContent.trimStart()}</Code>` (or
equivalent) so transformCodeBlocksWithSubheadings and findPrecedingHeadingLevel
logic remain unchanged. Ensure you handle cases where there are no attributes
(attributes string may be empty) to avoid introducing extra spaces.
Into ===>
Compared
oldandnewmarkdown fordocs/chat/rooms/messages.mdusing claude, it saidOld markdown:
⚠️ Relies on language identifiers in code fences (```javascript), which can be less prominent
⚠️ No explicit labels - must parse the code fence syntax
⚠️ Easier to miss - languages blend together without clear separators
✅ Cleaner, more concise
✅ Standard markdown syntax
New markdown:
✅ Explicit language headers (#### Javascript, #### React) make it immediately clear what each snippet is
✅ Hierarchical structure is unambiguous - headers act as labels
✅ Easy to extract - can search for #### pattern followed by language name
✅ Self-documenting - no need to infer from context
Summary by CodeRabbit
New Features
Tests