Skip to content

Comments

WIP mg-utils#1560

Open
PaulAdamDavis wants to merge 4 commits intomainfrom
dev-jsdom
Open

WIP mg-utils#1560
PaulAdamDavis wants to merge 4 commits intomainfrom
dev-jsdom

Conversation

@PaulAdamDavis
Copy link
Member

@PaulAdamDavis PaulAdamDavis commented Feb 20, 2026

Note

Medium Risk
Large refactor of WordPress HTML/XML parsing and serialization (switching away from cheerio) can subtly change output HTML and metadata extraction; new utility package and parser changes affect core migration correctness.

Overview
Adds new @tryghost/mg-utils package providing jsdom-based DOM parsing/manipulation helpers (parseFragment, HTML5-compliant serialization, and common DOM ops) with comprehensive unit tests and build/lint config.

Updates mg-wp-api HTML processing (and users-html-to-json) to replace cheerio usage with mg-utils DOM helpers, which changes serialization behavior (e.g., HTML entities are decoded) and refactors many DOM mutations to use native DOM APIs.

Refactors mg-wp-xml to drop cheerio XML parsing in favor of fast-xml-parser, introducing helpers for array/text normalization and adjusting file reading/stats/tests accordingly; also adds Parsely author extraction in mg-wp-api author assignment logic.

Written by Cursor Bugbot for commit 5c43346. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Walkthrough

This PR introduces a new DOM utilities package (mg-utils) built on jsdom to replace cheerio dependency across the codebase. The mg-utils package exports 14 utility functions for HTML fragment parsing, DOM manipulation, serialization, attribute handling, and element traversal. Subsequently, mg-wp-api is refactored to use the new domUtils library instead of cheerio for HTML processing in the processor module. Additionally, mg-wp-xml transitions from cheerio to fast-xml-parser for XML processing, rewriting extraction logic to use explicit object access patterns. All changes include corresponding test adjustments and dependency updates in affected packages.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'WIP mg-utils' is vague and uses a generic term ('WIP') that doesn't convey meaningful information about the actual changes in the changeset. Replace 'WIP' with a more descriptive title that explains the main purpose, such as 'Add mg-utils package with DOM utilities' or 'Create DOM utility library to replace Cheerio'.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether it relates to the changeset. Add a description that explains the purpose of this PR, including the creation of the mg-utils package and its integration across multiple packages.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev-jsdom

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/mg-wp-xml/lib/process.js (1)

479-481: ⚠️ Potential issue | 🟡 Minor

return new errors.NoContentError(...) silently swallows the error — should be throw.

Returning an error object from an async function resolves the Promise with that object as a success value rather than rejecting it. Callers using await receive the error object as a regular return value and will continue processing silently. The index.js consumer does await process.all(input, ctx) and returns the result directly without checking for errors, which would propagate the error object downstream. Throwing properly rejects the Promise and allows error handling up the call stack.

🐛 Proposed fix
-    return new errors.NoContentError({message: 'Input file is empty'});
+    throw new errors.NoContentError({message: 'Input file is empty'});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/process.js` around lines 479 - 481, In the async
function process.all (in packages/mg-wp-xml/lib/process.js) you must throw the
error instead of returning it: replace the `return new
errors.NoContentError({message: 'Input file is empty'})` with `throw new
errors.NoContentError(...)` so the Promise is rejected and upstream await
callers receive an exception; ensure you reference the errors.NoContentError
constructor when creating the thrown error so existing catch logic continues to
work.
🧹 Nitpick comments (7)
packages/mg-wp-xml/lib/read-xml.js (1)

4-11: parserOptions is duplicated verbatim from process.js — extract to a shared module.

The identical 7-property parserOptions object is defined in both read-xml.js (lines 4–11) and process.js (lines 12–19). If one is updated independently the two files will silently diverge, producing inconsistent parse results between contentStats and the main processing pipeline.

♻️ Proposed refactor: extract to a shared constants module

Create packages/mg-wp-xml/lib/parser-options.js:

+const parserOptions = {
+    ignoreAttributes: false,
+    attributeNamePrefix: '@_',
+    textNodeName: '#text',
+    parseTagValue: false,
+    parseAttributeValue: false,
+    trimValues: false
+};
+
+export {parserOptions};

In read-xml.js:

 import {XMLParser} from 'fast-xml-parser';
 import {readFile} from './read-file.js';
+import {parserOptions} from './parser-options.js';

-const parserOptions = {
-    ignoreAttributes: false,
-    attributeNamePrefix: '@_',
-    textNodeName: '#text',
-    parseTagValue: false,
-    parseAttributeValue: false,
-    trimValues: false
-};

Apply the same import in process.js, removing its local definition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/read-xml.js` around lines 4 - 11, The parserOptions
object is duplicated in read-xml.js and process.js; extract it into a single
shared module (e.g., create lib/parser-options.js that exports parserOptions)
and replace the local definitions by importing that export in both read-xml.js
and process.js; ensure the exported object preserves the same seven properties
(ignoreAttributes, attributeNamePrefix, textNodeName, parseTagValue,
parseAttributeValue, trimValues) and remove the duplicate declarations so both
modules use the single source of truth.
packages/mg-utils/src/lib/dom-utils.ts (2)

123-181: Consider extracting the repeated string-to-fragment conversion into a shared helper.

replaceWith, insertBefore, and insertAfter all contain the same 4-line pattern to convert a string into a DocumentFragment via a <template> element. Extracting this reduces duplication and centralizes any future changes (e.g., sanitization).

♻️ Proposed helper extraction
+/**
+ * Convert an HTML string to a DocumentFragment
+ */
+function toFragment(ownerDocument: Document, html: string): DocumentFragment {
+    const temp = ownerDocument.createElement('template');
+    temp.innerHTML = html;
+    const fragment = ownerDocument.createDocumentFragment();
+    while (temp.content.firstChild) {
+        fragment.appendChild(temp.content.firstChild);
+    }
+    return fragment;
+}
+
 export function replaceWith(el: Element | null, content: string | Node): void {
     if (!el || !el.parentNode) {
         return;
     }
 
     if (typeof content === 'string') {
-        const temp = el.ownerDocument.createElement('template');
-        temp.innerHTML = content;
-        const fragment = el.ownerDocument.createDocumentFragment();
-        while (temp.content.firstChild) {
-            fragment.appendChild(temp.content.firstChild);
-        }
+        const fragment = toFragment(el.ownerDocument, content);
         el.parentNode.replaceChild(fragment, el);
     } else if (content && content.nodeType) {
         el.parentNode.replaceChild(content, el);
     }
 }

Apply the same pattern in insertBefore, insertAfter, and wrap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-utils/src/lib/dom-utils.ts` around lines 123 - 181, Extract the
repeated "string → DocumentFragment" logic into a single helper (e.g.,
stringToFragment or createFragmentFromString) that takes an Element (or its
ownerDocument) and the HTML string, creates a template, sets temp.innerHTML,
moves its child nodes into a DocumentFragment and returns that fragment; then
replace the duplicated blocks in replaceWith, insertBefore, and insertAfter to
call this helper when content is a string (also update wrap if present to use
it), preserving existing null/parentNode checks and Node-type handling.

223-232: Mixed getter/setter return type could benefit from overload signatures.

attr returns string when getting and undefined when setting, but callers must always handle string | undefined. TypeScript overloads would make this contract explicit and eliminate unnecessary undefined checks for callers who are only reading.

♻️ Overloaded signatures
-export function attr(el: Element | null, name: string, value?: string): string | undefined {
+export function attr(el: Element | null, name: string): string;
+export function attr(el: Element | null, name: string, value: string): void;
+export function attr(el: Element | null, name: string, value?: string): string | void {
     if (!el) {
-        return '';
+        return value !== undefined ? undefined : '';
     }
     if (value !== undefined) {
         el.setAttribute(name, value);
         return;
     }
     return el.getAttribute(name) || '';
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-utils/src/lib/dom-utils.ts` around lines 223 - 232, The attr
function mixes getter and setter returns; add TypeScript overloads for attr to
make the getter return string and the setter return void (or undefined) so
callers reading attributes don't have to handle undefined. Specifically, add
overload signatures like: attr(el: Element | null, name: string): string; and
attr(el: Element | null, name: string, value: string): void; then keep a single
implementation signature (e.g., attr(el: Element | null, name: string, value?:
string): string | void) and ensure the implementation of attr (the existing
function) conforms to those overloads (return '' for missing element/getter and
return nothing for setter). Reference: function name attr in this file.
packages/mg-wp-xml/test/process.test.js (1)

6-13: parserOptions duplicated from process.js and read-xml.js — import the shared constant.

This is the third copy of the same object. Once parserOptions is exported from read-xml.js (or a shared module), this block can be removed and replaced with an import.

♻️ Proposed fix
 import {XMLParser} from 'fast-xml-parser';
+import {parserOptions} from '../lib/read-xml.js';
 import process, {processWPMeta} from '../lib/process.js';
 
-const parserOptions = {
-    ignoreAttributes: false,
-    attributeNamePrefix: '@_',
-    textNodeName: '#text',
-    parseTagValue: false,
-    parseAttributeValue: false,
-    trimValues: false
-};
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/test/process.test.js` around lines 6 - 13, The test
defines a duplicate parserOptions object already present in process.js and
read-xml.js; remove the local parserOptions block from process.test.js and
instead import the shared constant exported from read-xml.js (or the new shared
module) so all modules use the single source of truth; ensure read-xml.js
exports parserOptions (e.g., export const parserOptions) and update
process.test.js to import { parserOptions } from that module.
packages/mg-wp-xml/lib/process.js (2)

414-431: processAttachment traverses wp:postmeta twice — second loop is redundant.

processWPMeta(post) (line 421) already extracts every meta key→value pair (including _wp_attachment_image_alt) into meta. The explicit loop at lines 423–431 re-iterates the same postmeta array to recover the same value.

♻️ Proposed simplification
 const processAttachment = async (post) => {
     let attachmentKey = getText(post['wp:post_id']);
     let attachmentUrl = getText(post['wp:attachment_url']) || null;
     let attachmentDesc = getText(post['content:encoded']) || null;
     let attachmentTitle = getText(post.title) || null;
-    let attachmentAlt = null;
 
     let meta = await processWPMeta(post);
-
-    const postMetas = ensureArray(post['wp:postmeta']);
-    for (const row of postMetas) {
-        const metaKey = getText(row['wp:meta_key']);
-        const metaVal = getText(row['wp:meta_value']);
-
-        if (metaKey === '_wp_attachment_image_alt') {
-            attachmentAlt = metaVal;
-        }
-    }
+    let attachmentAlt = meta['_wp_attachment_image_alt'] || null;
 
     return {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/process.js` around lines 414 - 431, processAttachment
currently calls processWPMeta(post) which already returns a map of post meta,
then redundantly iterates ensureArray(post['wp:postmeta']) to find
'_wp_attachment_image_alt'; remove that second loop and instead read
attachmentAlt directly from the meta object returned by processWPMeta (e.g.
attachmentAlt = meta['_wp_attachment_image_alt'] || null) inside
processAttachment, keeping the existing variables attachmentKey, attachmentUrl,
attachmentDesc and attachmentTitle intact and deleting the ensureArray/postmeta
loop and its local metaKey/metaVal logic.

12-20: parserOptions is duplicated across three files — extract to a shared constant.

The identical parserOptions object is defined in read-xml.js, re-defined here (lines 12–20), and again in the test file. Any future option change must be applied in all three places. Export it from read-xml.js (or a dedicated config) and import it here and in the test.

♻️ Suggested refactor

In read-xml.js, export the constant:

-const parserOptions = {
+export const parserOptions = {
     ignoreAttributes: false,
     ...
 };

In process.js:

-import {XMLParser} from 'fast-xml-parser';
+import {XMLParser} from 'fast-xml-parser';
+import {parserOptions} from './read-xml.js';
 
-// XML Parser configuration
-const parserOptions = {
-    ignoreAttributes: false,
-    attributeNamePrefix: '@_',
-    textNodeName: '#text',
-    parseTagValue: false,
-    parseAttributeValue: false,
-    trimValues: false
-};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/process.js` around lines 12 - 20, The parserOptions
object is duplicated; extract it into a single exported constant (parserOptions)
in read-xml.js (or a new shared config module), replace the inline parserOptions
in process.js and the test with an import/require of that exported
parserOptions, and remove the local definitions so all files reference the same
shared constant; ensure import style matches the repo (CommonJS require or ES
import) and update any tests that referenced the old local variable to use the
imported parserOptions.
packages/mg-wp-api/lib/processor.js (1)

1066-1092: Dead code: srcset is already stripped from all img elements before the img.full.size-full check.

The first loop (lines 1066–1072) calls img.removeAttribute('srcset') on every img. By the time the second loop (lines 1076–1092) runs, getAttribute('srcset') will always return null, making the inner if block (lines 1083–1088) unreachable.

♻️ Proposed fix
     for (const img of parsed.$('img.full.size-full')) {
         const parent = img.parentElement;
         const parentTag = parent ? parent.tagName.toLowerCase() : '';
         if (parentTag !== 'figure' && parentTag !== 'a') {
             img.classList.add('kg-image');
-
-            if (img.getAttribute('srcset')) {
-                img.removeAttribute('width');
-                img.removeAttribute('height');
-                img.removeAttribute('srcset');
-                img.removeAttribute('sizes');
-            }
 
             wrap(img, '<figure class="kg-card kg-image-card kg-width-wide"></figure>');
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-api/lib/processor.js` around lines 1066 - 1092, The initial
loop unconditionally strips srcset/sizes from every <img> (parsed.$('img'))
which makes the later check for img.getAttribute('srcset') in the full-size
handler (parsed.$('img.full.size-full')) dead code; fix by moving/removing the
unconditional img.removeAttribute('srcset') and img.removeAttribute('sizes')
from the generic loop and instead only remove those attributes where intended
(inside the full-size handling block before wrap or in any other specific
branch), so parsed.$('img') still runs largerSrc(imageSrc) and sets src but does
not strip srcset/sizes globally; update references to the generic loop and the
full-size loop (parsed.$('img') and parsed.$('img.full.size-full')) and keep
wrap(...) and largerSrc(...) behavior intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/mg-utils/package.json`:
- Around line 22-33: Update the devDependency for `@types/jsdom` from 21.1.7 to
the exact version 27.0.0 in package.json to match TypeScript 5.8.3 and jsdom
26.0.0; edit the entry "@types/jsdom": "21.1.7" -> "@types/jsdom": "27.0.0" and
run your install step to refresh lockfile so type definitions align with
TypeScript and jsdom.

In `@packages/mg-utils/README.md`:
- Around line 5-13: The README install instructions currently suggest npm and a
plain yarn add; update the text to prefer yarn v1 only and use exact version
pinning by replacing the npm example and the plain yarn example with a single
yarn command using the -E flag (e.g., "yarn add -E `@tryghost/mg-utils`") so the
README's Install section reflects the project's policy to always use yarn (v1)
and exact versions.

In `@packages/mg-wp-api/lib/processor.js`:
- Around line 638-646: In the libsyn iframe processing callback (the map
assigned to libsynPodcasts), guard against a missing match by null-checking the
result of iframeSrc.match(libsynIdRegex) before accessing matches[1]; if
iframeSrc is falsy or matches is null, bail out early (return/continue) to avoid
a TypeError, then only extract showId and proceed with the existing logic that
uses showId; apply this check inside the anonymous async function used by
parsed.$('iframe[src*="libsyn.com/embed/"]').map so allowRemoteScraping behavior
and downstream code remain unchanged.

In `@packages/mg-wp-api/utils/users-html-to-json.js`:
- Around line 44-45: The code calls getAttribute('src') and immediately .trim()
which throws if getAttribute returns null; update the logic around
usernameImg/usernameCell (variables usernameImg and image) to safely handle a
missing src by first retrieving the rawSrc (e.g., const rawSrc = usernameImg ?
usernameImg.getAttribute('src') : null) and then only calling .trim() and
.replace('s=64','s=500') when rawSrc is a non-empty string; otherwise set image
to an empty string — ensure the final assignment for image uses a
guarded/conditional check or optional chaining with a fallback.

In `@packages/mg-wp-xml/lib/read-file.js`:
- Line 43: The function currently returns the local variable output without
assignment when detectType(...) yields a value other than 'folder' or 'file',
causing an unexpected undefined; update the logic in the function (the code
around detectType and the output variable in read-file.js) to add a guard clause
that throws a descriptive Error for unexpected types (e.g., include the actual
type returned by detectType) before the final return, or alternatively ensure
output is assigned a sane default for all branches so return output never yields
undefined.
- Around line 24-26: readFolder/readFileOrFolder currently concatenates raw XML
files (out.join('')) producing multiple XML declarations and root <rss>
elements; change readFolder to produce a single valid RSS envelope by extracting
each file's <channel> inner content (strip the <?xml ...?> and outer <rss>
wrapper/attributes) and concatenating channel children (especially <item>,
<wp:...> entries), then wrap the combined children inside one <rss
...><channel>...</channel></rss> before returning, or alternatively parse each
file separately in readFolder (using the same XML parser used by
process.all/XMLParser.parse), merge the resulting objects' channel/item arrays,
and return the serialized merged XML/object; update readFileOrFolder to use this
merged output so XMLParser.parse no longer sees multiple root documents.

In `@packages/mg-wp-xml/lib/read-xml.js`:
- Around line 19-27: The loop that counts posts/pages uses item['wp:post_type']
directly, but because fast-xml-parser may preserve surrounding whitespace
(trimValues: false), comparisons like === 'post' or === 'page' can miss matches;
update the code in the for-loop (where itemsArray is iterated and postType is
read) to normalize postType by calling String(item['wp:post_type'] || '').trim()
(or equivalent) before comparing, then use that trimmed value when incrementing
postsCount and pagesCount.

---

Outside diff comments:
In `@packages/mg-wp-xml/lib/process.js`:
- Around line 479-481: In the async function process.all (in
packages/mg-wp-xml/lib/process.js) you must throw the error instead of returning
it: replace the `return new errors.NoContentError({message: 'Input file is
empty'})` with `throw new errors.NoContentError(...)` so the Promise is rejected
and upstream await callers receive an exception; ensure you reference the
errors.NoContentError constructor when creating the thrown error so existing
catch logic continues to work.

---

Nitpick comments:
In `@packages/mg-utils/src/lib/dom-utils.ts`:
- Around line 123-181: Extract the repeated "string → DocumentFragment" logic
into a single helper (e.g., stringToFragment or createFragmentFromString) that
takes an Element (or its ownerDocument) and the HTML string, creates a template,
sets temp.innerHTML, moves its child nodes into a DocumentFragment and returns
that fragment; then replace the duplicated blocks in replaceWith, insertBefore,
and insertAfter to call this helper when content is a string (also update wrap
if present to use it), preserving existing null/parentNode checks and Node-type
handling.
- Around line 223-232: The attr function mixes getter and setter returns; add
TypeScript overloads for attr to make the getter return string and the setter
return void (or undefined) so callers reading attributes don't have to handle
undefined. Specifically, add overload signatures like: attr(el: Element | null,
name: string): string; and attr(el: Element | null, name: string, value:
string): void; then keep a single implementation signature (e.g., attr(el:
Element | null, name: string, value?: string): string | void) and ensure the
implementation of attr (the existing function) conforms to those overloads
(return '' for missing element/getter and return nothing for setter). Reference:
function name attr in this file.

In `@packages/mg-wp-api/lib/processor.js`:
- Around line 1066-1092: The initial loop unconditionally strips srcset/sizes
from every <img> (parsed.$('img')) which makes the later check for
img.getAttribute('srcset') in the full-size handler
(parsed.$('img.full.size-full')) dead code; fix by moving/removing the
unconditional img.removeAttribute('srcset') and img.removeAttribute('sizes')
from the generic loop and instead only remove those attributes where intended
(inside the full-size handling block before wrap or in any other specific
branch), so parsed.$('img') still runs largerSrc(imageSrc) and sets src but does
not strip srcset/sizes globally; update references to the generic loop and the
full-size loop (parsed.$('img') and parsed.$('img.full.size-full')) and keep
wrap(...) and largerSrc(...) behavior intact.

In `@packages/mg-wp-xml/lib/process.js`:
- Around line 414-431: processAttachment currently calls processWPMeta(post)
which already returns a map of post meta, then redundantly iterates
ensureArray(post['wp:postmeta']) to find '_wp_attachment_image_alt'; remove that
second loop and instead read attachmentAlt directly from the meta object
returned by processWPMeta (e.g. attachmentAlt = meta['_wp_attachment_image_alt']
|| null) inside processAttachment, keeping the existing variables attachmentKey,
attachmentUrl, attachmentDesc and attachmentTitle intact and deleting the
ensureArray/postmeta loop and its local metaKey/metaVal logic.
- Around line 12-20: The parserOptions object is duplicated; extract it into a
single exported constant (parserOptions) in read-xml.js (or a new shared config
module), replace the inline parserOptions in process.js and the test with an
import/require of that exported parserOptions, and remove the local definitions
so all files reference the same shared constant; ensure import style matches the
repo (CommonJS require or ES import) and update any tests that referenced the
old local variable to use the imported parserOptions.

In `@packages/mg-wp-xml/lib/read-xml.js`:
- Around line 4-11: The parserOptions object is duplicated in read-xml.js and
process.js; extract it into a single shared module (e.g., create
lib/parser-options.js that exports parserOptions) and replace the local
definitions by importing that export in both read-xml.js and process.js; ensure
the exported object preserves the same seven properties (ignoreAttributes,
attributeNamePrefix, textNodeName, parseTagValue, parseAttributeValue,
trimValues) and remove the duplicate declarations so both modules use the single
source of truth.

In `@packages/mg-wp-xml/test/process.test.js`:
- Around line 6-13: The test defines a duplicate parserOptions object already
present in process.js and read-xml.js; remove the local parserOptions block from
process.test.js and instead import the shared constant exported from read-xml.js
(or the new shared module) so all modules use the single source of truth; ensure
read-xml.js exports parserOptions (e.g., export const parserOptions) and update
process.test.js to import { parserOptions } from that module.

Comment on lines +22 to +33
"@types/jsdom": "21.1.7",
"@types/node": "22.15.21",
"@typescript-eslint/eslint-plugin": "8.32.1",
"@typescript-eslint/parser": "8.32.1",
"c8": "10.1.3",
"eslint": "8.57.0",
"eslint-plugin-ghost": "3.4.4",
"typescript": "5.8.3"
},
"dependencies": {
"jsdom": "26.0.0"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check jsdom versions on npm
echo "=== Latest jsdom versions ==="
curl -s https://registry.npmjs.org/jsdom | jq -r '.versions | keys | .[-5:]' | head -20

echo ""
echo "=== Check jsdom 26.0.0 specifically ==="
curl -s https://registry.npmjs.org/jsdom/26.0.0 | jq '{version: .version, types: (.types // .typings // "none"), main: .main}' 2>/dev/null || echo "jsdom 26.0.0 not found"

echo ""
echo "=== Latest `@types/jsdom` versions ==="
curl -s https://registry.npmjs.org/@types/jsdom | jq -r '.versions | keys | .[-5:]' | head -20

echo ""
echo "=== Check `@types/jsdom` compatibility info ==="
curl -s https://registry.npmjs.org/@types/jsdom | jq '.["dist-tags"]'

Repository: TryGhost/migrate

Length of output: 1303


Update @types/jsdom to 27.0.0 for TypeScript 5.8.3 compatibility.

The package.json specifies TypeScript 5.8.3, but @types/jsdom is pinned to 21.1.7. The npm registry indicates that for TypeScript 5.8, @types/jsdom should be 27.0.0. This version mismatch will cause type inaccuracies, as jsdom 26.0.0 does not ship bundled types and relies on external type definitions. Update to @types/jsdom@27.0.0 (exact version).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-utils/package.json` around lines 22 - 33, Update the
devDependency for `@types/jsdom` from 21.1.7 to the exact version 27.0.0 in
package.json to match TypeScript 5.8.3 and jsdom 26.0.0; edit the entry
"@types/jsdom": "21.1.7" -> "@types/jsdom": "27.0.0" and run your install step
to refresh lockfile so type definitions align with TypeScript and jsdom.

Comment on lines +5 to +13
## Install

To use this package in your own project:

`npm install @tryghost/mg-utils --save`

or

`yarn add @tryghost/mg-utils`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Install instructions should prefer yarn with exact version pinning.

Line 9 recommends npm install first, but the project uses yarn exclusively. Line 13 is missing the -E flag for exact versions.

📝 Suggested fix
 ## Install
 
 To use this package in your own project:
 
-`npm install `@tryghost/mg-utils` --save`
-
-or
-
-`yarn add `@tryghost/mg-utils``
+`yarn add -E `@tryghost/mg-utils``

As per coding guidelines, "Always use yarn (v1) for package management - not npm. Use yarn add -E for exact versions".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Install
To use this package in your own project:
`npm install @tryghost/mg-utils --save`
or
`yarn add @tryghost/mg-utils`
## Install
To use this package in your own project:
`yarn add -E `@tryghost/mg-utils``
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-utils/README.md` around lines 5 - 13, The README install
instructions currently suggest npm and a plain yarn add; update the text to
prefer yarn v1 only and use exact version pinning by replacing the npm example
and the plain yarn example with a single yarn command using the -E flag (e.g.,
"yarn add -E `@tryghost/mg-utils`") so the README's Install section reflects the
project's policy to always use yarn (v1) and exact versions.

Comment on lines +638 to 646
let libsynPodcasts = parsed.$('iframe[src*="libsyn.com/embed/"]').map(async (el) => {
if (!allowRemoteScraping) {
return;
}

let iframeSrc = $html(el).attr('src');
let iframeSrc = el.getAttribute('src');
let libsynIdRegex = new RegExp('/id/([0-9]{1,})/');
let matches = iframeSrc.match(libsynIdRegex);
let showId = matches[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

matches is never null-checked before accessing matches[1].

If iframeSrc.match(libsynIdRegex) returns null (URL does not contain /id/<number>/), the subsequent matches[1] throws a TypeError. The selector iframe[src*="libsyn.com/embed/"] ensures the host is Libsyn, but not that the path always contains /id/.

🐛 Proposed fix
         let matches = iframeSrc.match(libsynIdRegex);
+        if (!matches) {
+            return;
+        }
         let showId = matches[1];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-api/lib/processor.js` around lines 638 - 646, In the libsyn
iframe processing callback (the map assigned to libsynPodcasts), guard against a
missing match by null-checking the result of iframeSrc.match(libsynIdRegex)
before accessing matches[1]; if iframeSrc is falsy or matches is null, bail out
early (return/continue) to avoid a TypeError, then only extract showId and
proceed with the existing logic that uses showId; apply this check inside the
anonymous async function used by
parsed.$('iframe[src*="libsyn.com/embed/"]').map so allowRemoteScraping behavior
and downstream code remain unchanged.

Comment on lines +44 to +45
const usernameImg = usernameCell ? usernameCell.querySelector('img') : null;
let image = usernameImg ? usernameImg.getAttribute('src').trim().replace('s=64', 's=500') : '';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

getAttribute('src') returns null when the attribute is absent — .trim() will throw a TypeError.

Unlike Cheerio's .attr() which returns undefined, the DOM getAttribute() method returns null for missing attributes. If any <img> in the parsed table lacks a src attribute, null.trim() throws at runtime.

🐛 Proposed fix
-    let image = usernameImg ? usernameImg.getAttribute('src').trim().replace('s=64', 's=500') : '';
+    let image = usernameImg ? (usernameImg.getAttribute('src') || '').trim().replace('s=64', 's=500') : '';
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const usernameImg = usernameCell ? usernameCell.querySelector('img') : null;
let image = usernameImg ? usernameImg.getAttribute('src').trim().replace('s=64', 's=500') : '';
const usernameImg = usernameCell ? usernameCell.querySelector('img') : null;
let image = usernameImg ? (usernameImg.getAttribute('src') || '').trim().replace('s=64', 's=500') : '';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-api/utils/users-html-to-json.js` around lines 44 - 45, The
code calls getAttribute('src') and immediately .trim() which throws if
getAttribute returns null; update the logic around usernameImg/usernameCell
(variables usernameImg and image) to safely handle a missing src by first
retrieving the rawSrc (e.g., const rawSrc = usernameImg ?
usernameImg.getAttribute('src') : null) and then only calling .trim() and
.replace('s=64','s=500') when rawSrc is a non-empty string; otherwise set image
to an empty string — ensure the final assignment for image uses a
guarded/conditional check or optional chaining with a fallback.

Comment on lines +24 to +26
// When reading multiple files, we need to merge them
// Extract channel content from each and combine
return out.join('');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure
fd -t f "read-file.js|read-xml.js" packages/mg-wp-xml/lib/

Repository: TryGhost/migrate

Length of output: 130


🏁 Script executed:

# Read the actual file content
cat -n packages/mg-wp-xml/lib/read-file.js

Repository: TryGhost/migrate

Length of output: 1505


🏁 Script executed:

# Search for how readFileOrFolder and readFolder are used downstream
rg -n "readFileOrFolder|readFolder|readFile" --type js -A 5 -B 2 packages/mg-wp-xml/

Repository: TryGhost/migrate

Length of output: 6829


🏁 Script executed:

# Check the test fixtures to see what's actually in the multiple directory
find packages/mg-wp-xml/test/fixtures -type f -name "*.xml" | head -20

Repository: TryGhost/migrate

Length of output: 484


🏁 Script executed:

# Look at the fixture files
ls -la packages/mg-wp-xml/test/fixtures/

Repository: TryGhost/migrate

Length of output: 557


🏁 Script executed:

# Check what's in the multiple fixtures directory
cat packages/mg-wp-xml/test/fixtures/multiple/

Repository: TryGhost/migrate

Length of output: 123


🏁 Script executed:

# Check the process.all implementation to understand how it handles the concatenated string
rg -n "process\.all|export default" packages/mg-wp-xml/lib/process.js -A 10 -B 2

Repository: TryGhost/migrate

Length of output: 338


🏁 Script executed:

# Let's look at process.js more thoroughly to find the all() method
rg -n "const all|export.*all" packages/mg-wp-xml/lib/process.js -B 2 -A 15

Repository: TryGhost/migrate

Length of output: 585


🏁 Script executed:

# Check one of the multiple fixture files to see its structure
head -50 packages/mg-wp-xml/test/fixtures/multiple/posts-only.xml

Repository: TryGhost/migrate

Length of output: 2560


🏁 Script executed:

# Check the actual size and structure of the multiple files
ls -lh packages/mg-wp-xml/test/fixtures/multiple/ && echo "---" && wc -l packages/mg-wp-xml/test/fixtures/multiple/*.xml

Repository: TryGhost/migrate

Length of output: 474


🏁 Script executed:

# Let's see what readFolder would actually return - concatenation of these files
# First check the head of the second file to see if it starts with XML declaration
head -5 packages/mg-wp-xml/test/fixtures/multiple/pages-only.xml

Repository: TryGhost/migrate

Length of output: 436


🏁 Script executed:

# Let's trace through what happens - check if there are tests that use readFileOrFolder with folders
rg -n "readFileOrFolder.*multiple|fixtures/multiple" packages/mg-wp-xml/test/ -B 2 -A 3

Repository: TryGhost/migrate

Length of output: 1552


🏁 Script executed:

# Look for tests that call index.js or the full process with a folder path
rg -n "fixtures/multiple" packages/mg-wp-xml/test/ -B 3 -A 5

Repository: TryGhost/migrate

Length of output: 2211


🏁 Script executed:

# Let's see what the concatenation would actually produce - does it have multiple XML declarations?
python3 << 'EOF'
import os
import glob

# Get the files from the multiple directory
files = sorted(glob.glob('packages/mg-wp-xml/test/fixtures/multiple/*.xml'))

# Read and concatenate them
concatenated = ''
for f in files:
    with open(f, 'r') as file:
        concatenated += file.read()

# Count XML declarations and root elements
xml_decl_count = concatenated.count('<?xml')
rss_open_count = concatenated.count('<rss')
rss_close_count = concatenated.count('</rss>')
channel_count = concatenated.count('<channel>')

print(f"XML declarations: {xml_decl_count}")
print(f"<rss> opening tags: {rss_open_count}")
print(f"</rss> closing tags: {rss_close_count}")
print(f"<channel> tags: {channel_count}")
print(f"\nFirst 500 chars of concatenation:")
print(concatenated[:500])
print(f"\nLast 300 chars of concatenation:")
print(concatenated[-300:])
EOF

Repository: TryGhost/migrate

Length of output: 1013


🏁 Script executed:

# Check if fast-xml-parser can handle multiple root elements by looking at its options in the code
rg -n "parserOptions|XMLParser" packages/mg-wp-xml/lib/process.js -B 2 -A 5 | head -40

Repository: TryGhost/migrate

Length of output: 808


🏁 Script executed:

# Check if there are integration tests that call the full migrator with a folder path
rg -n "readFileOrFolder|process\.all" packages/mg-wp-xml/test/ -l

Repository: TryGhost/migrate

Length of output: 99


🏁 Script executed:

# Look for any tests that use the index.js export with a folder
rg -n "fixtures/multiple" packages/mg-wp-xml/ -r --type js

Repository: TryGhost/migrate

Length of output: 472


🏁 Script executed:

# Let's verify how XMLParser handles multiple root elements
python3 << 'EOF'
# Test what XMLParser does with multiple root elements
import json

# Simulate the concatenated output
xml_multiple_roots = """<?xml version="1.0" encoding="UTF-8"?>
<rss>
  <channel>
    <item>First</item>
  </channel>
</rss>
<?xml version="1.0" encoding="UTF-8"?>
<rss>
  <channel>
    <item>Second</item>
  </channel>
</rss>"""

# Try with an XML parser
from xml.etree import ElementTree as ET

try:
    root = ET.fromstring(xml_multiple_roots)
    print("XML parsing succeeded (unexpected)")
    print(f"Root tag: {root.tag}")
except ET.ParseError as e:
    print(f"XML parsing failed with error: {e}")
    print(f"Error type: standard XML parser cannot parse multiple root elements")

# Now test with a lenient approach (what fast-xml-parser might do)
print("\n--- Testing lenient parsing (first root only) ---")
# Find first root element
lines = xml_multiple_roots.split('\n')
first_root_end = None
for i, line in enumerate(lines):
    if line.strip().startswith('</rss>'):
        first_root_end = i + 1
        break

if first_root_end:
    first_root_only = '\n'.join(lines[:first_root_end])
    try:
        root = ET.fromstring(first_root_only)
        print("Parsing first root only: SUCCESS")
        items = root.findall('.//item')
        print(f"Found {len(items)} items in first root")
    except ET.ParseError as e:
        print(f"Error: {e}")
EOF

Repository: TryGhost/migrate

Length of output: 316


🏁 Script executed:

# Check the actual behavior - look at what fast-xml-parser documentation says or check how it's configured
cat packages/mg-wp-xml/lib/process.js | head -30

Repository: TryGhost/migrate

Length of output: 917


🏁 Script executed:

# Check if there are tests in process.test.js that test folder-based input
rg -n "fixtures/multiple|readFolder|readFileOrFolder" packages/mg-wp-xml/test/process.test.js -B 3 -A 5

Repository: TryGhost/migrate

Length of output: 42


🏁 Script executed:

# Let's look at the entire process.test.js to understand what's tested
cat packages/mg-wp-xml/test/process.test.js | head -100

Repository: TryGhost/migrate

Length of output: 3882


🏁 Script executed:

# Critical: run the actual test to see if the folder reading test passes
# First check if there's a way to run tests
ls -la packages/mg-wp-xml/test/*.js

Repository: TryGhost/migrate

Length of output: 332


🏁 Script executed:

# Let's check fast-xml-parser documentation or behavior - what happens when you parse multiple roots?
# Create a test to see how the actual parser behaves
python3 << 'EOF'
# Let's manually check what XMLParser does with multiple root elements
# by examining how fast-xml-parser is imported and used

import subprocess
import os

os.chdir('packages/mg-wp-xml')

# Check if there's a package.json to understand dependencies
result = subprocess.run(['cat', 'package.json'], capture_output=True, text=True)
print("=== package.json ===")
print(result.stdout[:1000])

EOF

Repository: TryGhost/migrate

Length of output: 923


🏁 Script executed:

# Let's also check what version of fast-xml-parser is being used
cat packages/mg-wp-xml/package.json | grep -A2 "fast-xml-parser"

Repository: TryGhost/migrate

Length of output: 149


🏁 Script executed:

# Let's check fast-xml-parser documentation behavior - does v5.3.7 handle multiple roots?
# First, let's see what the actual test result would be by examining the test more carefully

rg -n "Can read folder" packages/mg-wp-xml/test/read-file.test.js -A 10

Repository: TryGhost/migrate

Length of output: 416


🏁 Script executed:

# Let's check if there's any documentation or comments about the multiple file handling
rg -n "multiple|merge|channel" packages/mg-wp-xml/lib/read-file.js -i -B 2 -A 2

Repository: TryGhost/migrate

Length of output: 223


🏁 Script executed:

# Let's check if fast-xml-parser has any special options that handle this
cat packages/mg-wp-xml/lib/process.js | grep -A 10 "parserOptions"

Repository: TryGhost/migrate

Length of output: 724


🏁 Script executed:

# Most importantly - let's verify what XMLParser.parse() actually returns when given multiple roots
python3 << 'EOF'
# Check fast-xml-parser behavior - we need to understand if v5.3.7 silently takes first root
# or if it fails. Let's check the node_modules if available, or infer from behavior

import subprocess
import os

# Let's see if we can find any documentation in the repo about this
result = subprocess.run(
    ['find', '.', '-name', '*fast-xml-parser*', '-o', '-name', '*parser*test*'],
    capture_output=True,
    text=True,
    cwd='packages/mg-wp-xml'
)

print("Files related to parser:")
print(result.stdout[:500] if result.stdout else "None found")

# Check if there are any comments in process.js about this
result = subprocess.run(
    ['rg', 'parse.*input|parse.*multiple|parse.*folder', '-i', '-B2', '-A2'],
    capture_output=True,
    text=True,
    cwd='packages/mg-wp-xml/lib'
)
print("\nParser usage context:")
print(result.stdout[:1000] if result.stdout else "None")
EOF

Repository: TryGhost/migrate

Length of output: 123


🏁 Script executed:

# Critical check: let's see if the test actually passes or if there's a skip/xfail
rg -n "test.*Can read folder|xit\|skip\|todo" packages/mg-wp-xml/test/read-file.test.js -B 2 -A 1

Repository: TryGhost/migrate

Length of output: 220


🏁 Script executed:

# Let's verify the crucial test - what does process.test.js actually test with folder input?
rg -n "readFolder|fixtures/multiple" packages/mg-wp-xml/test/process.test.js

Repository: TryGhost/migrate

Length of output: 42


🏁 Script executed:

# Let's check what happens when XMLParser encounters multiple root elements
# by looking at how fast-xml-parser behaves (check if there's any error handling)
python3 << 'EOF'
# Simulate what XMLParser does with the concatenated malformed XML
test_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss>
  <channel>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
  </channel>
</rss>
<?xml version="1.0" encoding="UTF-8"?>
<rss>
  <channel>
    <item>Item 4</item>
    <item>Item 5</item>
  </channel>
</rss>"""

# Most XML parsers handle this by parsing only the first root
# Let's verify with ElementTree
from xml.etree import ElementTree as ET

try:
    # This will fail
    root = ET.fromstring(test_xml)
except ET.ParseError as e:
    print(f"Standard parser fails: {e}")
    
# But we can test what happens if we only parse the first document
print("\nIf parser takes first root only:")
first_doc = test_xml.split('<?xml')[1]  # Get everything after first declaration
first_doc = '<?xml' + first_doc.split('<?xml')[0]  # Get up to second declaration
try:
    root = ET.fromstring(first_doc)
    items = root.findall('.//item')
    print(f"Items in first document: {len(items)}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: TryGhost/migrate

Length of output: 189


🏁 Script executed:

# Critical: Check if there's an integration test or if the folder reading path is actually tested end-to-end
rg -n "readFileOrFolder|index.js|default export" packages/mg-wp-xml/test/ -B2 -A2

Repository: TryGhost/migrate

Length of output: 42


🏁 Script executed:

# Check what the three fixture files contain in terms of items
for file in packages/mg-wp-xml/test/fixtures/multiple/*.xml; do
    count=$(grep -o "<item>" "$file" | wc -l)
    echo "$(basename $file): $count items"
done

Repository: TryGhost/migrate

Length of output: 137


🏁 Script executed:

# Final verification: check if index.js (the main export) uses readFileOrFolder 
# and if it would break with folder input
cat packages/mg-wp-xml/index.js

Repository: TryGhost/migrate

Length of output: 461


🏁 Script executed:

# Check if there are any end-to-end tests or examples using readFileOrFolder with folder paths
rg -n "readFileOrFolder|pathToFile" packages/mg-wp-xml/ -B 2 -A 3

Repository: TryGhost/migrate

Length of output: 2421


🏁 Script executed:

# Let's verify the exact error that would occur if XMLParser receives the concatenated malformed XML
python3 << 'EOF'
import subprocess
import os

# Create a test file with concatenated multiple roots
test_content = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <item>First</item>
  </channel>
</rss>
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <item>Second</item>
  </channel>
</rss>"""

# Write test file
with open('/tmp/malformed_multi_root.xml', 'w') as f:
    f.write(test_content)

print("Test content created with multiple root elements")
print(f"Content has {test_content.count('<?xml')} XML declarations and {test_content.count('<rss')} root elements")
print("\nWhen XMLParser.parse() receives this, it will fail or silently ignore everything after first root")
EOF

Repository: TryGhost/migrate

Length of output: 258


Multi-file folder processing is broken: concatenating raw XML produces invalid XML

The comment ("Extract channel content from each and combine") is a leftover from removed cheerio logic—no extraction happens, just out.join('').

More critically, this concatenation is fundamentally broken. Each WordPress WXR file contains a complete XML document with its own <?xml version="1.0"?> declaration and root <rss> element. Joining them raw produces a byte stream with multiple XML declarations and multiple root elements—malformed XML that any standard parser (including fast-xml-parser) rejects with "junk after document element".

The README documents this as a supported feature: "Path to XML file, or folder of XML files". When a user provides a folder path:

  1. index.js calls readFileOrFolder(pathToFile)
  2. readFolder() returns concatenated malformed XML
  3. process.all() calls XMLParser.parse() on it
  4. Parsing fails, breaking the entire migration for folder inputs

This gap is not caught by tests—read-file.test.js only regex-matches the raw string (counting tags), not validating XML; process.test.js has no folder tests.

Fix: Extract and merge channel content from each file into a single valid RSS envelope, or parse files individually and merge at the object level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/read-file.js` around lines 24 - 26,
readFolder/readFileOrFolder currently concatenates raw XML files (out.join(''))
producing multiple XML declarations and root <rss> elements; change readFolder
to produce a single valid RSS envelope by extracting each file's <channel> inner
content (strip the <?xml ...?> and outer <rss> wrapper/attributes) and
concatenating channel children (especially <item>, <wp:...> entries), then wrap
the combined children inside one <rss ...><channel>...</channel></rss> before
returning, or alternatively parse each file separately in readFolder (using the
same XML parser used by process.all/XMLParser.parse), merge the resulting
objects' channel/item arrays, and return the serialized merged XML/object;
update readFileOrFolder to use this merged output so XMLParser.parse no longer
sees multiple root documents.

}

return prepForOutput(output);
return output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent undefined return when type is neither 'folder' nor 'file'

output is declared but never assigned when detectType returns any value other than 'folder' or 'file'. The function silently returns undefined instead of throwing a descriptive error, which will surface as a confusing downstream failure.

🛡️ Proposed fix – add a guard clause
     if (type === 'folder') {
         output = await readFolder(path);
     } else if (type === 'file') {
         output = await readFile(path);
+    } else {
+        throw new Error(`Unsupported path type '${type}' for path: ${path}`);
     }
 
     return output;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return output;
if (type === 'folder') {
output = await readFolder(path);
} else if (type === 'file') {
output = await readFile(path);
} else {
throw new Error(`Unsupported path type '${type}' for path: ${path}`);
}
return output;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/read-file.js` at line 43, The function currently
returns the local variable output without assignment when detectType(...) yields
a value other than 'folder' or 'file', causing an unexpected undefined; update
the logic in the function (the code around detectType and the output variable in
read-file.js) to add a guard clause that throws a descriptive Error for
unexpected types (e.g., include the actual type returned by detectType) before
the final return, or alternatively ensure output is assigned a sane default for
all branches so return output never yields undefined.

Comment on lines +19 to +27
const items = xml?.rss?.channel?.item || [];
// Ensure items is always an array (single item becomes object)
const itemsArray = Array.isArray(items) ? items : [items];

let postsOutput = [];
let pagesOutput = [];
let postsCount = 0;
let pagesCount = 0;

$xml('item').each((i, post) => {
const postType = $xml(post).children('wp\\:post_type').text();
const postLink = $xml(post).children('link').text();
for (const item of itemsArray) {
const postType = item['wp:post_type'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

trimValues: false risks silent miscount if <wp:post_type> values contain surrounding whitespace.

trimValues defaults to true in fast-xml-parser. With it explicitly set to false, item['wp:post_type'] will preserve any surrounding whitespace (e.g. '\npost\n'), causing the strict === 'post' / === 'page' comparisons to silently miss items and undercount.

WordPress exports are generally clean, but defensively calling .trim() before the comparison eliminates this edge case without cost.

🛡️ Proposed defensive fix
-        const postType = item['wp:post_type'];
+        const postType = item['wp:post_type']?.trim();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mg-wp-xml/lib/read-xml.js` around lines 19 - 27, The loop that
counts posts/pages uses item['wp:post_type'] directly, but because
fast-xml-parser may preserve surrounding whitespace (trimValues: false),
comparisons like === 'post' or === 'page' can miss matches; update the code in
the for-loop (where itemsArray is iterated and postType is read) to normalize
postType by calling String(item['wp:post_type'] || '').trim() (or equivalent)
before comparing, then use that trimmed value when incrementing postsCount and
pagesCount.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on March 3

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

// Text node
if (node.nodeType === 3) {
/* c8 ignore next -- defensive fallback for null textContent */
return node.textContent || '';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text nodes not HTML-escaped during serialization

High Severity

serializeNode returns raw textContent for text nodes without HTML-encoding special characters like &, <, and >. JSDOM decodes HTML entities during parsing (e.g., &amp;&, &lt;<), but this serializer never re-encodes them. This causes parsed.html() to produce invalid/corrupt HTML. For example, <p>A &amp; B</p> would round-trip to <p>A & B</p>, and <p>5 &lt; 10</p> would become <p>5 < 10</p> which breaks HTML structure. An escapeAttr function exists for attributes but no equivalent escaping is applied to text content.

Fix in Cursor Fix in Web

return `<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/"><channel>${input}</channel></rss>`;
// When reading multiple files, we need to merge them
// Extract channel content from each and combine
return out.join('');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folder reading produces invalid concatenated XML

High Severity

readFolder concatenates raw XML file contents, producing invalid XML when multiple files exist. Each file is a complete XML document with <?xml>, <rss>, and <channel> wrappers. The old code extracted channel content via cheerio and re-wrapped it, but the new code removed that extraction (despite the comment saying "Extract channel content from each and combine"). The fast-xml-parser in all() expects a single valid XML document, so parsing this concatenated string will fail or silently drop data.

Fix in Cursor Fix in Web

if (wpPost?.parsely?.meta?.author && wpPost.parsely.meta.author.length > 0) {
wpPost.parsely.meta.author.forEach((author) => {
post.data.author = processAuthor({name: author.name});
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forEach loop overwrites author with last value only

Medium Severity

The forEach loop over wpPost.parsely.meta.author repeatedly overwrites post.data.author with each iteration, so only the last author survives. When multiple parsely authors exist, all but the final one are silently discarded. Contrast this with the coAuthors branch below it, which properly uses post.data.authors (array) for multiple authors.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant