Skip to content

Add kb_reference sheet and KB-only analysis mode#257

Merged
soimkim merged 1 commit intomainfrom
slow
Mar 12, 2026
Merged

Add kb_reference sheet and KB-only analysis mode#257
soimkim merged 1 commit intomainfrom
slow

Conversation

@soimkim
Copy link
Contributor

@soimkim soimkim commented Mar 12, 2026

Description

  • New Features

    • Reports now include KB reference data and inject KB references into external report sheets when available.
    • CLI now defaults to running all scanners.
  • Refactor

    • Improved KB server reachability with automatic retries and better transient error handling.

@soimkim soimkim marked this pull request as draft March 12, 2026 04:52
@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an ALL_MODE scanner constant and propagates it through scanner selection, KB activation, merging, and reporting; implements KB reachability retries and HTTPError handling; introduces KB reference assembly/injection into reports; extends merge_results with excluded_files; adds kb_origin_url and kb_evidence to SourceItem.

Changes

Cohort / File(s) Summary
CLI / Scanner Flow & Reporting
src/fosslight_source/cli.py
Add ALL_MODE and KB_REFERENCE_HEADER; update SCANNER_TYPE; default scanner now ALL_MODE; adjust run_scanners and display-mode handling; include KB reference data in report sheet lists and inject into external sheets when KB is active.
KB Reachability & Merge Changes
src/fosslight_source/cli.py
check_kb_server_reachable now retries up to 3 times with 1s backoff and treats HTTPError as reachable; merge_results gains excluded_files: set = None and can add KB-derived items for files absent from scancode results; get_kb_reference_to_print added.
Source Item KB fields & lookup
src/fosslight_source/_scan_item.py
Add kb_origin_url and kb_evidence attributes to SourceItem; _get_origin_url_from_md5_hash docstring updated; set_oss_item sets KB fields and uses origin_url for subsequent OSS info extraction when available.

Sequence Diagram(s)

mermaid
sequenceDiagram
rect rgba(60,120,200,0.5)
Client->>CLI: start scan (selected_scanner = ALL_MODE)
end
rect rgba(40,160,100,0.5)
CLI->>ScannerManager: run_scanners(selected_scanner)
ScannerManager->>Scanners: perform scans (source, oss; conditional KB)
Scanners-->>ScannerManager: scan results
end
rect rgba(200,120,60,0.5)
ScannerManager->>KBService: check_kb_server_reachable() (retries on transient errors)
KBService-->>ScannerManager: reachable / responses
ScannerManager->>MergeService: merge_results(scancode_result, excluded_files)
MergeService->>KBService: query KB for missing files (md5)
KBService-->>MergeService: origin_url, evidence
MergeService->>ReportGenerator: create_report_file(merged_results, kb_reference)
ReportGenerator->>Storage: write report with KB reference sheet
Storage-->>Client: report ready
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: addition of KB reference sheet functionality and KB-only analysis mode, which are clearly reflected in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch slow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@soimkim soimkim self-assigned this Mar 12, 2026
@soimkim soimkim added the chore [PR/Issue] Refactoring, maintenance the code label Mar 12, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/fosslight_source/cli.py (1)

265-287: Unreachable code at line 287.

The return False at line 287 is dead code. All control flow paths within the loop either return True (on success or HTTPError) or return False inside the except blocks when attempt == 2. The loop can never complete naturally without returning.

🧹 Remove unreachable return statement
         except Exception as e:
             logger.debug(f"Unexpected error checking KB server: {e}")
             if attempt < 2:
                 time.sleep(1)
             else:
                 return False
-    return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 265 - 287, The final "return False"
in check_kb_server_reachable is unreachable because the except blocks return on
the last attempt; change those inner "else: return False" branches to "break"
(or set a flag) so the loop can exit naturally, then keep a single "return
False" after the loop to handle failure; update references within
check_kb_server_reachable (the except blocks handling urllib.error.URLError and
generic Exception, and the attempt retry logic) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fosslight_source/cli.py`:
- Around line 265-287: The final "return False" in check_kb_server_reachable is
unreachable because the except blocks return on the last attempt; change those
inner "else: return False" branches to "break" (or set a flag) so the loop can
exit naturally, then keep a single "return False" after the loop to handle
failure; update references within check_kb_server_reachable (the except blocks
handling urllib.error.URLError and generic Exception, and the attempt retry
logic) accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce86a3dc-c565-4d15-b7bb-0940d95597b4

📥 Commits

Reviewing files that changed from the base of the PR and between d13d337 and 6801ebd.

📒 Files selected for processing (1)
  • src/fosslight_source/cli.py

@soimkim soimkim marked this pull request as ready for review March 12, 2026 05:59
@soimkim soimkim changed the title Try to ping kb server 3 times Add kb_reference sheet and KB-only analysis mode Mar 12, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/fosslight_source/cli.py (1)

229-241: ⚠️ Potential issue | 🟠 Major

kb_reference is incorrectly gated by -m.

The new sheet is built only under if need_license, but need_license is tied to matched-text output. A default --scanner kb / --scanner all run without -m will never emit the KB reference sheet even when merged_result contains KB URLs.

Proposed fix
         if need_license:
             if selected_scanner == 'scancode':
                 sheet_list["scancode_reference"] = get_license_list_to_print(license_list)
             elif selected_scanner == 'scanoss':
                 sheet_list["scanoss_reference"] = get_scanoss_extra_info(scanoss_result)
-            elif selected_scanner == 'kb':
-                kb_ref = get_kb_reference_to_print(merged_result)
-                sheet_list["kb_reference"] = kb_ref
             else:
                 sheet_list["scancode_reference"] = get_license_list_to_print(license_list)
                 sheet_list["scanoss_reference"] = get_scanoss_extra_info(scanoss_result)
-                kb_ref = get_kb_reference_to_print(merged_result)
-                sheet_list["kb_reference"] = kb_ref
+
+        if selected_scanner in ['kb', ALL_MODE]:
+            sheet_list["kb_reference"] = get_kb_reference_to_print(merged_result)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 229 - 241, The KB reference sheet
is only being created inside the need_license branch, so runs with --scanner kb
or --scanner all without -m never get kb_reference even when merged_result has
KB URLs; update the logic around selected_scanner so that
get_kb_reference_to_print(merged_result) is called and
sheet_list["kb_reference"] is set whenever selected_scanner == 'kb' or
selected_scanner == 'all' (or the default branch for all scanners), independent
of need_license, while leaving the existing license/scanoss logic under
need_license unchanged.
🧹 Nitpick comments (1)
src/fosslight_source/cli.py (1)

311-315: Avoid mutable defaults in merge_results().

The function declares mutable defaults ([] and {}), which is a Python anti-pattern. Although the current call site at line 469 explicitly passes all arguments, these defaults should be replaced with None to follow best practices and prevent accidental state bleeding if the function is called differently in the future.

Proposed fix
def merge_results(
-    scancode_result: list = [], scanoss_result: list = [], spdx_downloads: dict = {},
+    scancode_result: list = None, scanoss_result: list = None, spdx_downloads: dict = None,
-    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = {},
+    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = None,
     excluded_files: set = None
 ) -> list:
+    if scancode_result is None:
+        scancode_result = []
+    if scanoss_result is None:
+        scanoss_result = []
+    if spdx_downloads is None:
+        spdx_downloads = {}
+    if manifest_licenses is None:
+        manifest_licenses = {}
     if excluded_files is None:
         excluded_files = set()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 311 - 315, Replace the mutable
default arguments in merge_results (scancode_result, scanoss_result,
spdx_downloads, manifest_licenses) with None and initialize them inside the
function (e.g., if scancode_result is None: scancode_result = []) to avoid
shared-state bugs; keep path_to_scan and run_kb as-is and preserve the
excluded_files default if intended, but ensure all mutable parameters are
reliably initialized inside merge_results before use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fosslight_source/_scan_item.py`:
- Around line 187-190: The code currently overwrites a missing API score by
setting kb_match_pct to "100"; instead, preserve the API's original response by
assigning kb_match_pct directly from _get_origin_url_from_md5_hash without
defaulting to "100". Update the branch that sets self.kb_origin_url and
self.kb_match_pct (references: _get_origin_url_from_md5_hash, kb_origin_url,
kb_match_pct) so that kb_match_pct is set to match_pct as-is (allowing
empty/None) or left unset, rather than substituting "100".

In `@src/fosslight_source/cli.py`:
- Around line 212-214: The cover sheet is using a fresh
check_kb_server_reachable() call instead of the actual decision made earlier (so
it can mismatch merge_results()); stop re-pinging and instead thread the boolean
that indicates whether KB lookup actually ran (e.g., kb_used) from the code that
resolved KB reachability and call site that invokes create_report_file(), then
inside create_report_file() use that kb_used value to set
scan_item.set_cover_comment("KB Enabled" if kb_used else "KB Unreachable")
rather than calling check_kb_server_reachable(); update the call sites of
create_report_file() to accept the new kb_used parameter and remove the
redundant reachability call near run_kb/scan_item.set_cover_comment.

---

Outside diff comments:
In `@src/fosslight_source/cli.py`:
- Around line 229-241: The KB reference sheet is only being created inside the
need_license branch, so runs with --scanner kb or --scanner all without -m never
get kb_reference even when merged_result has KB URLs; update the logic around
selected_scanner so that get_kb_reference_to_print(merged_result) is called and
sheet_list["kb_reference"] is set whenever selected_scanner == 'kb' or
selected_scanner == 'all' (or the default branch for all scanners), independent
of need_license, while leaving the existing license/scanoss logic under
need_license unchanged.

---

Nitpick comments:
In `@src/fosslight_source/cli.py`:
- Around line 311-315: Replace the mutable default arguments in merge_results
(scancode_result, scanoss_result, spdx_downloads, manifest_licenses) with None
and initialize them inside the function (e.g., if scancode_result is None:
scancode_result = []) to avoid shared-state bugs; keep path_to_scan and run_kb
as-is and preserve the excluded_files default if intended, but ensure all
mutable parameters are reliably initialized inside merge_results before use.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b8092cdf-4bff-4aa8-9f03-9ca3d11ddc93

📥 Commits

Reviewing files that changed from the base of the PR and between 6801ebd and 1a24382.

📒 Files selected for processing (2)
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
src/fosslight_source/cli.py (3)

318-322: Avoid mutable default arguments.

Per static analysis (B006), using mutable defaults like [] and {} can lead to unexpected behavior if the defaults are modified. While the current code appears safe, this is a maintenance hazard.

Proposed fix
 def merge_results(
-    scancode_result: list = [], scanoss_result: list = [], spdx_downloads: dict = {},
-    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = {},
-    excluded_files: set = None
+    scancode_result: list = None, scanoss_result: list = None, spdx_downloads: dict = None,
+    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = None,
+    excluded_files: set = None
 ) -> list:
+    if scancode_result is None:
+        scancode_result = []
+    if scanoss_result is None:
+        scanoss_result = []
+    if spdx_downloads is None:
+        spdx_downloads = {}
+    if manifest_licenses is None:
+        manifest_licenses = {}
     if excluded_files is None:
         excluded_files = set()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 318 - 322, The function
merge_results currently uses mutable default arguments (scancode_result,
scanoss_result, spdx_downloads, manifest_licenses) which can cause state
leakage; change their signatures to use None defaults and inside merge_results
initialize them to empty structures if None (e.g., if scancode_result is None:
scancode_result = []; same for scanoss_result and manifest_licenses with dicts
for spdx_downloads and manifest_licenses), leaving excluded_files as None or
converting to an empty set inside if needed; update references to these
variables in merge_results accordingly to avoid mutating shared defaults.

400-400: Mutable default argument path_to_exclude=[].

Same concern as merge_results() - mutable default arguments should be avoided.

Proposed fix
-    selected_scanner: str = ALL_MODE, path_to_exclude: list = [],
+    selected_scanner: str = ALL_MODE, path_to_exclude: list = None,

Then add at the start of the function:

if path_to_exclude is None:
    path_to_exclude = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` at line 400, The function signature currently
uses a mutable default for path_to_exclude (path_to_exclude: list = []); change
the default to None (e.g., path_to_exclude: Optional[list] = None) and at the
start of the function (the same function that declares selected_scanner and
path_to_exclude) add a guard: if path_to_exclude is None: path_to_exclude = []
to avoid shared mutable state; follow the same pattern used for merge_results
and update any type hints/imports as needed.

287-292: Catching bare Exception may hide unexpected errors.

Per static analysis (BLE001), catching blind Exception can mask programming errors. Consider catching more specific exceptions or at minimum logging the exception type.

Proposed improvement
-        except Exception as e:
-            logger.debug(f"Unexpected error checking KB server: {e}")
+        except (OSError, ValueError) as e:
+            logger.debug(f"Error checking KB server: {type(e).__name__}: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 287 - 292, Replace the broad
"except Exception as e" in the KB server check block with more specific
exception handling: catch network/IO related exceptions (e.g.,
requests.exceptions.RequestException, socket.error, TimeoutError) and handle
them the same way (use logger to record full exception type/message), and only
fall back to a generic except Exception as e that re-raises after logging if
truly needed; update the import list to include the specific exception classes
and use logger.exception or logger.debug to include exception type and stack
trace while preserving the existing retry logic that uses attempt, time.sleep,
and return False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fosslight_source/cli.py`:
- Around line 216-217: When selected_scanner == ALL_MODE the display string is
built from SCANNER_TYPE and incorrectly includes the 'all' entry; update the
logic that sets display_mode (the block using selected_scanner, ALL_MODE,
display_mode, and SCANNER_TYPE) to exclude the ALL_MODE value from the list
before joining (e.g., filter SCANNER_TYPE to skip entries equal to ALL_MODE) so
the output lists only actual scanner names.
- Around line 296-315: get_kb_reference_to_print builds rows with 3 columns but
KB_REFERENCE_HEADER has 4 columns, causing misalignment; update
get_kb_reference_to_print to include the missing ID column for each data row
(e.g., use an explicit identifier from the SourceItem like item.id or generate a
stable index with enumerate) so each row matches KB_REFERENCE_HEADER's ['ID',
'Source Path', 'KB Origin URL', 'Match %']; ensure the ID value is added as the
first element in the list comprehension that builds data and maintain sorting by
item.source_name_or_path.
- Around line 271-293: The final "return False" at the end of
check_kb_server_reachable is dead code because every path inside the
three-attempt loop already returns; remove that unreachable return (or
alternatively convert the loop to while/for and place a single return False
after the loop), updating only the check_kb_server_reachable function and
leaving KB_URL/logging logic unchanged.

---

Nitpick comments:
In `@src/fosslight_source/cli.py`:
- Around line 318-322: The function merge_results currently uses mutable default
arguments (scancode_result, scanoss_result, spdx_downloads, manifest_licenses)
which can cause state leakage; change their signatures to use None defaults and
inside merge_results initialize them to empty structures if None (e.g., if
scancode_result is None: scancode_result = []; same for scanoss_result and
manifest_licenses with dicts for spdx_downloads and manifest_licenses), leaving
excluded_files as None or converting to an empty set inside if needed; update
references to these variables in merge_results accordingly to avoid mutating
shared defaults.
- Line 400: The function signature currently uses a mutable default for
path_to_exclude (path_to_exclude: list = []); change the default to None (e.g.,
path_to_exclude: Optional[list] = None) and at the start of the function (the
same function that declares selected_scanner and path_to_exclude) add a guard:
if path_to_exclude is None: path_to_exclude = [] to avoid shared mutable state;
follow the same pattern used for merge_results and update any type hints/imports
as needed.
- Around line 287-292: Replace the broad "except Exception as e" in the KB
server check block with more specific exception handling: catch network/IO
related exceptions (e.g., requests.exceptions.RequestException, socket.error,
TimeoutError) and handle them the same way (use logger to record full exception
type/message), and only fall back to a generic except Exception as e that
re-raises after logging if truly needed; update the import list to include the
specific exception classes and use logger.exception or logger.debug to include
exception type and stack trace while preserving the existing retry logic that
uses attempt, time.sleep, and return False.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d996ecbd-675a-4ba9-b3f5-352adedab326

📥 Commits

Reviewing files that changed from the base of the PR and between 1a24382 and cbf16ca.

📒 Files selected for processing (2)
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/fosslight_source/_scan_item.py (1)

108-125: ⚠️ Potential issue | 🟠 Major

Return KB evidence with the URL.

This helper only returns the origin URL, but the new kb_reference sheet also exports kb_evidence. Since the request can include wfp_base64, set_oss_item() currently has to hardcode "exact_match" for every hit, which mislabels snippet/WFP-based matches. Please return the evidence from the KB response here and propagate that value instead of synthesizing it later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/_scan_item.py` around lines 108 - 125, The helper
_get_origin_url_from_md5_hash currently only extracts and returns the origin URL
from the KB response; update it to also extract the KB evidence field from the
response (e.g., from data['output'] or the returned object) and return both
values (e.g., as a tuple or small dict) so callers can use the actual evidence
instead of synthesizing "exact_match"; then update set_oss_item() to accept and
propagate the returned evidence value into the kb_evidence/kb_reference export
path rather than hardcoding "exact_match".
src/fosslight_source/cli.py (1)

229-244: ⚠️ Potential issue | 🟠 Major

Don't hide kb_reference behind -m.

kb_reference is unrelated to matched-license text, but this block only adds it when need_license is true. In --scanner kb or all runs without -m, the new sheet never gets emitted. Move the KB sheet generation outside the need_license gate and keep that flag only for scancode_reference / scanoss_reference.

💡 Possible split
         if need_license:
             if selected_scanner == 'scancode':
                 sheet_list["scancode_reference"] = get_license_list_to_print(license_list)
             elif selected_scanner == 'scanoss':
                 sheet_list["scanoss_reference"] = get_scanoss_extra_info(scanoss_result)
-            elif selected_scanner == 'kb':
-                kb_ref = get_kb_reference_to_print(merged_result)
-                sheet_list["kb_reference"] = kb_ref
             else:
                 sheet_list["scancode_reference"] = get_license_list_to_print(license_list)
                 sheet_list["scanoss_reference"] = get_scanoss_extra_info(scanoss_result)
-                kb_ref = get_kb_reference_to_print(merged_result)
-                sheet_list["kb_reference"] = kb_ref
+
+        if selected_scanner in ['kb', ALL_MODE]:
+            kb_ref = get_kb_reference_to_print(merged_result)
+            if len(kb_ref) > 1:
+                sheet_list["kb_reference"] = kb_ref
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 229 - 244, The current logic only
builds kb_reference when need_license is true, causing KB-only runs
(selected_scanner == 'kb' or 'all' without -m) to omit the KB sheet; change the
block so that get_kb_reference_to_print(merged_result) and
sheet_list["kb_reference"] = kb_ref are executed outside/after the need_license
conditional while keeping get_license_list_to_print(license_list) and
get_scanoss_extra_info(scanoss_result) generation inside the need_license
branch; ensure you still respect selected_scanner checks (selected_scanner ==
'kb' should produce only kb_reference when appropriate), and leave the final if
sheet_list: scan_item.external_sheets = sheet_list check as-is.
♻️ Duplicate comments (2)
src/fosslight_source/cli.py (2)

212-214: ⚠️ Potential issue | 🟡 Minor

Use the resolved KB state here, not a second health check.

merge_results() already decides whether KB lookups ran. Re-probing here can make the cover sheet say KB Enabled / KB Unreachable independently of what was actually used to build merged_result. Please thread that final kb_used boolean into create_report_file() and use it here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 212 - 214, Thread the authoritative
kb_used boolean (determined by merge_results()) into create_report_file() and
use that value when setting the cover comment instead of re-checking the server;
specifically, stop calling check_kb_server_reachable() in the block that
currently computes run_kb and calls scan_item.set_cover_comment, accept a
kb_used parameter in create_report_file (or add it to the call site that
produces the report), and replace the cover comment logic with
scan_item.set_cover_comment("KB Enabled" if kb_used else "KB Unreachable") so
the cover sheet reflects the actual merged_result usage.

306-314: ⚠️ Potential issue | 🟠 Major

Add the missing ID value to each KB row.

KB_REFERENCE_HEADER has four columns, but each row here has only three values. That shifts every exported column left and produces a malformed kb_reference sheet.

🧩 Minimal fix
-    data = [
-        [
-            item.source_name_or_path,
-            item.kb_origin_url,
-            str(getattr(item, 'kb_evidence', '') or '')
-        ]
-        for item in rows
-    ]
+    data = [
+        [
+            idx,
+            item.source_name_or_path,
+            item.kb_origin_url,
+            str(getattr(item, 'kb_evidence', '') or '')
+        ]
+        for idx, item in enumerate(rows, start=1)
+    ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 306 - 314, The KB rows list is
missing the ID column so each row has three values while KB_REFERENCE_HEADER has
four; update the list comprehension that builds data (the block using rows and
fields item.source_name_or_path, item.kb_origin_url, getattr(item,
'kb_evidence', '')) to include the ID value as the first (or appropriate)
element for each item—e.g., use getattr(item, 'id', '') or the correct
identifier attribute on the item—so each row contains four entries matching
KB_REFERENCE_HEADER before inserting the header.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/fosslight_source/_scan_item.py`:
- Around line 108-125: The helper _get_origin_url_from_md5_hash currently only
extracts and returns the origin URL from the KB response; update it to also
extract the KB evidence field from the response (e.g., from data['output'] or
the returned object) and return both values (e.g., as a tuple or small dict) so
callers can use the actual evidence instead of synthesizing "exact_match"; then
update set_oss_item() to accept and propagate the returned evidence value into
the kb_evidence/kb_reference export path rather than hardcoding "exact_match".

In `@src/fosslight_source/cli.py`:
- Around line 229-244: The current logic only builds kb_reference when
need_license is true, causing KB-only runs (selected_scanner == 'kb' or 'all'
without -m) to omit the KB sheet; change the block so that
get_kb_reference_to_print(merged_result) and sheet_list["kb_reference"] = kb_ref
are executed outside/after the need_license conditional while keeping
get_license_list_to_print(license_list) and
get_scanoss_extra_info(scanoss_result) generation inside the need_license
branch; ensure you still respect selected_scanner checks (selected_scanner ==
'kb' should produce only kb_reference when appropriate), and leave the final if
sheet_list: scan_item.external_sheets = sheet_list check as-is.

---

Duplicate comments:
In `@src/fosslight_source/cli.py`:
- Around line 212-214: Thread the authoritative kb_used boolean (determined by
merge_results()) into create_report_file() and use that value when setting the
cover comment instead of re-checking the server; specifically, stop calling
check_kb_server_reachable() in the block that currently computes run_kb and
calls scan_item.set_cover_comment, accept a kb_used parameter in
create_report_file (or add it to the call site that produces the report), and
replace the cover comment logic with scan_item.set_cover_comment("KB Enabled" if
kb_used else "KB Unreachable") so the cover sheet reflects the actual
merged_result usage.
- Around line 306-314: The KB rows list is missing the ID column so each row has
three values while KB_REFERENCE_HEADER has four; update the list comprehension
that builds data (the block using rows and fields item.source_name_or_path,
item.kb_origin_url, getattr(item, 'kb_evidence', '')) to include the ID value as
the first (or appropriate) element for each item—e.g., use getattr(item, 'id',
'') or the correct identifier attribute on the item—so each row contains four
entries matching KB_REFERENCE_HEADER before inserting the header.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 77a083ad-9036-4a1b-b9f5-0ed66750a9ac

📥 Commits

Reviewing files that changed from the base of the PR and between cbf16ca and 456439b.

📒 Files selected for processing (2)
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/fosslight_source/cli.py (1)

318-336: ⚠️ Potential issue | 🟠 Major

Mutable default arguments are mutated in function body - this is a bug.

The function uses mutable defaults (scancode_result: list = [], etc.) and then mutates them via .extend() (line 337) and .append() (lines 349, 362, 381). This means subsequent calls without explicit arguments will see leftover data from previous calls, causing data corruption.

Proposed fix
 def merge_results(
-    scancode_result: list = [], scanoss_result: list = [], spdx_downloads: dict = {},
-    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = {},
-    excluded_files: set = None
+    scancode_result: list = None, scanoss_result: list = None, spdx_downloads: dict = None,
+    path_to_scan: str = "", run_kb: bool = False, manifest_licenses: dict = None,
+    excluded_files: set | None = None
 ) -> list:
     ...
     if excluded_files is None:
         excluded_files = set()
+    if scancode_result is None:
+        scancode_result = []
+    if scanoss_result is None:
+        scanoss_result = []
+    if spdx_downloads is None:
+        spdx_downloads = {}
+    if manifest_licenses is None:
+        manifest_licenses = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 318 - 336, The merge_results
function currently uses mutable default arguments (scancode_result,
scanoss_result, spdx_downloads, manifest_licenses) which are mutated inside (via
.extend()/.append()), causing state leakage across calls; change each of these
default params to None and initialize them at the top of merge_results (e.g., if
scancode_result is None: scancode_result = []; same for scanoss_result,
spdx_downloads = {}, manifest_licenses = {}) so every call gets fresh
containers, keeping the existing logic that already handles excluded_files being
None.
♻️ Duplicate comments (1)
src/fosslight_source/cli.py (1)

296-315: ⚠️ Potential issue | 🟠 Major

Column count mismatch between header and data rows.

KB_REFERENCE_HEADER has 4 columns (['ID', 'Source Path', 'KB Origin URL', 'Evidence']) but the data rows only have 3 elements ([source_name_or_path, kb_origin_url, kb_evidence]). This will cause misaligned columns in the output spreadsheet.

Proposed fix - add ID column
     data = [
         [
+            idx + 1,  # ID column
             item.source_name_or_path,
             item.kb_origin_url,
             str(getattr(item, 'kb_evidence', '') or '')
         ]
-        for item in rows
+        for idx, item in enumerate(rows)
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 296 - 315,
get_kb_reference_to_print builds rows with only 3 values but KB_REFERENCE_HEADER
expects 4 columns; update the data row construction to include an ID as the
first column so each row matches KB_REFERENCE_HEADER. In
get_kb_reference_to_print add an ID value (prefer item.id via getattr(item,
'id', None) if present, otherwise a generated sequential index) and include it
before item.source_name_or_path; keep using item.source_name_or_path,
item.kb_origin_url and getattr(item, 'kb_evidence', '') for the remaining
columns so all rows align with KB_REFERENCE_HEADER.
🧹 Nitpick comments (3)
src/fosslight_source/cli.py (3)

271-293: Consider removing unreachable code at line 293.

The return False on line 293 is unreachable. The loop runs for attempt in [0, 1, 2], and on the final attempt (2), both exception handlers explicitly return False. The loop cannot complete without returning.

Proposed fix
         except Exception as e:
             logger.debug(f"Unexpected error checking KB server: {e}")
             if attempt < 2:
                 time.sleep(1)
             else:
                 return False
-    return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 271 - 293, The final `return False`
in check_kb_server_reachable is unreachable because the URLError and generic
Exception handlers return False on the last attempt; instead remove the inner
`else: return False` branches in the urllib.error.URLError and generic `except
Exception as e` blocks so the loop can naturally retry and fall through to the
single final `return False` at the end of check_kb_server_reachable; keep the
debug logs and the delay logic (time.sleep(1)) intact and reference the function
name check_kb_server_reachable and the exception handlers for URLError and
Exception when making the change.

391-394: Mutable default arguments in function signature.

formats: list = [] and path_to_exclude: list = [] use mutable defaults. While these aren't mutated directly in this function, it's still a Python anti-pattern that can lead to subtle bugs.

Proposed fix
     print_matched_text: bool = False,
-    formats: list = [], time_out: int = 120,
+    formats: list = None, time_out: int = 120,
     correct_mode: bool = True, correct_filepath: str = "",
-    selected_scanner: str = ALL_MODE, path_to_exclude: list = [],
+    selected_scanner: str = ALL_MODE, path_to_exclude: list = None,
     all_exclude_mode: tuple = ()

Then at the start of the function:

if formats is None:
    formats = []
if path_to_exclude is None:
    path_to_exclude = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` around lines 391 - 394, The function signature
currently uses mutable defaults formats: list = [] and path_to_exclude: list =
[], which is an anti-pattern; change those defaults to None (e.g., formats:
Optional[list] = None, path_to_exclude: Optional[list] = None) and at the start
of the same function initialize them with empty lists if None (if formats is
None: formats = []; if path_to_exclude is None: path_to_exclude = []); keep the
rest of the parameters (time_out, correct_mode, correct_filepath,
selected_scanner/ALL_MODE, all_exclude_mode) unchanged and ensure any type
hints/imports are updated accordingly.

143-143: Avoid mutable default argument.

Using formats: list = [] as a default can cause subtle bugs if the list is ever mutated. Although the current implementation doesn't modify it in place, it's safer to use None as the default.

Proposed fix
     correct_filepath: str = "", path_to_scan: str = "", path_to_exclude: list = [],
-    formats: list = [], api_limit_exceed: bool = False, files_count: int = 0, final_output_path: str = "",
+    formats: list = None, api_limit_exceed: bool = False, files_count: int = 0, final_output_path: str = "",
     run_kb_msg: str = ""
 ) -> 'ScannerItem':

Then at the start of the function body:

if formats is None:
    formats = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fosslight_source/cli.py` at line 143, Change the function parameter
default for formats from a mutable list to None (i.e., replace "formats: list =
[]" with "formats: list | None = None") and at the start of the function body
add a guard that checks if formats is None and assigns an empty list if so
(e.g., if formats is None: formats = []). This ensures the parameter "formats"
is not a shared mutable default; update any type hints or usages in the function
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/fosslight_source/cli.py`:
- Around line 318-336: The merge_results function currently uses mutable default
arguments (scancode_result, scanoss_result, spdx_downloads, manifest_licenses)
which are mutated inside (via .extend()/.append()), causing state leakage across
calls; change each of these default params to None and initialize them at the
top of merge_results (e.g., if scancode_result is None: scancode_result = [];
same for scanoss_result, spdx_downloads = {}, manifest_licenses = {}) so every
call gets fresh containers, keeping the existing logic that already handles
excluded_files being None.

---

Duplicate comments:
In `@src/fosslight_source/cli.py`:
- Around line 296-315: get_kb_reference_to_print builds rows with only 3 values
but KB_REFERENCE_HEADER expects 4 columns; update the data row construction to
include an ID as the first column so each row matches KB_REFERENCE_HEADER. In
get_kb_reference_to_print add an ID value (prefer item.id via getattr(item,
'id', None) if present, otherwise a generated sequential index) and include it
before item.source_name_or_path; keep using item.source_name_or_path,
item.kb_origin_url and getattr(item, 'kb_evidence', '') for the remaining
columns so all rows align with KB_REFERENCE_HEADER.

---

Nitpick comments:
In `@src/fosslight_source/cli.py`:
- Around line 271-293: The final `return False` in check_kb_server_reachable is
unreachable because the URLError and generic Exception handlers return False on
the last attempt; instead remove the inner `else: return False` branches in the
urllib.error.URLError and generic `except Exception as e` blocks so the loop can
naturally retry and fall through to the single final `return False` at the end
of check_kb_server_reachable; keep the debug logs and the delay logic
(time.sleep(1)) intact and reference the function name check_kb_server_reachable
and the exception handlers for URLError and Exception when making the change.
- Around line 391-394: The function signature currently uses mutable defaults
formats: list = [] and path_to_exclude: list = [], which is an anti-pattern;
change those defaults to None (e.g., formats: Optional[list] = None,
path_to_exclude: Optional[list] = None) and at the start of the same function
initialize them with empty lists if None (if formats is None: formats = []; if
path_to_exclude is None: path_to_exclude = []); keep the rest of the parameters
(time_out, correct_mode, correct_filepath, selected_scanner/ALL_MODE,
all_exclude_mode) unchanged and ensure any type hints/imports are updated
accordingly.
- Line 143: Change the function parameter default for formats from a mutable
list to None (i.e., replace "formats: list = []" with "formats: list | None =
None") and at the start of the function body add a guard that checks if formats
is None and assigns an empty list if so (e.g., if formats is None: formats =
[]). This ensures the parameter "formats" is not a shared mutable default;
update any type hints or usages in the function accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 26b9de3e-c64f-4336-a54d-8a1677e123c5

📥 Commits

Reviewing files that changed from the base of the PR and between 456439b and 620fe4c.

📒 Files selected for processing (2)
  • src/fosslight_source/_scan_item.py
  • src/fosslight_source/cli.py

@soimkim soimkim added enhancement [PR/Issue] New feature or request and removed chore [PR/Issue] Refactoring, maintenance the code labels Mar 12, 2026
@soimkim soimkim merged commit 816e546 into main Mar 12, 2026
7 checks passed
@soimkim soimkim deleted the slow branch March 12, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement [PR/Issue] New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant