feat: apply multiple testing correction (p-adj) for statistical significance#12
feat: apply multiple testing correction (p-adj) for statistical significance#12hjn0415a wants to merge 7 commits intoOpenMS:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR adds GO enrichment analysis capabilities to a proteomics workflow application. It introduces a new Proteomics LFQ results page, implements GO enrichment analysis with MyGeneInfo integration and Fisher's exact testing, migrates statistical filtering from raw p-values to FDR-adjusted p-values across multiple result views, and adds Windows compatibility to workflow management. Changes
Sequence Diagram(s)sequenceDiagram
participant WT as WorkflowTest
participant AD as get_abundance_data
participant MGI as MyGeneInfo API
participant SA as Statistical Analysis
participant RS as Results Storage
WT->>AD: request abundance data
AD-->>WT: return pivot_df with proteins
WT->>WT: extract UniProt IDs from pivot_df
WT->>WT: filter genes by p-value & log2FC<br/>(foreground & background sets)
loop for each GO type (BP, CC, MF)
WT->>MGI: query GO terms for genes
MGI-->>WT: return GO term mappings
WT->>SA: compute Fisher's exact test<br/>per GO term
SA-->>WT: return p-values & odds ratios
WT->>WT: generate Plotly bar chart
end
WT->>RS: save figures (JSON) & DataFrames
WT->>WT: store results in session state
WT-->>WT: log completion
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
content/results_pca.py (1)
46-46:⚠️ Potential issue | 🟡 MinorInconsistent user-facing message: still says "p-value" instead of "p-adj".
Line 46 reads "Not enough proteins after p-value filtering for PCA" but filtering now uses
p-adj.Proposed fix
- st.info("Not enough proteins after p-value filtering for PCA.") + st.info("Not enough proteins after p-adj filtering for PCA.")
🤖 Fix all issues with AI agents
In `@content/results_proteomicslfq.py`:
- Line 56: The table is being sorted by "p-value" but should use adjusted
p-values; update the call that renders the dataframe (st.dataframe(...)) to sort
pivot_df by "p-adj" instead of "p-value" (i.e., change
pivot_df.sort_values("p-value") to pivot_df.sort_values("p-adj") when
rendering), and ensure pivot_df contains the "p-adj" column or fall back
gracefully if missing (check pivot_df.columns or use
pivot_df.sort_values(by=["p-adj"] if "p-adj" in pivot_df.columns else
["p-value"])).
In `@src/common/results_helpers.py`:
- Around line 264-270: The issue is that when stats_df is empty the "p-adj"
column is never created and later access raises a KeyError; fix it by ensuring
"p-adj" exists regardless of emptiness—before the existing if not stats_df.empty
block (or in an else branch for the empty case) initialize stats_df["p-adj"] =
np.nan (using the same numpy alias) so that subsequent code referencing
stats_df["p-adj"] (and functions using stats_df, mask, multipletests) always
finds the column.
In `@src/WorkflowTest.py`:
- Around line 815-828: The inline GO enrichment block and the subsequent final
report comment both use the "5️⃣" section number; update the final report
section marker to "6️⃣" to avoid duplicate numbering—locate the comment
preceding the final report (near the block that follows
workspace_path/get_abundance_data and the call to self._run_go_enrichment) and
change "5️⃣ Final report" to "6️⃣ Final report" so section numbers are unique.
- Around line 856-859: The foreground selection for GO enrichment still filters
on analysis_df["p-value"] when the PR intends to use FDR-adjusted p-values;
update the filter in the fg_ids assignment to use the adjusted p-value column
(e.g., "p-adj") instead of "p-value" while still applying the log2FC cutoff
(fc_cutoff) and p_cutoff threshold, and also ensure the earlier dropna call that
constructs pivot_df includes "p-adj" in its subset so p-adj values are present
for filtering; locate and modify the fg_ids computation and the prior dropna
call (references: variable fg_ids, dataframe analysis_df, thresholds p_cutoff
and fc_cutoff, and the "p-adj" column) accordingly.
- Around line 870-872: The mygene query mg.querymany(bg_ids, scopes="uniprot",
fields="go", as_dataframe=False) is a network call with no timeout or error
handling; wrap this call in a try/except around the mg.querymany invocation (the
call that uses bg_ids and returns res_list) and add a timeout parameter if
supported by the client (or call via a requests.Session with timeout) so the
call fails fast; on exception, log the error (including exception details) and
implement a clear fallback (e.g., return an empty res_list or re-raise a custom
exception) so the workflow does not hang indefinitely.
🧹 Nitpick comments (8)
requirements.txt (1)
148-152: Consider pinning versions formygeneandstatsmodels.Both new dependencies are unpinned, which is consistent with other deps at the bottom of this file (e.g.,
scipy,scikit-learn). However, for reproducibility and to avoid surprise breakage, consider adding version constraints — especially forstatsmodels, whosemultipletestsAPI you depend on.src/workflow/WorkflowManager.py (1)
207-212: Prefersubprocess.runoveros.systemfor process termination.While the
int()cast on line 206 ensurespidis always a safe integer (mitigating the shell injection flagged by Ruff S605),os.systemis still a suboptimal choice: it spawns a shell, doesn't raise on failure, and returns only an opaque exit code.Proposed fix
+import subprocess ... # Windows if platform.system() == "Windows": - os.system(f"taskkill /F /T /PID {pid}") + subprocess.run( + ["taskkill", "/F", "/T", "/PID", str(pid)], + check=False, + ) else: # Linux/macOS os.kill(pid, signal.SIGTERM)src/WorkflowTest.py (4)
875-875: Use!= True→ne(True)or invert with~for pandas boolean filtering.
res_go["notfound"] != Trueworks but triggers Ruff E712. Thenotfoundcolumn may containNaNfor found entries, so a safe alternative:Proposed fix
- res_go = res_go[res_go["notfound"] != True] + res_go = res_go[~res_go["notfound"].fillna(False).astype(bool)]
885-888: Lambda captures loop variablego_type— but safe here since.apply()executes immediately.Ruff B023 flags this, but since the lambda is consumed immediately by
.apply()within the same iteration,go_typehas the correct value. No functional bug. Optionally, you can bind the variable as a default argument to silence the lint:Optional fix to silence lint
res_go[f"{go_type}_terms"] = res_go["go"].apply( - lambda x: extract_go_terms(x, go_type) + lambda x, _gt=go_type: extract_go_terms(x, _gt) )
893-893: Remove extraneousfprefix from strings without placeholders (lines 893, 942, 961).Ruff F541 flags these. They're harmless but noisy.
Proposed fix
- self.logger.log(f"✅ fg_set bg_set are set") + self.logger.log("✅ fg_set bg_set are set")- self.logger.log(f"✅ Plotly Figure generated") + self.logger.log("✅ Plotly Figure generated")- self.logger.log(f"✅ go_type generated") + self.logger.log("✅ go_type generated")
1013-1037:idxml_to_dfis duplicated fromsrc/common/results_helpers.py.This local function (lines 1013–1037) is identical to the one in
results_helpers.py(lines 18–43). Consider importing and reusing it to avoid drift. Note that it's already imported at line 16 viaparse_idxml, so addingidxml_to_dfto the import would be straightforward.Proposed fix
-from src.common.results_helpers import parse_idxml, build_spectra_cache +from src.common.results_helpers import parse_idxml, build_spectra_cache, idxml_to_dfThen remove the local
idxml_to_dfdefinition at lines 1013–1037.content/results_proteomicslfq.py (2)
39-39: Single-elementst.tabs— intentional placeholder?
st.tabs(["🧬 Protein Table"])with a single tab adds visual overhead. If more tabs are planned (e.g., a GO enrichment tab to complement the section below), this makes sense as a placeholder. Otherwise, the tab wrapper can be removed.
67-74: Consider handling malformedgo_results.jsongracefully.If the JSON file is corrupted or has an unexpected schema, the page will crash. A try/except around the load would improve resilience.
Proposed fix
import json import plotly.io as pio - with open(go_json_file, "r") as f: - go_data = json.load(f) + try: + with open(go_json_file, "r") as f: + go_data = json.load(f) + except (json.JSONDecodeError, KeyError) as e: + st.error(f"Failed to load GO results: {e}") + st.stop()
| st.info("No protein-level data available.") | ||
| else: | ||
| st.session_state["pivot_df"] = pivot_df | ||
| st.dataframe(pivot_df.sort_values("p-value"), use_container_width=True) |
There was a problem hiding this comment.
Sorting by p-value instead of p-adj — inconsistent with the PR's goal.
All other result pages (PCA, Volcano) have been migrated to use p-adj. This page still sorts the protein table by raw p-value.
Proposed fix
- st.dataframe(pivot_df.sort_values("p-value"), use_container_width=True)
+ st.dataframe(pivot_df.sort_values("p-adj"), use_container_width=True)🤖 Prompt for AI Agents
In `@content/results_proteomicslfq.py` at line 56, The table is being sorted by
"p-value" but should use adjusted p-values; update the call that renders the
dataframe (st.dataframe(...)) to sort pivot_df by "p-adj" instead of "p-value"
(i.e., change pivot_df.sort_values("p-value") to pivot_df.sort_values("p-adj")
when rendering), and ensure pivot_df contains the "p-adj" column or fall back
gracefully if missing (check pivot_df.columns or use
pivot_df.sort_values(by=["p-adj"] if "p-adj" in pivot_df.columns else
["p-value"])).
| if not stats_df.empty: | ||
| mask = stats_df["p-value"].notna() | ||
| if mask.any(): | ||
| _, p_adj, _, _ = multipletests(stats_df.loc[mask, "p-value"], method="fdr_bh") | ||
| stats_df.loc[mask, "p-adj"] = p_adj | ||
| else: | ||
| stats_df["p-adj"] = np.nan |
There was a problem hiding this comment.
FDR correction logic is well-implemented.
Proper NaN handling with the mask, correct multipletests unpacking, and the fdr_bh method is the standard Benjamini–Hochberg choice.
One edge case: if stats_df is empty (no protein groups), the p-adj column is never created, and line 296 will raise a KeyError. While unlikely given the earlier guards, consider adding a fallback:
Proposed defensive addition
if not stats_df.empty:
mask = stats_df["p-value"].notna()
if mask.any():
_, p_adj, _, _ = multipletests(stats_df.loc[mask, "p-value"], method="fdr_bh")
stats_df.loc[mask, "p-adj"] = p_adj
else:
stats_df["p-adj"] = np.nan
+ else:
+ stats_df["p-adj"] = pd.Series(dtype=float)🤖 Prompt for AI Agents
In `@src/common/results_helpers.py` around lines 264 - 270, The issue is that when
stats_df is empty the "p-adj" column is never created and later access raises a
KeyError; fix it by ensuring "p-adj" exists regardless of emptiness—before the
existing if not stats_df.empty block (or in an else branch for the empty case)
initialize stats_df["p-adj"] = np.nan (using the same numpy alias) so that
subsequent code referencing stats_df["p-adj"] (and functions using stats_df,
mask, multipletests) always finds the column.
| # ====================================================== | ||
| # ⚠️ 5️⃣ GO Enrichment Analysis (INLINE IN EXECUTION) | ||
| # ====================================================== | ||
| workspace_path = Path(self.workflow_dir).parent | ||
| res = get_abundance_data(workspace_path) | ||
| if res is not None: | ||
| pivot_df, _, _ = res | ||
| self.logger.log("✅ pivot_df loaded, starting GO enrichment...") | ||
| self._run_go_enrichment(pivot_df, results_dir) | ||
| else: | ||
| st.warning("GO enrichment skipped: abundance data not available.") | ||
|
|
||
| # ================================ | ||
| # 5️⃣ Final report |
There was a problem hiding this comment.
Duplicate section numbering: two sections labeled "5️⃣".
Line 816 labels GO enrichment as step 5️⃣, and line 828 labels the final report as step 5️⃣ as well. Consider renumbering the final report to 6️⃣.
🤖 Prompt for AI Agents
In `@src/WorkflowTest.py` around lines 815 - 828, The inline GO enrichment block
and the subsequent final report comment both use the "5️⃣" section number;
update the final report section marker to "6️⃣" to avoid duplicate
numbering—locate the comment preceding the final report (near the block that
follows workspace_path/get_abundance_data and the call to
self._run_go_enrichment) and change "5️⃣ Final report" to "6️⃣ Final report" so
section numbers are unique.
| fg_ids = analysis_df[ | ||
| (analysis_df["p-value"] < p_cutoff) & | ||
| (analysis_df["log2FC"].abs() >= fc_cutoff) | ||
| ]["UniProt"].dropna().astype(str).unique().tolist() |
There was a problem hiding this comment.
GO enrichment foreground selection still uses raw p-value — should this use p-adj?
The PR's stated goal is to upgrade from raw p-values to FDR-adjusted p-values. However, the foreground gene set for GO enrichment at line 857 still filters on analysis_df["p-value"] < p_cutoff. Since pivot_df now contains p-adj, consider using it here for consistency:
Proposed fix
- fg_ids = analysis_df[
- (analysis_df["p-value"] < p_cutoff) &
- (analysis_df["log2FC"].abs() >= fc_cutoff)
- ]["UniProt"].dropna().astype(str).unique().tolist()
+ fg_ids = analysis_df[
+ (analysis_df["p-adj"] < p_cutoff) &
+ (analysis_df["log2FC"].abs() >= fc_cutoff)
+ ]["UniProt"].dropna().astype(str).unique().tolist()Also update line 840 to include p-adj in the dropna subset:
- analysis_df = pivot_df.dropna(subset=["p-value", "log2FC"]).copy()
+ analysis_df = pivot_df.dropna(subset=["p-adj", "log2FC"]).copy()🤖 Prompt for AI Agents
In `@src/WorkflowTest.py` around lines 856 - 859, The foreground selection for GO
enrichment still filters on analysis_df["p-value"] when the PR intends to use
FDR-adjusted p-values; update the filter in the fg_ids assignment to use the
adjusted p-value column (e.g., "p-adj") instead of "p-value" while still
applying the log2FC cutoff (fc_cutoff) and p_cutoff threshold, and also ensure
the earlier dropna call that constructs pivot_df includes "p-adj" in its subset
so p-adj values are present for filtering; locate and modify the fg_ids
computation and the prior dropna call (references: variable fg_ids, dataframe
analysis_df, thresholds p_cutoff and fc_cutoff, and the "p-adj" column)
accordingly.
| res_list = mg.querymany( | ||
| bg_ids, scopes="uniprot", fields="go", as_dataframe=False | ||
| ) |
There was a problem hiding this comment.
External API call to MyGene.info has no timeout or error handling.
mg.querymany(...) is a network call. If the service is slow or unreachable, this will hang the workflow indefinitely. Wrap it in a try/except and consider setting a timeout.
Proposed fix
- res_list = mg.querymany(
- bg_ids, scopes="uniprot", fields="go", as_dataframe=False
- )
+ try:
+ res_list = mg.querymany(
+ bg_ids, scopes="uniprot", fields="go", as_dataframe=False
+ )
+ except Exception as e:
+ self.logger.log(f"❗ MyGene.info API error: {e}")
+ st.error("Failed to fetch GO terms from MyGene.info. Please try again later.")
+ return🤖 Prompt for AI Agents
In `@src/WorkflowTest.py` around lines 870 - 872, The mygene query
mg.querymany(bg_ids, scopes="uniprot", fields="go", as_dataframe=False) is a
network call with no timeout or error handling; wrap this call in a try/except
around the mg.querymany invocation (the call that uses bg_ids and returns
res_list) and add a timeout parameter if supported by the client (or call via a
requests.Session with timeout) so the call fails fast; on exception, log the
error (including exception details) and implement a clear fallback (e.g., return
an empty res_list or re-raise a custom exception) so the workflow does not hang
indefinitely.
Integration of statsmodels: Added statsmodels.stats.multitest to perform robust p-value adjustment.
Updated load_abundance_data:
Summary by CodeRabbit
New Features
Improvements