Introduction

Compute group-level bias metrics from a cohort built from demographics.csv (plus optional commitments tables).

This repo avoids hard-coding: you can compare any two values in your group column (e.g., any two ethnicities) and define outcomes from any demographics column.

Features

Build a cohort from demographics (optionally restricted to a list of CDC IDs)
Optional row filtering via filters.json
Two outcome modes:
- Categorical: outcome = 1 if outcome_col == outcome_positive
- Numeric threshold: outcome = 1 if outcome_col (op) threshold
2×2 contingency table (a,b,c,d)
Metrics:
- Odds Ratio (OR) + log CI
- Relative Risk (RR) + log CI
- Rate Ratio scaffold (requires person-time inputs; not wired into CLI yet)
Optional continuity correction (only if explicitly provided)
Chi-square test (χ² statistic, p-value, df; Yates on by default)
- Chi-square is computed via scipy.stats.chi2_contingency with optional Yates continuity correction.
- Chi-square is reported alongside OR/RR to assess independence of group and outcome.
Logistic regression mode (--mode logit)
- Output includes coefficient, odds ratio, confidence interval, p-value, and number of observations used.
- (adjusted odds ratios) for adjusted odds ratios controlling for user-specified covariates (auto one-hot encoding for categorical covariates)

Install

From repo root:

pip install -e .

CLI entrypoint:

bias-analysis --help

Or run directly:

python -m bias_analysis.cli --help

Inputs

Required

--demographics : path to a CSV (or XLSX if your read_table() supports it)

Optional

--current : current_commitments table (loaded and passed through; used later)
--prior : prior_commitments table (loaded and passed through; used later)
--cdc-ids : optional list of CDC IDs to restrict the cohort

CLI Arguments (Summary)

Core inputs

--demographics (required): CSV/XLSX with demographic data
--group-col: column defining groups (e.g., ethnicity)
--exposed: value treated as exposed group
--unexposed: value treated as unexposed group

Outcome definition (one required)

--outcome-col: column used to derive outcome
--outcome-positive: categorical positive value
--outcome-threshold: numeric threshold
--threshold-op: comparison operator (ge, gt, le, lt, eq, ne)

Analysis mode

--mode {2x2,logit} (default: 2x2)
--covariates: list of covariate columns (logit mode only)
--covariates-file: JSON file listing covariates
--drop-missing {any,outcome,covariates,none}

Statistical options

--alpha: significance level (default 0.05)
--continuity-correction: optional float (e.g., 0.5)
--no-chi2-yates: disable Yates correction for chi-square

Filtering

--filters-file: JSON filter file
--filters-json: inline JSON filters (advanced)

Data restriction

--cdc-ids: restrict analysis to specific CDC IDs
--min-cases: minimum rows after filtering

Filters

Filters are optional. If you don’t want filters, either omit --filters-file or use:

filters.json

[]

Filter objects look like:

[
  {"col": "sex", "op": "eq", "value": "Male"},
  {"col": "facility", "op": "in", "value": ["A", "B", "C"]}
]

Supported ops in the current CLI:

in (value must be a list)
eq
neq

Important: if you filter ethnicity to a subset (e.g., only ["Black","White"]), then trying to run --exposed "American Indian" will fail because those rows were filtered out. For arbitrary group comparisons, keep filters empty or avoid filtering the group column.

Outcome modes

1. Categorical outcome

Outcome = 1 when:

demographics[outcome_col] == outcome_positive

PowerShell example:

python -m bias_analysis.cli `
  --demographics demographics.csv `
  --group-col ethnicity `
  --exposed "White" `
  --unexposed "Black" `
  --outcome-col "controlling offense" `
  --outcome-positive "PC187 2nd" `
  --filters-file filters.json `
  --continuity-correction 0.5

2. Numeric threshold outcome

Outcome = 1 when:

to_numeric(demographics[outcome_col]) (op) threshold

Supported --threshold-op:

ge (>=)
gt (>)
le (<=)
lt (<)
eq (==)
ne (!=)

Example (sentence length ≥ 10 years):

python -m bias_analysis.cli `
  --demographics demographics.csv `
  --group-col ethnicity `
  --exposed "White" `
  --unexposed "Black" `
  --outcome-col "aggregate sentence in years" `
  --outcome-threshold 10 `
  --threshold-op ge

Logistic regression mode (adjusted OR)

In addition to the default 2×2 mode, the CLI supports:

--mode logit to fit a logistic regression:
- Outcome: the derived 0/1 outcome (from --outcome-* flags)
- Exposure: a group indicator (1 = exposed, 0 = unexposed)
- Covariates: user-provided columns, automatically encoded:
  - numeric → kept numeric (to_numeric(errors="coerce"))
  - categorical → one-hot encoded (get_dummies(drop_first=True))

Example: adjusted OR controlling for age + county + offense category

python -m bias_analysis.cli `
  --mode logit `
  --demographics demographics.csv `
  --group-col ethnicity `
  --exposed "White" `
  --unexposed "Black" `
  --outcome-col "aggregate sentence in years" `
  --outcome-threshold 10 `
  --threshold-op ge `
  --covariates "age" "sentencing_county" "offense_category" `
  --drop-missing any
  | Out-File -Encoding utf8 ./logit_white_vs_black_ge10.json


## Continuity correction (zero cells)

By default, this repo does not silently “fix” zero cells.

- If any cell in the 2×2 table is zero and you do **not** pass `--continuity-correction`, ratio metrics return **NaN**.
- To enable a continuity correction (common choice: 0.5):

```bash
--continuity-correction 0.5

The correction is only applied when a zero cell is present.

Output

The CLI prints JSON with:

inputs : all inputs used (including filters and whether commitments were loaded)
table : the 2×2 table {a,b,c,d}
metrics : computed metrics + CIs

Statistical notes & references

Continuity correction

When one or more cells in a 2×2 contingency table are zero, odds ratios and relative risks are mathematically undefined.
By default, this repo returns NaN in those cases rather than silently imputing values.

If the user explicitly passes --continuity-correction (e.g. 0.5), a standard continuity correction is applied to enable finite estimation of log-odds–based measures and confidence intervals. This behavior is opt-in and the correction value is echoed in the output for transparency.

This approach follows established statistical methodology for sparse contingency tables and small-sample binomial models.

References

Anscombe, F. J. (1956).
On estimating binomial response relations.
Biometrika, 43(3–4), 461–464.
(Primary methodological reference for continuity correction in binomial/logistic settings.)
Agresti, A. (2013).
Categorical Data Analysis (3rd ed.). Wiley.
(Modern textbook treatment of sparse contingency tables, odds ratios, and continuity corrections.)
Szumilas, M. (2010).
Explaining Odds Ratios.
Journal of the Canadian Academy of Child and Adolescent Psychiatry, 19(3), 227–229.
(Clear explanation of odds ratios and their interpretation.)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bias_analysis		bias_analysis
tests		tests
.gitignore		.gitignore
README.md		README.md
covariates.json		covariates.json
filters.json		filters.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Features

Install

Inputs

Required

Optional

CLI Arguments (Summary)

Core inputs

Outcome definition (one required)

Analysis mode

Statistical options

Filtering

Data restriction

Filters

Outcome modes

1. Categorical outcome

2. Numeric threshold outcome

Logistic regression mode (adjusted OR)

Example: adjusted OR controlling for age + county + offense category

Output

Statistical notes & references

Continuity correction

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Features

Install

Inputs

Required

Optional

CLI Arguments (Summary)

Core inputs

Outcome definition (one required)

Analysis mode

Statistical options

Filtering

Data restriction

Filters

Outcome modes

1. Categorical outcome

2. Numeric threshold outcome

Logistic regression mode (adjusted OR)

Example: adjusted OR controlling for age + county + offense category

Output

Statistical notes & references

Continuity correction

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages