Pipeline-agnostic SNP comparison tool for bacterial genomics.
CoreGuard compares SNP calls from multiple pipelines (Snippy, CFSAN, GATK, etc.), helping identify discrepancies and artifacts in variant calling. Optionally, a reference alignment can be used to detect coverage gaps and validate SNP calls.
Live Viewer: https://genpat-it.github.io/coreguard/
The viewer includes 5 pre-loaded demo datasets. Click the demo buttons to explore:
| Demo | Organism | Samples | Pipelines | Description |
|---|---|---|---|---|
| Demo 1 | Listeria monocytogenes | 4 | Snippy, CFSAN, SPANDx | Small dataset, clone group + outlier (EGD-e ref) |
| Demo 2 | Listeria monocytogenes | 53 | Snippy, CFSAN, SPANDx | Large outbreak dataset |
| Demo 3 | Brucella melitensis | 17 | Snippy, CFSAN, SPANDx | Different organism |
| Demo 4 | West Nile Virus | 10 | Snippy, CFSAN, SPANDx | Viral genome (small reference) |
| Demo 5 | Listeria monocytogenes | 4 | Snippy, CFSAN, SPANDx | Same as Demo 1 with F2365 ref (different serovar) |
All demos use minimap2 as the reference alignment (BAM pileup without variant calling) for ground truth comparison, plus Snippy v4.6.0, CFSAN SNP Pipeline v2.2.1, and SPANDx v4.0.5 as variant calling pipelines.
The demos illustrate how different pipelines can produce varying SNP calls on the same data, and how CoreGuard helps identify these discrepancies. Demo 5 specifically demonstrates the critical impact of reference genome choice on SNP calling resolution.
Different SNP pipelines can produce different results on the same data. This matters because:
- Outbreak investigation: SNP distances determine if isolates belong to the same outbreak. A difference of 1-2 SNPs can change epidemiological conclusions.
- Pipeline validation: When adopting a new pipeline, you need to verify it produces consistent results with established methods.
- Troubleshooting: When results seem wrong, CoreGuard helps visualize exactly where and why pipelines disagree.
- Quality control: Identify problematic samples or genomic regions where pipelines consistently fail.
- Multi-pipeline comparison: Compare SNP calls from any number of pipelines side-by-side
- Reference alignment support (optional): Designate a baseline BAM alignment for gap detection and validation
- Dashboard viewer: KPI dashboard with pre-computed statistics, runs entirely in the browser via WebAssembly
- Pre-computed statistics: All KPIs computed at CLI time — the viewer loads instantly with no recomputation
- MNP decomposition: Multi-nucleotide polymorphisms (e.g.,
TTGGCG→CCGGCT) are automatically decomposed into individual SNPs - Pre-computed distance matrices: Display pairwise SNP distance matrices from each pipeline's native output
- Reference discriminating SNPs vs pipelines: Compare reference alignment discriminating positions against each pipeline's core SNP output, with per-position drill-down detail
For each pipeline, CoreGuard computes:
- Gap-Intersect / Gap-Union: Two gap-exclusion strategies
- Gap-Intersect: exclude positions where ALL samples have a gap (permissive)
- Gap-Union: exclude positions where ANY sample has a gap (restrictive)
- Usable Space: Reference length minus excluded gap positions
- Total SNPs: Positions in usable space where at least one sample has an alt allele
- Consensus SNPs: All samples agree on the same alt allele (non-discriminating)
- Discriminating SNPs: Samples differ — these contribute to Hamming distance
- Missing Calls: Positions where some samples have a call but others don't (variant calling inconsistency)
For non-reference pipelines, discriminating SNPs are classified by heuristic:
- Gap-affected: At least one sample was skipped due to gap — partial comparison
- Ref-consensus: Reference alignment shows all samples agree — likely variant calling artifact
- Majority-rule: All but one sample agree — likely single-sample VC miss
- Confirmed: Genuine disagreement between samples
- Avg Usable Space / Discriminating SNPs: Averaged across all sample pairs (Gap-Union)
- Min / Median / Max: Distribution of pairwise discriminating SNPs
- Per-Sample Table: For each sample, average usable space and discriminating SNPs across all pairs involving that sample
The reference pipeline calls SNPs from BAM pileup using majority vote:
- Count reads at each position (A, C, G, T)
- If total depth <
min_depth→ Gap - If majority base <
min_consensus(default 80%) → Ambiguous (skipped) - If majority base ≠ reference → SNP
- Description: Project description in Markdown (from config)
- Pipelines: Pipeline metadata (labels, commands, data types)
- Statistics: KPI dashboard with reference metrics and reference discriminating SNPs vs pipelines
- SNP Distance Matrix: Pre-computed distance matrices from each pipeline
- Dark/Light theme: Toggle between themes
- Collapsible panels: Organize information in expandable sections
- Info icons: Hover/click for detailed explanations of each metric
# Clone the repository
git clone https://github.com/genpat-it/coreguard.git
cd coreguard
# Build with Rust
cargo build --release
# Binary will be at ./target/release/coreguard# project.yaml
reference:
path: reference.fasta
label: "My Reference Genome"
# Optional: project description (inline markdown or path to .md file)
description: "## My Study\nComparison of Snippy vs CFSAN on *Listeria* dataset."
# Or: description: docs/study_description.md
samples:
sample1:
label: "Sample 1" # optional display label
sample2: {}
sample3: {}
pipelines:
# Reference alignment (BAM only — gap detection and SNP validation baseline)
minimap2:
label: "Reference (minimap2)"
command: "minimap2 -ax sr -t 8 ref.fa reads_1.fq.gz reads_2.fq.gz | samtools sort -o out.bam"
reference: true
samples:
sample1:
bam: alignments/sample1.bam
sample2:
bam: alignments/sample2.bam
sample3:
bam: alignments/sample3.bam
# SNP pipeline with core_snps + BAM (recommended)
snippy:
label: "Snippy v4.6"
command: "snippy --ref reference.fa --R1 reads_1.fq.gz --R2 reads_2.fq.gz --outdir out"
distance_matrix: snippy/core.distances.tsv # pre-computed distance matrix (optional)
core_snps: snippy/core.tab # core SNP output file (recommended)
samples:
sample1:
bam: snippy/sample1/snps.bam
sample2:
bam: snippy/sample2/snps.bam
sample3:
bam: snippy/sample3/snps.bam
# Another pipeline
cfsan:
label: "CFSAN SNP Pipeline"
command: "cfsan_snp_pipeline run -m soft -o output reference.fasta"
distance_matrix: cfsan/snp_distance_matrix.tsv # pre-computed distance matrix (optional)
core_snps: cfsan/snplist.txt # core SNP positions (recommended)
samples:
sample1:
bam: cfsan/sample1/reads.sorted.bam
sample2:
bam: cfsan/sample2/reads.sorted.bam
sample3:
bam: cfsan/sample3/reads.sorted.bam
options:
min_depth: 1
min_qual: 0
include_indels: false# Dashboard mode (recommended) - lightweight output for web viewing
coreguard compare --config project.yaml -o report.json.gz --dashboard
# Full mode - includes raw data for advanced analysis
coreguard compare --config project.yaml -o report.json.gz
# JSON format (human-readable, for debugging)
coreguard compare --config project.yaml -o report.jsonDashboard mode (--dashboard) omits raw gap/SNP data and keeps only pre-computed statistics. This reduces file size by 10-100x while maintaining full dashboard functionality.
Open https://genpat-it.github.io/coreguard/ and drag & drop your report file.
Supported formats: .json, .json.gz, .bin, .bin.gz
| Option | Description | Default |
|---|---|---|
min_depth |
Minimum read depth to consider position covered | 1 |
min_qual |
Minimum VCF QUAL score (0 = no filtering; recommended since variant callers already apply their own filters) | 0 |
min_consensus |
Minimum fraction of reads agreeing on a base for reference pileup (0.0-1.0) | 0.8 |
include_indels |
Include insertions/deletions | false |
| Option | Description | Default |
|---|---|---|
label |
Display name in the viewer | pipeline ID |
command |
Command line used (shown in viewer) | - |
reference |
Mark as reference alignment (BAM-only baseline) | false |
gaps_only |
Only load gaps from BAM, skip SNP pileup (reduces output size) | false |
distance_matrix |
Path to pre-computed SNP distance matrix (TSV) | - |
core_snps |
Path to core SNP output (snippy core.tab or CFSAN snplist.txt) |
- |
| Option | Description |
|---|---|
bam |
Path to BAM file for coverage/gap detection (required) |
vcf |
Path to VCF file with SNP calls (optional, only needed if no core_snps) |
label |
Display name for the sample (optional) |
The description field accepts either inline markdown or a path to a .md/.txt file:
# Inline
description: "## My Study\nComparing pipelines on *Listeria* outbreak data."
# File reference
description: docs/study_description.mdThe core_snps field points to a CoreGuard TSV file — a simple tab-separated format with genomic positions and per-sample alleles:
CHR POS REF sample1 sample2 sample3 sample4
CP014790.1 16686 C C C T C
CP014790.1 17153 C C T C C
CP014790.1 23349 T T T A TCHR— Chromosome/contig namePOS— Genomic position (1-based)REF— Reference allele<sample>— Sample allele (-for gap,Nfor no call)
Snippy core.tab output is already in this format — use it directly.
CFSAN requires conversion:
coreguard convert --from cfsan-snpma \
-i snpma.fasta \
--snplist snplist.txt \
--reference-snp referenceSNP.fasta \
-o cfsan_core_snps.tsvSPANDx — convert GATK VariantsToTable output:
coreguard convert --from spandx-vcf-table \
-i out.vcf.table \
-o spandx_core_snps.tsvThen reference the converted file in your YAML:
pipelines:
cfsan:
label: "CFSAN 2.2.1"
core_snps: cfsan_core_snps.tsv
spandx:
label: "SPANDx v4.0.5"
core_snps: spandx_core_snps.tsvThe distance_matrix field points to a TSV file with pairwise SNP distances between samples. CoreGuard can compute this automatically from various sources:
From Nexus SNP matrix (e.g., SPANDx Ortho_SNP_matrix.nex):
coreguard convert --from nexus \
-i Ortho_SNP_matrix.nex \
-o distances.tsvFrom CoreGuard TSV core_snps:
coreguard convert --from core-snps \
-i core_snps.tsv \
-o distances.tsvThese computed distances are displayed in the viewer's "SNP Distance Matrix" panel.
| Format | Extension | Description |
|---|---|---|
| JSON gzip | .json.gz |
Recommended for production |
| JSON | .json |
Human-readable, for debugging |
| Mode | Flag | Output Size | Use Case |
|---|---|---|---|
| Dashboard | --dashboard |
10-100x smaller | Web viewing, sharing |
| Full | (default) | Large | Advanced analysis, raw data access |
Recommendation: Use --dashboard for web viewing. Use full mode only if you need raw gap/SNP data for custom analysis.
- CLI: Rust with rayon parallelization (fast BAM/core_snps processing, pre-computes all KPIs)
- Viewer: Vanilla JavaScript dashboard with pre-computed statistics
- No backend required: All processing happens at CLI time; viewer loads instantly
The CLI requires samtools to be installed and available in PATH for BAM depth calculation.
Paper in preparation
MIT License - see LICENSE
GenPat Team
Contact: genpat@izs.it