BacDive API Assay Metadata Extractor

Extract API assay metadata from BacDive JSON data with comprehensive identifier mappings to CHEBI, EC, RHEA, and PubChem databases.

Overview

This project analyzes the BacDive bacterial database JSON file to extract:

API Assay Kits - All 17 API assay kit types found in the data
Well Metadata - Individual wells/tests with human-readable labels
Identifier Mappings - Links to CHEBI, PubChem, EC numbers, and RHEA reactions
Enzyme Information - Enzyme activities with EC classifications

Features

✅ Parses 99,392 bacterial strain records from BacDive
✅ Extracts 17 unique API kit types (API zym, API 50CHac, etc.)
✅ Maps substrate codes to CHEBI and PubChem identifiers
✅ Maps enzyme EC numbers to RHEA reaction databases
✅ Generates consolidated JSON metadata files
✅ Optional split output for individual API kits
✅ Comprehensive statistics and summaries

Installation

This project uses uv for fast, reliable Python package management.

Prerequisites

Python 3.12+
uv package manager

Install uv

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Install Dependencies

# Create virtual environment and install dependencies
uv sync

# Or install in development mode
uv pip install -e .

Usage

Basic Usage

# Extract metadata from default location (bacdive_strains.json)
uv run extract-metadata

# Or activate the virtual environment first
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate     # Windows
extract-metadata

Advanced Options

# Specify custom input file
uv run extract-metadata --input path/to/bacdive_strains.json

# Specify custom output directory
uv run extract-metadata --output-dir processed_data/

# Generate individual files for each API kit
uv run extract-metadata --split-kits

# Pretty-print JSON output (indented)
uv run extract-metadata --pretty

# Combine options
uv run extract-metadata --input bacdive_strains.json \
                        --output-dir data/ \
                        --split-kits \
                        --pretty

Validation

All curated identifier mappings are validated against authoritative sources and actual extracted data. See VALIDATION.md and API_WELL_CODE_SOURCES.md for complete details.

Quick Validation

# Fast validation using ontology files (~5 seconds)
make validate

# Validate API kit well code mappings against official docs
make validate-api

# Validate mappings against actual extracted data
make validate-data

# Full validation with API calls (~20 minutes)
make validate-full

# Track ontology file versions
make track-files

Validation Sources

Database	Source	Method
CHEBI	KG-Microbe ontology TSV	Offline lookup
EC	KG-Microbe ontology TSV	Offline lookup
GO	KG-Microbe ontology TSV	Offline lookup
PubChem	PubChem API	Online validation
KEGG	KEGG API	Online validation
bioMérieux Docs	Official API kit documentation	Manual curation

Validation Coverage

Ontology Identifiers: 81/84 CHEBI valid (96.4%), 39/39 EC valid (100%), 55 GO terms valid

API Kit Well Code Mappings:

100% coverage across all 17 API kits
503/503 well codes mapped (data-driven validation)
59/59 wells validated against official bioMérieux documentation
All kits: API zym, API 50CHac, API biotype100, API 20E, API 20NE, API rID32STR, API coryne, API rID32A, API ID32E, API NH, API ID32STA, API CAM, API 20STR, API LIST, API STA, API 20A, API 50CHas

See VALIDATION.md for:

Detailed validation results
Error and warning details
Instructions for fixing invalid IDs
Version control strategy for ontology files

See API_WELL_CODE_SOURCES.md for:

How well codes are verified against official sources
Kit-specific context for ambiguous codes
Cross-kit consistency analysis

Output Files

Default Output (`data/`)

assay_metadata.json - Consolidated metadata for all API kits, wells, and enzymes
api_kits_list.json - Summary list of all 17 API kit types
statistics.json - Dataset statistics

With `--split-kits` Option

Additional directory data/kits/ containing individual JSON files for each kit:

API_zym.json
API_50CHac.json
API_20NE.json
... (one file per kit)

Output Schema

API Kit Metadata

{
  "kit_name": "API zym",
  "description": "Enzyme activity testing for 19 different enzymes",
  "category": "Enzyme profiling",
  "well_count": 20,
  "wells": ["Control", "Alkaline phosphatase", "Esterase", ...],
  "occurrence_count": 11747
}

Well Metadata

{
  "code": "GLU",
  "label": "D-Glucose",
  "well_type": "substrate",
  "description": "Tests for utilization/fermentation of D-Glucose",
  "chemical_ids": {
    "chebi_id": "CHEBI:17234",
    "chebi_name": "D-Glucose",
    "pubchem_cid": "5793",
    "pubchem_name": "D-Glucose"
  },
  "used_in_kits": ["API 50CHac", "API biotype100", "API 20E"]
}

Enzyme Metadata

{
  "enzyme_name": "beta-galactosidase",
  "ec_number": "3.2.1.23",
  "ec_name": null,
  "rhea_ids": ["10079", "10080", "10081"]
}

API Kit Types Found

The extractor identifies 17 different API assay kits:

Kit Name	Type	Well Count	Occurrences
API zym	Enzyme profiling	20	11,747
API 50CHac	Carbohydrate fermentation	50	6,853
API 20NE	Bacterial identification	21	3,833
API rID32STR	Bacterial identification	32	3,666
API biotype100	Biochemical profiling	99	3,599
API 20E	Bacterial identification	26	3,452
API coryne	Bacterial identification	varies	3,287
...	...	...	...

Project Structure

assay-metadata/
├── src/bacdive_assay_metadata/
│   ├── __init__.py           # Package initialization
│   ├── models.py             # Pydantic data models
│   ├── parser.py             # BacDive JSON parser
│   ├── mappers.py            # Identifier mapping utilities
│   ├── metadata_builder.py  # Metadata construction
│   └── main.py               # CLI entry point
├── data/                     # Output directory (generated)
├── bacdive_strains.json      # Input data file
├── pyproject.toml            # Project configuration
├── .python-version           # Python version specification
└── README.md                 # This file

Development

Running Tests

# Install dev dependencies
uv sync --dev

# Run tests (when implemented)
uv run pytest

Code Structure

models.py - Pydantic models for type-safe data structures
parser.py - Extracts API assay data from BacDive JSON
mappers.py - Maps codes to biological database identifiers
metadata_builder.py - Orchestrates parsing and mapping
main.py - Command-line interface

Identifier Mapping Coverage

Chemical Identifiers (CHEBI/PubChem)

✅ Monosaccharides: glucose, fructose, galactose, mannose, ribose, xylose, arabinose
✅ Disaccharides: maltose, lactose, sucrose, trehalose, cellobiose, melibiose
✅ Sugar alcohols: sorbitol, mannitol, inositol, dulcitol, xylitol
✅ Organic acids: citrate, lactate, pyruvate, succinate, fumarate
✅ Amino acids: tryptophan, glutamine, proline, alanine, serine, tyrosine
✅ 100+ substrate mappings total

Enzyme Identifiers (EC/RHEA)

✅ EC numbers: 129/158 enzyme wells (81.6% coverage)
- 10 EC numbers added via deterministic lookup (ExpASy ENZYME, BRENDA)
- Glycosidases: alpha-arabinofuranosidase, alpha-fucosidase, alpha-glucosidase, alpha-mannosidase, beta-glucosidase, beta-mannosidase, beta-N-acetylhexosaminidase, beta-galactosidase
- Other enzymes: tryptophanase (indole production)
✅ RHEA reaction IDs fetched via API
✅ 175 unique enzymes cataloged
✅ GO terms for arylamidases and other enzyme activities
✅ All EC numbers validated against KG-Microbe EC ontology (249,191 terms)

Data Sources

BacDive: Bacterial Diversity Metadatabase (99,392 strains)
CHEBI: Chemical Entities of Biological Interest
PubChem: Public chemistry database
EC: Enzyme Commission classification
RHEA: Expert-curated biochemical reactions

Performance

Processes 99,392 bacterial strains
Extracts ~17 API kit types
Identifies ~150+ unique wells/tests
Runtime: ~2-5 minutes (depending on system)

License

This project is part of the KG-Microbe knowledge graph initiative.

Contributing

Contributions welcome! Areas for improvement:

Add more substrate/chemical mappings
Integrate EC name lookups
Add InChI/SMILES for chemicals
Implement caching for API calls
Add unit tests
Support for additional output formats (CSV, TSV)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
notes		notes
references		references
scripts		scripts
src/bacdive_assay_metadata		src/bacdive_assay_metadata
.gitignore		.gitignore
.python-version		.python-version
API_WELL_CODE_SOURCES.md		API_WELL_CODE_SOURCES.md
BACDIVE_ENZYME_EC_MAPPING_REPORT.md		BACDIVE_ENZYME_EC_MAPPING_REPORT.md
CLAUDE.md		CLAUDE.md
CLEANUP.sh		CLEANUP.sh
EC_LOOKUP_RESULTS.csv		EC_LOOKUP_RESULTS.csv
EC_NUMBER_ASSIGNMENT_PLAN.md		EC_NUMBER_ASSIGNMENT_PLAN.md
ENZYME_PATHWAY_ID_ASSIGNMENT.md		ENZYME_PATHWAY_ID_ASSIGNMENT.md
MAPPING_METHODOLOGY.md		MAPPING_METHODOLOGY.md
Makefile		Makefile
README.md		README.md
VALIDATION.md		VALIDATION.md
WORKFLOW.md		WORKFLOW.md
api_kit_validation_report.json		api_kit_validation_report.json
check_enzyme_complete_coverage.py		check_enzyme_complete_coverage.py
check_enzyme_ec_coverage.py		check_enzyme_ec_coverage.py
data_validation_report.json		data_validation_report.json
ec_mapping_report.md		ec_mapping_report.md
ec_mappings_exact.tsv		ec_mappings_exact.tsv
enzyme_ec_exact_matcher.py		enzyme_ec_exact_matcher.py
extract_enzyme_names.py		extract_enzyme_names.py
extract_metpo_predicates.py		extract_metpo_predicates.py
map_bacdive_enzymes.py		map_bacdive_enzymes.py
metabolite_mappings_researched.py		metabolite_mappings_researched.py
metpo.owl		metpo.owl
metpo_relation_mapping_report.md		metpo_relation_mapping_report.md
ontology_file_metadata.json		ontology_file_metadata.json
pyproject.toml		pyproject.toml
rebuild_ec_mappings.py		rebuild_ec_mappings.py
unique_enzyme_names.txt		unique_enzyme_names.txt
unmapped_metabolites.txt		unmapped_metabolites.txt
unmapped_metabolites_summary.md		unmapped_metabolites_summary.md
uv.lock		uv.lock

CultureBotAI/assay-metadata

Folders and files

Latest commit

History

Repository files navigation

BacDive API Assay Metadata Extractor

Overview

Features

Installation

Prerequisites

Install uv

Install Dependencies

Usage

Basic Usage

Advanced Options

Validation

Quick Validation

Validation Sources

Validation Coverage

Output Files

Default Output (data/)

With --split-kits Option

Output Schema

API Kit Metadata

Well Metadata

Enzyme Metadata

API Kit Types Found

Project Structure

Development

Running Tests

Code Structure

Identifier Mapping Coverage

Chemical Identifiers (CHEBI/PubChem)

Enzyme Identifiers (EC/RHEA)

Data Sources

Performance

License

Contributing

Citation

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Default Output (`data/`)

With `--split-kits` Option

Packages