Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
e2b03d2
CU-869c36ruk: Add initial README
mart-r Feb 10, 2026
69a8d5b
CU-869c36ruk: Move embedding linker to separate project / package
mart-r Feb 10, 2026
6760ce4
CU-869c36ruk: Add initial pyproject.toml
mart-r Feb 10, 2026
5451559
CU-869c36ruk: Move embedding linker config to relevant project
mart-r Feb 10, 2026
70d25b6
CU-869c36ruk: Add entry point based plugin registration
mart-r Feb 10, 2026
0c2107c
CU-869c36ruk: Add component registration for embedding linker plugin
mart-r Feb 10, 2026
843f6b0
CU-869c36ruk: Remove embedding linker registration from core lib
mart-r Feb 10, 2026
e6460ab
CU-869c36ruk: Centralise plugin / project name
mart-r Feb 10, 2026
d357bd2
CU-869c36ruk: Centralise name again
mart-r Feb 10, 2026
fb21570
CU-869c36ruk: Standardise / fix license format in pyproject.toml
mart-r Feb 10, 2026
73ae1ed
CU-869c36ruk: Add missing dep for embedding linker
mart-r Feb 10, 2026
0d2fffb
CU-869c36ruk: Move embedding linker tests to new project
mart-r Feb 10, 2026
10075a9
CU-869c36ruk: Fix typo for lazy registration method name
mart-r Feb 10, 2026
4535954
CU-869c36ruk: Fix import paths for tests
mart-r Feb 10, 2026
34da02a
CU-869c36ruk: Add helper module for tests
mart-r Feb 10, 2026
4176cb0
CU-869c36ruk: Use correct (local) imports for embedding linker config
mart-r Feb 10, 2026
1418fbc
CU-869c36ruk: Use correct (local) import within tests; add a simple i…
mart-r Feb 10, 2026
f151c3e
CU-869c36ruk: Remove non-existant core lib import of config
mart-r Feb 10, 2026
cf97c33
CU-869c36ruk: Make sure the core lib is marked as typed
mart-r Feb 10, 2026
9e1adc5
CU-869c36ruk: Rename tag (embedding instead of embed)
mart-r Feb 10, 2026
7a28a7b
CU-869c36ruk: Rename embedding linker folder
mart-r Feb 10, 2026
e6aa16f
CU-869c36ruk: Add initial workflows for medcat-embedding-linker
mart-r Feb 10, 2026
de28b5f
CU-869c36ruk: Fix issue with component registartion (NER/linker)
mart-r Feb 10, 2026
abeacaf
CU-869c36ruk: Fix linker name in tests
mart-r Feb 10, 2026
f888575
CU-869c36ruk: Unify component naming
mart-r Feb 10, 2026
f99ccb3
CU-869c36ruk: Fix issue with test PyPI push
mart-r Feb 10, 2026
f0ff04c
CU-869c36ruk: Fix workflow typo
mart-r Feb 10, 2026
a08b37c
CU-869c36ruk: Bump medcat dependency to 2.5 (for lazy registration)
mart-r Feb 10, 2026
0076416
CU-869c36ruk: Update plugin catalog with new entry for medcat-embeddi…
mart-r Feb 11, 2026
52c2240
CU-869c36ruk: Remove license section from README
mart-r Feb 11, 2026
2cec2d5
CU-869c25ux2: Rename workflow
mart-r Feb 11, 2026
bc5a1e7
CU-869c25ux2: Moved publishing to the joint workflow
mart-r Feb 11, 2026
4211afe
CU-869c25ux2: Fix typo in release workflow job
mart-r Feb 11, 2026
4186c8d
CU-869c36ruk: Move workflow to uv
mart-r Feb 11, 2026
49e9acf
CU-869c36ruk: Remove unnecessary step
mart-r Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .github/workflows/medcat-embedding-linker_ci.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alhendrickson - recently merged all the trainer workflows into one with final pushes of images / libs etc. predicated on a tagged build. Saves the extra workflow? We should likely do this for all the dirs?

https://github.com/CogStack/cogstack-nlp/blob/main/.github/workflows/medcat-trainer_ci.yml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that now

Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
name: medcat-embedding-linker - CI (test | publish)

on:
push:
branches: [ main ]
tags:
- 'medcat-embedding-linker/v*.*.*'
pull_request:
paths:
- 'medcat-embedding-linker/**'
- '.github/workflows/medcat-embedding-linker**'

permissions:
id-token: write

defaults:
run:
working-directory: ./medcat-plugins/embedding-linker

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [ '3.10', '3.11', '3.12' ]
max-parallel: 4
steps:
- uses: actions/checkout@v6
- name: Install uv for Python ${{ matrix.python-version }}
uses: astral-sh/setup-uv@v7
with:
python-version: ${{ matrix.python-version }}
enable-cache: true
- name: Install the project
run: |
uv sync --all-extras --dev
uv run python -m ensurepip
uv run python -m pip install --upgrade pip
- name: Check types
run: |
uv run python -m mypy --follow-imports=normal src/medcat_embedding_linker
- name: Ruff linting
run: |
uv run ruff check src/medcat_embedding_linker --preview
- name: Test
run: |
uv run python -m unittest discover

publish-to-test-PyPI:
runs-on: ubuntu-latest
needs: build
steps:
- name: Checkout main
uses: actions/checkout@v6
with:
fetch-depth: 0 # fetch all history
fetch-tags: true # fetch tags explicitly

- name: Install uv for Python 3.10
uses: astral-sh/setup-uv@v7
with:
python-version: '3.10'
enable-cache: true

- name: Install dependencies
run: |
uv run python -m ensurepip

- name: Set timestamp-based dev version
run: |
TS=$(date -u +"%Y%m%d%H%M%S")
echo "SETUPTOOLS_SCM_PRETEND_VERSION_FOR_MEDCAT_EMBEDDING_LINKER=0.2.2.dev${TS}" >> $GITHUB_ENV

- name: Build package
run: |
uv build

- name: Publish distribution to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository_url: https://test.pypi.org/legacy/
packages_dir: medcat-plugins/embedding-linker/dist

publish-to-PyPI:
runs-on: ubuntu-latest
if: startsWith(github.ref, 'refs/tags/')
needs: build
steps:
- name: Checkout main
uses: actions/checkout@v6

- name: Install uv for Python 3.10
uses: astral-sh/setup-uv@v7
with:
python-version: '3.10'
enable-cache: true

- name: Install dependencies
run: |
uv run python -m ensurepip

- name: Build client package
run: |
uv build

- name: Publish production distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages_dir: medcat-plugins/embedding-linker/dist
192 changes: 192 additions & 0 deletions medcat-plugins/embedding-linker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# MedCAT Embedding Linker

A MedCAT plugin that provides an embedding-based entity linking component using transformer models from HuggingFace.

## Overview

This plugin replaces MedCAT's default linking component with a transformer-based approach that uses semantic similarity between entity contexts and concept embeddings to perform entity disambiguation.

**Key features:**
- Semantic similarity-based linking using transformer embeddings
- Support for any HuggingFace sentence-transformer model
- Efficient batch processing with GPU acceleration
- Configurable similarity thresholds and context windows
- CUI-based filtering (include/exclude lists)

## Requirements

- **MedCAT**: 2.0+ ([PyPI](https://pypi.org/project/medcat/) | [GitHub](https://github.com/CogStack/MedCAT))
- Python 3.10+
- PyTorch
- Transformers

## Installation

```bash
pip install medcat-embedding-linker
```

## Quick Start

```python
from medcat.cat import CAT
from medcat.config import Config
from medcat.components.types import CoreComponentType

from medcat_embedding_linker import EmbeddingLinking

# Load your MedCAT model
cat = CAT.load_model_pack("path/to/model_pack")

# Configure the embedding linker
cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Recreate the pipeline to register the new linker
cat._recreate_pipe()

# Generate embeddings for your concept database
linker = self.get_component(CoreComponentType.linking)
# create
linker.create_embeddings()

# Use as normal
entities = cat.get_entities("Patient presents with chest pain and dyspnea.")
```

## How It Works

### Component Registration

The embedding linker automatically registers itself as `embedding_linker` when `EmbeddingLinking` config is detected. It implements MedCAT's `AbstractEntityProvidingComponent` interface and is lazily loaded when the pipeline is created.

### Embedding Generation

The linker operates on two types of embeddings:

**1. Concept Embeddings** (pre-computed)
- Each CUI is represented by its longest name's embedding
- Stored in `cdb.addl_info["cui_embeddings"]`
- Used for final disambiguation between candidate CUIs

**2. Name Embeddings** (pre-computed)
- Each concept name in the CDB gets its own embedding
- Stored in `cdb.addl_info["name_embeddings"]`
- Used for initial candidate retrieval

Both are generated via `linker.create_embeddings()` and cached for inference.

### Inference Process

For each detected entity:

1. **Context Vector Calculation**: Extract a text snippet around the entity (size controlled by `context_window_size`) and embed it
2. **Candidate Retrieval**: Compare context embedding against all name embeddings to find top matches above `short_similarity_threshold`
3. **Disambiguation**: If multiple CUIs are associated with the best-matching name, compare against CUI embeddings to select the final concept
4. **Filtering**: Apply CUI include/exclude filters and check against `long_similarity_threshold`

## Configuration

### Key Parameters

```python
config.components.linking = EmbeddingLinking(
# Model settings
embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
max_token_length=128,

# Context settings
context_window_size=10, # tokens on each side of entity

# Similarity thresholds
short_similarity_threshold=0.3, # for candidate retrieval
long_similarity_threshold=0.5, # for final linking

# Batch sizes
embedding_batch_size=4096,
linking_batch_size=512,

# Filtering
filters=Filters(
cuis={"C0018802", "C0011849"}, # include only these
cuis_exclude={"C0000001"} # or exclude these
),

# Advanced options
use_ner_link_candidates=True,
always_calculate_similarity=False,
filter_before_disamb=True,
gpu_device="cuda:0" # or None for auto-detect
)
```

### Embedding Models

Any HuggingFace model compatible with sentence transformers will work. Popular options:

- `sentence-transformers/all-MiniLM-L6-v2` (default, fast and lightweight)
- `sentence-transformers/all-mpnet-base-v2` (higher quality)
- `UFNLP/gatortron-medium` (biomedical domain)
- `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`

## Advanced Usage

### Re-generating Embeddings

If you modify your CDB or want to try a different model:

```python
linker = cat.get_component("embedding_linker")
linker.create_embeddings(
embedding_model_name="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
max_length=256
)
```

### GPU Configuration

```python
# Use specific GPU
cat.config.components.linking.gpu_device = "cuda:1"

# Force CPU
cat.config.components.linking.gpu_device = "cpu"
```

### Filtering

```python
# Include only specific CUIs
cat.config.components.linking.filters.cuis = {"C0011849", "C0018802"}

# Exclude specific CUIs
cat.config.components.linking.filters.cuis_exclude = {"C0000001"}

# Note: If both are set, only include filters are applied
```

## Performance Considerations

- **First-time embedding generation**: Can take several minutes for large CDBs (millions of concepts)
- **GPU recommended**: 10-50x faster inference with CUDA
- **Batch sizes**: Increase if you have GPU memory available
- **Model selection**: Smaller models (e.g., MiniLM) are faster but may be less accurate than larger domain-specific models

## Limitations

- Does not support `prefer_frequent_concepts` or `prefer_primary_name` from the default linker (logs warnings if set)
- Training mode is not applicable (logs warning if enabled)
- Requires pre-computed embeddings before inference

## Citation

If you use this plugin, please cite MedCAT:

```bibtex
@article{medcat2021,
title={Medical Concept Annotation Tool (MedCAT)},
author={Kraljevic, Zeljko and et al.},
journal={arXiv preprint arXiv:2010.01165},
year={2021}
}
```
Loading
Loading