Generative modeling reveals the connection between cellular morphology and gene expression

This repository contains the official code for COSMIC, a bidirectional generative model that links single-cell transcriptomics and nuclear morphology images. It includes data preprocessing pipelines, training scripts for both directions (seq2img and img2seq), and evaluation utilities.

Overview

COSMIC is a bidirectional generative framework that links single-cell morphology with gene expression. Built on a foundation model trained on over 21 million segmented nuclei and coupled to transcriptomic embeddings, COSMIC quantitatively decomposes transcriptional variance reflected in morphology and morphological variance explained by gene expression. Using a new IRIS-based multimodal dataset that captures high-resolution images and transcriptomes from the same cells, COSMIC accurately models cell type identity, continuous dynamics such as cell cycle progression, and treatment response in prostate cancer. This framework establishes a quantitative bridge between cellular form and gene expression, enabling mechanistic discovery and predictive modeling in basic and translational cell biology.

Installation

Estimated installation time: around 5 minutes.

Clone the repository

git clone https://github.com/mlbio-epfl/COSMIC.git
cd COSMIC

Create and activate a Python environment (conda, mamba, or venv)
```
conda create -n cosmic python=3.9
conda activate cosmic
```

Install dependencies

pip install imagen-pytorch
pip install scanpy
pip install scvi-tools
pip install "lightly[timm]"
pip install POT

Quick Start

Below is a high-level workflow. You can adapt paths, configs, and script names to your local setup.

1. Data Preparation

Download the datasets:
- Processed single-cell RNA-seq (IRIS_mouse.h5ad and IRIS_human.h5ad).
- Single-cell nuclear images (mouse_IRIS_images.zip, human_IRIS_images.zip).
Place them under a common root:
```
./data/
```
Unzip the zip file of the nuclear images. You will get images named with [cell_id].jpg in the folders ./data/images_mouse and ./data/images_human.

If you have paired RNA-seq and image data, you can evaluate COSMIC on your own dataset by following the same data preparation steps and placing your files in the same directory structure (i.e., using the same paths) as in this repo.

2. Training + Inference

You can train COSMIC and run inference in one or both directions:

seq2img: generate nuclear images from gene expression.
img2seq: predict gene expression from nuclear images.

2.1. Seq2img

2.1.1. Here, we use the mouse data as an example. First, you need to get the features from gene expression using scVI:

python ./seq2img/scVI_mouse.py

After running this, you will get feature_mouse_scvi.pt in ./seq2img/feature. If you want to skip this step, you can download the features directly from here.

2.1.2. Then, you can train the diffusion model conditioned on the gene expression features:

python ./seq2img/train_mouse.py

During training, you can keep checking the intermediate results in ./seq2img/result/mouse. The full training takes around 36 hours on a single GPU with 24 GB. After training, you will have a checkpoint seq2img_mouse.pt in ./seq2img/ckpt. If you want to skip this step, you can download the checkpoints directly from here.

2.1.3. Finally, you can run the inference to generate nuclear images:

python ./seq2img/inference_mouse.py

The generated images will be saved in the folder ./seq2img/inference/mouse.

2.2. Img2seq

First, download the checkpoint of morphology FM ckpt_morphFM.pt here and put it into ./ckpt. Then, run

python ./img2seq/mouse.py

After running, you will get both the model checkpoint and the predicted genes in the file img2seq_mouse.pt in ./img2seq/inference. The keys to the checkpoint and the inference are "model_ckpt" and "pred_genes".

3. Evaluation

Once models are trained, you can evaluate fidelity, diversity, and cross-modal consistency.

1. Evaluate generated nuclear images

First, we evaluate the generated images in our morphology FM embedding space. Thus, let's first get the embedding:

python ./seq2img/get_embedding.py --species mouse

After running this, you will get feature_mouse_gen_morphFM.pt in ./seq2img/feature. If you want to skip this step, you can download the features directly from here. Please also download the features of the ground truth images feature_mouse_gt_morphFM.pt from the same place.

Then, we can start evaluating:

Coverage (COV) to measure how well the generated embeddings cover the real embeddings (the higher the better):

python ./seq2img/eval_cov.py --species mouse

Sliced Wasserstein Distance (SWD) between real and generated embeddings (the lower the better):

python ./seq2img/eval_swd.py --species mouse

k-Nearest-Neighbour Accuracy (k-NNA) to assess overlap between real and generated distributions (the closer to 0.5 the better):

python ./seq2img/eval_knna.py --species mouse

Cell type classification on generated nuclear images (the higher the better):

python ./seq2img/eval_ct_cls.py --species mouse

2. Evaluate generated gene expression

Per-gene Pearson correlation between predicted and ground-truth expression across different cell types (the higher the better):

python ./img2seq/eval_pcorr_acrossCT.py --species mouse

Per-gene Pearson correlation between predicted and ground-truth expression within each cell type (the higher the better):

python ./img2seq/eval_pcorr_withinCT.py --species mouse

Cell type classification on generated gene expression (the higher the better):

python ./img2seq/eval_ct_cls.py --species mouse

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
img2seq		img2seq
seq2img		seq2img
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative modeling reveals the connection between cellular morphology and gene expression

Overview

Installation

Quick Start

1. Data Preparation

2. Training + Inference

3. Evaluation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

mlbio-epfl/COSMIC

Folders and files

Latest commit

History

Repository files navigation

Generative modeling reveals the connection between cellular morphology and gene expression

Overview

Installation

Quick Start

1. Data Preparation

2. Training + Inference

3. Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages