This repository contains the official code for COSMIC, a bidirectional generative model that links single-cell transcriptomics and nuclear morphology images. It includes data preprocessing pipelines, training scripts for both directions (seq2img and img2seq), and evaluation utilities.
COSMIC is a bidirectional generative framework that links single-cell morphology with gene expression. Built on a foundation model trained on over 21 million segmented nuclei and coupled to transcriptomic embeddings, COSMIC quantitatively decomposes transcriptional variance reflected in morphology and morphological variance explained by gene expression. Using a new IRIS-based multimodal dataset that captures high-resolution images and transcriptomes from the same cells, COSMIC accurately models cell type identity, continuous dynamics such as cell cycle progression, and treatment response in prostate cancer. This framework establishes a quantitative bridge between cellular form and gene expression, enabling mechanistic discovery and predictive modeling in basic and translational cell biology.
Estimated installation time: around 5 minutes.
-
Clone the repository
git clone https://github.com/mlbio-epfl/COSMIC.git cd COSMIC -
Create and activate a Python environment (conda, mamba, or venv)
conda create -n cosmic python=3.9 conda activate cosmic
-
Install dependencies
pip install imagen-pytorch pip install scanpy pip install scvi-tools pip install "lightly[timm]" pip install POT
Below is a high-level workflow. You can adapt paths, configs, and script names to your local setup.
- Download the datasets:
- Processed single-cell RNA-seq (
IRIS_mouse.h5adandIRIS_human.h5ad). - Single-cell nuclear images (
mouse_IRIS_images.zip,human_IRIS_images.zip).
- Processed single-cell RNA-seq (
- Place them under a common root:
./data/ - Unzip the zip file of the nuclear images. You will get images named with
[cell_id].jpgin the folders./data/images_mouseand./data/images_human.
If you have paired RNA-seq and image data, you can evaluate COSMIC on your own dataset by following the same data preparation steps and placing your files in the same directory structure (i.e., using the same paths) as in this repo.
You can train COSMIC and run inference in one or both directions:
- seq2img: generate nuclear images from gene expression.
- img2seq: predict gene expression from nuclear images.
2.1. Seq2img
2.1.1. Here, we use the mouse data as an example. First, you need to get the features from gene expression using scVI:
python ./seq2img/scVI_mouse.pyAfter running this, you will get feature_mouse_scvi.pt in ./seq2img/feature. If you want to skip this step, you can download the features directly from here.
2.1.2. Then, you can train the diffusion model conditioned on the gene expression features:
python ./seq2img/train_mouse.pyDuring training, you can keep checking the intermediate results in ./seq2img/result/mouse. The full training takes around 36 hours on a single GPU with 24 GB. After training, you will have a checkpoint seq2img_mouse.pt in ./seq2img/ckpt. If you want to skip this step, you can download the checkpoints directly from here.
2.1.3. Finally, you can run the inference to generate nuclear images:
python ./seq2img/inference_mouse.pyThe generated images will be saved in the folder ./seq2img/inference/mouse.
2.2. Img2seq
First, download the checkpoint of morphology FM ckpt_morphFM.pt here and put it into ./ckpt. Then, run
python ./img2seq/mouse.pyAfter running, you will get both the model checkpoint and the predicted genes in the file img2seq_mouse.pt in ./img2seq/inference. The keys to the checkpoint and the inference are "model_ckpt" and "pred_genes".
Once models are trained, you can evaluate fidelity, diversity, and cross-modal consistency.
1. Evaluate generated nuclear images
First, we evaluate the generated images in our morphology FM embedding space. Thus, let's first get the embedding:
python ./seq2img/get_embedding.py --species mouseAfter running this, you will get feature_mouse_gen_morphFM.pt in ./seq2img/feature. If you want to skip this step, you can download the features directly from here. Please also download the features of the ground truth images feature_mouse_gt_morphFM.pt from the same place.
Then, we can start evaluating:
- Coverage (COV) to measure how well the generated embeddings cover the real embeddings (the higher the better):
python ./seq2img/eval_cov.py --species mouse- Sliced Wasserstein Distance (SWD) between real and generated embeddings (the lower the better):
python ./seq2img/eval_swd.py --species mouse- k-Nearest-Neighbour Accuracy (k-NNA) to assess overlap between real and generated distributions (the closer to 0.5 the better):
python ./seq2img/eval_knna.py --species mouse- Cell type classification on generated nuclear images (the higher the better):
python ./seq2img/eval_ct_cls.py --species mouse2. Evaluate generated gene expression
- Per-gene Pearson correlation between predicted and ground-truth expression across different cell types (the higher the better):
python ./img2seq/eval_pcorr_acrossCT.py --species mouse- Per-gene Pearson correlation between predicted and ground-truth expression within each cell type (the higher the better):
python ./img2seq/eval_pcorr_withinCT.py --species mouse- Cell type classification on generated gene expression (the higher the better):
python ./img2seq/eval_ct_cls.py --species mouse