ConsistCompose is a novel unified multimodal framework designed for layout-controllable multi-instance image composition. It addresses a critical gap in existing multimodal models—while most systems excel at visual grounding (aligning language with image regions), they lack precise control over spatial layout in generative tasks.
Built upon the unified understanding and generation architecture of BAGEL and enhanced by SenseNova-SI's spatial intelligence, ConsistCompose introduces Linguistic-Embedded Layout-Grounded Generation (LELG). This paradigm embeds layout coordinates directly into language prompts as textual tokens, eliminating the need for specialized spatial encoders or task-specific branches. To enable large-scale training, we construct ConsistCompose3M (3.4M samples), a high-quality dataset with layout and identity annotations that provides structured spatial-semantic supervision.
ConsistCompose achieves state-of-the-art performance on layout control benchmarks while preserving strong general multimodal capabilities, establishing a principled solution for precise, flexible image composition.
- [2026-02-27] Official release of the ConsistCompose code repository, ConsistCompose-BAGEL-7B-MoT model, and ConsistCompose3M dataset on Hugging Face
- [2026-02-22] Our work is accepted to the CVPR2026
- [2025-11-23] Initial submission of our paper to arXiv (2511.18333)
| Methods | Instance Success Ratio (Avg) | Image Success Ratio (Avg) | mIoU | AP | AP50 | AP75 |
|---|---|---|---|---|---|---|
| GLIGEN | 82.6% | 52.1% | 69.0 | 40.5 | 75.9 | 39.1 |
| InstanceDiffusion | 87.8% | 65.5% | 78.1 | 57.2 | 83.6 | 65.5 |
| MIGC++ | 86.8% | 63.4% | 74.9 | 48.3 | 79.2 | 52.6 |
| CreatiLayout | 74.0% | 42.5% | 64.9 | 32.4 | 61.1 | 31.6 |
| PlanGen | 82.5% | 50.3% | 66.2 | 31.9 | 74.0 | 21.5 |
| Ours (ConsistCompose) | 92.6% | 76.1% | 85.3 | 70.9 | 89.1 | 76.9 |
7.2% mIoU gain and 13.7% AP improvement over state-of-the-art baselines
| Methods | MS-Bench | MS-Bench-Random | ||||||
|---|---|---|---|---|---|---|---|---|
| CLIP-T | DINO | mIoU | AP | CLIP-T | DINO | mIoU | AP | |
| GLIGEN | 0.309 | 0.454 | 0.868 | 0.751 | 0.312 | 0.431 | 0.858 | 0.722 |
| MS-Diffusion | 0.336 | 0.555 | 0.466 | 0.108 | 0.334 | 0.544 | 0.464 | 0.105 |
| MUSE | 0.320 | 0.619 | 0.698 | 0.352 | 0.321 | 0.607 | 0.673 | 0.303 |
| Ours | 0.333 | 0.660 | 0.889 | 0.789 | 0.334 | 0.630 | 0.878 | 0.756 |
| Model | MMBench | MMMU | GenEval | GEdit |
|---|---|---|---|---|
| Bagel Base | 81.4 | 46.4 | 0.86 | 6.68 |
| Ours (w/o Coord) | 81.5 | 39.4 | 0.88 | 6.23 |
| Ours (w/ Coord) | 81.4 | 42.3 | 0.88 | 6.31 |
| Method | Single | Multi | ||||
|---|---|---|---|---|---|---|
| DINO | CLIP-I | CLIP-T | DINO | CLIP-I | CLIP-T | |
| UNO | 0.661 | 0.796 | 0.304 | 0.491 | 0.715 | 0.323 |
| OmniGen | 0.554 | 0.746 | 0.322 | 0.441 | 0.692 | 0.341 |
| OmniGen2 | 0.671 | 0.791 | 0.312 | 0.459 | 0.698 | 0.333 |
| Ours | 0.677 | 0.792 | 0.314 | 0.506 | 0.703 | 0.335 |
# Clone repository
git clone git@github.com:OpenSenseNova/ConsistCompose.git
cd ConsistCompose/
conda create -n cc python=3.10 -y
conda activate cc
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolationThis example performs layout-grounded text-to-image generation by embedding normalized bounding box coordinates directly into the prompt, enabling precise spatial control over the position and scale of each object in the output image.
python example_text2image.py \
--prompt "In a dimly lit cavern, a powerful dragon <bbox>[0.380, 0.086, 0.768, 0.673]</bbox> stands majestically, its textured scales glistening in the flickering firelight. Beside it, a brave man <bbox>[0.155, 0.231, 0.439, 0.717]</bbox> clad in armor, stands poised with determination, his hand gripping the hilt of a gleaming sword <bbox>[0.166, 0.401, 0.577, 0.663]</bbox> that reflects the dancing flames. The air is tense with anticipation as sparks rise from the crackling fire, illuminating the rocky surroundings and casting intricate shadows on the cavern walls. This scene paints a vivid picture of medieval fantasy, capturing a moment that is both dramatic and full of potential action." \
--mode layout_t2i \
--model_path sensenova/ConsistCompose-BAGEL-7B-MoTThis example supports identity-preserving image composition with multiple references, where the model maintains the visual characteristics of subjects from reference images while arranging them according to the given layout constraints.
python example_subject_driven.py \
--model_path sensenova/ConsistCompose-BAGEL-7B-MoT \
--jsonl_path examples/layout_subject_driven.jsonl \
--mode layout_subject_drivenLoad ConsistCompose3M via Hugging Face Datasets:
from datasets import load_dataset
# Load full dataset (webdataset format)
dataset = load_dataset("sensenova/ConsistCompose3M", split="train")
# Load task-specific subset (e.g., layout-aware text-to-image)
t2i_dataset = load_dataset("sensenova/ConsistCompose3M", data_files="jsonl_extended/layout_t2i/*.jsonl")If you use ConsistCompose, ConsistCompose3M, or related resources in your research, please cite:
@article{shi2025consistcompose,
title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and Yang, Lei and Lin, Dahua and Wang, Quan},
journal={arXiv preprint arXiv:2511.18333},
year={2025}
}- The ConsistCompose framework and models are released under the Apache-2.0 License.
- The ConsistCompose3M dataset is licensed under Apache-2.0, with additional terms for derived data (see dataset README for details).
