ConsistCompose: Unified Multimodal Layout Control for Image Composition

Overview

ConsistCompose is a novel unified multimodal framework designed for layout-controllable multi-instance image composition. It addresses a critical gap in existing multimodal models—while most systems excel at visual grounding (aligning language with image regions), they lack precise control over spatial layout in generative tasks.

Built upon the unified understanding and generation architecture of BAGEL and enhanced by SenseNova-SI's spatial intelligence, ConsistCompose introduces Linguistic-Embedded Layout-Grounded Generation (LELG). This paradigm embeds layout coordinates directly into language prompts as textual tokens, eliminating the need for specialized spatial encoders or task-specific branches. To enable large-scale training, we construct ConsistCompose3M (3.4M samples), a high-quality dataset with layout and identity annotations that provides structured spatial-semantic supervision.

ConsistCompose achieves state-of-the-art performance on layout control benchmarks while preserving strong general multimodal capabilities, establishing a principled solution for precise, flexible image composition.

News

[2026-02-27] Official release of the ConsistCompose code repository, ConsistCompose-BAGEL-7B-MoT model, and ConsistCompose3M dataset on Hugging Face
[2026-02-22] Our work is accepted to the CVPR2026
[2025-11-23] Initial submission of our paper to arXiv (2511.18333)

📊 Benchmark Results

1. COCO-Position (Layout Control)

Methods	Instance Success Ratio (Avg)	Image Success Ratio (Avg)	mIoU	AP	AP50	AP75
GLIGEN	82.6%	52.1%	69.0	40.5	75.9	39.1
InstanceDiffusion	87.8%	65.5%	78.1	57.2	83.6	65.5
MIGC++	86.8%	63.4%	74.9	48.3	79.2	52.6
CreatiLayout	74.0%	42.5%	64.9	32.4	61.1	31.6
PlanGen	82.5%	50.3%	66.2	31.9	74.0	21.5
Ours (ConsistCompose)	92.6%	76.1%	85.3	70.9	89.1	76.9

7.2% mIoU gain and 13.7% AP improvement over state-of-the-art baselines

2. MS-Bench & MS-Bench-Random

Methods	MS-Bench				MS-Bench-Random
Methods	CLIP-T	DINO	mIoU	AP	CLIP-T	DINO	mIoU	AP
GLIGEN	0.309	0.454	0.868	0.751	0.312	0.431	0.858	0.722
MS-Diffusion	0.336	0.555	0.466	0.108	0.334	0.544	0.464	0.105
MUSE	0.320	0.619	0.698	0.352	0.321	0.607	0.673	0.303
Ours	0.333	0.660	0.889	0.789	0.334	0.630	0.878	0.756

3. General Multimodal Capabilities

Model	MMBench	MMMU	GenEval	GEdit
Bagel Base	81.4	46.4	0.86	6.68
Ours (w/o Coord)	81.5	39.4	0.88	6.23
Ours (w/ Coord)	81.4	42.3	0.88	6.31

4. DreamBench (Identity Preservation)

Method	Single			Multi
Method	DINO	CLIP-I	CLIP-T	DINO	CLIP-I	CLIP-T
UNO	0.661	0.796	0.304	0.491	0.715	0.323
OmniGen	0.554	0.746	0.322	0.441	0.692	0.341
OmniGen2	0.671	0.791	0.312	0.459	0.698	0.333
Ours	0.677	0.792	0.314	0.506	0.703	0.335

## 🛠️ QuickStart

Installation

# Clone repository
git clone git@github.com:OpenSenseNova/ConsistCompose.git
cd ConsistCompose/

conda create -n cc python=3.10 -y
conda activate cc
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolation

Key Examples

1. Layout Control for Text-to-Image Composition

This example performs layout-grounded text-to-image generation by embedding normalized bounding box coordinates directly into the prompt, enabling precise spatial control over the position and scale of each object in the output image.

python example_text2image.py  \
 --prompt "In a dimly lit cavern, a powerful dragon <bbox>[0.380, 0.086, 0.768, 0.673]</bbox> stands majestically, its textured scales glistening in the flickering firelight. Beside it, a brave man <bbox>[0.155, 0.231, 0.439, 0.717]</bbox> clad in armor, stands poised with determination, his hand gripping the hilt of a gleaming sword <bbox>[0.166, 0.401, 0.577, 0.663]</bbox> that reflects the dancing flames. The air is tense with anticipation as sparks rise from the crackling fire, illuminating the rocky surroundings and casting intricate shadows on the cavern walls. This scene paints a vivid picture of medieval fantasy, capturing a moment that is both dramatic and full of potential action." \
 --mode  layout_t2i \
 --model_path sensenova/ConsistCompose-BAGEL-7B-MoT

2. Layout Control for Image Composition with Multi-Reference

This example supports identity-preserving image composition with multiple references, where the model maintains the visual characteristics of subjects from reference images while arranging them according to the given layout constraints.

python example_subject_driven.py \
  --model_path sensenova/ConsistCompose-BAGEL-7B-MoT \
  --jsonl_path examples/layout_subject_driven.jsonl \
  --mode layout_subject_driven

Download Dataset

Load ConsistCompose3M via Hugging Face Datasets:

from datasets import load_dataset

# Load full dataset (webdataset format)
dataset = load_dataset("sensenova/ConsistCompose3M", split="train")

# Load task-specific subset (e.g., layout-aware text-to-image)
t2i_dataset = load_dataset("sensenova/ConsistCompose3M", data_files="jsonl_extended/layout_t2i/*.jsonl")

🖊️ Citation

If you use ConsistCompose, ConsistCompose3M, or related resources in your research, please cite:

@article{shi2025consistcompose,
  title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
  author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and Yang, Lei and Lin, Dahua and Wang, Quan},
  journal={arXiv preprint arXiv:2511.18333},
  year={2025}
}

License

The ConsistCompose framework and models are released under the Apache-2.0 License.
The ConsistCompose3M dataset is licensed under Apache-2.0, with additional terms for derived data (see dataset README for details).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
assets		assets
config		config
consist_compose		consist_compose
examples		examples
LICENSE.txt		LICENSE.txt
README.md		README.md
README_CN.md		README_CN.md
example_subject_driven.py		example_subject_driven.py
example_text2image.py		example_text2image.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Overview

News

📊 Benchmark Results

1. COCO-Position (Layout Control)

2. MS-Bench & MS-Bench-Random

3. General Multimodal Capabilities

4. DreamBench (Identity Preservation)

Installation

Key Examples

1. Layout Control for Text-to-Image Composition

2. Layout Control for Image Composition with Multi-Reference

Download Dataset

🖊️ Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

OpenSenseNova/ConsistCompose

Folders and files

Latest commit

History

Repository files navigation

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Overview

News

📊 Benchmark Results

1. COCO-Position (Layout Control)

2. MS-Bench & MS-Bench-Random

3. General Multimodal Capabilities

4. DreamBench (Identity Preservation)

Installation

Key Examples

1. Layout Control for Text-to-Image Composition

2. Layout Control for Image Composition with Multi-Reference

Download Dataset

🖊️ Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages