PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

This repository contains the official implementation for our paper: PLaID++: A Preference-Aligned Language Model for Targeted Inorganic Materials Design, by Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, and Ethan Ritz.

Summary

PLaID++ introduces an LLM fine-tuned for stable and property-targeted inorganic crystal generation. PLaID++ achieves a ~50% higher S.U.N. (Stable, Unique, Novel) rate than prior work and robust space-group conditioned generation through:

Leveraging a novel Wyckoff-based text encoding
Aligning the model using Direct Preference Optimization (DPO), an RL method guided by machine-learned interatomic potentials
Unified training across conditional and unconditional generation tasks

Setup

First, create an environment and install the dependencies using uv:

uv venv --python 3.12 plaid
source plaid/bin/activate
uv sync

Usage

Below are the main entry points for running the core workflows. For detailed options, please consult the script comments.

Supervised Fine-Tuning To fine-tune Qwen-2.5 7B on crystal text representations (coordinate or Wyckoff):

python llm_finetune.py --run-name 7b-wyckoff-run-qwen --model 7b --batch-size 16 --fp4 --lr 5e-4 --qwen

Direct Preference Optimization (DPO) Fine-Tuning To run DPO for preference-aligned RL across 7 iterations(see scripts/plaid_dpo.sh for the full script).

bash scripts/plaid_dpo.sh | tee logs/plaid_dpo.log

MLIP Evals: To evaluate our crystals, run

python evals/novelty.py "${results.csv}" test.json "${model_name}" --sun_out "${sun_output.csv}"  # unconditional eval
bash run_novelty.sh qwen_7b_dpo_wyckoff_sg_combined_t2_novelty_v2_dt_v2_it7_temp1.1 0.1 esen  # conditional eval

Visualizations To plot space group, stability, and other metrics:

python visualizations/histogram_ehull.py
python visualizations/conditional_sg_bar_graph.py

DFT Pipeline We run DFT on 1000 crystal structures sampled from the final PLaID++ flagship model. We've included scripts to help prepare the necessary configuration files to run DFT using vasp in directories corresponding to each crystal.

python dft/LLM_dft_create_inputs.py

To compute corrected energy above hull from DFT outputs (namely through vasprun.xml files):

python dft/ehull_correction_newest.py

Data

The data/ directory contains the primary dataset splits.

For energy-above-hull (Ehull) evaluation or DFT validation, follow the instructions and links in the paper for any required external data.

Samples for unconditional and conditional generation used for the final model results in the main paper are available in the results/ directory.

Project Structure

PLaID/
├─ DPO_preprocess.py                  — Build DPO preference datasets from eval results
├─ DPO_train.py                       — Train models with DPO
├─ llm_finetune.py                    — Supervised fine‑tuning/LoRA utilities for LLMs
├─ llm_sample.py                      — Sample generations (conditional/unconstrained) from trained models
├─ scripts/
│  └─ plaid_dpo.sh                    — DPO pipeline for flagship Plaid++
├─ cond_gen/
│  ├─ csv_gen.py                      — Create condition CSVs for conditional generation
│  ├─ alex_mp_analysis.py             — MP data analysis helpers (condition curation)
├─ evals/
│  ├─ e_above_hull.py                 — Compute E_above_hull; batched relaxations
│  ├─ basic_eval.py                   — Core evaluation routines for generations
│  ├─ relaxations.py                  — Relaxation backends and wrappers
│  ├─ novelty.py                      — Novelty metrics for generated structures
├─ dft/
│  ├─ LLM_dft_create_inputs.py        — Prepare DFT input files for follow‑up calculations
│  ├─ ehull_correction_newest.py      — Convex hull energy correction utilities
├─ data/                              — Datasets (raw/processed) for training/eval
├─ results/                           — Evaluation outputs (CSV summaries, metrics)
└─ exp/                               — Experiment outputs (checkpoints, trained models)

Model

The full PLaID++ model is available on HuggingFace.

Citation

Arxiv Link

@article{xu2025plaid++,
  title={PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design},
  author={Xu, Andy and Desai, Rohan and Wang, Larry and Hope, Gabriel and Ritz, Ethan},
  journal={arXiv preprint arXiv:2509.07150},
  year={2025}
}

License

Most of PLaID++ is distributed under the CC BY 4.0 license. However, some components of the project are governed by different licenses: pymatgen is licensed under MIT, Hugging Face Transformers under Apache 2.0, and ASE under the GNU Lesser General Public License.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
cond_gen		cond_gen
data		data
dft		dft
evals		evals
results		results
scripts		scripts
visualizations		visualizations
.gitignore		.gitignore
.python-version		.python-version
DPO_preprocess.py		DPO_preprocess.py
DPO_train.py		DPO_train.py
README.md		README.md
enrich_train_with_bulk_modulus.py		enrich_train_with_bulk_modulus.py
llm_finetune.py		llm_finetune.py
llm_ift_hydra.py		llm_ift_hydra.py
llm_sample.py		llm_sample.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
templating.py		templating.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Summary

Setup

Usage

Data

Project Structure

Model

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Summary

Setup

Usage

Data

Project Structure

Model

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages