Skip to content

andaero/PLaID

Repository files navigation

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

arXiv Code Weights

This repository contains the official implementation for our paper: PLaID++: A Preference-Aligned Language Model for Targeted Inorganic Materials Design, by Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, and Ethan Ritz.

Summary

PLaID++ introduces an LLM fine-tuned for stable and property-targeted inorganic crystal generation. PLaID++ achieves a ~50% higher S.U.N. (Stable, Unique, Novel) rate than prior work and robust space-group conditioned generation through:

  1. Leveraging a novel Wyckoff-based text encoding
  2. Aligning the model using Direct Preference Optimization (DPO), an RL method guided by machine-learned interatomic potentials
  3. Unified training across conditional and unconditional generation tasks

plaid_architecture_diagram

Setup

First, create an environment and install the dependencies using uv:

uv venv --python 3.12 plaid
source plaid/bin/activate
uv sync

Usage

Below are the main entry points for running the core workflows. For detailed options, please consult the script comments.

  1. Supervised Fine-Tuning To fine-tune Qwen-2.5 7B on crystal text representations (coordinate or Wyckoff):
python llm_finetune.py --run-name 7b-wyckoff-run-qwen --model 7b --batch-size 16 --fp4 --lr 5e-4 --qwen
  1. Direct Preference Optimization (DPO) Fine-Tuning To run DPO for preference-aligned RL across 7 iterations(see scripts/plaid_dpo.sh for the full script).
bash scripts/plaid_dpo.sh | tee logs/plaid_dpo.log
  1. MLIP Evals: To evaluate our crystals, run
python evals/novelty.py "${results.csv}" test.json "${model_name}" --sun_out "${sun_output.csv}"  # unconditional eval
bash run_novelty.sh qwen_7b_dpo_wyckoff_sg_combined_t2_novelty_v2_dt_v2_it7_temp1.1 0.1 esen  # conditional eval
  1. Visualizations To plot space group, stability, and other metrics:
python visualizations/histogram_ehull.py
python visualizations/conditional_sg_bar_graph.py
  1. DFT Pipeline We run DFT on 1000 crystal structures sampled from the final PLaID++ flagship model. We've included scripts to help prepare the necessary configuration files to run DFT using vasp in directories corresponding to each crystal.
python dft/LLM_dft_create_inputs.py

To compute corrected energy above hull from DFT outputs (namely through vasprun.xml files):

python dft/ehull_correction_newest.py

Data

The data/ directory contains the primary dataset splits.

For energy-above-hull (Ehull) evaluation or DFT validation, follow the instructions and links in the paper for any required external data.

Samples for unconditional and conditional generation used for the final model results in the main paper are available in the results/ directory.

Project Structure

PLaID/
├─ DPO_preprocess.py                  — Build DPO preference datasets from eval results
├─ DPO_train.py                       — Train models with DPO
├─ llm_finetune.py                    — Supervised fine‑tuning/LoRA utilities for LLMs
├─ llm_sample.py                      — Sample generations (conditional/unconstrained) from trained models
├─ scripts/
│  └─ plaid_dpo.sh                    — DPO pipeline for flagship Plaid++
├─ cond_gen/
│  ├─ csv_gen.py                      — Create condition CSVs for conditional generation
│  ├─ alex_mp_analysis.py             — MP data analysis helpers (condition curation)
├─ evals/
│  ├─ e_above_hull.py                 — Compute E_above_hull; batched relaxations
│  ├─ basic_eval.py                   — Core evaluation routines for generations
│  ├─ relaxations.py                  — Relaxation backends and wrappers
│  ├─ novelty.py                      — Novelty metrics for generated structures
├─ dft/
│  ├─ LLM_dft_create_inputs.py        — Prepare DFT input files for follow‑up calculations
│  ├─ ehull_correction_newest.py      — Convex hull energy correction utilities
├─ data/                              — Datasets (raw/processed) for training/eval
├─ results/                           — Evaluation outputs (CSV summaries, metrics)
└─ exp/                               — Experiment outputs (checkpoints, trained models)

Model

The full PLaID++ model is available on HuggingFace.

Citation

Arxiv Link

@article{xu2025plaid++,
  title={PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design},
  author={Xu, Andy and Desai, Rohan and Wang, Larry and Hope, Gabriel and Ritz, Ethan},
  journal={arXiv preprint arXiv:2509.07150},
  year={2025}
}

License

Most of PLaID++ is distributed under the CC BY 4.0 license. However, some components of the project are governed by different licenses: pymatgen is licensed under MIT, Hugging Face Transformers under Apache 2.0, and ASE under the GNU Lesser General Public License.

About

Official Implementation of PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors