Skip to content

CASE-Lab-UMD/LLM-Drop

Repository files navigation

[TMLR] Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

OpenReview TMLR 2026 Python 3.10+

Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li

📰 News⚙️ Installation📦 Layout🧰 Models📊 Benchmark📄 Citation

This is the official implementation for the paper Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR).

📖 Introduction

This project studies architectural redundancy in Transformer-based LLMs and provides practical pipelines for:

  • Block Drop
  • Layer Drop (Attention/MLP)
  • Joint Layer Drop
  • Post-training quantization (AWQ/GPTQ)

The dropping pipeline is built on LLaMA-Factory. Quantization support is built on AutoAWQ and AutoGPTQ.

Layer-Drop.svg

📰 News

  • Feb 2026: This paper is published in Transactions on Machine Learning Research (TMLR).
  • May 2025: 🏆 Awarded the Qualcomm Innovation Fellowship (QIF) North America for the proposal “Less Attention, Much Faster: Toward a Future of Efficiency-Optimized Transformer Architectures.”
  • Nov 2024: Added support for more model families (Gemma2, Baichuan, DeepSeek, Yi, Solar).
  • Sep 2024: Released dropped-model checkpoints in this Hugging Face collection.
  • Jun 2024: Released arXiv preprint and code.

⚙️ Installation

conda create -n llm-drop python=3.10 -y
conda activate llm-drop

git clone https://github.com/CASE-Lab-UMD/LLM-Drop.git
cd LLM-Drop

# Core dropping pipeline
pip install -e .

# Quantization dependencies (optional)
cd src/llmtuner/compression/quantization/AutoAWQ
pip install -e .

cd AutoAWQ_kernels
pip install -e .

cd ../../AutoGPTQ
pip install -vvv --no-build-isolation -e .

cd ../../../../../..

📦 Repository Layout

  • src/compress.py: main entry for dropping/compression workflow.
  • scripts/dropping/*.sh: example scripts for block/layer dropping.
  • scripts/benchmark/benchmark_lm_eval.sh: LM-Eval benchmark script.
  • scripts/benchmark/benchmark_speed.sh: speed benchmark wrapper.
  • src/benchmark_speed.py: speed benchmarking implementation.
  • scripts/quantization/*.sh: AWQ/GPTQ quantization examples.

🧰 Prepare Models

  1. Download a base model from Hugging Face (for example mistralai/Mistral-7B-v0.1).
  2. Add auto_map in the model config.json so Transformers can load custom dropped-model classes.
  3. Set drop lists in config.json:
  • Drop attention layers:
"drop_mlp_list": [],
"drop_attn_list": [25, 26, 24, 22]
  • Drop MLP layers:
"drop_mlp_list": [26, 27, 25, 24],
"drop_attn_list": []
  • Drop full blocks:
"drop_mlp_list": [26, 25, 24, 27],
"drop_attn_list": [26, 25, 24, 27]

Example auto_map for Mistral:

"auto_map": {
  "AutoConfig": "configuration_dropped_mistral.MistralConfig",
  "AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM"
}

See model files under src/llmtuner/compression/prune/models.

🚀 Run Dropping

# Block Drop
bash scripts/dropping/block_drop.sh

# Layer Drop
bash scripts/dropping/layer_drop.sh

# Joint Layer Drop
bash scripts/dropping/layer_drop_joint.sh

These scripts estimate module importance, select layers/blocks to drop, and generate updated model configs/checkpoints.

📊 Benchmark

🧪 1) Task Performance

bash scripts/benchmark/benchmark_lm_eval.sh

Notes:

⚡ 2) Inference Speed

bash scripts/benchmark/benchmark_speed.sh

Before running, edit placeholders in scripts/benchmark/benchmark_speed.sh:

  • model_path
  • save_file
  • model_type

🧊 3) Quantization

bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh

Before running, edit placeholders in those scripts (model_path, quant_path) and ensure CUDA-compatible package versions.

📄 Citation

@misc{he2024uncoveringredundancytransformers,
  title={Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping},
  author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
  year={2024},
  howpublished={OpenReview},
  url={https://openreview.net/forum?id=1I7PCbOPfe}
}

📬 Contact

  • Shwai He: shwaihe@umd.edu
  • Guoheng Sun: ghsun@umd.edu

About

The official implementation of the paper "Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR)".

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors