HAFITH: Aspect-Ratio Preserving VLM for Historical Arabic Manuscript Text Recognition

Mohammed Naseif, Islam Mesabah, Dalia Hajjaj, Abdulrahman Hassan, Anis Koubaa, Ahmed Elhayek

University of Prince Mugrin, Medina, Saudi Arabia

Overview

HAFITH (حافظ, Arabic for "preserver") is a vision-language model for historical Arabic manuscript text recognition. It achieves 5.10% Character Error Rate (CER) on a combined benchmark of MUHARAF, KHATT, and RASAM — a 36% relative improvement over the previous state of the art.

HAFITH addresses three fundamental limitations of existing Arabic OCR systems:

Problem	Existing Approach	HAFITH Solution
Aspect ratio distortion	Fixed 384×384 resizing	SigLIP V2 NaFlex — native-resolution encoding
Inefficient Arabic tokenization	English RoBERTa tokenizer (~4.1 tokens/word)	Aranizer-PBE-64k (~1.0 tokens/word)
Training data scarcity	Fine-tuning on small real datasets	1M synthetic samples across 350 Arabic fonts

Architecture

Input Image (any aspect ratio)
       │
       ▼
┌─────────────────────────────┐
│  NaFlex Preprocessing       │  Patchify at native aspect ratio
│  (up to 512 patches)        │  Pad unused positions to 512
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  SigLIP V2 NaFlex Encoder   │  400M params, 1152-dim embeddings
│  (google/siglip2-so400m)    │  All patch embeddings (no CLS-only)
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  Modality Projection MLP    │  1152 → 1024 dim
│  + Learnable Pos Embeddings │  LayerNorm, GELU
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  RoBERTa-Large Decoder      │  24 layers, 16 heads, 1024-dim
│  (trained from scratch)     │  Aranizer-PBE-64k vocabulary (64K tokens)
└──────────────┬──────────────┘
               │
               ▼
       Arabic Text Output

Results

Combined Benchmark (MUHARAF + KHATT + RASAM)

Model	CER ↓	WER ↓
CRNN + CTC	14.82%	—
TrOCR-Base	13.41%	—
TrOCR-Large	11.73%	31.82%
HATFormer	8.60%	—
HAFITH (Ours)	5.10%	18.05%

Per-Dataset Results

Dataset	Model	CER ↓	WER ↓
MUHARAF	HATFormer	8.60%	—
MUHARAF	HAFITH	8.35%	24.76%
KHATT	HATFormer	15.40%	—
KHATT	HAFITH	11.21%	37.36%
RASAM	TrOCR-Large	35.26%	50.92%
RASAM	HAFITH	4.95%	18.94%

Ablation Study

Configuration	CER	∆rel
TrOCR-Large Baseline	11.73%	—
+ Quality Filtering	9.77%	−16.7%
+ SigLIP2 NaFlex	6.60%	−32.4%
+ Aranizer (no synthetic)	8.47%	+28.3%
+ Aranizer + Synthetic	5.10%	−22.7%

Installation

git clone https://github.com/mdnaseif/hafith.git
cd hafith
pip install -e .

Or install dependencies directly:

pip install -r requirements.txt

Requirements: Python 3.9+, CUDA 11.8+, PyTorch 2.0+

Quick Start

Inference

from hafith import HAFITHRecognizer

recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")

# From file path
text = recognizer.recognize("path/to/manuscript_line.png")
print(text)

# From PIL Image
from PIL import Image
image = Image.open("manuscript.png")
text = recognizer.recognize(image)
print(text)

Batch Inference

from hafith import HAFITHRecognizer

recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")

images = ["line1.png", "line2.png", "line3.png"]
results = recognizer.recognize_batch(images, batch_size=8)
for img, text in zip(images, results):
    print(f"{img}: {text}")

Training

Step 1: Prepare Data

Download the benchmark datasets:

python scripts/download_datasets.py --output-dir data/

Or download manually:

After downloading, run quality filtering:

python scripts/filter_data.py \
    --input-dir data/raw/ \
    --output-dir data/filtered/ \
    --max-token-length 64

Step 2: Generate Synthetic Data (Optional but Recommended)

python scripts/generate_synthetic.py \
    --corpus-path data/ArabicText-Large/ \
    --fonts-dir data/fonts/ \
    --backgrounds-dir data/backgrounds/ \
    --output-dir data/synthetic/ \
    --num-samples 1000000 \
    --num-workers 8

Step 3: Train

From scratch (recommended):

python scripts/train.py \
    --config configs/hafith_default.yaml \
    --data-dir data/filtered/ \
    --output-dir outputs/hafith_run1/

Resume from checkpoint:

python scripts/train.py \
    --config configs/hafith_default.yaml \
    --checkpoint outputs/hafith_run1/stage2/checkpoint-113100 \
    --resume

Fine-tune from pretrained HAFITH:

python scripts/train.py \
    --config configs/hafith_finetune.yaml \
    --checkpoint mdnaseif/hafith \
    --start-stage 2

Step 4: Evaluate

python scripts/evaluate.py \
    --model-path outputs/hafith_run1/final/ \
    --data-dir data/filtered/ \
    --split test \
    --output-file results/evaluation.json

Project Structure

hafith/
├── hafith/                     # Core library
│   ├── __init__.py
│   ├── model.py                # SigLIP2TrOCRModel architecture
│   ├── encoder.py              # SigLIP2 NaFlex vision encoder wrapper
│   ├── dataset.py              # NativeResolutionDataset + collator
│   ├── tokenizer_utils.py      # Aranizer setup utilities
│   ├── quality_filter.py       # DataQualityChecker
│   ├── metrics.py              # CER/WER with statistical sampling
│   └── recognizer.py           # High-level inference API
├── scripts/
│   ├── train.py                # Main training script
│   ├── evaluate.py             # Evaluation script
│   ├── inference.py            # Single/batch inference CLI
│   ├── generate_synthetic.py   # Synthetic data generation
│   ├── filter_data.py          # Data quality filtering
│   └── download_datasets.py    # Dataset download helper
├── configs/
│   ├── hafith_default.yaml     # Default training config
│   └── hafith_finetune.yaml    # Fine-tuning config
├── data/                       # (gitignored) Datasets
├── outputs/                    # (gitignored) Training outputs
├── tests/
│   ├── test_model.py
│   ├── test_dataset.py
│   └── test_metrics.py
├── docs/
│   └── architecture.md
├── requirements.txt
├── setup.py
└── README.md

Datasets

Real Benchmark Datasets

Dataset	Domain	Train	Val	Test	Avg Aspect Ratio
MUHARAF	Historical archival (19th–20th c.)	21,129	1,021	1,278	~8:1
KHATT	Contemporary handwriting (1,000 writers)	15,886	962	1,197	~6:1
RASAM	Maghrebi manuscripts (10th–20th c.)	2,739	915	906	15.1:1

Synthetic Dataset

1 million manuscript-style line images
350 Arabic fonts (Naskh, Ruq'ah, Thuluth, Maghrebi)
50 aged parchment background textures
Stochastic degradations: paper texture, baseline warp, noise, ink effects, aging

Download: mdnaseif/hafith-synthetic-1m

Configuration

See configs/hafith_default.yaml for all options. Key parameters:

model:
  encoder: google/siglip2-so400m-patch16-naflex
  decoder: microsoft/trocr-large-handwritten  # architecture reference only
  tokenizer: riotu-lab/Aranizer-PBE-64k
  encoder_from_scratch: false
  decoder_from_scratch: true          # required when using Aranizer
  max_num_patches: 512
  max_target_length: 64

training:
  stage1_epochs: 5                    # encoder frozen, decoder + projection
  stage2_epochs: 50                   # full fine-tuning
  batch_size: 4
  gradient_accumulation_steps: 8     # effective batch = 32
  stage1_lr: 5.0e-5
  stage2_lr: 1.0e-5
  fp16: true
  label_smoothing: 0.1

data:
  use_synthetic_pretraining: true
  use_quality_filtering: true
  max_token_length: 64

Computational Requirements

Stage	Hardware	Time
Synthetic pretraining	1× RTX 4090 (24GB)	~7 days
Real data fine-tuning	1× RTX 4090 (24GB)	~3 days
Inference	1× RTX 4090 (24GB)	12.5 samples/sec
Inference (INT8)	1× RTX 4090 (24GB)	19.8 samples/sec

Total training: ~10 days on a single RTX 4090.

Citation

@article{naseif2026hafith,
  title={{HAFITH}: Aspect-Ratio Preserving {VLM} for Historical Arabic Manuscript Text Recognition},
  author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Koubaa, Anis and Elhayek, Ahmed},
  journal={Information Processing \& Management},
  year={2026}
}

License

This project is licensed under the MIT License — see LICENSE for details.

Acknowledgments

SigLIP V2 by Google DeepMind for the NaFlex vision encoder
Aranizer by RIOTU Lab for the Arabic tokenizer
TrOCR by Microsoft for the decoder architecture
The MUHARAF, KHATT, and RASAM dataset creators for benchmark data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAFITH: Aspect-Ratio Preserving VLM for Historical Arabic Manuscript Text Recognition

Overview

Architecture

Results

Combined Benchmark (MUHARAF + KHATT + RASAM)

Per-Dataset Results

Ablation Study

Installation

Quick Start

Inference

Batch Inference

Training

Step 1: Prepare Data

Step 2: Generate Synthetic Data (Optional but Recommended)

Step 3: Train

Step 4: Evaluate

Project Structure

Datasets

Real Benchmark Datasets

Synthetic Dataset

Configuration

Computational Requirements

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
hafith		hafith
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

HAFITH: Aspect-Ratio Preserving VLM for Historical Arabic Manuscript Text Recognition

Overview

Architecture

Results

Combined Benchmark (MUHARAF + KHATT + RASAM)

Per-Dataset Results

Ablation Study

Installation

Quick Start

Inference

Batch Inference

Training

Step 1: Prepare Data

Step 2: Generate Synthetic Data (Optional but Recommended)

Step 3: Train

Step 4: Evaluate

Project Structure

Datasets

Real Benchmark Datasets

Synthetic Dataset

Configuration

Computational Requirements

Citation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages