Mohammed Naseif, Islam Mesabah, Dalia Hajjaj, Abdulrahman Hassan, Anis Koubaa, Ahmed Elhayek
University of Prince Mugrin, Medina, Saudi Arabia
HAFITH (حافظ, Arabic for "preserver") is a vision-language model for historical Arabic manuscript text recognition. It achieves 5.10% Character Error Rate (CER) on a combined benchmark of MUHARAF, KHATT, and RASAM — a 36% relative improvement over the previous state of the art.
HAFITH addresses three fundamental limitations of existing Arabic OCR systems:
| Problem | Existing Approach | HAFITH Solution |
|---|---|---|
| Aspect ratio distortion | Fixed 384×384 resizing | SigLIP V2 NaFlex — native-resolution encoding |
| Inefficient Arabic tokenization | English RoBERTa tokenizer (~4.1 tokens/word) | Aranizer-PBE-64k (~1.0 tokens/word) |
| Training data scarcity | Fine-tuning on small real datasets | 1M synthetic samples across 350 Arabic fonts |
Input Image (any aspect ratio)
│
▼
┌─────────────────────────────┐
│ NaFlex Preprocessing │ Patchify at native aspect ratio
│ (up to 512 patches) │ Pad unused positions to 512
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ SigLIP V2 NaFlex Encoder │ 400M params, 1152-dim embeddings
│ (google/siglip2-so400m) │ All patch embeddings (no CLS-only)
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Modality Projection MLP │ 1152 → 1024 dim
│ + Learnable Pos Embeddings │ LayerNorm, GELU
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ RoBERTa-Large Decoder │ 24 layers, 16 heads, 1024-dim
│ (trained from scratch) │ Aranizer-PBE-64k vocabulary (64K tokens)
└──────────────┬──────────────┘
│
▼
Arabic Text Output
| Model | CER ↓ | WER ↓ |
|---|---|---|
| CRNN + CTC | 14.82% | — |
| TrOCR-Base | 13.41% | — |
| TrOCR-Large | 11.73% | 31.82% |
| HATFormer | 8.60% | — |
| HAFITH (Ours) | 5.10% | 18.05% |
| Dataset | Model | CER ↓ | WER ↓ |
|---|---|---|---|
| MUHARAF | HATFormer | 8.60% | — |
| MUHARAF | HAFITH | 8.35% | 24.76% |
| KHATT | HATFormer | 15.40% | — |
| KHATT | HAFITH | 11.21% | 37.36% |
| RASAM | TrOCR-Large | 35.26% | 50.92% |
| RASAM | HAFITH | 4.95% | 18.94% |
| Configuration | CER | ∆rel |
|---|---|---|
| TrOCR-Large Baseline | 11.73% | — |
| + Quality Filtering | 9.77% | −16.7% |
| + SigLIP2 NaFlex | 6.60% | −32.4% |
| + Aranizer (no synthetic) | 8.47% | +28.3% |
| + Aranizer + Synthetic | 5.10% | −22.7% |
git clone https://github.com/mdnaseif/hafith.git
cd hafith
pip install -e .Or install dependencies directly:
pip install -r requirements.txtRequirements: Python 3.9+, CUDA 11.8+, PyTorch 2.0+
from hafith import HAFITHRecognizer
recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")
# From file path
text = recognizer.recognize("path/to/manuscript_line.png")
print(text)
# From PIL Image
from PIL import Image
image = Image.open("manuscript.png")
text = recognizer.recognize(image)
print(text)from hafith import HAFITHRecognizer
recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")
images = ["line1.png", "line2.png", "line3.png"]
results = recognizer.recognize_batch(images, batch_size=8)
for img, text in zip(images, results):
print(f"{img}: {text}")Download the benchmark datasets:
python scripts/download_datasets.py --output-dir data/Or download manually:
After downloading, run quality filtering:
python scripts/filter_data.py \
--input-dir data/raw/ \
--output-dir data/filtered/ \
--max-token-length 64python scripts/generate_synthetic.py \
--corpus-path data/ArabicText-Large/ \
--fonts-dir data/fonts/ \
--backgrounds-dir data/backgrounds/ \
--output-dir data/synthetic/ \
--num-samples 1000000 \
--num-workers 8From scratch (recommended):
python scripts/train.py \
--config configs/hafith_default.yaml \
--data-dir data/filtered/ \
--output-dir outputs/hafith_run1/Resume from checkpoint:
python scripts/train.py \
--config configs/hafith_default.yaml \
--checkpoint outputs/hafith_run1/stage2/checkpoint-113100 \
--resumeFine-tune from pretrained HAFITH:
python scripts/train.py \
--config configs/hafith_finetune.yaml \
--checkpoint mdnaseif/hafith \
--start-stage 2python scripts/evaluate.py \
--model-path outputs/hafith_run1/final/ \
--data-dir data/filtered/ \
--split test \
--output-file results/evaluation.jsonhafith/
├── hafith/ # Core library
│ ├── __init__.py
│ ├── model.py # SigLIP2TrOCRModel architecture
│ ├── encoder.py # SigLIP2 NaFlex vision encoder wrapper
│ ├── dataset.py # NativeResolutionDataset + collator
│ ├── tokenizer_utils.py # Aranizer setup utilities
│ ├── quality_filter.py # DataQualityChecker
│ ├── metrics.py # CER/WER with statistical sampling
│ └── recognizer.py # High-level inference API
├── scripts/
│ ├── train.py # Main training script
│ ├── evaluate.py # Evaluation script
│ ├── inference.py # Single/batch inference CLI
│ ├── generate_synthetic.py # Synthetic data generation
│ ├── filter_data.py # Data quality filtering
│ └── download_datasets.py # Dataset download helper
├── configs/
│ ├── hafith_default.yaml # Default training config
│ └── hafith_finetune.yaml # Fine-tuning config
├── data/ # (gitignored) Datasets
├── outputs/ # (gitignored) Training outputs
├── tests/
│ ├── test_model.py
│ ├── test_dataset.py
│ └── test_metrics.py
├── docs/
│ └── architecture.md
├── requirements.txt
├── setup.py
└── README.md
| Dataset | Domain | Train | Val | Test | Avg Aspect Ratio |
|---|---|---|---|---|---|
| MUHARAF | Historical archival (19th–20th c.) | 21,129 | 1,021 | 1,278 | ~8:1 |
| KHATT | Contemporary handwriting (1,000 writers) | 15,886 | 962 | 1,197 | ~6:1 |
| RASAM | Maghrebi manuscripts (10th–20th c.) | 2,739 | 915 | 906 | 15.1:1 |
- 1 million manuscript-style line images
- 350 Arabic fonts (Naskh, Ruq'ah, Thuluth, Maghrebi)
- 50 aged parchment background textures
- Stochastic degradations: paper texture, baseline warp, noise, ink effects, aging
Download: mdnaseif/hafith-synthetic-1m
See configs/hafith_default.yaml for all options. Key parameters:
model:
encoder: google/siglip2-so400m-patch16-naflex
decoder: microsoft/trocr-large-handwritten # architecture reference only
tokenizer: riotu-lab/Aranizer-PBE-64k
encoder_from_scratch: false
decoder_from_scratch: true # required when using Aranizer
max_num_patches: 512
max_target_length: 64
training:
stage1_epochs: 5 # encoder frozen, decoder + projection
stage2_epochs: 50 # full fine-tuning
batch_size: 4
gradient_accumulation_steps: 8 # effective batch = 32
stage1_lr: 5.0e-5
stage2_lr: 1.0e-5
fp16: true
label_smoothing: 0.1
data:
use_synthetic_pretraining: true
use_quality_filtering: true
max_token_length: 64| Stage | Hardware | Time |
|---|---|---|
| Synthetic pretraining | 1× RTX 4090 (24GB) | ~7 days |
| Real data fine-tuning | 1× RTX 4090 (24GB) | ~3 days |
| Inference | 1× RTX 4090 (24GB) | 12.5 samples/sec |
| Inference (INT8) | 1× RTX 4090 (24GB) | 19.8 samples/sec |
Total training: ~10 days on a single RTX 4090.
@article{naseif2026hafith,
title={{HAFITH}: Aspect-Ratio Preserving {VLM} for Historical Arabic Manuscript Text Recognition},
author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Koubaa, Anis and Elhayek, Ahmed},
journal={Information Processing \& Management},
year={2026}
}This project is licensed under the MIT License — see LICENSE for details.