Skip to content

mdnaseif/hafith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HAFITH: Aspect-Ratio Preserving VLM for Historical Arabic Manuscript Text Recognition

Paper HuggingFace Model HuggingFace Dataset License: MIT

Mohammed Naseif, Islam Mesabah, Dalia Hajjaj, Abdulrahman Hassan, Anis Koubaa, Ahmed Elhayek

University of Prince Mugrin, Medina, Saudi Arabia


Overview

HAFITH (حافظ, Arabic for "preserver") is a vision-language model for historical Arabic manuscript text recognition. It achieves 5.10% Character Error Rate (CER) on a combined benchmark of MUHARAF, KHATT, and RASAM — a 36% relative improvement over the previous state of the art.

HAFITH addresses three fundamental limitations of existing Arabic OCR systems:

Problem Existing Approach HAFITH Solution
Aspect ratio distortion Fixed 384×384 resizing SigLIP V2 NaFlex — native-resolution encoding
Inefficient Arabic tokenization English RoBERTa tokenizer (~4.1 tokens/word) Aranizer-PBE-64k (~1.0 tokens/word)
Training data scarcity Fine-tuning on small real datasets 1M synthetic samples across 350 Arabic fonts

Architecture

Input Image (any aspect ratio)
       │
       ▼
┌─────────────────────────────┐
│  NaFlex Preprocessing       │  Patchify at native aspect ratio
│  (up to 512 patches)        │  Pad unused positions to 512
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  SigLIP V2 NaFlex Encoder   │  400M params, 1152-dim embeddings
│  (google/siglip2-so400m)    │  All patch embeddings (no CLS-only)
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  Modality Projection MLP    │  1152 → 1024 dim
│  + Learnable Pos Embeddings │  LayerNorm, GELU
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│  RoBERTa-Large Decoder      │  24 layers, 16 heads, 1024-dim
│  (trained from scratch)     │  Aranizer-PBE-64k vocabulary (64K tokens)
└──────────────┬──────────────┘
               │
               ▼
       Arabic Text Output

Results

Combined Benchmark (MUHARAF + KHATT + RASAM)

Model CER ↓ WER ↓
CRNN + CTC 14.82%
TrOCR-Base 13.41%
TrOCR-Large 11.73% 31.82%
HATFormer 8.60%
HAFITH (Ours) 5.10% 18.05%

Per-Dataset Results

Dataset Model CER ↓ WER ↓
MUHARAF HATFormer 8.60%
MUHARAF HAFITH 8.35% 24.76%
KHATT HATFormer 15.40%
KHATT HAFITH 11.21% 37.36%
RASAM TrOCR-Large 35.26% 50.92%
RASAM HAFITH 4.95% 18.94%

Ablation Study

Configuration CER ∆rel
TrOCR-Large Baseline 11.73%
+ Quality Filtering 9.77% −16.7%
+ SigLIP2 NaFlex 6.60% −32.4%
+ Aranizer (no synthetic) 8.47% +28.3%
+ Aranizer + Synthetic 5.10% −22.7%

Installation

git clone https://github.com/mdnaseif/hafith.git
cd hafith
pip install -e .

Or install dependencies directly:

pip install -r requirements.txt

Requirements: Python 3.9+, CUDA 11.8+, PyTorch 2.0+


Quick Start

Inference

from hafith import HAFITHRecognizer

recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")

# From file path
text = recognizer.recognize("path/to/manuscript_line.png")
print(text)

# From PIL Image
from PIL import Image
image = Image.open("manuscript.png")
text = recognizer.recognize(image)
print(text)

Batch Inference

from hafith import HAFITHRecognizer

recognizer = HAFITHRecognizer.from_pretrained("mdnaseif/hafith")

images = ["line1.png", "line2.png", "line3.png"]
results = recognizer.recognize_batch(images, batch_size=8)
for img, text in zip(images, results):
    print(f"{img}: {text}")

Training

Step 1: Prepare Data

Download the benchmark datasets:

python scripts/download_datasets.py --output-dir data/

Or download manually:

After downloading, run quality filtering:

python scripts/filter_data.py \
    --input-dir data/raw/ \
    --output-dir data/filtered/ \
    --max-token-length 64

Step 2: Generate Synthetic Data (Optional but Recommended)

python scripts/generate_synthetic.py \
    --corpus-path data/ArabicText-Large/ \
    --fonts-dir data/fonts/ \
    --backgrounds-dir data/backgrounds/ \
    --output-dir data/synthetic/ \
    --num-samples 1000000 \
    --num-workers 8

Step 3: Train

From scratch (recommended):

python scripts/train.py \
    --config configs/hafith_default.yaml \
    --data-dir data/filtered/ \
    --output-dir outputs/hafith_run1/

Resume from checkpoint:

python scripts/train.py \
    --config configs/hafith_default.yaml \
    --checkpoint outputs/hafith_run1/stage2/checkpoint-113100 \
    --resume

Fine-tune from pretrained HAFITH:

python scripts/train.py \
    --config configs/hafith_finetune.yaml \
    --checkpoint mdnaseif/hafith \
    --start-stage 2

Step 4: Evaluate

python scripts/evaluate.py \
    --model-path outputs/hafith_run1/final/ \
    --data-dir data/filtered/ \
    --split test \
    --output-file results/evaluation.json

Project Structure

hafith/
├── hafith/                     # Core library
│   ├── __init__.py
│   ├── model.py                # SigLIP2TrOCRModel architecture
│   ├── encoder.py              # SigLIP2 NaFlex vision encoder wrapper
│   ├── dataset.py              # NativeResolutionDataset + collator
│   ├── tokenizer_utils.py      # Aranizer setup utilities
│   ├── quality_filter.py       # DataQualityChecker
│   ├── metrics.py              # CER/WER with statistical sampling
│   └── recognizer.py           # High-level inference API
├── scripts/
│   ├── train.py                # Main training script
│   ├── evaluate.py             # Evaluation script
│   ├── inference.py            # Single/batch inference CLI
│   ├── generate_synthetic.py   # Synthetic data generation
│   ├── filter_data.py          # Data quality filtering
│   └── download_datasets.py    # Dataset download helper
├── configs/
│   ├── hafith_default.yaml     # Default training config
│   └── hafith_finetune.yaml    # Fine-tuning config
├── data/                       # (gitignored) Datasets
├── outputs/                    # (gitignored) Training outputs
├── tests/
│   ├── test_model.py
│   ├── test_dataset.py
│   └── test_metrics.py
├── docs/
│   └── architecture.md
├── requirements.txt
├── setup.py
└── README.md

Datasets

Real Benchmark Datasets

Dataset Domain Train Val Test Avg Aspect Ratio
MUHARAF Historical archival (19th–20th c.) 21,129 1,021 1,278 ~8:1
KHATT Contemporary handwriting (1,000 writers) 15,886 962 1,197 ~6:1
RASAM Maghrebi manuscripts (10th–20th c.) 2,739 915 906 15.1:1

Synthetic Dataset

  • 1 million manuscript-style line images
  • 350 Arabic fonts (Naskh, Ruq'ah, Thuluth, Maghrebi)
  • 50 aged parchment background textures
  • Stochastic degradations: paper texture, baseline warp, noise, ink effects, aging

Download: mdnaseif/hafith-synthetic-1m


Configuration

See configs/hafith_default.yaml for all options. Key parameters:

model:
  encoder: google/siglip2-so400m-patch16-naflex
  decoder: microsoft/trocr-large-handwritten  # architecture reference only
  tokenizer: riotu-lab/Aranizer-PBE-64k
  encoder_from_scratch: false
  decoder_from_scratch: true          # required when using Aranizer
  max_num_patches: 512
  max_target_length: 64

training:
  stage1_epochs: 5                    # encoder frozen, decoder + projection
  stage2_epochs: 50                   # full fine-tuning
  batch_size: 4
  gradient_accumulation_steps: 8     # effective batch = 32
  stage1_lr: 5.0e-5
  stage2_lr: 1.0e-5
  fp16: true
  label_smoothing: 0.1

data:
  use_synthetic_pretraining: true
  use_quality_filtering: true
  max_token_length: 64

Computational Requirements

Stage Hardware Time
Synthetic pretraining 1× RTX 4090 (24GB) ~7 days
Real data fine-tuning 1× RTX 4090 (24GB) ~3 days
Inference 1× RTX 4090 (24GB) 12.5 samples/sec
Inference (INT8) 1× RTX 4090 (24GB) 19.8 samples/sec

Total training: ~10 days on a single RTX 4090.


Citation

@article{naseif2026hafith,
  title={{HAFITH}: Aspect-Ratio Preserving {VLM} for Historical Arabic Manuscript Text Recognition},
  author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Koubaa, Anis and Elhayek, Ahmed},
  journal={Information Processing \& Management},
  year={2026}
}

License

This project is licensed under the MIT License — see LICENSE for details.

Acknowledgments

  • SigLIP V2 by Google DeepMind for the NaFlex vision encoder
  • Aranizer by RIOTU Lab for the Arabic tokenizer
  • TrOCR by Microsoft for the decoder architecture
  • The MUHARAF, KHATT, and RASAM dataset creators for benchmark data

About

HAFITH: Aspect-Ratio Preserving VLM for Historical Arabic Manuscript Text Recognition

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages