Skip to content

Unified web service combining multilingual OCR with AI-powered image captioning

Notifications You must be signed in to change notification settings

prith486/image-ocrr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Image OCR + AI Caption Service

A unified web service that combines multilingual OCR text extraction with AI-powered image captioning using Gemini Flash.

🌟 Features

  • πŸ” Automatic Text Detection - Intelligently detects if image contains text
  • 🌏 Multilingual OCR - Supports 11 Indian languages + English
    • Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
  • πŸ€– AI Image Captioning - Gemini 1.5 Flash powered scene descriptions
  • πŸ”„ Smart Fusion - Combines OCR text + AI captions for complete context
  • 🌐 Translation - Automatically translates non-English text to English
  • ⚑ Parallel Processing - OCR and captioning run simultaneously
  • πŸ”‘ Multi-API Key Support - Automatic fallback between multiple OpenRouter keys
  • πŸ“± Web Interface - Beautiful drag-and-drop UI

πŸ“‹ Requirements

System Requirements

  • Python 3.11+ (tested with Python 3.13.7)
  • Tesseract OCR 5.5+ installed on Windows
  • OpenRouter API Key (free tier available at https://openrouter.ai/keys)

Python Dependencies

fastapi
uvicorn[standard]
python-multipart
aiofiles
Pillow
pytesseract
deep-translator
langdetect
httpx
tenacity
python-dotenv
openai

πŸš€ Quick Start

1. Install Tesseract OCR

Windows (using Winget):

winget install UB-Mannheim.TesseractOCR

2. Download Indian Language Packs

Language packs should be downloaded from https://github.com/tesseract-ocr/tessdata and placed in C:\Program Files\Tesseract-OCR\tessdata\

Required files:

  • hin.traineddata (Hindi)
  • ben.traineddata (Bengali)
  • guj.traineddata (Gujarati)
  • kan.traineddata (Kannada)
  • mal.traineddata (Malayalam)
  • mar.traineddata (Marathi)
  • ori.traineddata (Odia)
  • pan.traineddata (Punjabi)
  • tam.traineddata (Tamil)
  • tel.traineddata (Telugu)

3. Clone Repository

git clone https://github.com/YOUR_USERNAME/image-ocr.git
cd image-ocr

4. Create Virtual Environment

python -m venv .venv
.venv\Scripts\activate

5. Install Dependencies

pip install -r requirements.txt

6. Configure API Keys

Create a .env file in the root directory:

# OpenRouter API Configuration
OPENROUTER_API_KEY=sk-or-v1-YOUR_KEY_HERE

# Optional: Add multiple keys separated by comma for fallback
# OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2

OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_MODEL=google/gemini-flash-1.5:free

Get your free API key at: https://openrouter.ai/keys

7. Run the Service

python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload

The service will be available at:

πŸ“– Usage

Web Interface

  1. Open http://localhost:8000 in your browser
  2. Drag and drop an image or click to browse
  3. Click "Process Image"
  4. View results:
    • Combined output (OCR + Caption)
    • Original text (if detected)
    • Detected language
    • English translation (if needed)

API Usage

Upload and Process Image:

curl -X POST "http://localhost:8000/upload" \
  -F "file=@image.jpg"

Response:

{
  "success": true,
  "filename": "image.jpg",
  "processing_mode": "with_text",
  "has_text": true,
  "results": {
    "combined_output": "AI caption with OCR text...",
    "caption": "AI generated scene description",
    "ocr": {
      "original_text": "Original text in source language",
      "detected_language": "hi",
      "translated_text": "Translated English text"
    }
  }
}

πŸ—οΈ Project Structure

image-ocr/
β”œβ”€β”€ main.py                      # FastAPI application
β”œβ”€β”€ .env                         # Configuration (not in repo)
β”œβ”€β”€ src/cas/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ captioner.py            # Gemini caption generation
β”‚   β”œβ”€β”€ config.py               # Configuration management
β”‚   β”œβ”€β”€ fusion.py               # Text deduplication
β”‚   β”œβ”€β”€ io_utils.py             # Image preprocessing
β”‚   β”œβ”€β”€ ocr_service.py          # OCR text extraction
β”‚   β”œβ”€β”€ provider_openrouter.py  # OpenRouter API client
β”‚   β”œβ”€β”€ quality.py              # Post-processing
β”‚   └── unified_processor.py    # Main processing orchestrator
β”œβ”€β”€ static/
β”‚   └── index.html              # Web interface
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_caption.py          # CLI caption generator
β”‚   └── run_fusion.py           # CLI fusion tester
β”œβ”€β”€ data/samples/               # Test images
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── SETUP_GUIDE.md             # GitHub setup instructions

πŸ”§ Configuration

Environment Variables

Variable Description Default
OPENROUTER_API_KEY OpenRouter API key(s), comma-separated Required
OPENROUTER_BASE_URL API base URL https://openrouter.ai/api/v1
OPENROUTER_MODEL Model to use google/gemini-flash-1.5:free

Multi-API Key Fallback

To avoid rate limits, add multiple API keys:

OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2,sk-or-v1-key3

The service will automatically switch to the next key if one hits rate limit.

🎯 Processing Modes

Mode 1: Textless Image

  • Detection: No text found
  • Processing: AI caption only
  • Output: Scene description

Mode 2: Image with Text

  • Detection: Text found via OCR
  • Processing: OCR + AI caption in parallel
  • Output: Fused result combining both

πŸ“ API Endpoints

GET /

Serves the web interface

POST /upload

Upload and process an image

Parameters:

  • file: Image file (multipart/form-data)

Response: JSON with processing results

GET /health

Health check endpoint

GET /api/info

Get API information and capabilities

GET /docs

Interactive API documentation (Swagger UI)

πŸ› Troubleshooting

Rate Limit Errors (429)

  • Solution 1: Wait 1-2 minutes for free tier reset
  • Solution 2: Add multiple API keys for automatic fallback
  • Solution 3: Add credits to OpenRouter account

Tesseract Not Found

  • Verify installation: tesseract --version
  • Check path in ocr_service.py line 15

Language Pack Missing

Unicode Decode Error

  • Ensure .env file is UTF-8 encoded
  • Check for special characters in API keys

🚒 Deployment

Local Development

uvicorn main:app --reload

Production

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Docker (Future)

Docker support is planned for containerized deployment.

πŸ“Š Performance

  • OCR Speed: ~1-2 seconds per image
  • AI Caption: ~2-4 seconds per image
  • Total (parallel): ~2-4 seconds per image
  • Rate Limit (free): ~10 requests/minute per API key

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

MIT License - feel free to use in your projects!

πŸ™ Acknowledgments

  • Tesseract OCR - Google's open-source OCR engine
  • OpenRouter - Unified API for LLM access
  • Google Gemini - AI image understanding
  • Deep Translator - Multi-language translation

πŸ“ž Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review API documentation at /docs
  3. Open an issue on GitHub

πŸ—ΊοΈ Roadmap

  • Multilingual OCR support
  • AI caption generation
  • Web interface
  • Multi-API key fallback
  • Docker containerization
  • Batch processing
  • REST API authentication
  • WebSocket support for real-time processing
  • Support for more languages
  • PDF document processing

Built with ❀️ for multilingual image understanding

About

Unified web service combining multilingual OCR with AI-powered image captioning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published