Image OCR + AI Caption Service

A unified web service that combines multilingual OCR text extraction with AI-powered image captioning using Gemini Flash.

🌟 Features

🔍 Automatic Text Detection - Intelligently detects if image contains text
🌏 Multilingual OCR - Supports 11 Indian languages + English
- Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
🤖 AI Image Captioning - Gemini 1.5 Flash powered scene descriptions
🔄 Smart Fusion - Combines OCR text + AI captions for complete context
🌐 Translation - Automatically translates non-English text to English
⚡ Parallel Processing - OCR and captioning run simultaneously
🔑 Multi-API Key Support - Automatic fallback between multiple OpenRouter keys
📱 Web Interface - Beautiful drag-and-drop UI

📋 Requirements

System Requirements

Python 3.11+ (tested with Python 3.13.7)
Tesseract OCR 5.5+ installed on Windows
OpenRouter API Key (free tier available at https://openrouter.ai/keys)

Python Dependencies

fastapi
uvicorn[standard]
python-multipart
aiofiles
Pillow
pytesseract
deep-translator
langdetect
httpx
tenacity
python-dotenv
openai

🚀 Quick Start

1. Install Tesseract OCR

Windows (using Winget):

winget install UB-Mannheim.TesseractOCR

2. Download Indian Language Packs

Language packs should be downloaded from https://github.com/tesseract-ocr/tessdata and placed in C:\Program Files\Tesseract-OCR\tessdata\

Required files:

hin.traineddata (Hindi)
ben.traineddata (Bengali)
guj.traineddata (Gujarati)
kan.traineddata (Kannada)
mal.traineddata (Malayalam)
mar.traineddata (Marathi)
ori.traineddata (Odia)
pan.traineddata (Punjabi)
tam.traineddata (Tamil)
tel.traineddata (Telugu)

3. Clone Repository

git clone https://github.com/YOUR_USERNAME/image-ocr.git
cd image-ocr

4. Create Virtual Environment

python -m venv .venv
.venv\Scripts\activate

5. Install Dependencies

pip install -r requirements.txt

6. Configure API Keys

Create a .env file in the root directory:

# OpenRouter API Configuration
OPENROUTER_API_KEY=sk-or-v1-YOUR_KEY_HERE

# Optional: Add multiple keys separated by comma for fallback
# OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2

OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_MODEL=google/gemini-flash-1.5:free

Get your free API key at: https://openrouter.ai/keys

7. Run the Service

python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload

The service will be available at:

Web Interface: http://localhost:8000
API Documentation: http://localhost:8000/docs
API Info: http://localhost:8000/api/info

📖 Usage

Web Interface

Open http://localhost:8000 in your browser
Drag and drop an image or click to browse
Click "Process Image"
View results:
- Combined output (OCR + Caption)
- Original text (if detected)
- Detected language
- English translation (if needed)

API Usage

Upload and Process Image:

curl -X POST "http://localhost:8000/upload" \
  -F "file=@image.jpg"

Response:

{
  "success": true,
  "filename": "image.jpg",
  "processing_mode": "with_text",
  "has_text": true,
  "results": {
    "combined_output": "AI caption with OCR text...",
    "caption": "AI generated scene description",
    "ocr": {
      "original_text": "Original text in source language",
      "detected_language": "hi",
      "translated_text": "Translated English text"
    }
  }
}

🏗️ Project Structure

image-ocr/
├── main.py                      # FastAPI application
├── .env                         # Configuration (not in repo)
├── src/cas/
│   ├── __init__.py
│   ├── captioner.py            # Gemini caption generation
│   ├── config.py               # Configuration management
│   ├── fusion.py               # Text deduplication
│   ├── io_utils.py             # Image preprocessing
│   ├── ocr_service.py          # OCR text extraction
│   ├── provider_openrouter.py  # OpenRouter API client
│   ├── quality.py              # Post-processing
│   └── unified_processor.py    # Main processing orchestrator
├── static/
│   └── index.html              # Web interface
├── scripts/
│   ├── run_caption.py          # CLI caption generator
│   └── run_fusion.py           # CLI fusion tester
├── data/samples/               # Test images
├── requirements.txt
├── README.md
└── SETUP_GUIDE.md             # GitHub setup instructions

🔧 Configuration

Environment Variables

Variable	Description	Default
`OPENROUTER_API_KEY`	OpenRouter API key(s), comma-separated	Required
`OPENROUTER_BASE_URL`	API base URL	`https://openrouter.ai/api/v1`
`OPENROUTER_MODEL`	Model to use	`google/gemini-flash-1.5:free`

Multi-API Key Fallback

To avoid rate limits, add multiple API keys:

OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2,sk-or-v1-key3

The service will automatically switch to the next key if one hits rate limit.

🎯 Processing Modes

Mode 1: Textless Image

Detection: No text found
Processing: AI caption only
Output: Scene description

Mode 2: Image with Text

Detection: Text found via OCR
Processing: OCR + AI caption in parallel
Output: Fused result combining both

📝 API Endpoints

`GET /`

Serves the web interface

`POST /upload`

Upload and process an image

Parameters:

file: Image file (multipart/form-data)

Response: JSON with processing results

`GET /health`

Health check endpoint

`GET /api/info`

Get API information and capabilities

`GET /docs`

Interactive API documentation (Swagger UI)

🐛 Troubleshooting

Rate Limit Errors (429)

Solution 1: Wait 1-2 minutes for free tier reset
Solution 2: Add multiple API keys for automatic fallback
Solution 3: Add credits to OpenRouter account

Tesseract Not Found

Verify installation: tesseract --version
Check path in ocr_service.py line 15

Language Pack Missing

Download from: https://github.com/tesseract-ocr/tessdata
Place in: C:\Program Files\Tesseract-OCR\tessdata\

Unicode Decode Error

Ensure .env file is UTF-8 encoded
Check for special characters in API keys

🚢 Deployment

Local Development

uvicorn main:app --reload

Production

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Docker (Future)

Docker support is planned for containerized deployment.

📊 Performance

OCR Speed: ~1-2 seconds per image
AI Caption: ~2-4 seconds per image
Total (parallel): ~2-4 seconds per image
Rate Limit (free): ~10 requests/minute per API key

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

MIT License - feel free to use in your projects!

🙏 Acknowledgments

Tesseract OCR - Google's open-source OCR engine
OpenRouter - Unified API for LLM access
Google Gemini - AI image understanding
Deep Translator - Multi-language translation

📞 Support

For issues and questions:

Check the troubleshooting section
Review API documentation at /docs
Open an issue on GitHub

🗺️ Roadmap

Built with ❤️ for multilingual image understanding

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/samples		data/samples
scripts		scripts
src/cas		src/cas
static		static
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
main.py		main.py
requirements.txt		requirements.txt

prith486/image-ocrr

Folders and files

Latest commit

History

Repository files navigation