A unified web service that combines multilingual OCR text extraction with AI-powered image captioning using Gemini Flash.
- π Automatic Text Detection - Intelligently detects if image contains text
- π Multilingual OCR - Supports 11 Indian languages + English
- Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
- π€ AI Image Captioning - Gemini 1.5 Flash powered scene descriptions
- π Smart Fusion - Combines OCR text + AI captions for complete context
- π Translation - Automatically translates non-English text to English
- β‘ Parallel Processing - OCR and captioning run simultaneously
- π Multi-API Key Support - Automatic fallback between multiple OpenRouter keys
- π± Web Interface - Beautiful drag-and-drop UI
- Python 3.11+ (tested with Python 3.13.7)
- Tesseract OCR 5.5+ installed on Windows
- OpenRouter API Key (free tier available at https://openrouter.ai/keys)
fastapi
uvicorn[standard]
python-multipart
aiofiles
Pillow
pytesseract
deep-translator
langdetect
httpx
tenacity
python-dotenv
openai
Windows (using Winget):
winget install UB-Mannheim.TesseractOCRLanguage packs should be downloaded from https://github.com/tesseract-ocr/tessdata and placed in C:\Program Files\Tesseract-OCR\tessdata\
Required files:
hin.traineddata(Hindi)ben.traineddata(Bengali)guj.traineddata(Gujarati)kan.traineddata(Kannada)mal.traineddata(Malayalam)mar.traineddata(Marathi)ori.traineddata(Odia)pan.traineddata(Punjabi)tam.traineddata(Tamil)tel.traineddata(Telugu)
git clone https://github.com/YOUR_USERNAME/image-ocr.git
cd image-ocrpython -m venv .venv
.venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the root directory:
# OpenRouter API Configuration
OPENROUTER_API_KEY=sk-or-v1-YOUR_KEY_HERE
# Optional: Add multiple keys separated by comma for fallback
# OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_MODEL=google/gemini-flash-1.5:freeGet your free API key at: https://openrouter.ai/keys
python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reloadThe service will be available at:
- Web Interface: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- API Info: http://localhost:8000/api/info
- Open http://localhost:8000 in your browser
- Drag and drop an image or click to browse
- Click "Process Image"
- View results:
- Combined output (OCR + Caption)
- Original text (if detected)
- Detected language
- English translation (if needed)
Upload and Process Image:
curl -X POST "http://localhost:8000/upload" \
-F "file=@image.jpg"Response:
{
"success": true,
"filename": "image.jpg",
"processing_mode": "with_text",
"has_text": true,
"results": {
"combined_output": "AI caption with OCR text...",
"caption": "AI generated scene description",
"ocr": {
"original_text": "Original text in source language",
"detected_language": "hi",
"translated_text": "Translated English text"
}
}
}image-ocr/
βββ main.py # FastAPI application
βββ .env # Configuration (not in repo)
βββ src/cas/
β βββ __init__.py
β βββ captioner.py # Gemini caption generation
β βββ config.py # Configuration management
β βββ fusion.py # Text deduplication
β βββ io_utils.py # Image preprocessing
β βββ ocr_service.py # OCR text extraction
β βββ provider_openrouter.py # OpenRouter API client
β βββ quality.py # Post-processing
β βββ unified_processor.py # Main processing orchestrator
βββ static/
β βββ index.html # Web interface
βββ scripts/
β βββ run_caption.py # CLI caption generator
β βββ run_fusion.py # CLI fusion tester
βββ data/samples/ # Test images
βββ requirements.txt
βββ README.md
βββ SETUP_GUIDE.md # GitHub setup instructions
| Variable | Description | Default |
|---|---|---|
OPENROUTER_API_KEY |
OpenRouter API key(s), comma-separated | Required |
OPENROUTER_BASE_URL |
API base URL | https://openrouter.ai/api/v1 |
OPENROUTER_MODEL |
Model to use | google/gemini-flash-1.5:free |
To avoid rate limits, add multiple API keys:
OPENROUTER_API_KEY=sk-or-v1-key1,sk-or-v1-key2,sk-or-v1-key3The service will automatically switch to the next key if one hits rate limit.
- Detection: No text found
- Processing: AI caption only
- Output: Scene description
- Detection: Text found via OCR
- Processing: OCR + AI caption in parallel
- Output: Fused result combining both
Serves the web interface
Upload and process an image
Parameters:
file: Image file (multipart/form-data)
Response: JSON with processing results
Health check endpoint
Get API information and capabilities
Interactive API documentation (Swagger UI)
- Solution 1: Wait 1-2 minutes for free tier reset
- Solution 2: Add multiple API keys for automatic fallback
- Solution 3: Add credits to OpenRouter account
- Verify installation:
tesseract --version - Check path in
ocr_service.pyline 15
- Download from: https://github.com/tesseract-ocr/tessdata
- Place in:
C:\Program Files\Tesseract-OCR\tessdata\
- Ensure
.envfile is UTF-8 encoded - Check for special characters in API keys
uvicorn main:app --reloaduvicorn main:app --host 0.0.0.0 --port 8000 --workers 4Docker support is planned for containerized deployment.
- OCR Speed: ~1-2 seconds per image
- AI Caption: ~2-4 seconds per image
- Total (parallel): ~2-4 seconds per image
- Rate Limit (free): ~10 requests/minute per API key
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - feel free to use in your projects!
- Tesseract OCR - Google's open-source OCR engine
- OpenRouter - Unified API for LLM access
- Google Gemini - AI image understanding
- Deep Translator - Multi-language translation
For issues and questions:
- Check the troubleshooting section
- Review API documentation at
/docs - Open an issue on GitHub
- Multilingual OCR support
- AI caption generation
- Web interface
- Multi-API key fallback
- Docker containerization
- Batch processing
- REST API authentication
- WebSocket support for real-time processing
- Support for more languages
- PDF document processing
Built with β€οΈ for multilingual image understanding