An educational RAG (Retrieval-Augmented Generation) system with a FastAPI backend, React frontend, Qdrant vector database, and vLLM for inference.
- Document Management: Upload and process PDF, DOCX, TXT, MD, HTML, and XML files
- Vector Search: Semantic search using Qdrant and SentenceTransformers
- RAG Query: Answer questions based on document content
- Streaming Responses: Real-time token streaming using Server-Sent Events
- Chat History: Persistent chat sessions with conversation context
- Multiple Interfaces: Web UI with 6 specialized tabs
- Flexible Deployment: Docker or native installation
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Frontend │────▶│ Backend │────▶│ Qdrant │
│ (React) │ │ (FastAPI) │ │ (Vectors) │
│ Port 3000 │ │ Port 8000 │ │ Port 6333 │
└─────────────┘ └──────┬───────┘ └─────────────┘
│
▼
┌──────────────┐
│ vLLM │
│ (Llama 3.2) │
│ Port 8001 │
└──────────────┘
- Python 3.10+
- Node.js 18+
- Docker (optional, for containerized deployment)
- 16GB+ RAM recommended
- GPU recommended (for vLLM)
# Clone/navigate to repository
cd workshop-rag
# Run setup script
./scripts/setup_all.sh
# Start all services
./scripts/start_all.sh1. Backend Setup
cd backend
./setup.sh
source .venv/bin/activate2. Download Model
cd ..
./scripts/download_model.sh3. Start Services
Terminal 1 - Qdrant:
./scripts/start_qdrant.shTerminal 2 - vLLM:
./scripts/start_vllm.shTerminal 3 - Backend:
cd backend
source .venv/bin/activate
uvicorn app.main:app --reload --host 0.0.0.0 --port 80004. Frontend Setup
cd frontend
npm install
npm run devVisit http://localhost:3000
Install vLLM using vllm-metal
- Navigate to Upload Documents tab
- Select files (PDF, DOCX, TXT, MD, HTML, XML)
- Click Upload
- Documents are automatically chunked and embedded
- Navigate to Query Documents tab
- Enter your question
- Adjust parameters (temperature, top-k, etc.)
- View streaming response and retrieved sources
- Navigate to Chat History tab
- Create new chat session
- Ask questions with conversation context
- View and manage chat history
Edit backend/.env:
# LLM Settings
LLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=512
# Document Processing
CHUNK_SIZE=512
CHUNK_OVERLAP=128
# Qdrant
QDRANT_HOST=localhost
QDRANT_PORT=6333Edit frontend/.env:
VITE_API_URL=http://localhost:8000POST /api/v1/documents/upload- Upload documentGET /api/v1/documents/list- List all documentsDELETE /api/v1/documents/{id}- Delete documentPOST /api/v1/documents/sync- Sync from data folder
POST /api/v1/query/query- Non-streaming queryPOST /api/v1/query/query/stream- Streaming query (SSE)
POST /api/v1/chat/new- Create sessionGET /api/v1/chat/list- List sessionsGET /api/v1/chat/{id}- Get historyDELETE /api/v1/chat/{id}- Delete session
workshop-rag/
├── backend/
│ ├── app/
│ │ ├── api/ # API routes
│ │ ├── core/ # Configuration
│ │ ├── models/ # Data models
│ │ ├── schemas/ # Pydantic schemas
│ │ ├── services/ # Business logic
│ │ └── main.py # FastAPI app
│ ├── pyproject.toml
│ └── setup.sh
├── frontend/
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── services/ # API client
│ │ └── App.tsx
│ └── package.json
├── data/ # Document storage
├── chat_history/ # Chat sessions
├── qdrant_storage/ # Vector DB
├── models/ # Downloaded models
└── scripts/ # Setup scripts
cd backend
source .venv/bin/activate
# Run with auto-reload
uvicorn app.main:app --reload
# Run tests
pytest
# Format code
black app/
isort app/cd frontend
# Development server
npm run dev
# Build for production
npm run build
# Type checking
npm run type-check- Check if Qdrant is running:
curl http://localhost:6333 - Check if vLLM is running:
curl http://localhost:8001/v1/models - Verify
.envconfiguration
- Login to HuggingFace:
huggingface-cli login - Check disk space (need ~6.5GB)
- Verify internet connection
- Reduce
MAX_MODEL_LENin vLLM config - Use smaller batch sizes
- Consider using CPU-only mode
- Enable GPU support for vLLM
- Reduce
LLM_MAX_TOKENS - Use tensor parallelism for multi-GPU
- Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
- LLM: Llama 3.2 3B Instruct (8-bit quantization)
- Chunking: 512 tokens with 128 token overlap
- Vector Distance: Cosine similarity
- Context Window: 8192 tokens
See LICENSE file for details.
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Open pull request
For issues and questions, please open a GitHub issue.
