A Python middleware service that ingests community data from various sources (Reddit, Discord, etc.) about development platforms (Lovable, Replit, etc.) and stores it in Chroma Cloud vector database.
- FastAPI REST API with bearer token authentication
- Chroma Cloud integration with automatic collection management
- Smart content chunking for optimal vector embedding
- Duplicate handling with upsert operations
- Parent-child relationships for comments and posts
- Comprehensive error handling and logging
- Health monitoring endpoint
pip install -r requirements.txtπ Now with automated Azure deployment via GitHub Actions!
Copy the example environment file and update with your values:
cp .env.example .envEdit .env file:
BEARER_TOKEN=your-secure-bearer-token-here
CHROMA_API_KEY=ck-D8S37tEEaVKAQqyw2mGy8sSswmAfKaYqxEWBuYGMHT5B
CHROMA_TENANT=cc8a08d9-0db3-472d-bc29-7a2b7cddbc55
CHROMA_DATABASE=weavy_community_intelligence
python main.pyThe service will start on http://localhost:8000
Ingest community content into Chroma database.
Query Parameters:
test(optional, boolean): Use test collection if true
Headers:
Authorization: Bearer YOUR_TOKEN
Content-Type: application/json
Request Body:
{
"platform": "Lovable",
"source": "Reddit",
"id": "abc123",
"timestamp": "2024-01-15T10:30:00Z",
"deeplink": "https://reddit.com/r/programming/comments/abc123",
"author": "https://reddit.com/u/testuser",
"title": "Test Post Title",
"body": "This is the main content of the post that will be embedded",
"isComment": false
}Response:
{"status": "success"}Check service health status.
Response:
{"status": "healthy", "chroma": "connected"}curl -X POST "http://localhost:8000/ingest?test=true" \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"platform": "Lovable",
"source": "Reddit",
"id": "abc123",
"timestamp": "2024-01-15T10:30:00Z",
"deeplink": "https://reddit.com/r/programming/comments/abc123",
"author": "https://reddit.com/u/testuser",
"title": "Test Post Title",
"body": "This is the main content of the post that will be embedded",
"isComment": false
}'curl -X POST "http://localhost:8000/ingest" \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"platform": "Replit",
"source": "Discord",
"id": "xyz789",
"timestamp": "2024-01-15T10:35:00Z",
"deeplink": "https://discord.com/channels/123/456/789",
"author": "https://discord.com/users/123456789",
"title": "",
"body": "This is a comment on the original post",
"isComment": true
}'curl http://localhost:8000/health- Posts:
{platform}_{source}_post_{id} - Comments:
{platform}_{source}_comment_{id}_{unique_hash} - Chunks: Base ID +
_chunk_{index}(if content is chunked)
- Maximum ~512 tokens per chunk
- Sentence boundary preservation
- Context overlap between chunks (~50 tokens)
- Title included with each chunk for posts
- Comments typically not chunked unless very long
- Production:
community_content - Test:
community_content-test
Each document includes comprehensive metadata:
{
"platform": str, # Lovable, Replit, etc.
"source": str, # Reddit, Discord, etc.
"original_id": str, # Original ID from request
"timestamp": str, # ISO 8601 timestamp
"deeplink": str, # Direct URL to content
"author": str, # Author URL
"title": str, # Content title
"is_comment": bool, # Comment flag
"parent_post_id": str, # Post ID if comment
"chunk_index": int, # 0-based chunk index
"total_chunks": int, # Total chunks for content
"ingested_at": str # Server ingestion timestamp
}- 400: Invalid request format or missing fields
- 403: Invalid authentication token
- 500: Internal server error (embedding failures, etc.)
- 503: Service unavailable (Chroma connection issues)
The service provides comprehensive logging including:
- Request processing status
- Chunking information
- Collection operations
- Error details
- Performance metrics
Logs are written to stdout in structured format suitable for log aggregation.
- Bearer token authentication for all endpoints
- Input validation on all fields
- No sensitive data logged
- Environment-based configuration
project/
βββ main.py # Main FastAPI application
βββ .env # Environment variables (not in git)
βββ .env.example # Example environment file
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ utils/
βββ __init__.py
βββ chunking.py # Content chunking logic
βββ chroma_client.py # Chroma connection management
- Install dependencies:
pip install -r requirements.txt - Set up environment variables in
.env - Run the service:
python main.py - View API docs:
http://localhost:8000/docs
The service includes automatic API documentation via FastAPI's built-in Swagger UI.
- Use a proper secret management system instead of
.envfiles - Implement rate limiting for production use
- Monitor Chroma storage usage and implement cleanup strategies
- Consider adding retry logic with exponential backoff
- Set up proper log aggregation and monitoring
- Use a process manager like systemd or docker for deployment