Skip to content

Muxite/webRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Euglena

A scalable Graph-of-Thoughts (GoT) agent service for web research using task decomposition and parallel reasoning to increase accuracy and reduce cost.

Euglena is an agent with web crawling and retrieval-augmented generation. Tasks decompose into parallel subproblems (search, visit, save, think) and merge into structured deliverables. Context persists in ChromaDB. Cost efficiency is maximized through dynamic beam-width and a token-efficient workflow that benefits cheaper models through structured reasoning.

Live: https://euglena.vercel.app/

Note: The visit system does not work on deployment due to website anti-bot measures that detect AWS datacenter IPs. The agent demonstrates resilience by working with search results and other available data sources. This website will be available until 2026-03-16 00:00 PST.

Benchmark Results

64 runs across 16 tests x 2 models x 2 execution variants.

Graph vs Sequential

Graph-of-Thought scores 26.8% higher than sequential and uses 29% fewer tokens.

Metric Graph Sequential Delta % Change
Avg Score 0.921 0.727 +0.195 +26.8%
Pass Rate 90.6% 46.9% +43.8pp
Avg Cost $0.04 $0.06 -$0.01 -22.5%
Avg Tokens 22.6k 32.0k -9.4k -29.3%
Avg Duration 688s 349s +339s +97.2%

Executive Summary

Overall Leaderboard

Rank System Avg Score Median Std Pass % $/run Time
1 gpt-5.2 [graph] 0.930 0.979 0.094 93.8% $0.07 471s
2 gpt-5-mini [graph] 0.912 0.931 0.097 87.5% $0.01 904s
3 gpt-5.2 [sequential] 0.729 0.746 0.172 50.0% $0.09 200s
4 gpt-5-mini [sequential] 0.724 0.697 0.164 43.8% $0.03 497s

Efficiency

Efficiency Dashboard

Features

  • Graph-of-Thought reasoning: Tasks decompose into parallel subproblems (search, visit, think, save), then merge results upward through the DAG into structured deliverables
  • Dual execution modes: graph (parallel branching with best-first selection) and sequential (generate then pick, single path depth first) for A/B comparison
  • Bot-resistant web access: Primary aiohttp connector with automatic undetected-chromedriver fallback on 403/401
  • Long-term memory (RAG): Crawled content is chunked and embedded into ChromaDB, queryable across tasks and reasoning steps
  • Dynamic beam width: Branching factor adapts to score quality. Expands exploration when scores are low, narrows when confident
  • Deduplication and pruning: Candidate thoughts are deduplicated by embedding similarity. Low-scoring nodes are pruned to save budget
  • Elastic worker fleet: ECS autoscaling matches demand via CloudWatch queue-depth metrics, winds down when idle
  • User-scoped quotas: Supabase enforces per-user daily usage limits with JWT authentication
  • Comprehensive test suite: 39 priority-ordered tests with programmatic and LLM-based validation

Observability

Structured telemetry at every layer without cluttering business logic.

Layer What Is Tracked Where
Connectors Every HTTP request, LLM call, search query, browser fetch. Timing, status, payload size ConnectorBase._record_timing, _record_io
AgentIO Unified interface telemetry. Visit/search/store/retrieve with fallback tracking AgentIO methods
Engine Step-by-step DAG traversal. Expansion, evaluation, selection, merge, pruning events IdeaDagEngine logger
GoT Operations Embedding, deduplication hits, dynamic beam decisions, prune events GoTOperations
Memory Chunk storage, retrieval counts, namespace isolation MemoryManager
Test Runner Per-test scores, pass/fail, cost, tokens, duration, graph structure metrics idea_test_runner.py
Visualization 4-page core dashboard, heatmaps, efficiency frontiers, difficulty rankings testing/visualization_*

Connector base classes handle I/O logging so action classes stay focused on logic (see OOP conventions).

Test and Visualization Pipeline

idea_test_runner  >  JSON results  >  visualization_summary  >  terminal report
                                   >  visualization_core     >  4-page PNG dashboard
                                   >  visualization_plots    >  detailed plot gallery

Results are written to agent/idea_test_results/ as timestamped JSON. The visualizer can filter by run ID (--latest, --run-id) and generates executive dashboards, heatmaps, efficiency frontiers, and per-test breakdowns.

Regenerating Visualizations:

# From services/ directory, run visualization in Docker
docker compose run --rm agent python -m app.testing.idea_test_visualize --latest --core-only

# Or generate all plots (including detailed gallery)
docker compose run --rm agent python -m app.testing.idea_test_visualize --latest

# List available test runs
docker compose run --rm agent python -m app.testing.idea_test_visualize --list-runs

# Generate and copy benchmark plots to docs/benchmark/ (from project root)
python scripts/generate_benchmark_plots.py

Visualization Improvements:

  • Executive Summary: Score heatmap (test × system) replaces model leaderboard table for better visual insight
  • Efficiency Dashboard: Violin plots with all datapoints replace cramped tables, showing full score distributions
  • Larger fonts: All text increased for better readability (titles 32-48pt, labels 18-22pt)
  • All datapoints visible: Individual test runs shown as scatter points overlaid on distributions
  • Clear trends: Graph vs Sequential advantage highlighted with annotations and visual comparisons

Visualizations are automatically generated after test runs and saved to agent/idea_test_results/plots_<run_id>/.

Tech Stack

Layer Technology
Frontend React, Vite, Supabase Auth
Backend FastAPI, RabbitMQ, Redis, ChromaDB, Supabase
Agent Graph-of-Thought engine, OpenAI LLMs, Brave Search, undetected-chromedriver
Infra AWS ECS, ECR, CloudWatch, Lambda autoscaling

Quick Start

Local Development

cd services
cp keys.env.example keys.env   # add OPENAI_API_KEY + SEARCH_API_KEY
docker compose up -d            # starts gateway, agent, rabbitmq, redis, chroma

The system will be available at:

  • Frontend: http://localhost:5173 (Vite dev server)
  • Gateway API: http://localhost:8000
  • RabbitMQ Management: http://localhost:15672 (guest/guest)
  • ChromaDB: http://localhost:8001

Running Tests

# Run specific tests
IDEA_TEST_IDS=019,025 docker compose run --profile test visit-test

# Run full test suite
docker compose run --profile test idea-test

# Benchmark mode (top 8 tests, 3 models, 3 runs each)
IDEA_TEST_MODE=benchmark docker compose run --profile test idea-test

Environment Variables

Key environment variables for testing:

  • IDEA_TEST_IDS: Comma-separated test IDs (e.g., "019,025,033")
  • IDEA_TEST_MODE: "default" or "benchmark"
  • IDEA_TEST_RUNS: Number of runs per test/model pair
  • IDEA_TEST_CONCURRENCY: Max parallel executions
  • IDEA_TEST_MODELS: Comma-separated models (e.g., "gpt-5.2,gpt-5-mini")
  • IDEA_TEST_EXECUTION_VARIANTS: "graph", "sequential", or both

Repo Layout

services/
  agent/          Agent service (GoT engine, connectors, tests)
  gateway/        FastAPI gateway, task intake, Supabase sync
  shared/         Connector configs, models, storage helpers
  metrics/        CloudWatch queue-depth publisher
  lambda_autoscaling/  ECS autoscaler
frontend/         React web UI
scripts/          Deployment, diagnostics, audits
docs/             Architecture, security, benchmark plots

Documentation

About

general purpose agentic RAG webcrawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors