Euglena

A scalable Graph-of-Thoughts (GoT) agent service for web research using task decomposition and parallel reasoning to increase accuracy and reduce cost.

Euglena is an agent with web crawling and retrieval-augmented generation. Tasks decompose into parallel subproblems (search, visit, save, think) and merge into structured deliverables. Context persists in ChromaDB. Cost efficiency is maximized through dynamic beam-width and a token-efficient workflow that benefits cheaper models through structured reasoning.

Live: https://euglena.vercel.app/

Note: The visit system does not work on deployment due to website anti-bot measures that detect AWS datacenter IPs. The agent demonstrates resilience by working with search results and other available data sources. This website will be available until 2026-03-16 00:00 PST.

Benchmark Results

64 runs across 16 tests x 2 models x 2 execution variants.

Graph vs Sequential

Graph-of-Thought scores 26.8% higher than sequential and uses 29% fewer tokens.

Metric	Graph	Sequential	Delta	% Change
Avg Score	0.921	0.727	+0.195	+26.8%
Pass Rate	90.6%	46.9%	+43.8pp
Avg Cost	$0.04	$0.06	-$0.01	-22.5%
Avg Tokens	22.6k	32.0k	-9.4k	-29.3%
Avg Duration	688s	349s	+339s	+97.2%

Overall Leaderboard

Rank	System	Avg Score	Median	Std	Pass %	$/run	Time
1	gpt-5.2 [graph]	0.930	0.979	0.094	93.8%	$0.07	471s
2	gpt-5-mini [graph]	0.912	0.931	0.097	87.5%	$0.01	904s
3	gpt-5.2 [sequential]	0.729	0.746	0.172	50.0%	$0.09	200s
4	gpt-5-mini [sequential]	0.724	0.697	0.164	43.8%	$0.03	497s

Efficiency

Features

Graph-of-Thought reasoning: Tasks decompose into parallel subproblems (search, visit, think, save), then merge results upward through the DAG into structured deliverables
Dual execution modes: graph (parallel branching with best-first selection) and sequential (generate then pick, single path depth first) for A/B comparison
Bot-resistant web access: Primary aiohttp connector with automatic undetected-chromedriver fallback on 403/401
Long-term memory (RAG): Crawled content is chunked and embedded into ChromaDB, queryable across tasks and reasoning steps
Dynamic beam width: Branching factor adapts to score quality. Expands exploration when scores are low, narrows when confident
Deduplication and pruning: Candidate thoughts are deduplicated by embedding similarity. Low-scoring nodes are pruned to save budget
Elastic worker fleet: ECS autoscaling matches demand via CloudWatch queue-depth metrics, winds down when idle
User-scoped quotas: Supabase enforces per-user daily usage limits with JWT authentication
Comprehensive test suite: 39 priority-ordered tests with programmatic and LLM-based validation

Observability

Structured telemetry at every layer without cluttering business logic.

Layer	What Is Tracked	Where
Connectors	Every HTTP request, LLM call, search query, browser fetch. Timing, status, payload size	`ConnectorBase._record_timing`, `_record_io`
AgentIO	Unified interface telemetry. Visit/search/store/retrieve with fallback tracking	`AgentIO` methods
Engine	Step-by-step DAG traversal. Expansion, evaluation, selection, merge, pruning events	`IdeaDagEngine` logger
GoT Operations	Embedding, deduplication hits, dynamic beam decisions, prune events	`GoTOperations`
Memory	Chunk storage, retrieval counts, namespace isolation	`MemoryManager`
Test Runner	Per-test scores, pass/fail, cost, tokens, duration, graph structure metrics	`idea_test_runner.py`
Visualization	4-page core dashboard, heatmaps, efficiency frontiers, difficulty rankings	`testing/visualization_*`

Connector base classes handle I/O logging so action classes stay focused on logic (see OOP conventions).

Test and Visualization Pipeline

idea_test_runner  >  JSON results  >  visualization_summary  >  terminal report
                                   >  visualization_core     >  4-page PNG dashboard
                                   >  visualization_plots    >  detailed plot gallery

Results are written to agent/idea_test_results/ as timestamped JSON. The visualizer can filter by run ID (--latest, --run-id) and generates executive dashboards, heatmaps, efficiency frontiers, and per-test breakdowns.

Regenerating Visualizations:

# From services/ directory, run visualization in Docker
docker compose run --rm agent python -m app.testing.idea_test_visualize --latest --core-only

# Or generate all plots (including detailed gallery)
docker compose run --rm agent python -m app.testing.idea_test_visualize --latest

# List available test runs
docker compose run --rm agent python -m app.testing.idea_test_visualize --list-runs

# Generate and copy benchmark plots to docs/benchmark/ (from project root)
python scripts/generate_benchmark_plots.py

Visualization Improvements:

Executive Summary: Score heatmap (test × system) replaces model leaderboard table for better visual insight
Efficiency Dashboard: Violin plots with all datapoints replace cramped tables, showing full score distributions
Larger fonts: All text increased for better readability (titles 32-48pt, labels 18-22pt)
All datapoints visible: Individual test runs shown as scatter points overlaid on distributions
Clear trends: Graph vs Sequential advantage highlighted with annotations and visual comparisons

Visualizations are automatically generated after test runs and saved to agent/idea_test_results/plots_<run_id>/.

Tech Stack

Layer	Technology
Frontend	React, Vite, Supabase Auth
Backend	FastAPI, RabbitMQ, Redis, ChromaDB, Supabase
Agent	Graph-of-Thought engine, OpenAI LLMs, Brave Search, undetected-chromedriver
Infra	AWS ECS, ECR, CloudWatch, Lambda autoscaling

Quick Start

Local Development

cd services
cp keys.env.example keys.env   # add OPENAI_API_KEY + SEARCH_API_KEY
docker compose up -d            # starts gateway, agent, rabbitmq, redis, chroma

The system will be available at:

Frontend: http://localhost:5173 (Vite dev server)
Gateway API: http://localhost:8000
RabbitMQ Management: http://localhost:15672 (guest/guest)
ChromaDB: http://localhost:8001

Running Tests

# Run specific tests
IDEA_TEST_IDS=019,025 docker compose run --profile test visit-test

# Run full test suite
docker compose run --profile test idea-test

# Benchmark mode (top 8 tests, 3 models, 3 runs each)
IDEA_TEST_MODE=benchmark docker compose run --profile test idea-test

Environment Variables

Key environment variables for testing:

IDEA_TEST_IDS: Comma-separated test IDs (e.g., "019,025,033")
IDEA_TEST_MODE: "default" or "benchmark"
IDEA_TEST_RUNS: Number of runs per test/model pair
IDEA_TEST_CONCURRENCY: Max parallel executions
IDEA_TEST_MODELS: Comma-separated models (e.g., "gpt-5.2,gpt-5-mini")
IDEA_TEST_EXECUTION_VARIANTS: "graph", "sequential", or both

Repo Layout

services/
  agent/          Agent service (GoT engine, connectors, tests)
  gateway/        FastAPI gateway, task intake, Supabase sync
  shared/         Connector configs, models, storage helpers
  metrics/        CloudWatch queue-depth publisher
  lambda_autoscaling/  ECS autoscaler
frontend/         React web UI
scripts/          Deployment, diagnostics, audits
docs/             Architecture, security, benchmark plots

Documentation

System Architecture - Overall system design and message flow
Agent Architecture - Graph-of-Thought engine internals
Test Suite - Test structure and validation
Deployment - Deployment guide
Debugger - Debugging tools and techniques
Scripts - Deployment and diagnostic scripts

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.cursor/rules		.cursor/rules
.idea		.idea
docs		docs
frontend		frontend
scripts		scripts
services		services
shared		shared
snapshots		snapshots
.cursorignore		.cursorignore
.gitignore		.gitignore
README.md		README.md
desktop.ini		desktop.ini
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Euglena

Benchmark Results

Graph vs Sequential

Overall Leaderboard

Efficiency

Features

Observability

Test and Visualization Pipeline

Tech Stack

Quick Start

Local Development

Running Tests

Environment Variables

Repo Layout

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Euglena

Benchmark Results

Graph vs Sequential

Overall Leaderboard

Efficiency

Features

Observability

Test and Visualization Pipeline

Tech Stack

Quick Start

Local Development

Running Tests

Environment Variables

Repo Layout

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages