-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Epic: AI-Assisted Debugging Infrastructure for ZigCraft
Summary
This epic implements a comprehensive debugging infrastructure to enable AI agents to effectively debug, analyze, and interact with ZigCraft voxel engine. The system includes crash detection and reporting, screenshot capture, debug console, remote HTTP/WebSocket API, metrics collection, input injection, state snapshotting, and OpenCode MCP integration.
Primary Goals:
- Enable AI agents to visually inspect of game state
- Allow remote control and inspection via HTTP/WebSocket APIs
- Automatically capture diagnostic information on crashes
- Provide reproducible debugging through state snapshots
- Support both local development and CI/CD environments
Background
Currently, debugging ZigCraft requires manual reproduction and inspection. With an increasing codebase (Vulkan RHI, PBR rendering, multi-threaded world generation, complex shaders), debugging visual bugs, crashes, and performance issues is time-consuming and often requires running the game locally.
AI agents (via OpenCode) have limited ability to debug graphics applications because they cannot:
- See of rendered output
- Interact with the game interface
- Query internal state
- Replay actions to reproduce bugs
This epic builds a debugging infrastructure that exposes game state and control surfaces to AI agents through multiple interfaces, making automated debugging feasible.
Scope
In Scope:
- Crash detection, reporting, and recovery
- Screenshot capture system (manual, auto, crash-triggered)
- In-game debug console with extensible command system
- RESTful HTTP API and WebSocket telemetry server
- Performance metrics collection and export
- Input injection system for remote control
- State snapshot system for reproducible debugging
- OpenCode MCP server integration for AI tooling
- CI/CD integration with headless screenshot testing
Out of Scope:
- Full-blown game editor tools
- Multiplayer networking for remote debugging
- GPU profiling tools (existing tools like RenderDoc recommended)
- Automated bug fixing (agent only assists in diagnosis)
Phases & Dependencies
This epic is organized into 9 phases with clear dependency chains. Some phases can be developed in parallel and merged together in 3 major milestones.
Phase 1: Crash Detection & Reporting
Status: 🔴 Blocking (Priority - Critical)
Duration: 3-4 days
Dependencies: None
Deliverables:
src/engine/core/crash_handler.zig- Signal handler (SIGSEGV, SIGABRT, SIGILL)- Crash directory structure:
crashes/<timestamp>/ - Crash dump files:
stacktrace.txt- Zig stack tracelast_100_lines.log- Log buffer capturestate.json- Game state snapshot
- Crash reporting UI dialog on startup (using existing UISystem)
- Heartbeat system with watchdog thread
Key Technical Details:
- Use
std.os.sigactionfor signal handling - Async-signal-safe operations in signal handler
- Store last N log lines in circular buffer (alloc-free in crash path)
- Heartbeat file every 5 seconds, watchdog after 15s timeout
- Graceful shutdown before crash dump if possible
Acceptance Criteria:
- ✅ Intentional crash (
SIGSEGV) generates complete crash dump - ✅ Crash reports displayed on next game launch
- ✅ Can view crash details and optionally upload
- ✅ "Continue from safe state" recovers pre-crash world
- ✅ Frozen process (no heartbeat) triggers graceful shutdown
Phase 2: Screenshot Capture System
Status: 🟡 Ready to Start (No dependencies)
Duration: 2-3 days
Dependencies: None (can be parallel with Phase 1)
Deliverables:
src/engine/graphics/screenshot.zig- Screenshot capture module- Vulkan swapchain image readback logic
- STB PNG encoding (leverage existing
libs/stb/stb_image_impl.c) - Screenshot organization:
screenshots/session_<timestamp>/<timestamp>.png - Capture modes:
- Manual:
F11hotkey - Console:
screenshot [name]command - Auto-timer:
screenshot start/stop [interval_seconds] - Crash-triggered: Capture last frame before crash
- Manual:
Key Technical Details:
- Async capture (queue request, render thread processes next frame)
VkImage→stbi_write_pngpipeline- Maximum screenshot directory size (e.g., 500MB) with oldest deletion
- Thumbnail generation using
stb_image_resize(optional) - Crash hook calls screenshot handler in signal-safe path (if possible)
Acceptance Criteria:
- ✅ F11 saves screenshot to correct directory
- ✅ Console command
screenshot debug_viewworks - ✅ Auto-timer
screenshot start 5captures every 5 seconds - ✅ Crash-triggered screenshot captured (last frame)
- ✅ Screenshots are valid PNG files
- ✅ Directory size limit respected
Phase 3: Debug Console
Status: 🟢 Ready to Start (Depends on Phase 2)
Duration: 2-3 days
Dependencies: Phase 2 (screenshot command)
Deliverables:
src/engine/ui/debug_console.zig- Console UI overlaysrc/engine/core/command_registry.zig- Command plugin system- Toggle key:
`(backtick) - Console features:
- Text input field
- Command history (up/down arrows)
- Command autocomplete (tab completion)
- Output buffer with scroll
- Core commands:
help [command] - Show command list or details screenshot [name] - Take screenshot (Phase 2) stats - Print frame metrics teleport <x> <y> <z> - Teleport player set_time <0-23> - Set time of day toggle_wireframe - Toggle wireframe mode toggle_debug_shadows - Toggle shadow debug visualization save_state [name] - Save state snapshot (Phase 7) load_state [name] - Load state snapshot (Phase 7) crash - Trigger intentional crash (Phase 1) - Command registration API for extensibility
Key Technical Details:
- Leverage existing
UISystemfor rendering - Input handling via existing
Inputsystem - Command parsing:
command arg1 arg2 --flag value - Autocomplete: fuzzy match on registered commands
- Command context: access to
EngineContextfor game state
Acceptance Criteria:
- ✅ Backtick opens/closes console
- ✅ Can type commands and see output
- ✅ Arrow keys navigate history
- ✅ Tab completes commands
- ✅ All core commands work correctly
- ✅ Custom commands can be registered
Phase 4: Remote HTTP/WebSocket API
Status: 🔴 Critical Path (Depends on Phase 3)
Duration: 4-5 days
Dependencies: Phase 3 (console integration), Phase 2 (screenshot endpoint)
Deliverables:
src/engine/remote/api_server.zig- HTTP server usingstd.http.Serversrc/engine/remote/websocket.zig- WebSocket implementationsrc/engine/remote/protocol.zig- Request/response type definitions- REST API endpoints:
GET /api/status - Server status, version, uptime GET /api/stats - Current metrics (Phase 5) GET /api/screenshot - Get last screenshot path/URL POST /api/screenshot - Trigger screenshot POST /api/input/inject - Inject input event (Phase 6) POST /api/console/execute - Execute console command GET /api/state - Get current game state POST /api/state/snapshot - Create state snapshot (Phase 7) POST /api/state/restore - Restore state snapshot (Phase 7) GET /api/crashes - List crash dumps (Phase 1) GET /api/logs?level=X&lines=N - Fetch log lines GET /api/metrics - Get metrics history (Phase 5) - WebSocket endpoint:
/api/ws- Subscribe to telemetry stream (FPS, frame time, memory, entities)
- Bidirectional commands
- Heartbeat ping/pong
- Security:
- Configurable API key via
Authorizationheader - Default:
localhost-onlymode - Rate limiting per IP
- Configurable API key via
Key Technical Details:
- HTTP server runs on background thread (non-blocking)
- Use
std.http.Server(Zig stdlib, no external deps) - WebSocket frame parsing (RFC 6455)
- JSON serialization via custom writer (avoid heavy JSON lib)
- CORS headers for cross-origin requests
- Graceful shutdown: flush pending requests before exit
Settings Integration:
// src/game/settings.zig
remote_debug: struct {
enabled: bool = false,
port: u16 = 8080,
api_key: []const u8 = "",
allow_remote: bool = false, // localhost only by default
}Acceptance Criteria:
- ✅ Server starts on configured port
- ✅ All REST endpoints return valid JSON
- ✅ WebSocket connects and receives telemetry
- ✅ CORS headers present for browser clients
- ✅ API key validation works
- ✅ Rate limiting blocks abusive requests
- ✅ Server gracefully shuts down on game quit
Phase 5: Metrics & Telemetry
Status: 🟢 Ready to Start (No blocking dependencies)
Duration: 2-3 days
Dependencies: None (can be parallel with Phases 1-3)
Deliverables:
src/engine/core/metrics.zig- Metrics collection system- Collected metrics:
- Frame time: min, max, avg (last 60 frames)
- FPS: current, 1s avg, 10s avg
- Render pass timings (shadow, GPass, SSAO, etc.)
- Memory: allocated, freed, current usage (heap allocator stats)
- World: loaded chunks, meshing queue size, entities count
- Graphics: draw calls, triangles, texture memory
- Job system: active jobs, thread utilization
- Export formats:
- JSON: Single snapshot
- CSV: Time series history
- Live stream: WebSocket (Phase 4)
- Console commands:
metrics - Print current metrics metrics export csv - Export to file metrics history N - Print last N frames
Key Technical Details:
- Rolling buffer for history (configurable size, e.g., 3600 frames = 1 min at 60fps)
- Update metrics in
App.runSingleFrame()after all systems updated - Frame timing via
std.time.nanoTimestamp - Memory stats via
std.heap.GeneralPurposeAllocatorin debug mode - Zero-allocation hot path (use pre-allocated buffers)
Acceptance Criteria:
- ✅ Metrics updated every frame with minimal overhead
- ✅ Console
metricsprints readable output - ✅ CSV export works:
metrics export stats.csv - ✅ WebSocket telemetry stream updates at 10Hz
- ✅ Metrics accurate (verified vs manual measurements)
Phase 6: Input Injection
Status: 🟢 Ready to Start (Depends on Phase 4)
Duration: 3-4 days
Dependencies: Phase 4 (API endpoint needs injection system)
Deliverables:
src/engine/input/injector.zig- Input queue and injection system- Input injection types:
// Key press { "type": "key", "key": "W|A|S|D|Space|Shift|F1|Escape|...", "action": "press|release|tap" } // Mouse movement { "type": "mouse_move", "x": 640, "y": 480 } // Mouse button { "type": "mouse_button", "button": "left|right|middle", "action": "press|release|click" } // Mouse scroll { "type": "mouse_scroll", "scroll_y": 1.0 }
- Thread-safe input queue (producer: API/consumer: game loop)
- Input recording system:
- Record real input to JSON
- Playback recorded sequences
- Speed control (0.5x, 1x, 2x, etc.)
- Console command:
input_playback <file.json> [speed]
Key Technical Details:
- Queue uses
std.ArrayListprotected bystd.Thread.Mutex - Game thread processes queue in
Input.pollEvents()(after SDL events) - Injected events marked as synthetic (for debugging/avoiding feedback loops)
- Input validation: reject unknown keys, out-of-bounds positions
- Timestamp-based playback (maintain original timing)
Acceptance Criteria:
- ✅ API
POST /api/input/injectsuccessfully presses keys - ✅ Multiple input events processed in correct order
- ✅ Recording creates valid JSON
- ✅ Playback reproduces recorded input
- ✅ Speed control changes playback rate
- ✅ Queue doesn't cause frame drops
Phase 7: State Snapshots
Status: 🟢 Ready to Start (Depends on Phase 4)
Duration: 3-4 days
Dependencies: Phase 4 (API endpoint)
Deliverables:
src/game/snapshot.zig- State serialization system- Snapshot contents:
- Player: position, rotation, velocity, inventory
- World: seed, loaded chunks (positions), modified blocks
- Time: game time, day/night cycle
- Settings: graphics options, input bindings
- Session metadata: timestamp, version, frame count
- Snapshot format:
- Binary: compact, fast write/read
- Optional JSON: human-readable (for debugging)
- Compression:
std.compress.zlib(optional)
- Snapshot management:
- List:
snapshots/directory scan - Delete:
rm snapshots/<name>.snap - Compare:
diff snapshots/A.snap B.snap - Auto-snapshot on crash (Phase 1)
- List:
- Console commands:
save_state [name] - Save snapshot load_state [name] - Load snapshot list_states - List snapshots - API endpoints:
POST /api/state/snapshot - Create snapshot POST /api/state/restore - Restore snapshot GET /api/state/list - List snapshots
Key Technical Details:
- Serialization:
std.io.BufferedWriter+ custom binary format - Chunk data: only store modified blocks (diff from seed)
- Validate snapshot on load (checksum via
std.hash.Fnv1a) - Loading: hot-reload world chunks, recreate player state
- Fallback: if load fails, continue with current state (don't crash)
Acceptance Criteria:
- ✅
save_state testcreates valid snapshot - ✅
load_state testrestores state exactly - ✅ Snapshots capture modified terrain
- ✅ Load fails gracefully on corrupted snapshot
- ✅ API endpoint creates and loads snapshots
- ✅ Crash auto-snapshot captures pre-crash state
Phase 8: OpenCode MCP Integration
Status: 🟡 Ready to Start (Depends on Phases 4, 6, 7)
Duration: 3-4 days
Dependencies: Phase 4 (API), Phase 6 (input), Phase 7 (snapshots)
Deliverables:
opencode/mcp-zigcraft/- MCP server implementation (Bun/TypeScript)- MCP tools (following OpenCode tool format):
// tool/take-screenshot.ts { description: "Capture game screenshot", args: { name: string }, returns: { path: string, url: string } } // tool/inject-input.ts { description: "Inject keyboard/mouse input", args: { type: string, key?: string, x?: number, ... }, returns: { success: boolean } } // tool/query-state.ts { description: "Get current game state", args: { }, returns: { player: {...}, world: {...}, time: ... } } // tool/console-command.ts { description: "Execute debug console command", args: { command: string }, returns: { output: string, success: boolean } } // tool/get-metrics.ts { description: "Get performance metrics", args: { format: "json|csv" }, returns: { metrics: {...} } } // tool/list-crashes.ts { description: "List recent crash reports", args: { }, returns: { crashes: [...] } } // tool/save-snapshot.ts { description: "Create state snapshot", args: { name: string }, returns: { path: string } } // tool/load-snapshot.ts { description: "Load state snapshot", args: { name: string }, returns: { success: boolean } }
- MCP server implementation:
- Use
@opencode-ai/pluginSDK - Connect to game via HTTP/WebSocket (Phase 4)
- Error handling and retry logic
- Discovery of game instance (configurable host/port)
- Use
File Structure:
opencode/mcp-zigcraft/
├── index.ts # MCP server entry point
├── client.ts # HTTP/WebSocket client to game
├── types.ts # TypeScript types for API
└── tool/
├── take-screenshot.ts
├── inject-input.ts
├── query-state.ts
├── console-command.ts
├── get-metrics.ts
├── list-crashes.ts
├── save-snapshot.ts
└── load-snapshot.ts
Configuration:
// opencode.json
{
"mcp": {
"zigcraft": {
"type": "local",
"command": ["bun", "run", "opencode/mcp-zigcraft/index.ts"],
"env": {
"ZIGCRAFT_API_HOST": "http://localhost:8080"
},
"enabled": true
}
}
}Acceptance Criteria:
- ✅ MCP server starts without errors
- ✅ All 8 tools are discoverable by OpenCode
- ✅ Tools successfully call game API
- ✅ Error responses handled gracefully
- ✅ Retry logic handles game restart
- ✅ AI agent can use tools end-to-end
Phase 9: CI/CD Integration
Status: 🟢 Final Phase (Depends on all previous)
Duration: 2-3 days
Dependencies: All phases (especially Phase 2, 4, 7)
Deliverables:
- Visual regression test workflow
- Crash recovery test workflow
- API integration test suite
- Headless screenshot automation
Key Components:
A. Headless Screenshot Testing:
- Build flag:
-Dheadless=true(render to offscreen buffer) - Automated scenario script:
// test_scenarios/basic_render.json { "name": "Basic Render Test", "steps": [ { "command": "teleport 0 100 0" }, { "wait": 60 }, // wait 60 frames { "screenshot": "frame_001" }, { "command": "set_time 12" }, { "wait": 60 }, { "screenshot": "frame_002" } ] }
- Compare screenshots against baseline (
assets/baselines/) - Fail CI if pixel difference > threshold (e.g., 1%)
B. Crash Recovery Testing:
- Test script that triggers crashes:
// test_scenarios/crash_test.json { "steps": [ { "command": "crash" }, { "restart": true }, // auto-restart game { "verify": "crash_report_exists" }, { "load_snapshot": "auto_recovery" }, { "verify": "state_restored" } ] } - Verify crash handler creates dump
- Verify auto-recovery loads correct state
C. API Integration Tests:
- Node.js test suite using
supertestoraxios - Test all REST endpoints with valid/invalid inputs
- Test WebSocket connection and telemetry
- Test rate limiting and auth
GitHub Workflow Addition:
# .github/workflows/debugging-tools.yml
name: Debugging Tools Test
on: [push, pull_request]
jobs:
visual-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v16
- name: Run screenshot tests
run: nix develop --command zig build test-screenshots
crash-recovery:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v16
- name: Run crash tests
run: nix develop --command zig build test-crash-recovery
api-integration:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v16
- name: Run API tests
run: |
nix develop --command zig build run -Dremote_debug=true &
sleep 10
bun run opencode/tests/api.tsAcceptance Criteria:
- ✅ Visual regression passes on main branch
- ✅ Intentional rendering changes trigger test failure
- ✅ Crash recovery test passes
- ✅ API integration test suite passes (all endpoints)
- ✅ Tests run in CI without display server
Parallelization Strategy
This epic is organized to maximize parallel development while respecting dependencies:
Track A (Core Infrastructure) - Can run in parallel
- Phase 1: Crash Detection - Independent
- Phase 2: Screenshots - Independent
- Phase 5: Metrics - Independent
These three phases have no dependencies on each other and can be developed simultaneously by different developers.
Track B (User Interfaces) - Depends on Track A
- Phase 3: Debug Console - Depends on Phase 2 (screenshot command)
- Phase 4: Remote API - Depends on Phase 3 (console integration), Phase 2 (screenshot endpoint)
- Phase 6: Input Injection - Depends on Phase 4 (API endpoint)
- Phase 7: State Snapshots - Depends on Phase 4 (API endpoint)
These phases build on Track A but can proceed in parallel once dependencies are met:
- Phase 4 can start once Phase 2 is complete
- Phase 6 can start once Phase 4 is partially complete (just need API server)
- Phase 7 can start once Phase 4 is partially complete (just need API server)
Track C (Integration) - Depends on Track B
- Phase 8: MCP Integration - Depends on Phase 4, 6, 7 (needs stable API)
- Phase 9: CI/CD - Depends on all previous phases
Critical Path (Sequential)
The minimum sequential work required:
- Phase 2 (Screenshots) → Phase 3 (Console) → Phase 4 (API) → Phase 8 (MCP)
- Phase 1 (Crash) can happen anytime
- Phase 5 (Metrics) can happen anytime
- Phase 6/7 can start once Phase 4 is ready
- Phase 9 requires everything
Recommended Parallel Work:
| Week | Developer A | Developer B | Developer C |
|---|---|---|---|
| 1 | Phase 1 (Crash) | Phase 2 (Screenshots) | Phase 5 (Metrics) |
| 2 | Phase 3 (Console) | Phase 4 (API) | Phase 5 (Metrics) |
| 3 | Phase 4 (API) | Phase 6 (Input) | Phase 7 (Snapshots) |
| 4 | Phase 8 (MCP) | Phase 7 (Snapshots) | Documentation |
| 5 | Phase 9 (CI/CD) | Testing | Testing |
Merge Strategy
This epic will be merged in 3 major milestones to reduce integration risk and provide intermediate value:
Milestone 1: Core Diagnostics (Week 1-2)
Merge Target: Feature branch debug-diagnostics
Includes:
- Phase 1: Crash Detection & Reporting
- Phase 2: Screenshot Capture System
- Phase 5: Metrics & Telemetry
Value:
- Crash dumps now captured automatically
- Developers can take screenshots
- Metrics available via console
Branch: feature/debug-core-diagnostics
Milestone 2: Remote Control (Week 3-4)
Merge Target: Feature branch debug-remote-control
Includes:
- Phase 3: Debug Console
- Phase 4: Remote HTTP/WebSocket API
- Phase 6: Input Injection
- Phase 7: State Snapshots
Value:
- Full remote debugging capability
- AI agents can now interact with game
- Reproducible debugging via snapshots
Branch: feature/debug-remote-control
Depends on: Milestone 1 (merged to main)
Milestone 3: AI Integration & CI (Week 5)
Merge Target: Feature branch debug-ai-integration
Includes:
- Phase 8: OpenCode MCP Integration
- Phase 9: CI/CD Integration
Value:
- Full AI agent debugging workflow
- Automated visual regression testing
- Crash recovery verification
Branch: feature/debug-ai-integration
Depends on: Milestone 2 (merged to main)
File Structure Summary
New Files:
src/
├── engine/
│ ├── core/
│ │ ├── crash_handler.zig [Phase 1]
│ │ ├── metrics.zig [Phase 5]
│ │ └── command_registry.zig [Phase 3]
│ ├── graphics/
│ │ └── screenshot.zig [Phase 2]
│ ├── input/
│ │ └── injector.zig [Phase 6]
│ ├── remote/
│ │ ├── api_server.zig [Phase 4]
│ │ ├── websocket.zig [Phase 4]
│ │ └── protocol.zig [Phase 4]
│ └── ui/
│ └── debug_console.zig [Phase 3]
└── game/
└── snapshot.zig [Phase 7]
opencode/
└── mcp-zigcraft/ [Phase 8]
├── index.ts
├── client.ts
├── types.ts
└── tool/
├── take-screenshot.ts
├── inject-input.ts
├── query-state.ts
├── console-command.ts
├── get-metrics.ts
├── list-crashes.ts
├── save-snapshot.ts
└── load-snapshot.ts
test_scenarios/ [Phase 9]
├── basic_render.json
└── crash_test.json
assets/baselines/ [Phase 9]
└── screenshots/
├── frame_001.png
└── frame_002.png
Modified Files:
src/engine/core/log.zig [Phase 1 - add crash logger]
src/game/app.zig [All phases - integrate systems]
src/game/settings.zig [Phase 4 - add remote_debug config]
src/game/screens/world.zig [Phase 3 - integrate console]
build.zig [Phase 9 - add test targets]
opencode.json [Phase 8 - add MCP config]
.github/workflows/build.yml [Phase 9 - add debug tests]
Technical Decisions & Rationale
1. HTTP Server Choice
Decision: Use Zig's std.http.Server (no external dependencies)
Rationale:
- Keeps Nix environment clean (no additional packages)
- Already available in stdlib (Zig 0.14)
- Sufficient performance for debugging API
- No security concerns with third-party HTTP libraries
- Single binary distribution (no runtime deps)
2. Screenshot Format
Decision: PNG via STB (existing libs/stb/stb_image_write.h)
Rationale:
- STB already in project for texture loading
- Lossless compression (important for visual regression)
- Easy for AI agents to analyze
- Standard format with broad tooling support
- No additional dependencies
3. WebSocket vs Pure REST
Decision: Both - REST for commands, WebSocket for telemetry
Rationale:
- REST: Simple, stateless, idempotent operations (perfect for commands)
- WebSocket: Real-time bidirectional stream (ideal for telemetry)
- AI can choose optimal interface based on use case
- Reduces HTTP polling overhead for metrics
4. Input Injection: IPC vs In-Process Queue
Decision: In-process queue (no external IPC)
Rationale:
- Simpler implementation (no socket/pipe complexity)
- No threading issues with external processes
- Still works with remote API (API pushes to in-process queue)
- Lower latency for immediate debugging
5. State Snapshot Format
Decision: Binary with optional JSON, optional compression
Rationale:
- Binary: Fast serialization/deserialization, compact size
- JSON: Human-readable for manual inspection (debug tool output)
- Compression: Optional (trade size vs. CPU)
- Custom format (no heavy serialization lib like MessagePack)
6. MCP Implementation Language
Decision: TypeScript/Bun (matches OpenCode config)
Rationale:
- OpenCode already uses Bun/TypeScript for tools
@opencode-ai/pluginSDK designed for TypeScript- Excellent HTTP/WebSocket libraries in npm ecosystem
- Type safety for API contracts
- Easy to test and debug
Risk Assessment
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Vulkan screenshot capture fails | High (Phase 2 blocked) | Medium | Fallback to glReadPixel if VkImage readback fails |
| Signal handler crashes | Critical (system instability) | Low | Use async-signal-safe operations only, test extensively |
| HTTP server blocks render thread | Medium (frame drops) | Low | Run API server on background thread |
| WebSocket connection instability | Medium (agent loses connection) | Medium | Auto-reconnect with exponential backoff |
| Snapshot corruption | High (data loss) | Low | Checksum validation, keep N backups |
| Performance overhead | Medium (degraded gameplay) | Low | Optional enable/disable (default off), profile hot paths |
| CI display requirement | Medium (tests can't run) | High | Use headless Wayland (already in CI) + virtual display |
Acceptance Criteria (Epic Level)
Definition of Done:
- ✅ All 9 phases completed and merged in their respective milestones
- ✅ All unit tests pass (
zig build test) - ✅ Integration tests pass (
zig build test-integration) - ✅ New CI tests pass (visual regression, crash recovery, API tests)
- ✅ AI agent can successfully:
- Take screenshots and analyze them
- Query game state and metrics
- Inject input to control game
- Save/restore snapshots to reproduce bugs
- Execute debug console commands
- ✅ Crash handler successfully captures diagnostics on SIGSEGV
- ✅ Remote API accessible via MCP tools
- ✅ Documentation updated (AGENTS.md, README.md)
- ✅ Code formatted (
zig fmt src/) - ✅ No regressions in existing functionality
Success Metrics
Quantitative:
- Crash detection: 100% of crashes captured with stacktrace
- Screenshot capture time: <100ms (non-blocking)
- API response time: <50ms for status/metrics queries
- WebSocket telemetry latency: <10ms
- Snapshot save time: <1s for typical world state
- Snapshot load time: <2s for typical world state
- Memory overhead: <50MB with all debug systems enabled
Qualitative:
- AI agents can diagnose visual bugs without human intervention
- Developers can reproduce bugs from crash reports in <5 minutes
- Remote debugging reduces time to root cause by 50%
- CI catches visual regressions before merge
Open Questions
-
Screenshot resolution: Should screenshots capture at native resolution or match window size? (Recommend: native for better debugging detail)
-
Metrics retention: How long should metrics history be kept? (Recommend: 3600 frames = 1 minute @ 60fps, configurable)
-
API authentication: Should we support token-based auth for remote access? (Recommend: API key header, configurable)
-
Snapshot compression: Enable compression by default? (Recommend: Yes for worlds >100MB chunks)
-
WebSocket rate limit: What's a reasonable telemetry update rate? (Recommend: 10Hz, configurable)
Related Issues
None yet (will be created as sub-issues)
Notes for Sub-Issue Creation
When splitting this epic into sub-issues:
- One phase per sub-issue (9 sub-issues total)
- Reference this epic in all sub-issues:
Part of #[EPIC_NUMBER] - Include acceptance criteria from this epic in each sub-issue
- Tag sub-issues with milestone labels:
milestone-1-core-diagnostics(Phases 1, 2, 5)milestone-2-remote-control(Phases 3, 4, 6, 7)milestone-3-ai-integration(Phases 8, 9)
- Add dependency links between sub-issues based on phase dependencies
- Use this epic's file structure section for reference
Summary
This epic implements a complete debugging infrastructure that transforms ZigCraft from a "black box" to a transparent, observable system. By building crash detection, screenshot capture, remote API, and AI integration in a phased approach, we minimize risk while delivering incremental value at each milestone.
The parallelization strategy allows 3 developers to work simultaneously, and 3-milestone merge strategy ensures integration risks are caught early. The final result will enable AI agents to effectively debug game, dramatically reducing time to identify and fix visual bugs, crashes, and performance issues.