Skip to content

Epic: AI-Assisted Debugging Infrastructure for ZigCraft #220

@MichaelFisher1997

Description

@MichaelFisher1997

Epic: AI-Assisted Debugging Infrastructure for ZigCraft

Summary

This epic implements a comprehensive debugging infrastructure to enable AI agents to effectively debug, analyze, and interact with ZigCraft voxel engine. The system includes crash detection and reporting, screenshot capture, debug console, remote HTTP/WebSocket API, metrics collection, input injection, state snapshotting, and OpenCode MCP integration.

Primary Goals:

  • Enable AI agents to visually inspect of game state
  • Allow remote control and inspection via HTTP/WebSocket APIs
  • Automatically capture diagnostic information on crashes
  • Provide reproducible debugging through state snapshots
  • Support both local development and CI/CD environments

Background

Currently, debugging ZigCraft requires manual reproduction and inspection. With an increasing codebase (Vulkan RHI, PBR rendering, multi-threaded world generation, complex shaders), debugging visual bugs, crashes, and performance issues is time-consuming and often requires running the game locally.

AI agents (via OpenCode) have limited ability to debug graphics applications because they cannot:

  • See of rendered output
  • Interact with the game interface
  • Query internal state
  • Replay actions to reproduce bugs

This epic builds a debugging infrastructure that exposes game state and control surfaces to AI agents through multiple interfaces, making automated debugging feasible.

Scope

In Scope:

  • Crash detection, reporting, and recovery
  • Screenshot capture system (manual, auto, crash-triggered)
  • In-game debug console with extensible command system
  • RESTful HTTP API and WebSocket telemetry server
  • Performance metrics collection and export
  • Input injection system for remote control
  • State snapshot system for reproducible debugging
  • OpenCode MCP server integration for AI tooling
  • CI/CD integration with headless screenshot testing

Out of Scope:

  • Full-blown game editor tools
  • Multiplayer networking for remote debugging
  • GPU profiling tools (existing tools like RenderDoc recommended)
  • Automated bug fixing (agent only assists in diagnosis)

Phases & Dependencies

This epic is organized into 9 phases with clear dependency chains. Some phases can be developed in parallel and merged together in 3 major milestones.


Phase 1: Crash Detection & Reporting

Status: 🔴 Blocking (Priority - Critical)
Duration: 3-4 days
Dependencies: None

Deliverables:

  1. src/engine/core/crash_handler.zig - Signal handler (SIGSEGV, SIGABRT, SIGILL)
  2. Crash directory structure: crashes/<timestamp>/
  3. Crash dump files:
    • stacktrace.txt - Zig stack trace
    • last_100_lines.log - Log buffer capture
    • state.json - Game state snapshot
  4. Crash reporting UI dialog on startup (using existing UISystem)
  5. Heartbeat system with watchdog thread

Key Technical Details:

  • Use std.os.sigaction for signal handling
  • Async-signal-safe operations in signal handler
  • Store last N log lines in circular buffer (alloc-free in crash path)
  • Heartbeat file every 5 seconds, watchdog after 15s timeout
  • Graceful shutdown before crash dump if possible

Acceptance Criteria:

  • ✅ Intentional crash (SIGSEGV) generates complete crash dump
  • ✅ Crash reports displayed on next game launch
  • ✅ Can view crash details and optionally upload
  • ✅ "Continue from safe state" recovers pre-crash world
  • ✅ Frozen process (no heartbeat) triggers graceful shutdown

Phase 2: Screenshot Capture System

Status: 🟡 Ready to Start (No dependencies)
Duration: 2-3 days
Dependencies: None (can be parallel with Phase 1)

Deliverables:

  1. src/engine/graphics/screenshot.zig - Screenshot capture module
  2. Vulkan swapchain image readback logic
  3. STB PNG encoding (leverage existing libs/stb/stb_image_impl.c)
  4. Screenshot organization: screenshots/session_<timestamp>/<timestamp>.png
  5. Capture modes:
    • Manual: F11 hotkey
    • Console: screenshot [name] command
    • Auto-timer: screenshot start/stop [interval_seconds]
    • Crash-triggered: Capture last frame before crash

Key Technical Details:

  • Async capture (queue request, render thread processes next frame)
  • VkImagestbi_write_png pipeline
  • Maximum screenshot directory size (e.g., 500MB) with oldest deletion
  • Thumbnail generation using stb_image_resize (optional)
  • Crash hook calls screenshot handler in signal-safe path (if possible)

Acceptance Criteria:

  • ✅ F11 saves screenshot to correct directory
  • ✅ Console command screenshot debug_view works
  • ✅ Auto-timer screenshot start 5 captures every 5 seconds
  • ✅ Crash-triggered screenshot captured (last frame)
  • ✅ Screenshots are valid PNG files
  • ✅ Directory size limit respected

Phase 3: Debug Console

Status: 🟢 Ready to Start (Depends on Phase 2)
Duration: 2-3 days
Dependencies: Phase 2 (screenshot command)

Deliverables:

  1. src/engine/ui/debug_console.zig - Console UI overlay
  2. src/engine/core/command_registry.zig - Command plugin system
  3. Toggle key: ` (backtick)
  4. Console features:
    • Text input field
    • Command history (up/down arrows)
    • Command autocomplete (tab completion)
    • Output buffer with scroll
  5. Core commands:
    help [command]           - Show command list or details
    screenshot [name]         - Take screenshot (Phase 2)
    stats                    - Print frame metrics
    teleport <x> <y> <z>     - Teleport player
    set_time <0-23>          - Set time of day
    toggle_wireframe         - Toggle wireframe mode
    toggle_debug_shadows      - Toggle shadow debug visualization
    save_state [name]        - Save state snapshot (Phase 7)
    load_state [name]        - Load state snapshot (Phase 7)
    crash                    - Trigger intentional crash (Phase 1)
    
  6. Command registration API for extensibility

Key Technical Details:

  • Leverage existing UISystem for rendering
  • Input handling via existing Input system
  • Command parsing: command arg1 arg2 --flag value
  • Autocomplete: fuzzy match on registered commands
  • Command context: access to EngineContext for game state

Acceptance Criteria:

  • ✅ Backtick opens/closes console
  • ✅ Can type commands and see output
  • ✅ Arrow keys navigate history
  • ✅ Tab completes commands
  • ✅ All core commands work correctly
  • ✅ Custom commands can be registered

Phase 4: Remote HTTP/WebSocket API

Status: 🔴 Critical Path (Depends on Phase 3)
Duration: 4-5 days
Dependencies: Phase 3 (console integration), Phase 2 (screenshot endpoint)

Deliverables:

  1. src/engine/remote/api_server.zig - HTTP server using std.http.Server
  2. src/engine/remote/websocket.zig - WebSocket implementation
  3. src/engine/remote/protocol.zig - Request/response type definitions
  4. REST API endpoints:
    GET    /api/status                    - Server status, version, uptime
    GET    /api/stats                     - Current metrics (Phase 5)
    GET    /api/screenshot                - Get last screenshot path/URL
    POST   /api/screenshot                - Trigger screenshot
    POST   /api/input/inject              - Inject input event (Phase 6)
    POST   /api/console/execute           - Execute console command
    GET    /api/state                     - Get current game state
    POST   /api/state/snapshot            - Create state snapshot (Phase 7)
    POST   /api/state/restore             - Restore state snapshot (Phase 7)
    GET    /api/crashes                   - List crash dumps (Phase 1)
    GET    /api/logs?level=X&lines=N      - Fetch log lines
    GET    /api/metrics                   - Get metrics history (Phase 5)
    
  5. WebSocket endpoint: /api/ws
    • Subscribe to telemetry stream (FPS, frame time, memory, entities)
    • Bidirectional commands
    • Heartbeat ping/pong
  6. Security:
    • Configurable API key via Authorization header
    • Default: localhost-only mode
    • Rate limiting per IP

Key Technical Details:

  • HTTP server runs on background thread (non-blocking)
  • Use std.http.Server (Zig stdlib, no external deps)
  • WebSocket frame parsing (RFC 6455)
  • JSON serialization via custom writer (avoid heavy JSON lib)
  • CORS headers for cross-origin requests
  • Graceful shutdown: flush pending requests before exit

Settings Integration:

// src/game/settings.zig
remote_debug: struct {
    enabled: bool = false,
    port: u16 = 8080,
    api_key: []const u8 = "",
    allow_remote: bool = false, // localhost only by default
}

Acceptance Criteria:

  • ✅ Server starts on configured port
  • ✅ All REST endpoints return valid JSON
  • ✅ WebSocket connects and receives telemetry
  • ✅ CORS headers present for browser clients
  • ✅ API key validation works
  • ✅ Rate limiting blocks abusive requests
  • ✅ Server gracefully shuts down on game quit

Phase 5: Metrics & Telemetry

Status: 🟢 Ready to Start (No blocking dependencies)
Duration: 2-3 days
Dependencies: None (can be parallel with Phases 1-3)

Deliverables:

  1. src/engine/core/metrics.zig - Metrics collection system
  2. Collected metrics:
    • Frame time: min, max, avg (last 60 frames)
    • FPS: current, 1s avg, 10s avg
    • Render pass timings (shadow, GPass, SSAO, etc.)
    • Memory: allocated, freed, current usage (heap allocator stats)
    • World: loaded chunks, meshing queue size, entities count
    • Graphics: draw calls, triangles, texture memory
    • Job system: active jobs, thread utilization
  3. Export formats:
    • JSON: Single snapshot
    • CSV: Time series history
    • Live stream: WebSocket (Phase 4)
  4. Console commands:
    metrics              - Print current metrics
    metrics export csv   - Export to file
    metrics history N    - Print last N frames
    

Key Technical Details:

  • Rolling buffer for history (configurable size, e.g., 3600 frames = 1 min at 60fps)
  • Update metrics in App.runSingleFrame() after all systems updated
  • Frame timing via std.time.nanoTimestamp
  • Memory stats via std.heap.GeneralPurposeAllocator in debug mode
  • Zero-allocation hot path (use pre-allocated buffers)

Acceptance Criteria:

  • ✅ Metrics updated every frame with minimal overhead
  • ✅ Console metrics prints readable output
  • ✅ CSV export works: metrics export stats.csv
  • ✅ WebSocket telemetry stream updates at 10Hz
  • ✅ Metrics accurate (verified vs manual measurements)

Phase 6: Input Injection

Status: 🟢 Ready to Start (Depends on Phase 4)
Duration: 3-4 days
Dependencies: Phase 4 (API endpoint needs injection system)

Deliverables:

  1. src/engine/input/injector.zig - Input queue and injection system
  2. Input injection types:
    // Key press
    {
      "type": "key",
      "key": "W|A|S|D|Space|Shift|F1|Escape|...",
      "action": "press|release|tap"
    }
    
    // Mouse movement
    {
      "type": "mouse_move",
      "x": 640,
      "y": 480
    }
    
    // Mouse button
    {
      "type": "mouse_button",
      "button": "left|right|middle",
      "action": "press|release|click"
    }
    
    // Mouse scroll
    {
      "type": "mouse_scroll",
      "scroll_y": 1.0
    }
  3. Thread-safe input queue (producer: API/consumer: game loop)
  4. Input recording system:
    • Record real input to JSON
    • Playback recorded sequences
    • Speed control (0.5x, 1x, 2x, etc.)
  5. Console command: input_playback <file.json> [speed]

Key Technical Details:

  • Queue uses std.ArrayList protected by std.Thread.Mutex
  • Game thread processes queue in Input.pollEvents() (after SDL events)
  • Injected events marked as synthetic (for debugging/avoiding feedback loops)
  • Input validation: reject unknown keys, out-of-bounds positions
  • Timestamp-based playback (maintain original timing)

Acceptance Criteria:

  • ✅ API POST /api/input/inject successfully presses keys
  • ✅ Multiple input events processed in correct order
  • ✅ Recording creates valid JSON
  • ✅ Playback reproduces recorded input
  • ✅ Speed control changes playback rate
  • ✅ Queue doesn't cause frame drops

Phase 7: State Snapshots

Status: 🟢 Ready to Start (Depends on Phase 4)
Duration: 3-4 days
Dependencies: Phase 4 (API endpoint)

Deliverables:

  1. src/game/snapshot.zig - State serialization system
  2. Snapshot contents:
    • Player: position, rotation, velocity, inventory
    • World: seed, loaded chunks (positions), modified blocks
    • Time: game time, day/night cycle
    • Settings: graphics options, input bindings
    • Session metadata: timestamp, version, frame count
  3. Snapshot format:
    • Binary: compact, fast write/read
    • Optional JSON: human-readable (for debugging)
    • Compression: std.compress.zlib (optional)
  4. Snapshot management:
    • List: snapshots/ directory scan
    • Delete: rm snapshots/<name>.snap
    • Compare: diff snapshots/A.snap B.snap
    • Auto-snapshot on crash (Phase 1)
  5. Console commands:
    save_state [name]        - Save snapshot
    load_state [name]        - Load snapshot
    list_states              - List snapshots
    
  6. API endpoints:
    POST /api/state/snapshot - Create snapshot
    POST /api/state/restore  - Restore snapshot
    GET  /api/state/list    - List snapshots
    

Key Technical Details:

  • Serialization: std.io.BufferedWriter + custom binary format
  • Chunk data: only store modified blocks (diff from seed)
  • Validate snapshot on load (checksum via std.hash.Fnv1a)
  • Loading: hot-reload world chunks, recreate player state
  • Fallback: if load fails, continue with current state (don't crash)

Acceptance Criteria:

  • save_state test creates valid snapshot
  • load_state test restores state exactly
  • ✅ Snapshots capture modified terrain
  • ✅ Load fails gracefully on corrupted snapshot
  • ✅ API endpoint creates and loads snapshots
  • ✅ Crash auto-snapshot captures pre-crash state

Phase 8: OpenCode MCP Integration

Status: 🟡 Ready to Start (Depends on Phases 4, 6, 7)
Duration: 3-4 days
Dependencies: Phase 4 (API), Phase 6 (input), Phase 7 (snapshots)

Deliverables:

  1. opencode/mcp-zigcraft/ - MCP server implementation (Bun/TypeScript)
  2. MCP tools (following OpenCode tool format):
    // tool/take-screenshot.ts
    {
      description: "Capture game screenshot",
      args: { name: string },
      returns: { path: string, url: string }
    }
    
    // tool/inject-input.ts
    {
      description: "Inject keyboard/mouse input",
      args: { type: string, key?: string, x?: number, ... },
      returns: { success: boolean }
    }
    
    // tool/query-state.ts
    {
      description: "Get current game state",
      args: { },
      returns: { player: {...}, world: {...}, time: ... }
    }
    
    // tool/console-command.ts
    {
      description: "Execute debug console command",
      args: { command: string },
      returns: { output: string, success: boolean }
    }
    
    // tool/get-metrics.ts
    {
      description: "Get performance metrics",
      args: { format: "json|csv" },
      returns: { metrics: {...} }
    }
    
    // tool/list-crashes.ts
    {
      description: "List recent crash reports",
      args: { },
      returns: { crashes: [...] }
    }
    
    // tool/save-snapshot.ts
    {
      description: "Create state snapshot",
      args: { name: string },
      returns: { path: string }
    }
    
    // tool/load-snapshot.ts
    {
      description: "Load state snapshot",
      args: { name: string },
      returns: { success: boolean }
    }
  3. MCP server implementation:
    • Use @opencode-ai/plugin SDK
    • Connect to game via HTTP/WebSocket (Phase 4)
    • Error handling and retry logic
    • Discovery of game instance (configurable host/port)

File Structure:

opencode/mcp-zigcraft/
├── index.ts              # MCP server entry point
├── client.ts             # HTTP/WebSocket client to game
├── types.ts              # TypeScript types for API
└── tool/
    ├── take-screenshot.ts
    ├── inject-input.ts
    ├── query-state.ts
    ├── console-command.ts
    ├── get-metrics.ts
    ├── list-crashes.ts
    ├── save-snapshot.ts
    └── load-snapshot.ts

Configuration:

// opencode.json
{
  "mcp": {
    "zigcraft": {
      "type": "local",
      "command": ["bun", "run", "opencode/mcp-zigcraft/index.ts"],
      "env": {
        "ZIGCRAFT_API_HOST": "http://localhost:8080"
      },
      "enabled": true
    }
  }
}

Acceptance Criteria:

  • ✅ MCP server starts without errors
  • ✅ All 8 tools are discoverable by OpenCode
  • ✅ Tools successfully call game API
  • ✅ Error responses handled gracefully
  • ✅ Retry logic handles game restart
  • ✅ AI agent can use tools end-to-end

Phase 9: CI/CD Integration

Status: 🟢 Final Phase (Depends on all previous)
Duration: 2-3 days
Dependencies: All phases (especially Phase 2, 4, 7)

Deliverables:

  1. Visual regression test workflow
  2. Crash recovery test workflow
  3. API integration test suite
  4. Headless screenshot automation

Key Components:

A. Headless Screenshot Testing:

  • Build flag: -Dheadless=true (render to offscreen buffer)
  • Automated scenario script:
    // test_scenarios/basic_render.json
    {
      "name": "Basic Render Test",
      "steps": [
        { "command": "teleport 0 100 0" },
        { "wait": 60 }, // wait 60 frames
        { "screenshot": "frame_001" },
        { "command": "set_time 12" },
        { "wait": 60 },
        { "screenshot": "frame_002" }
      ]
    }
  • Compare screenshots against baseline (assets/baselines/)
  • Fail CI if pixel difference > threshold (e.g., 1%)

B. Crash Recovery Testing:

  • Test script that triggers crashes:
    // test_scenarios/crash_test.json
    {
      "steps": [
        { "command": "crash" },
        { "restart": true }, // auto-restart game
        { "verify": "crash_report_exists" },
        { "load_snapshot": "auto_recovery" },
        { "verify": "state_restored" }
      ]
    }
    
  • Verify crash handler creates dump
  • Verify auto-recovery loads correct state

C. API Integration Tests:

  • Node.js test suite using supertest or axios
  • Test all REST endpoints with valid/invalid inputs
  • Test WebSocket connection and telemetry
  • Test rate limiting and auth

GitHub Workflow Addition:

# .github/workflows/debugging-tools.yml
name: Debugging Tools Test
on: [push, pull_request]
jobs:
  visual-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run screenshot tests
        run: nix develop --command zig build test-screenshots
  
  crash-recovery:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run crash tests
        run: nix develop --command zig build test-crash-recovery
  
  api-integration:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run API tests
        run: |
          nix develop --command zig build run -Dremote_debug=true &
          sleep 10
          bun run opencode/tests/api.ts

Acceptance Criteria:

  • ✅ Visual regression passes on main branch
  • ✅ Intentional rendering changes trigger test failure
  • ✅ Crash recovery test passes
  • ✅ API integration test suite passes (all endpoints)
  • ✅ Tests run in CI without display server

Parallelization Strategy

This epic is organized to maximize parallel development while respecting dependencies:

Track A (Core Infrastructure) - Can run in parallel

  • Phase 1: Crash Detection - Independent
  • Phase 2: Screenshots - Independent
  • Phase 5: Metrics - Independent

These three phases have no dependencies on each other and can be developed simultaneously by different developers.

Track B (User Interfaces) - Depends on Track A

  • Phase 3: Debug Console - Depends on Phase 2 (screenshot command)
  • Phase 4: Remote API - Depends on Phase 3 (console integration), Phase 2 (screenshot endpoint)
  • Phase 6: Input Injection - Depends on Phase 4 (API endpoint)
  • Phase 7: State Snapshots - Depends on Phase 4 (API endpoint)

These phases build on Track A but can proceed in parallel once dependencies are met:

  • Phase 4 can start once Phase 2 is complete
  • Phase 6 can start once Phase 4 is partially complete (just need API server)
  • Phase 7 can start once Phase 4 is partially complete (just need API server)

Track C (Integration) - Depends on Track B

  • Phase 8: MCP Integration - Depends on Phase 4, 6, 7 (needs stable API)
  • Phase 9: CI/CD - Depends on all previous phases

Critical Path (Sequential)

The minimum sequential work required:

  1. Phase 2 (Screenshots) → Phase 3 (Console) → Phase 4 (API) → Phase 8 (MCP)
  2. Phase 1 (Crash) can happen anytime
  3. Phase 5 (Metrics) can happen anytime
  4. Phase 6/7 can start once Phase 4 is ready
  5. Phase 9 requires everything

Recommended Parallel Work:

Week Developer A Developer B Developer C
1 Phase 1 (Crash) Phase 2 (Screenshots) Phase 5 (Metrics)
2 Phase 3 (Console) Phase 4 (API) Phase 5 (Metrics)
3 Phase 4 (API) Phase 6 (Input) Phase 7 (Snapshots)
4 Phase 8 (MCP) Phase 7 (Snapshots) Documentation
5 Phase 9 (CI/CD) Testing Testing

Merge Strategy

This epic will be merged in 3 major milestones to reduce integration risk and provide intermediate value:

Milestone 1: Core Diagnostics (Week 1-2)

Merge Target: Feature branch debug-diagnostics

Includes:

  • Phase 1: Crash Detection & Reporting
  • Phase 2: Screenshot Capture System
  • Phase 5: Metrics & Telemetry

Value:

  • Crash dumps now captured automatically
  • Developers can take screenshots
  • Metrics available via console

Branch: feature/debug-core-diagnostics

Milestone 2: Remote Control (Week 3-4)

Merge Target: Feature branch debug-remote-control

Includes:

  • Phase 3: Debug Console
  • Phase 4: Remote HTTP/WebSocket API
  • Phase 6: Input Injection
  • Phase 7: State Snapshots

Value:

  • Full remote debugging capability
  • AI agents can now interact with game
  • Reproducible debugging via snapshots

Branch: feature/debug-remote-control

Depends on: Milestone 1 (merged to main)

Milestone 3: AI Integration & CI (Week 5)

Merge Target: Feature branch debug-ai-integration

Includes:

  • Phase 8: OpenCode MCP Integration
  • Phase 9: CI/CD Integration

Value:

  • Full AI agent debugging workflow
  • Automated visual regression testing
  • Crash recovery verification

Branch: feature/debug-ai-integration

Depends on: Milestone 2 (merged to main)

File Structure Summary

New Files:

src/
├── engine/
│   ├── core/
│   │   ├── crash_handler.zig       [Phase 1]
│   │   ├── metrics.zig             [Phase 5]
│   │   └── command_registry.zig    [Phase 3]
│   ├── graphics/
│   │   └── screenshot.zig          [Phase 2]
│   ├── input/
│   │   └── injector.zig            [Phase 6]
│   ├── remote/
│   │   ├── api_server.zig          [Phase 4]
│   │   ├── websocket.zig           [Phase 4]
│   │   └── protocol.zig            [Phase 4]
│   └── ui/
│       └── debug_console.zig       [Phase 3]
└── game/
    └── snapshot.zig                 [Phase 7]

opencode/
└── mcp-zigcraft/                   [Phase 8]
    ├── index.ts
    ├── client.ts
    ├── types.ts
    └── tool/
        ├── take-screenshot.ts
        ├── inject-input.ts
        ├── query-state.ts
        ├── console-command.ts
        ├── get-metrics.ts
        ├── list-crashes.ts
        ├── save-snapshot.ts
        └── load-snapshot.ts

test_scenarios/                     [Phase 9]
├── basic_render.json
└── crash_test.json

assets/baselines/                    [Phase 9]
└── screenshots/
    ├── frame_001.png
    └── frame_002.png

Modified Files:

src/engine/core/log.zig             [Phase 1 - add crash logger]
src/game/app.zig                    [All phases - integrate systems]
src/game/settings.zig               [Phase 4 - add remote_debug config]
src/game/screens/world.zig          [Phase 3 - integrate console]
build.zig                           [Phase 9 - add test targets]
opencode.json                       [Phase 8 - add MCP config]
.github/workflows/build.yml         [Phase 9 - add debug tests]

Technical Decisions & Rationale

1. HTTP Server Choice

Decision: Use Zig's std.http.Server (no external dependencies)

Rationale:

  • Keeps Nix environment clean (no additional packages)
  • Already available in stdlib (Zig 0.14)
  • Sufficient performance for debugging API
  • No security concerns with third-party HTTP libraries
  • Single binary distribution (no runtime deps)

2. Screenshot Format

Decision: PNG via STB (existing libs/stb/stb_image_write.h)

Rationale:

  • STB already in project for texture loading
  • Lossless compression (important for visual regression)
  • Easy for AI agents to analyze
  • Standard format with broad tooling support
  • No additional dependencies

3. WebSocket vs Pure REST

Decision: Both - REST for commands, WebSocket for telemetry

Rationale:

  • REST: Simple, stateless, idempotent operations (perfect for commands)
  • WebSocket: Real-time bidirectional stream (ideal for telemetry)
  • AI can choose optimal interface based on use case
  • Reduces HTTP polling overhead for metrics

4. Input Injection: IPC vs In-Process Queue

Decision: In-process queue (no external IPC)

Rationale:

  • Simpler implementation (no socket/pipe complexity)
  • No threading issues with external processes
  • Still works with remote API (API pushes to in-process queue)
  • Lower latency for immediate debugging

5. State Snapshot Format

Decision: Binary with optional JSON, optional compression

Rationale:

  • Binary: Fast serialization/deserialization, compact size
  • JSON: Human-readable for manual inspection (debug tool output)
  • Compression: Optional (trade size vs. CPU)
  • Custom format (no heavy serialization lib like MessagePack)

6. MCP Implementation Language

Decision: TypeScript/Bun (matches OpenCode config)

Rationale:

  • OpenCode already uses Bun/TypeScript for tools
  • @opencode-ai/plugin SDK designed for TypeScript
  • Excellent HTTP/WebSocket libraries in npm ecosystem
  • Type safety for API contracts
  • Easy to test and debug

Risk Assessment

Risk Impact Likelihood Mitigation
Vulkan screenshot capture fails High (Phase 2 blocked) Medium Fallback to glReadPixel if VkImage readback fails
Signal handler crashes Critical (system instability) Low Use async-signal-safe operations only, test extensively
HTTP server blocks render thread Medium (frame drops) Low Run API server on background thread
WebSocket connection instability Medium (agent loses connection) Medium Auto-reconnect with exponential backoff
Snapshot corruption High (data loss) Low Checksum validation, keep N backups
Performance overhead Medium (degraded gameplay) Low Optional enable/disable (default off), profile hot paths
CI display requirement Medium (tests can't run) High Use headless Wayland (already in CI) + virtual display

Acceptance Criteria (Epic Level)

Definition of Done:

  1. ✅ All 9 phases completed and merged in their respective milestones
  2. ✅ All unit tests pass (zig build test)
  3. ✅ Integration tests pass (zig build test-integration)
  4. ✅ New CI tests pass (visual regression, crash recovery, API tests)
  5. ✅ AI agent can successfully:
    • Take screenshots and analyze them
    • Query game state and metrics
    • Inject input to control game
    • Save/restore snapshots to reproduce bugs
    • Execute debug console commands
  6. ✅ Crash handler successfully captures diagnostics on SIGSEGV
  7. ✅ Remote API accessible via MCP tools
  8. ✅ Documentation updated (AGENTS.md, README.md)
  9. ✅ Code formatted (zig fmt src/)
  10. ✅ No regressions in existing functionality

Success Metrics

Quantitative:

  • Crash detection: 100% of crashes captured with stacktrace
  • Screenshot capture time: <100ms (non-blocking)
  • API response time: <50ms for status/metrics queries
  • WebSocket telemetry latency: <10ms
  • Snapshot save time: <1s for typical world state
  • Snapshot load time: <2s for typical world state
  • Memory overhead: <50MB with all debug systems enabled

Qualitative:

  • AI agents can diagnose visual bugs without human intervention
  • Developers can reproduce bugs from crash reports in <5 minutes
  • Remote debugging reduces time to root cause by 50%
  • CI catches visual regressions before merge

Open Questions

  1. Screenshot resolution: Should screenshots capture at native resolution or match window size? (Recommend: native for better debugging detail)

  2. Metrics retention: How long should metrics history be kept? (Recommend: 3600 frames = 1 minute @ 60fps, configurable)

  3. API authentication: Should we support token-based auth for remote access? (Recommend: API key header, configurable)

  4. Snapshot compression: Enable compression by default? (Recommend: Yes for worlds >100MB chunks)

  5. WebSocket rate limit: What's a reasonable telemetry update rate? (Recommend: 10Hz, configurable)

Related Issues

None yet (will be created as sub-issues)

Notes for Sub-Issue Creation

When splitting this epic into sub-issues:

  1. One phase per sub-issue (9 sub-issues total)
  2. Reference this epic in all sub-issues: Part of #[EPIC_NUMBER]
  3. Include acceptance criteria from this epic in each sub-issue
  4. Tag sub-issues with milestone labels:
    • milestone-1-core-diagnostics (Phases 1, 2, 5)
    • milestone-2-remote-control (Phases 3, 4, 6, 7)
    • milestone-3-ai-integration (Phases 8, 9)
  5. Add dependency links between sub-issues based on phase dependencies
  6. Use this epic's file structure section for reference

Summary

This epic implements a complete debugging infrastructure that transforms ZigCraft from a "black box" to a transparent, observable system. By building crash detection, screenshot capture, remote API, and AI integration in a phased approach, we minimize risk while delivering incremental value at each milestone.

The parallelization strategy allows 3 developers to work simultaneously, and 3-milestone merge strategy ensures integration risks are caught early. The final result will enable AI agents to effectively debug game, dramatically reducing time to identify and fix visual bugs, crashes, and performance issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcidocumentationImprovements or additions to documentationengineenhancementNew feature or requesthotfixquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions