Epic: AI-Assisted Debugging Infrastructure for ZigCraft

# Epic: AI-Assisted Debugging Infrastructure for ZigCraft

## Summary

This epic implements a comprehensive debugging infrastructure to enable AI agents to effectively debug, analyze, and interact with ZigCraft voxel engine. The system includes crash detection and reporting, screenshot capture, debug console, remote HTTP/WebSocket API, metrics collection, input injection, state snapshotting, and OpenCode MCP integration.

**Primary Goals:**
- Enable AI agents to visually inspect of game state
- Allow remote control and inspection via HTTP/WebSocket APIs
- Automatically capture diagnostic information on crashes
- Provide reproducible debugging through state snapshots
- Support both local development and CI/CD environments

## Background

Currently, debugging ZigCraft requires manual reproduction and inspection. With an increasing codebase (Vulkan RHI, PBR rendering, multi-threaded world generation, complex shaders), debugging visual bugs, crashes, and performance issues is time-consuming and often requires running the game locally.

AI agents (via OpenCode) have limited ability to debug graphics applications because they cannot:
- See of rendered output
- Interact with the game interface
- Query internal state
- Replay actions to reproduce bugs

This epic builds a debugging infrastructure that exposes game state and control surfaces to AI agents through multiple interfaces, making automated debugging feasible.

## Scope

**In Scope:**
- Crash detection, reporting, and recovery
- Screenshot capture system (manual, auto, crash-triggered)
- In-game debug console with extensible command system
- RESTful HTTP API and WebSocket telemetry server
- Performance metrics collection and export
- Input injection system for remote control
- State snapshot system for reproducible debugging
- OpenCode MCP server integration for AI tooling
- CI/CD integration with headless screenshot testing

**Out of Scope:**
- Full-blown game editor tools
- Multiplayer networking for remote debugging
- GPU profiling tools (existing tools like RenderDoc recommended)
- Automated bug fixing (agent only assists in diagnosis)

## Phases & Dependencies

This epic is organized into **9 phases** with clear dependency chains. Some phases can be developed in parallel and merged together in **3 major milestones**.

---

### Phase 1: Crash Detection & Reporting
**Status**: 🔴 Blocking (Priority - Critical)
**Duration**: 3-4 days
**Dependencies**: None

**Deliverables:**
1. `src/engine/core/crash_handler.zig` - Signal handler (SIGSEGV, SIGABRT, SIGILL)
2. Crash directory structure: `crashes/<timestamp>/`
3. Crash dump files:
   - `stacktrace.txt` - Zig stack trace
   - `last_100_lines.log` - Log buffer capture
   - `state.json` - Game state snapshot
4. Crash reporting UI dialog on startup (using existing UISystem)
5. Heartbeat system with watchdog thread

**Key Technical Details:**
- Use `std.os.sigaction` for signal handling
- Async-signal-safe operations in signal handler
- Store last N log lines in circular buffer (alloc-free in crash path)
- Heartbeat file every 5 seconds, watchdog after 15s timeout
- Graceful shutdown before crash dump if possible

**Acceptance Criteria:**
- ✅ Intentional crash (`SIGSEGV`) generates complete crash dump
- ✅ Crash reports displayed on next game launch
- ✅ Can view crash details and optionally upload
- ✅ "Continue from safe state" recovers pre-crash world
- ✅ Frozen process (no heartbeat) triggers graceful shutdown

---

### Phase 2: Screenshot Capture System
**Status**: 🟡 Ready to Start (No dependencies)
**Duration**: 2-3 days
**Dependencies**: None (can be parallel with Phase 1)

**Deliverables:**
1. `src/engine/graphics/screenshot.zig` - Screenshot capture module
2. Vulkan swapchain image readback logic
3. STB PNG encoding (leverage existing `libs/stb/stb_image_impl.c`)
4. Screenshot organization: `screenshots/session_<timestamp>/<timestamp>.png`
5. Capture modes:
   - Manual: `F11` hotkey
   - Console: `screenshot [name]` command
   - Auto-timer: `screenshot start/stop [interval_seconds]`
   - Crash-triggered: Capture last frame before crash

**Key Technical Details:**
- Async capture (queue request, render thread processes next frame)
- `VkImage` → `stbi_write_png` pipeline
- Maximum screenshot directory size (e.g., 500MB) with oldest deletion
- Thumbnail generation using `stb_image_resize` (optional)
- Crash hook calls screenshot handler in signal-safe path (if possible)

**Acceptance Criteria:**
- ✅ F11 saves screenshot to correct directory
- ✅ Console command `screenshot debug_view` works
- ✅ Auto-timer `screenshot start 5` captures every 5 seconds
- ✅ Crash-triggered screenshot captured (last frame)
- ✅ Screenshots are valid PNG files
- ✅ Directory size limit respected

---

### Phase 3: Debug Console
**Status**: 🟢 Ready to Start (Depends on Phase 2)
**Duration**: 2-3 days
**Dependencies**: Phase 2 (screenshot command)

**Deliverables:**
1. `src/engine/ui/debug_console.zig` - Console UI overlay
2. `src/engine/core/command_registry.zig` - Command plugin system
3. Toggle key: `` ` `` (backtick)
4. Console features:
   - Text input field
   - Command history (up/down arrows)
   - Command autocomplete (tab completion)
   - Output buffer with scroll
5. Core commands:
   ```
   help [command]           - Show command list or details
   screenshot [name]         - Take screenshot (Phase 2)
   stats                    - Print frame metrics
   teleport <x> <y> <z>     - Teleport player
   set_time <0-23>          - Set time of day
   toggle_wireframe         - Toggle wireframe mode
   toggle_debug_shadows      - Toggle shadow debug visualization
   save_state [name]        - Save state snapshot (Phase 7)
   load_state [name]        - Load state snapshot (Phase 7)
   crash                    - Trigger intentional crash (Phase 1)
   ```
6. Command registration API for extensibility

**Key Technical Details:**
- Leverage existing `UISystem` for rendering
- Input handling via existing `Input` system
- Command parsing: `command arg1 arg2 --flag value`
- Autocomplete: fuzzy match on registered commands
- Command context: access to `EngineContext` for game state

**Acceptance Criteria:**
- ✅ Backtick opens/closes console
- ✅ Can type commands and see output
- ✅ Arrow keys navigate history
- ✅ Tab completes commands
- ✅ All core commands work correctly
- ✅ Custom commands can be registered

---

### Phase 4: Remote HTTP/WebSocket API
**Status**: 🔴 Critical Path (Depends on Phase 3)
**Duration**: 4-5 days
**Dependencies**: Phase 3 (console integration), Phase 2 (screenshot endpoint)

**Deliverables:**
1. `src/engine/remote/api_server.zig` - HTTP server using `std.http.Server`
2. `src/engine/remote/websocket.zig` - WebSocket implementation
3. `src/engine/remote/protocol.zig` - Request/response type definitions
4. REST API endpoints:
   ```
   GET    /api/status                    - Server status, version, uptime
   GET    /api/stats                     - Current metrics (Phase 5)
   GET    /api/screenshot                - Get last screenshot path/URL
   POST   /api/screenshot                - Trigger screenshot
   POST   /api/input/inject              - Inject input event (Phase 6)
   POST   /api/console/execute           - Execute console command
   GET    /api/state                     - Get current game state
   POST   /api/state/snapshot            - Create state snapshot (Phase 7)
   POST   /api/state/restore             - Restore state snapshot (Phase 7)
   GET    /api/crashes                   - List crash dumps (Phase 1)
   GET    /api/logs?level=X&lines=N      - Fetch log lines
   GET    /api/metrics                   - Get metrics history (Phase 5)
   ```
5. WebSocket endpoint: `/api/ws`
   - Subscribe to telemetry stream (FPS, frame time, memory, entities)
   - Bidirectional commands
   - Heartbeat ping/pong
6. Security:
   - Configurable API key via `Authorization` header
   - Default: `localhost-only` mode
   - Rate limiting per IP

**Key Technical Details:**
- HTTP server runs on background thread (non-blocking)
- Use `std.http.Server` (Zig stdlib, no external deps)
- WebSocket frame parsing (RFC 6455)
- JSON serialization via custom writer (avoid heavy JSON lib)
- CORS headers for cross-origin requests
- Graceful shutdown: flush pending requests before exit

**Settings Integration:**
```zig
// src/game/settings.zig
remote_debug: struct {
    enabled: bool = false,
    port: u16 = 8080,
    api_key: []const u8 = "",
    allow_remote: bool = false, // localhost only by default
}
```

**Acceptance Criteria:**
- ✅ Server starts on configured port
- ✅ All REST endpoints return valid JSON
- ✅ WebSocket connects and receives telemetry
- ✅ CORS headers present for browser clients
- ✅ API key validation works
- ✅ Rate limiting blocks abusive requests
- ✅ Server gracefully shuts down on game quit

---

### Phase 5: Metrics & Telemetry
**Status**: 🟢 Ready to Start (No blocking dependencies)
**Duration**: 2-3 days
**Dependencies**: None (can be parallel with Phases 1-3)

**Deliverables:**
1. `src/engine/core/metrics.zig` - Metrics collection system
2. Collected metrics:
   - Frame time: min, max, avg (last 60 frames)
   - FPS: current, 1s avg, 10s avg
   - Render pass timings (shadow, GPass, SSAO, etc.)
   - Memory: allocated, freed, current usage (heap allocator stats)
   - World: loaded chunks, meshing queue size, entities count
   - Graphics: draw calls, triangles, texture memory
   - Job system: active jobs, thread utilization
3. Export formats:
   - JSON: Single snapshot
   - CSV: Time series history
   - Live stream: WebSocket (Phase 4)
4. Console commands:
   ```
   metrics              - Print current metrics
   metrics export csv   - Export to file
   metrics history N    - Print last N frames
   ```

**Key Technical Details:**
- Rolling buffer for history (configurable size, e.g., 3600 frames = 1 min at 60fps)
- Update metrics in `App.runSingleFrame()` after all systems updated
- Frame timing via `std.time.nanoTimestamp`
- Memory stats via `std.heap.GeneralPurposeAllocator` in debug mode
- Zero-allocation hot path (use pre-allocated buffers)

**Acceptance Criteria:**
- ✅ Metrics updated every frame with minimal overhead
- ✅ Console `metrics` prints readable output
- ✅ CSV export works: `metrics export stats.csv`
- ✅ WebSocket telemetry stream updates at 10Hz
- ✅ Metrics accurate (verified vs manual measurements)

---

### Phase 6: Input Injection
**Status**: 🟢 Ready to Start (Depends on Phase 4)
**Duration**: 3-4 days
**Dependencies**: Phase 4 (API endpoint needs injection system)

**Deliverables:**
1. `src/engine/input/injector.zig` - Input queue and injection system
2. Input injection types:
   ```json
   // Key press
   {
     "type": "key",
     "key": "W|A|S|D|Space|Shift|F1|Escape|...",
     "action": "press|release|tap"
   }
   
   // Mouse movement
   {
     "type": "mouse_move",
     "x": 640,
     "y": 480
   }
   
   // Mouse button
   {
     "type": "mouse_button",
     "button": "left|right|middle",
     "action": "press|release|click"
   }
   
   // Mouse scroll
   {
     "type": "mouse_scroll",
     "scroll_y": 1.0
   }
   ```
3. Thread-safe input queue (producer: API/consumer: game loop)
4. Input recording system:
   - Record real input to JSON
   - Playback recorded sequences
   - Speed control (0.5x, 1x, 2x, etc.)
5. Console command: `input_playback <file.json> [speed]`

**Key Technical Details:**
- Queue uses `std.ArrayList` protected by `std.Thread.Mutex`
- Game thread processes queue in `Input.pollEvents()` (after SDL events)
- Injected events marked as synthetic (for debugging/avoiding feedback loops)
- Input validation: reject unknown keys, out-of-bounds positions
- Timestamp-based playback (maintain original timing)

**Acceptance Criteria:**
- ✅ API `POST /api/input/inject` successfully presses keys
- ✅ Multiple input events processed in correct order
- ✅ Recording creates valid JSON
- ✅ Playback reproduces recorded input
- ✅ Speed control changes playback rate
- ✅ Queue doesn't cause frame drops

---

### Phase 7: State Snapshots
**Status**: 🟢 Ready to Start (Depends on Phase 4)
**Duration**: 3-4 days
**Dependencies**: Phase 4 (API endpoint)

**Deliverables:**
1. `src/game/snapshot.zig` - State serialization system
2. Snapshot contents:
   - Player: position, rotation, velocity, inventory
   - World: seed, loaded chunks (positions), modified blocks
   - Time: game time, day/night cycle
   - Settings: graphics options, input bindings
   - Session metadata: timestamp, version, frame count
3. Snapshot format:
   - Binary: compact, fast write/read
   - Optional JSON: human-readable (for debugging)
   - Compression: `std.compress.zlib` (optional)
4. Snapshot management:
   - List: `snapshots/` directory scan
   - Delete: `rm snapshots/<name>.snap`
   - Compare: `diff snapshots/A.snap B.snap`
   - Auto-snapshot on crash (Phase 1)
5. Console commands:
   ```
   save_state [name]        - Save snapshot
   load_state [name]        - Load snapshot
   list_states              - List snapshots
   ```
6. API endpoints:
   ```
   POST /api/state/snapshot - Create snapshot
   POST /api/state/restore  - Restore snapshot
   GET  /api/state/list    - List snapshots
   ```

**Key Technical Details:**
- Serialization: `std.io.BufferedWriter` + custom binary format
- Chunk data: only store modified blocks (diff from seed)
- Validate snapshot on load (checksum via `std.hash.Fnv1a`)
- Loading: hot-reload world chunks, recreate player state
- Fallback: if load fails, continue with current state (don't crash)

**Acceptance Criteria:**
- ✅ `save_state test` creates valid snapshot
- ✅ `load_state test` restores state exactly
- ✅ Snapshots capture modified terrain
- ✅ Load fails gracefully on corrupted snapshot
- ✅ API endpoint creates and loads snapshots
- ✅ Crash auto-snapshot captures pre-crash state

---

### Phase 8: OpenCode MCP Integration
**Status**: 🟡 Ready to Start (Depends on Phases 4, 6, 7)
**Duration**: 3-4 days
**Dependencies**: Phase 4 (API), Phase 6 (input), Phase 7 (snapshots)

**Deliverables:**
1. `opencode/mcp-zigcraft/` - MCP server implementation (Bun/TypeScript)
2. MCP tools (following OpenCode tool format):
   ```typescript
   // tool/take-screenshot.ts
   {
     description: "Capture game screenshot",
     args: { name: string },
     returns: { path: string, url: string }
   }
   
   // tool/inject-input.ts
   {
     description: "Inject keyboard/mouse input",
     args: { type: string, key?: string, x?: number, ... },
     returns: { success: boolean }
   }
   
   // tool/query-state.ts
   {
     description: "Get current game state",
     args: { },
     returns: { player: {...}, world: {...}, time: ... }
   }
   
   // tool/console-command.ts
   {
     description: "Execute debug console command",
     args: { command: string },
     returns: { output: string, success: boolean }
   }
   
   // tool/get-metrics.ts
   {
     description: "Get performance metrics",
     args: { format: "json|csv" },
     returns: { metrics: {...} }
   }
   
   // tool/list-crashes.ts
   {
     description: "List recent crash reports",
     args: { },
     returns: { crashes: [...] }
   }
   
   // tool/save-snapshot.ts
   {
     description: "Create state snapshot",
     args: { name: string },
     returns: { path: string }
   }
   
   // tool/load-snapshot.ts
   {
     description: "Load state snapshot",
     args: { name: string },
     returns: { success: boolean }
   }
   ```
3. MCP server implementation:
   - Use `@opencode-ai/plugin` SDK
   - Connect to game via HTTP/WebSocket (Phase 4)
   - Error handling and retry logic
   - Discovery of game instance (configurable host/port)

**File Structure:**
```
opencode/mcp-zigcraft/
├── index.ts              # MCP server entry point
├── client.ts             # HTTP/WebSocket client to game
├── types.ts              # TypeScript types for API
└── tool/
    ├── take-screenshot.ts
    ├── inject-input.ts
    ├── query-state.ts
    ├── console-command.ts
    ├── get-metrics.ts
    ├── list-crashes.ts
    ├── save-snapshot.ts
    └── load-snapshot.ts
```

**Configuration:**
```json
// opencode.json
{
  "mcp": {
    "zigcraft": {
      "type": "local",
      "command": ["bun", "run", "opencode/mcp-zigcraft/index.ts"],
      "env": {
        "ZIGCRAFT_API_HOST": "http://localhost:8080"
      },
      "enabled": true
    }
  }
}
```

**Acceptance Criteria:**
- ✅ MCP server starts without errors
- ✅ All 8 tools are discoverable by OpenCode
- ✅ Tools successfully call game API
- ✅ Error responses handled gracefully
- ✅ Retry logic handles game restart
- ✅ AI agent can use tools end-to-end

---

### Phase 9: CI/CD Integration
**Status**: 🟢 Final Phase (Depends on all previous)
**Duration**: 2-3 days
**Dependencies**: All phases (especially Phase 2, 4, 7)

**Deliverables:**
1. Visual regression test workflow
2. Crash recovery test workflow
3. API integration test suite
4. Headless screenshot automation

**Key Components:**

**A. Headless Screenshot Testing:**
- Build flag: `-Dheadless=true` (render to offscreen buffer)
- Automated scenario script:
  ```json
  // test_scenarios/basic_render.json
  {
    "name": "Basic Render Test",
    "steps": [
      { "command": "teleport 0 100 0" },
      { "wait": 60 }, // wait 60 frames
      { "screenshot": "frame_001" },
      { "command": "set_time 12" },
      { "wait": 60 },
      { "screenshot": "frame_002" }
    ]
  }
  ```
- Compare screenshots against baseline (`assets/baselines/`)
- Fail CI if pixel difference > threshold (e.g., 1%)

**B. Crash Recovery Testing:**
- Test script that triggers crashes:
  ```
  // test_scenarios/crash_test.json
  {
    "steps": [
      { "command": "crash" },
      { "restart": true }, // auto-restart game
      { "verify": "crash_report_exists" },
      { "load_snapshot": "auto_recovery" },
      { "verify": "state_restored" }
    ]
  }
  ```
- Verify crash handler creates dump
- Verify auto-recovery loads correct state

**C. API Integration Tests:**
- Node.js test suite using `supertest` or `axios`
- Test all REST endpoints with valid/invalid inputs
- Test WebSocket connection and telemetry
- Test rate limiting and auth

**GitHub Workflow Addition:**
```yaml
# .github/workflows/debugging-tools.yml
name: Debugging Tools Test
on: [push, pull_request]
jobs:
  visual-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run screenshot tests
        run: nix develop --command zig build test-screenshots
  
  crash-recovery:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run crash tests
        run: nix develop --command zig build test-crash-recovery
  
  api-integration:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@v16
      - name: Run API tests
        run: |
          nix develop --command zig build run -Dremote_debug=true &
          sleep 10
          bun run opencode/tests/api.ts
```

**Acceptance Criteria:**
- ✅ Visual regression passes on main branch
- ✅ Intentional rendering changes trigger test failure
- ✅ Crash recovery test passes
- ✅ API integration test suite passes (all endpoints)
- ✅ Tests run in CI without display server

## Parallelization Strategy

This epic is organized to maximize parallel development while respecting dependencies:

### Track A (Core Infrastructure) - Can run in parallel
- **Phase 1**: Crash Detection - Independent
- **Phase 2**: Screenshots - Independent
- **Phase 5**: Metrics - Independent

These three phases have **no dependencies** on each other and can be developed simultaneously by different developers.

### Track B (User Interfaces) - Depends on Track A
- **Phase 3**: Debug Console - Depends on Phase 2 (screenshot command)
- **Phase 4**: Remote API - Depends on Phase 3 (console integration), Phase 2 (screenshot endpoint)
- **Phase 6**: Input Injection - Depends on Phase 4 (API endpoint)
- **Phase 7**: State Snapshots - Depends on Phase 4 (API endpoint)

These phases build on Track A but can proceed in parallel once dependencies are met:
- **Phase 4** can start once Phase 2 is complete
- **Phase 6** can start once Phase 4 is partially complete (just need API server)
- **Phase 7** can start once Phase 4 is partially complete (just need API server)

### Track C (Integration) - Depends on Track B
- **Phase 8**: MCP Integration - Depends on Phase 4, 6, 7 (needs stable API)
- **Phase 9**: CI/CD - Depends on all previous phases

### Critical Path (Sequential)
The minimum sequential work required:
1. Phase 2 (Screenshots) → Phase 3 (Console) → Phase 4 (API) → Phase 8 (MCP)
2. Phase 1 (Crash) can happen anytime
3. Phase 5 (Metrics) can happen anytime
4. Phase 6/7 can start once Phase 4 is ready
5. Phase 9 requires everything

**Recommended Parallel Work:**
| Week | Developer A | Developer B | Developer C |
|------|-------------|-------------|-------------|
| 1 | Phase 1 (Crash) | Phase 2 (Screenshots) | Phase 5 (Metrics) |
| 2 | Phase 3 (Console) | Phase 4 (API) | Phase 5 (Metrics) |
| 3 | Phase 4 (API) | Phase 6 (Input) | Phase 7 (Snapshots) |
| 4 | Phase 8 (MCP) | Phase 7 (Snapshots) | Documentation |
| 5 | Phase 9 (CI/CD) | Testing | Testing |

## Merge Strategy

This epic will be merged in **3 major milestones** to reduce integration risk and provide intermediate value:

### Milestone 1: Core Diagnostics (Week 1-2)
**Merge Target**: Feature branch `debug-diagnostics`

**Includes:**
- Phase 1: Crash Detection & Reporting
- Phase 2: Screenshot Capture System
- Phase 5: Metrics & Telemetry

**Value:**
- Crash dumps now captured automatically
- Developers can take screenshots
- Metrics available via console

**Branch:** `feature/debug-core-diagnostics`

### Milestone 2: Remote Control (Week 3-4)
**Merge Target**: Feature branch `debug-remote-control`

**Includes:**
- Phase 3: Debug Console
- Phase 4: Remote HTTP/WebSocket API
- Phase 6: Input Injection
- Phase 7: State Snapshots

**Value:**
- Full remote debugging capability
- AI agents can now interact with game
- Reproducible debugging via snapshots

**Branch:** `feature/debug-remote-control`

**Depends on:** Milestone 1 (merged to main)

### Milestone 3: AI Integration & CI (Week 5)
**Merge Target**: Feature branch `debug-ai-integration`

**Includes:**
- Phase 8: OpenCode MCP Integration
- Phase 9: CI/CD Integration

**Value:**
- Full AI agent debugging workflow
- Automated visual regression testing
- Crash recovery verification

**Branch:** `feature/debug-ai-integration`

**Depends on:** Milestone 2 (merged to main)

## File Structure Summary

**New Files:**
```
src/
├── engine/
│   ├── core/
│   │   ├── crash_handler.zig       [Phase 1]
│   │   ├── metrics.zig             [Phase 5]
│   │   └── command_registry.zig    [Phase 3]
│   ├── graphics/
│   │   └── screenshot.zig          [Phase 2]
│   ├── input/
│   │   └── injector.zig            [Phase 6]
│   ├── remote/
│   │   ├── api_server.zig          [Phase 4]
│   │   ├── websocket.zig           [Phase 4]
│   │   └── protocol.zig            [Phase 4]
│   └── ui/
│       └── debug_console.zig       [Phase 3]
└── game/
    └── snapshot.zig                 [Phase 7]

opencode/
└── mcp-zigcraft/                   [Phase 8]
    ├── index.ts
    ├── client.ts
    ├── types.ts
    └── tool/
        ├── take-screenshot.ts
        ├── inject-input.ts
        ├── query-state.ts
        ├── console-command.ts
        ├── get-metrics.ts
        ├── list-crashes.ts
        ├── save-snapshot.ts
        └── load-snapshot.ts

test_scenarios/                     [Phase 9]
├── basic_render.json
└── crash_test.json

assets/baselines/                    [Phase 9]
└── screenshots/
    ├── frame_001.png
    └── frame_002.png
```

**Modified Files:**
```
src/engine/core/log.zig             [Phase 1 - add crash logger]
src/game/app.zig                    [All phases - integrate systems]
src/game/settings.zig               [Phase 4 - add remote_debug config]
src/game/screens/world.zig          [Phase 3 - integrate console]
build.zig                           [Phase 9 - add test targets]
opencode.json                       [Phase 8 - add MCP config]
.github/workflows/build.yml         [Phase 9 - add debug tests]
```

## Technical Decisions & Rationale

### 1. HTTP Server Choice
**Decision**: Use Zig's `std.http.Server` (no external dependencies)

**Rationale:**
- Keeps Nix environment clean (no additional packages)
- Already available in stdlib (Zig 0.14)
- Sufficient performance for debugging API
- No security concerns with third-party HTTP libraries
- Single binary distribution (no runtime deps)

### 2. Screenshot Format
**Decision**: PNG via STB (existing `libs/stb/stb_image_write.h`)

**Rationale:**
- STB already in project for texture loading
- Lossless compression (important for visual regression)
- Easy for AI agents to analyze
- Standard format with broad tooling support
- No additional dependencies

### 3. WebSocket vs Pure REST
**Decision**: Both - REST for commands, WebSocket for telemetry

**Rationale:**
- **REST**: Simple, stateless, idempotent operations (perfect for commands)
- **WebSocket**: Real-time bidirectional stream (ideal for telemetry)
- AI can choose optimal interface based on use case
- Reduces HTTP polling overhead for metrics

### 4. Input Injection: IPC vs In-Process Queue
**Decision**: In-process queue (no external IPC)

**Rationale:**
- Simpler implementation (no socket/pipe complexity)
- No threading issues with external processes
- Still works with remote API (API pushes to in-process queue)
- Lower latency for immediate debugging

### 5. State Snapshot Format
**Decision**: Binary with optional JSON, optional compression

**Rationale:**
- Binary: Fast serialization/deserialization, compact size
- JSON: Human-readable for manual inspection (debug tool output)
- Compression: Optional (trade size vs. CPU)
- Custom format (no heavy serialization lib like MessagePack)

### 6. MCP Implementation Language
**Decision**: TypeScript/Bun (matches OpenCode config)

**Rationale:**
- OpenCode already uses Bun/TypeScript for tools
- `@opencode-ai/plugin` SDK designed for TypeScript
- Excellent HTTP/WebSocket libraries in npm ecosystem
- Type safety for API contracts
- Easy to test and debug

## Risk Assessment

| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| **Vulkan screenshot capture fails** | High (Phase 2 blocked) | Medium | Fallback to glReadPixel if VkImage readback fails |
| **Signal handler crashes** | Critical (system instability) | Low | Use async-signal-safe operations only, test extensively |
| **HTTP server blocks render thread** | Medium (frame drops) | Low | Run API server on background thread |
| **WebSocket connection instability** | Medium (agent loses connection) | Medium | Auto-reconnect with exponential backoff |
| **Snapshot corruption** | High (data loss) | Low | Checksum validation, keep N backups |
| **Performance overhead** | Medium (degraded gameplay) | Low | Optional enable/disable (default off), profile hot paths |
| **CI display requirement** | Medium (tests can't run) | High | Use headless Wayland (already in CI) + virtual display |

## Acceptance Criteria (Epic Level)

**Definition of Done:**
1. ✅ All 9 phases completed and merged in their respective milestones
2. ✅ All unit tests pass (`zig build test`)
3. ✅ Integration tests pass (`zig build test-integration`)
4. ✅ New CI tests pass (visual regression, crash recovery, API tests)
5. ✅ AI agent can successfully:
   - Take screenshots and analyze them
   - Query game state and metrics
   - Inject input to control game
   - Save/restore snapshots to reproduce bugs
   - Execute debug console commands
6. ✅ Crash handler successfully captures diagnostics on SIGSEGV
7. ✅ Remote API accessible via MCP tools
8. ✅ Documentation updated (AGENTS.md, README.md)
9. ✅ Code formatted (`zig fmt src/`)
10. ✅ No regressions in existing functionality

## Success Metrics

**Quantitative:**
- Crash detection: 100% of crashes captured with stacktrace
- Screenshot capture time: <100ms (non-blocking)
- API response time: <50ms for status/metrics queries
- WebSocket telemetry latency: <10ms
- Snapshot save time: <1s for typical world state
- Snapshot load time: <2s for typical world state
- Memory overhead: <50MB with all debug systems enabled

**Qualitative:**
- AI agents can diagnose visual bugs without human intervention
- Developers can reproduce bugs from crash reports in <5 minutes
- Remote debugging reduces time to root cause by 50%
- CI catches visual regressions before merge

## Open Questions

1. **Screenshot resolution**: Should screenshots capture at native resolution or match window size? (Recommend: native for better debugging detail)

2. **Metrics retention**: How long should metrics history be kept? (Recommend: 3600 frames = 1 minute @ 60fps, configurable)

3. **API authentication**: Should we support token-based auth for remote access? (Recommend: API key header, configurable)

4. **Snapshot compression**: Enable compression by default? (Recommend: Yes for worlds >100MB chunks)

5. **WebSocket rate limit**: What's a reasonable telemetry update rate? (Recommend: 10Hz, configurable)

## Related Issues

None yet (will be created as sub-issues)

## Notes for Sub-Issue Creation

When splitting this epic into sub-issues:

1. **One phase per sub-issue** (9 sub-issues total)
2. Reference this epic in all sub-issues: `Part of #[EPIC_NUMBER]`
3. Include acceptance criteria from this epic in each sub-issue
4. Tag sub-issues with milestone labels:
   - `milestone-1-core-diagnostics` (Phases 1, 2, 5)
   - `milestone-2-remote-control` (Phases 3, 4, 6, 7)
   - `milestone-3-ai-integration` (Phases 8, 9)
5. Add dependency links between sub-issues based on phase dependencies
6. Use this epic's file structure section for reference

---

## Summary

This epic implements a complete debugging infrastructure that transforms ZigCraft from a "black box" to a transparent, observable system. By building crash detection, screenshot capture, remote API, and AI integration in a phased approach, we minimize risk while delivering incremental value at each milestone.

The parallelization strategy allows 3 developers to work simultaneously, and 3-milestone merge strategy ensures integration risks are caught early. The final result will enable AI agents to effectively debug game, dramatically reducing time to identify and fix visual bugs, crashes, and performance issues.

Week	Developer A	Developer B	Developer C
1	Phase 1 (Crash)	Phase 2 (Screenshots)	Phase 5 (Metrics)
2	Phase 3 (Console)	Phase 4 (API)	Phase 5 (Metrics)
3	Phase 4 (API)	Phase 6 (Input)	Phase 7 (Snapshots)
4	Phase 8 (MCP)	Phase 7 (Snapshots)	Documentation
5	Phase 9 (CI/CD)	Testing	Testing

Risk	Impact	Likelihood	Mitigation
Vulkan screenshot capture fails	High (Phase 2 blocked)	Medium	Fallback to glReadPixel if VkImage readback fails
Signal handler crashes	Critical (system instability)	Low	Use async-signal-safe operations only, test extensively
HTTP server blocks render thread	Medium (frame drops)	Low	Run API server on background thread
WebSocket connection instability	Medium (agent loses connection)	Medium	Auto-reconnect with exponential backoff
Snapshot corruption	High (data loss)	Low	Checksum validation, keep N backups
Performance overhead	Medium (degraded gameplay)	Low	Optional enable/disable (default off), profile hot paths
CI display requirement	Medium (tests can't run)	High	Use headless Wayland (already in CI) + virtual display

Epic: AI-Assisted Debugging Infrastructure for ZigCraft #220

Description

Epic: AI-Assisted Debugging Infrastructure for ZigCraft

Summary

Background

Scope

Phases & Dependencies

Phase 1: Crash Detection & Reporting

Phase 2: Screenshot Capture System

Phase 3: Debug Console

Phase 4: Remote HTTP/WebSocket API

Phase 5: Metrics & Telemetry

Phase 6: Input Injection

Phase 7: State Snapshots

Phase 8: OpenCode MCP Integration

Phase 9: CI/CD Integration

Parallelization Strategy

Track A (Core Infrastructure) - Can run in parallel

Track B (User Interfaces) - Depends on Track A

Track C (Integration) - Depends on Track B

Critical Path (Sequential)

Merge Strategy

Milestone 1: Core Diagnostics (Week 1-2)

Milestone 2: Remote Control (Week 3-4)

Milestone 3: AI Integration & CI (Week 5)

File Structure Summary

Technical Decisions & Rationale

1. HTTP Server Choice

2. Screenshot Format

3. WebSocket vs Pure REST

4. Input Injection: IPC vs In-Process Queue

5. State Snapshot Format

6. MCP Implementation Language

Risk Assessment

Acceptance Criteria (Epic Level)

Success Metrics

Open Questions

Related Issues

Notes for Sub-Issue Creation

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions