Skip to content

tanoaks14/prodhelp

Repository files navigation

ProdHelp - Multi-Platform Observability MCP Server

A Model Context Protocol (MCP) server that provides unified access to multiple observability and infrastructure tools with natural language query generation.

Now with 55+ production templates for debugging, incident response, deployment analysis, capacity planning and business metrics.

Supported Platforms

Platform Query Language Use Cases
New Relic NRQL via NerdGraph APM metrics, error rates, throughput, infrastructure, deployment analysis
Splunk SPL Log search, event analysis, error investigation, root cause
Kubernetes kubectl Pod management, logs, cluster operations

Key Features

Auto-detect Platform - Automatically routes queries to the apropriate platform based on natural language

55+ Production Templates - Pre-built templates for common production scenarios:

  • 14 Debug templates (current failures, errors, latency)
  • 9 P1 Incident templates (critical metrics, spike analysis)
  • 6 Deployment templates (version comparison, rollback validation)
  • 8 Capacity templates (memory leaks, resource saturation)
  • 7 Business templates (revenue impact, SLA compliance)
  • 11+ Splunk log templates

Execute or Preview - Generate queries for review or execute directly against your systems

Natural Language - Just describe what you want: "deployment comparison for payment-api" or "memory leak detection"

Schema Reference - Built-in documentation for each query language and template

Installation

Quick Setup (Recommended)

Windows (PowerShell):

.\setup.ps1

Windows (CMD):

setup.bat

Linux/macOS:

chmod +x setup.sh
./setup.sh

Manual Setup

# Create virtual environment
python -m venv .venv

# Activate (Windows PowerShell)
.venv\Scripts\Activate.ps1

# Activate (Windows CMD)
.venv\Scripts\activate.bat

# Activate (Linux/macOS)
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

Configuration

Set environment variables for the platforms you want to use:

New Relic

NEW_RELIC_API_KEY=your-api-key
NEW_RELIC_ACCOUNT_ID=your-account-id

Splunk

SPLUNK_HOST=splunk.example.com
SPLUNK_TOKEN=your-token
# Or use username/password
SPLUNK_USERNAME=admin
SPLUNK_PASSWORD=password

Kubernetes

KUBECONFIG=/path/to/.kube/config
KUBE_CONTEXT=my-cluster
KUBE_NAMESPACE=default

Custom APIs

# Path to custom API definitions
CUSTOM_APIS_PATH=apis/custom_apis.json

# API-specific tokens (used by custom_apis.json)
GITHUB_TOKEN=ghp_xxxxxxxxxxxx
PAGERDUTY_API_KEY=u+xxxxxxxxxxxxxxxx
DATADOG_API_KEY=xxxxxxxxxxxxxxxxxxxxx
DATADOG_APP_KEY=xxxxxxxxxxxxxxxxxxxxx
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/xxx/xxx

You can also create a .env file in the project root. Copy .env.example to get started:

cp .env.example .env

Usage

Running the MCP Server

# Run directly
python -m src.server

# Or using the MCP CLI
mcp run src/server.py

The server will start and listen for MCP client connections. Use it with Claude Desktop, Cline, or any MCP-compatible client.

MCP Client Configuration

Add to your MCP client configuration (e.g., Claude Desktop config file):

{
  "mcpServers": {
    "prodhelp": {
      "command": "python",
      "args": ["-m", "src.server"],
      "cwd": "d:/apps/prodhelp",
      "env": {
        "NEW_RELIC_API_KEY": "your-key",
        "NEW_RELIC_ACCOUNT_ID": "your-account",
        "SPLUNK_HOST": "splunk.example.com",
        "SPLUNK_TOKEN": "your-token"
      }
    }
  }
}

Real-World Usage Examples

Scenario 1: Just deployed and want to validate

User: "deployment comparison for payment-api before 14:30 and after 14:30"

Server analyzes:
- Before: 0.5% error rate, 200ms latency
- After: 5.0% error rate, 450ms latency
- Conclusion: Deployment caused issues, consider rollback

Scenario 2: P1 alert fired for high error rate

User: "comprehensive debug for payment-api"
Then: "error reasons for payment-api"
Then: "failed endpoints for payment-api"

Result: Found that /api/v1/checkout endpoint has 95% errors
Root cause identified in 2 minutes

Scenario 3: Application getting slower over time

User: "memory leak detection"

Server shows:
- payment-api-01: Memory growing 0.8% per minute
- payment-api-02: Memory growing 0.9% per minute
- Memory leak detected, restart recomended

Scenario 4: Need to show business impact to executives

User: "revenue impact for payment-api with 150 avg transaction"

Server calculates:
- 500 failed transactions in last hour
- Estimated revenue loss: $75,000
- Critical: Escalate immediately

Common Workflows

Daily Health Check (2 minutes):

1. newrelic: comprehensive debug for <your-app>
2. newrelic: critical transactions for <your-app>
3. newrelic: resource saturation

After Every Deployment (30 seconds):

1. newrelic: deployment comparison for <app> before <time> and after <time>
2. newrelic: version errors for <app>

P1 Incident Response (5-10 minutes):

1. newrelic: comprehensive debug for <app>
2. newrelic: current failures for <app>
3. newrelic: error reasons for <app>
4. splunk: p1 root cause in production
5. newrelic: failed endpoints for <app>

Capacity Planning (5 minutes):

1. newrelic: capacity forecast for <app>
2. newrelic: resource saturation
3. newrelic: memory leak detection
4. newrelic: connection pool status for <app>

Tools

query

Universal query tool with automatic platform detection.

query(
    text="What's the error rate for my-service?",
    execute=False  # Set True to execute
)

newrelic_query

Generate or execute New Relic NRQL queries.

newrelic_query(
    intent="error rate",
    app_name="my-service",
    time_range="1 hour ago"
)

splunk_query

Generate or execute Splunk SPL queries.

splunk_query(
    intent="top errors",
    index="production",
    time_range="-24h"
)

kubectl_command

Generate or execute kubectl commands.

kubectl_command(
    intent="get logs",
    namespace="production",
    resource_name="my-pod"
)

list_templates

View available query templates.

list_templates(platform="newrelic")  # or "splunk", "kubectl", or None for all

get_schema

Get reference documentation for a query language.

get_schema(platform="splunk")

Custom APIs

Define any REST API in JSON format and it automatically becomes an MCP tool.

Adding a Custom API

  1. Edit apis/custom_apis.json
  2. Add your API definition with endpoints
  3. Call reload_custom_apis() or restart the server

Custom API Tools

# List available custom APIs
list_custom_apis()

# Get endpoint details with sample request/response
get_api_endpoint_info(api_id="github_issues", endpoint_id="list_issues")

# Call an API
call_custom_api(
    api_id="github_issues",
    endpoint_id="list_issues",
    params={"owner": "microsoft", "repo": "vscode", "state": "open"}
)

# Reload after editing custom_apis.json
reload_custom_apis()

Custom API JSON Format

{
  "apis": [
    {
      "id": "my_api",
      "name": "My API",
      "description": "Description here",
      "enabled": true,
      "base_url": "https://api.example.com",
      "auth": {
        "type": "bearer",
        "token_env": "MY_API_TOKEN"
      },
      "endpoints": [
        {
          "id": "get_data",
          "name": "Get Data",
          "method": "GET",
          "path": "/data/{id}",
          "parameters": {
            "path": {
              "id": {"type": "string", "required": true}
            },
            "query": {
              "limit": {"type": "integer", "default": 10}
            }
          },
          "sample_request": {"id": "123", "limit": 5},
          "sample_response": {"status": 200, "body": {...}}
        }
      ]
    }
  ]
}

Supported Auth Types:

  • none - No authentication
  • bearer - Bearer token from env var
  • basic - Username/password from env vars
  • header - Custom header with template
  • headers - Multiple custom headers
  • api_key - API key as query parameter

Example Queries

The server now includes 55+ production-ready templates across debugging, incident response, deployment analysis, capacity planning, and business metrics.

Quick Start Examples

Just deployed? Check the impact:

newrelic: deployment comparison for payment-api before 14:30 and after 14:30
newrelic: version errors for payment-api
newrelic: rollback validation for payment-api

Got a P1 incident? Start here:

newrelic: comprehensive debug for payment-api
newrelic: current failures for payment-api
newrelic: error reasons for payment-api
splunk: p1 root cause in production

Memory or capacity issues:

newrelic: memory leak detection
newrelic: connection pool status for payment-api
newrelic: capacity forecast for payment-api
splunk: memory leak logs

Need to show business impact:

newrelic: revenue impact for payment-api with 150 avg transaction
newrelic: checkout funnel for ecommerce-app
newrelic: sla compliance for payment-api with 1000ms threshold

Debug Templates (14 templates)

These help you debug production issues quickly - usually within 5 minutes you'll know whats happening.

Current state snapshot:

newrelic: current failures for payment-api
# Returns: failure count, failure rate %, P50/P95/P99 latency

Find whats broken:

newrelic: failed endpoints for payment-api
# Shows which endpoints have errors, sorted by worst first

Understand the errors:

newrelic: error types for payment-api
newrelic: error reasons for payment-api
# Get error classes, messages, HTTP status codes

Check performance:

newrelic: latency metrics for payment-api
newrelic: endpoint performance for payment-api
# P50/P95/P99 latency per endpoint with timeline

See overall health:

newrelic: availability for payment-api
newrelic: debug all for payment-api
# Uptime %, success rate, comprehensive metrics

Check dependecies:

newrelic: service calls for payment-api
newrelic: database queries for payment-api
# External service health, slow database queries

P1 Incident Templates (9 templates)

For critical production incidents - these get you to root cause in under 10 mins.

Critical metrics dashboard:

newrelic: critical metrics for payment-api
# Error rate, throughput, latency, apdex - all in one

Error spike investigation:

newrelic: error spike analysis for payment-api
splunk: p1 error timeline in production
# When errors started, error types, timeline

Find affected services:

newrelic: affected endpoints for payment-api
splunk: p1 affected services in production
# Which endpoints failing, error rates

Check external dependancies:

newrelic: external services for payment-api
splunk: p1 dependency failures in production
# Upstream/downstream service health

Database bottlenecks:

newrelic: database performance for payment-api
splunk: p1 database issues in production
# Slow queries, connection issues

Infrastructure problems:

newrelic: host resources
splunk: p1 resource exhaustion in production
# CPU, memory, disk usage

Trace a specific request:

newrelic: trace request 550e8400-e29b-41d4-a716-446655440000
splunk: p1 request tracing in production with trace abc123
# Follow request through all services

Deployment & Release Templates (6 templates)

Most P1 incidents are caused by deployments - these help you validate releases and make rollback decisions in seconds.

Compare before/after deployment:

newrelic: deployment comparison for payment-api before 14:30 and after 14:30
# See if error rate or latency increased after deploy

Check which version has issues:

newrelic: version errors for payment-api
# Compare error rates across versions (v2.3.1 vs v2.3.0)

Validate a rollback worked:

newrelic: rollback validation for payment-api
# Confirms metrics returned to normal after rollback

Monitor canary deployment:

newrelic: canary health for payment-api
# Compare canary vs production metrics to decide if safe to promote

Find recent changes:

newrelic: recent changes for payment-api
splunk: deployment logs
# What changed recently, correlate with error spikes

View deployment history:

newrelic: deployment timeline for payment-api
splunk: version error logs
# Timeline of deployments and their impact

Capacity & Resource Templates (8 templates)

Prevent incidents before they happen - detect memory leaks, pool exhaustion, and capacity issues early.

Detect memory leaks:

newrelic: memory leak detection
splunk: memory leak logs
# Growing memory trend, OutOfMemory errors

Thread pool saturation:

newrelic: thread pool for payment-api
# Active threads, queued tasks, utilization %

Connection pool health:

newrelic: connection pool status for payment-api
splunk: connection pool logs
# DB/cache pool utilization, timeouts

Message queue backlog:

newrelic: queue depth
splunk: queue backlog logs
# Queue depth, publish/consume rates

Rate limiting issues:

newrelic: rate limiting for payment-api
splunk: rate limit logs
# 429 errors, which endpoints rate limited

Autoscaling events:

newrelic: autoscale triggers
# When and why instances scaled up/down

Capacity forecasting:

newrelic: capacity forecast for payment-api
# Project capacity needs for next few hours

Find resource bottlenecks:

newrelic: resource saturation
# Which hosts hitting CPU/memory/disk limits

Business & Revenue Templates (7 templates)

Translate technical issues to business impact - what executives actually care about.

Checkout flow health:

newrelic: checkout funnel for ecommerce-app
splunk: checkout errors
# Success rate at each checkout step, where users drop off

Calculate revenue loss:

newrelic: revenue impact for payment-api with 200 avg transaction
# Estimated $ lost from failed transactions

Critical transaction monitoring:

newrelic: critical transactions for payment-api
splunk: critical transaction logs
# Payment, order, purchase success rates

SLA tracking:

newrelic: sla compliance for payment-api with 1000ms threshold
splunk: sla breach logs
# % meeting SLA, breach timeline

User journey analysis:

newrelic: user journey for ecommerce-app
splunk: user journey logs
# Home > Product > Cart > Checkout > Confirm funnel

Cart abandonment:

newrelic: cart abandonment for ecommerce-app
# Abandonment rate correlated with errors

Conversion impact:

newrelic: conversion impact for ecommerce-app
# Conversion rate vs error rate correlation

Natural Language Queries

The router automatically detects which platform to use based on your query:

These route to New Relic:

  • "whats the error rate for my application?"
  • "show me latency metrics"
  • "deployment comparison for payment-api"
  • "memory leak detection"
  • "revenue impact for checkout"

These route to Splunk:

  • "show me error logs from production"
  • "deployment logs in the last 2 hours"
  • "connection pool exhausted errors"
  • "p1 root cause analysis"

These route to Kubernetes:

  • "list all pods in production namespace"
  • "get logs for payment-pod"
  • "describe service api-gateway"

Template Categories Summary

Category Templates Use For Time to Insight
Debug 14 Current state, errors, latency 30 seconds - 2 mins
P1 Incident 9 Critical outages, spike analysis 5-10 mins to root cause
Deployment 6 Deploy validation, rollback decisions 30 seconds
Capacity 8 Memory leaks, resource planning 2-5 mins
Business 7 Revenue impact, SLA tracking 1-2 mins
Splunk Logs 19 Log analysis, error context 1-3 mins

Total: 55+ production-ready templates

Production Workflows and Decision Trees

P1 Incident Response Workflow

When a critical incident occurs, follow this systematic approach:

Step 1: Immediate Assessment (30 seconds)

newrelic: comprehensive debug for <app>

This gives you the complete picture - error rate, latency, throughput, apdex.

Step 2: Identify What's Broken (1 minute)

newrelic: current failures for <app>
newrelic: failed endpoints for <app>

Now you know which specific endpoints are failing.

Step 3: Understand Why (2 minutes)

newrelic: error reasons for <app>
splunk: p1 root cause in production

Get the specific error messages, HTTP status codes, and stack traces.

Step 4: Check Recent Changes (1 minute)

newrelic: recent changes for <app>
splunk: deployment logs

Was there a recent deployment? Configuration change?

Step 5: Make Decision (1 minute)

  • If deployment-related: Rollback immediately
  • If dependency-related: Check external services
  • If capacity-related: Scale up resources

Total time to root cause: 5-10 minutes

Deployment Validation Workflow

After every deployment, validate health in 30 seconds:

Check impact immediately:

newrelic: deployment comparison for <app> before <deploy-time> and after <deploy-time>

Decision criteria:

  • Error rate increased >2x → Rollback
  • Latency increased >2x → Rollback
  • Error rate increased 1.5-2x → Monitor closely
  • Metrics similar or better → Deploy successful

Validate the rollback:

newrelic: rollback validation for <app>

Confirms metrics returned to normal.

Memory Leak Investigation

When application performance degrades over time:

Detect the leak (2 minutes):

newrelic: memory leak detection

Look for memory growth >0.5% per minute.

Get error details:

splunk: memory leak logs

Find OutOfMemory errors, heap warnings.

Check related resources:

newrelic: thread pool for <app>
newrelic: connection pool status for <app>

Decision:

  • Memory growing steadily → Memory leak, restart required
  • Thread pool exhausted → Increase thread pool or scale
  • Connection pool saturated → Increase pool size

Business Impact Assessment

When executives ask "how much is this costing us?":

Calculate revenue impact:

newrelic: revenue impact for <app> with <avg-transaction-value> avg transaction

Example output: 500 failed transactions × $150 = $75,000 lost

Show conversion impact:

newrelic: conversion impact for <app>

Compare today's conversion rate vs baseline.

Identify broken user journeys:

newrelic: checkout funnel for <app>
newrelic: user journey for <app>

Where are users dropping off?

Escalation thresholds:

  • Revenue loss >$10k/hour → Page executives
  • Checkout success <95% → Critical priority
  • Conversion rate drop >20% → Immediate investigation

Capacity Planning Workflow

Before major events (Black Friday, product launch):

Forecast capacity needs:

newrelic: capacity forecast for <app>

See current throughput trend and projected load.

Check current resource usage:

newrelic: resource saturation

Which resources will hit limits first?

Validate autoscaling:

newrelic: autoscale triggers

Ensure autoscaling configured properly.

Check connection pools:

newrelic: connection pool status for <app>

Make sure database can handle increased load.

Pre-emptive actions:

  • Scale up if CPU/memory >70% during normal load
  • Increase connection pools if utilization >60%
  • Add read replicas if database queries slow

Key Decision Criteria

When to Rollback a Deployment

Rollback immediately if:

  • New version error rate >2x old version
  • New version latency >2x old version
  • New version apdex score <0.5
  • Critical transaction success rate <95%
  • Revenue-impacting errors detected

Monitor closely if:

  • Error rate increased 1.5-2x (prepare for rollback)
  • Latency increased 1.5-2x (investigate)
  • Minor increase in errors but not critical endpoints

Capacity Alert Thresholds

Critical (take action now):

  • CPU >90%
  • Memory >90% or growing >0.5%/minute
  • Thread pool utilization >90%
  • Connection pool utilization >95%
  • Queue depth growing continuously

Warning (plan action):

  • CPU 75-90%
  • Memory 75-90%
  • Thread pool 75-90%
  • Connection pool 85-95%
  • Queue depth stable but high

Business Impact Escalation

Critical severity (page executives):

  • Revenue loss >$50k/hour
  • Checkout success <90%
  • Payment success <95%
  • Major customer impact

High severity (page senior engineers):

  • Revenue loss $10k-50k/hour
  • Checkout success 90-95%
  • Payment success 95-98%
  • Multiple customers affected

Medium severity (standard response):

  • Revenue loss <$10k/hour
  • Checkout success >95%
  • Payment success >98%
  • Limited customer impact

Quick Reference Commands

Deployment Commands

# Compare metrics before/after deploy
newrelic: deployment comparison for payment-api before 14:30 and after 14:30

# Check error rates by version
newrelic: version errors for payment-api

# Validate rollback restored health
newrelic: rollback validation for payment-api

# Monitor canary deployment
newrelic: canary health for payment-api

# See what changed recently
newrelic: recent changes for payment-api
splunk: deployment logs

Debug Commands

# Quick health check
newrelic: comprehensive debug for payment-api

# Current failure metrics
newrelic: current failures for payment-api

# Which endpoints broken
newrelic: failed endpoints for payment-api

# Why they're failing
newrelic: error reasons for payment-api
newrelic: error types for payment-api

# Performance analysis
newrelic: latency metrics for payment-api
newrelic: endpoint performance for payment-api

# Dependency health
newrelic: service calls for payment-api
newrelic: database queries for payment-api

Capacity Commands

# Detect memory leaks
newrelic: memory leak detection
splunk: memory leak logs

# Check resource usage
newrelic: resource saturation
newrelic: thread pool for payment-api
newrelic: connection pool status for payment-api

# Queue analysis
newrelic: queue depth
splunk: queue backlog logs

# Capacity planning
newrelic: capacity forecast for payment-api
newrelic: autoscale triggers

# Rate limiting
newrelic: rate limiting for payment-api
splunk: rate limit logs

Business Impact Commands

# Revenue calculation
newrelic: revenue impact for payment-api with 150 avg transaction

# Funnel analysis
newrelic: checkout funnel for ecommerce-app
newrelic: user journey for ecommerce-app

# Transaction monitoring
newrelic: critical transactions for payment-api
splunk: critical transaction logs

# SLA tracking
newrelic: sla compliance for payment-api with 1000ms threshold
splunk: sla breach logs

# Conversion analysis
newrelic: conversion impact for ecommerce-app
newrelic: cart abandonment for ecommerce-app

P1 Incident Commands

# Initial assessment
newrelic: critical metrics for payment-api
newrelic: comprehensive debug for payment-api

# Error investigation
newrelic: error spike analysis for payment-api
splunk: p1 error timeline in production

# Service health
newrelic: affected endpoints for payment-api
splunk: p1 affected services in production

# Dependency check
newrelic: external services for payment-api
splunk: p1 dependency failures in production

# Infrastructure
newrelic: host resources
splunk: p1 resource exhaustion in production

# Request tracing
newrelic: trace request <trace-id>
splunk: p1 request tracing in production with trace <trace-id>

Common Issues and Solutions

Issue: Application Slow After Deployment

Diagnosis:

newrelic: deployment comparison for <app> before <time> and after <time>
newrelic: latency metrics for <app>
newrelic: database queries for <app>

Common causes:

  • New code introduced N+1 queries
  • Database connection pool too small
  • External API calls not optimized
  • Memory leak causing GC pressure

Solution:

  • Rollback if latency >2x baseline
  • Check database query performance
  • Increase connection pool if saturated
  • Monitor memory growth

Issue: Intermittent Errors

Diagnosis:

newrelic: error timeline for <app>
newrelic: error reasons for <app>
splunk: p1 error timeline in production

Common causes:

  • Connection pool exhaustion (errors every N minutes)
  • Rate limiting from external service
  • Memory pressure causing timeouts
  • Load balancer health check failures

Solution:

  • Check connection pool utilization
  • Review rate limiting logs
  • Verify autoscaling thresholds
  • Check load balancer configuration

Issue: Checkout Failures

Diagnosis:

newrelic: checkout funnel for ecommerce-app
newrelic: revenue impact for payment-api with <avg-value> avg transaction
splunk: checkout errors

Common causes:

  • Payment gateway timeout
  • Session expiration
  • Inventory service unavailable
  • Database connection issues

Solution:

  • Check payment gateway health
  • Verify session timeout settings
  • Test inventory service
  • Scale database connections

Issue: High Memory Usage

Diagnosis:

newrelic: memory leak detection
newrelic: thread pool for <app>
splunk: memory leak logs

Common causes:

  • Memory leak in application code
  • Connection leak (unclosed connections)
  • Large object retention
  • Thread leak

Solution:

  • Restart affected instances
  • Heap dump analysis
  • Review recent code changes
  • Check connection pool settings

Template Parameters Reference

Common parameters used across templates:

app_name (required in most queries)

  • Your application name in New Relic
  • Example: payment-api, ecommerce-app, user-service

time_range (optional, varies by template)

  • New Relic: "5 minutes ago", "1 hour ago", "1 day ago"
  • Splunk: "-5m", "-1h", "-24h"
  • Default varies by template (5min for current state, 1h for analysis)

deployment_time (required for deployment comparison)

  • Epoch timestamp or ISO format
  • Example: 1642604400 or "14:30"

avg_transaction_value (optional, for revenue impact)

  • Average dollar value per transaction
  • Example: 100, 150, 200
  • Default: 100

sla_threshold (optional, for SLA compliance)

  • Response time threshold in milliseconds
  • Example: 500, 1000, 2000
  • Default: 1000

cpu_threshold / memory_threshold

  • Resource utilization percentage
  • Example: 80, 85, 90
  • Defaults: cpu=80, memory=85

error_threshold (optional, for filtering)

  • Error rate percentage threshold
  • Example: 1, 5, 10
  • Default: 1 (shows endpoints with >1% errors)

slow_threshold (optional, for database queries)

  • Query duration threshold in milliseconds
  • Example: 100, 500, 1000
  • Default: 100

Tips for Effective Usage

Tip 1: Combine New Relic + Splunk

  • New Relic tells you WHAT is happening (metrics)
  • Splunk tells you WHY (logs, errors, stack traces)
  • Always use both for complete root cause analysis

Tip 2: Establish Baselines

  • Know your normal error rate (e.g., 0.5%)
  • Know your normal p95 latency (e.g., 300ms)
  • Know your normal conversion rate (e.g., 3.2%)
  • Deviations become obvious when you have baselines

Tip 3: Run Deployment Checks Automatically After every deploy:

  1. Wait 5 minutes for metrics to stabalize
  2. Run deployment comparison
  3. Automatic rollback if error rate >2x

Tip 4: Use Time Ranges Strategically

  • Current state: 5 minutes (what's happening now)
  • Troubleshooting: 30 minutes - 1 hour (enough context)
  • Analysis: 1-4 hours (see patterns)
  • Capacity planning: 4-24 hours (trends)

Tip 5: Start Broad, Then Focus

  1. Start with comprehensive debug (all metrics)
  2. Identify problem area (errors vs latency vs capacity)
  3. Use specific template for deep dive
  4. Get logs from Splunk for details

Tip 6: Revenue First for Executives When reporting to executives:

  • Lead with business impact ($75k revenue loss)
  • Then technical details (payment gateway timeout)
  • Then action plan (switching to backup gateway)
  • Avoid technical jargon in initial report

Architecture

┌────────────────────────────────────────────────────────┐
│                     MCP Server                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │              Intent Router                      │   │
│  │   (Classifies queries → platforms)              │   │
│  └─────────────────────────────────────────────────┘   │
│                          │                             │
│         ┌────────────────┼────────────────┐            │
│         ▼                ▼                ▼            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  NerdGraph  │  │   Splunk    │  │   kubectl   │     │
│  │   Adapter   │  │   Adapter   │  │   Adapter   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
│         │                │                │            │
│         ▼                ▼                ▼            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  GraphQL    │  │  REST API   │  │    CLI      │     │
│  │    API      │  │             │  │  Subprocess │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└────────────────────────────────────────────────────────┘

Development

Running Tests

pytest tests/

Code Formatting

ruff check src/
ruff format src/

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors