A Model Context Protocol (MCP) server that provides unified access to multiple observability and infrastructure tools with natural language query generation.
Now with 55+ production templates for debugging, incident response, deployment analysis, capacity planning and business metrics.
| Platform | Query Language | Use Cases |
|---|---|---|
| New Relic | NRQL via NerdGraph | APM metrics, error rates, throughput, infrastructure, deployment analysis |
| Splunk | SPL | Log search, event analysis, error investigation, root cause |
| Kubernetes | kubectl | Pod management, logs, cluster operations |
Auto-detect Platform - Automatically routes queries to the apropriate platform based on natural language
55+ Production Templates - Pre-built templates for common production scenarios:
- 14 Debug templates (current failures, errors, latency)
- 9 P1 Incident templates (critical metrics, spike analysis)
- 6 Deployment templates (version comparison, rollback validation)
- 8 Capacity templates (memory leaks, resource saturation)
- 7 Business templates (revenue impact, SLA compliance)
- 11+ Splunk log templates
Execute or Preview - Generate queries for review or execute directly against your systems
Natural Language - Just describe what you want: "deployment comparison for payment-api" or "memory leak detection"
Schema Reference - Built-in documentation for each query language and template
Windows (PowerShell):
.\setup.ps1Windows (CMD):
setup.batLinux/macOS:
chmod +x setup.sh
./setup.sh# Create virtual environment
python -m venv .venv
# Activate (Windows PowerShell)
.venv\Scripts\Activate.ps1
# Activate (Windows CMD)
.venv\Scripts\activate.bat
# Activate (Linux/macOS)
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"Set environment variables for the platforms you want to use:
NEW_RELIC_API_KEY=your-api-key
NEW_RELIC_ACCOUNT_ID=your-account-idSPLUNK_HOST=splunk.example.com
SPLUNK_TOKEN=your-token
# Or use username/password
SPLUNK_USERNAME=admin
SPLUNK_PASSWORD=passwordKUBECONFIG=/path/to/.kube/config
KUBE_CONTEXT=my-cluster
KUBE_NAMESPACE=default# Path to custom API definitions
CUSTOM_APIS_PATH=apis/custom_apis.json
# API-specific tokens (used by custom_apis.json)
GITHUB_TOKEN=ghp_xxxxxxxxxxxx
PAGERDUTY_API_KEY=u+xxxxxxxxxxxxxxxx
DATADOG_API_KEY=xxxxxxxxxxxxxxxxxxxxx
DATADOG_APP_KEY=xxxxxxxxxxxxxxxxxxxxx
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/xxx/xxxYou can also create a .env file in the project root. Copy .env.example to get started:
cp .env.example .env# Run directly
python -m src.server
# Or using the MCP CLI
mcp run src/server.pyThe server will start and listen for MCP client connections. Use it with Claude Desktop, Cline, or any MCP-compatible client.
Add to your MCP client configuration (e.g., Claude Desktop config file):
{
"mcpServers": {
"prodhelp": {
"command": "python",
"args": ["-m", "src.server"],
"cwd": "d:/apps/prodhelp",
"env": {
"NEW_RELIC_API_KEY": "your-key",
"NEW_RELIC_ACCOUNT_ID": "your-account",
"SPLUNK_HOST": "splunk.example.com",
"SPLUNK_TOKEN": "your-token"
}
}
}
}Scenario 1: Just deployed and want to validate
User: "deployment comparison for payment-api before 14:30 and after 14:30"
Server analyzes:
- Before: 0.5% error rate, 200ms latency
- After: 5.0% error rate, 450ms latency
- Conclusion: Deployment caused issues, consider rollback
Scenario 2: P1 alert fired for high error rate
User: "comprehensive debug for payment-api"
Then: "error reasons for payment-api"
Then: "failed endpoints for payment-api"
Result: Found that /api/v1/checkout endpoint has 95% errors
Root cause identified in 2 minutes
Scenario 3: Application getting slower over time
User: "memory leak detection"
Server shows:
- payment-api-01: Memory growing 0.8% per minute
- payment-api-02: Memory growing 0.9% per minute
- Memory leak detected, restart recomended
Scenario 4: Need to show business impact to executives
User: "revenue impact for payment-api with 150 avg transaction"
Server calculates:
- 500 failed transactions in last hour
- Estimated revenue loss: $75,000
- Critical: Escalate immediately
Daily Health Check (2 minutes):
1. newrelic: comprehensive debug for <your-app>
2. newrelic: critical transactions for <your-app>
3. newrelic: resource saturation
After Every Deployment (30 seconds):
1. newrelic: deployment comparison for <app> before <time> and after <time>
2. newrelic: version errors for <app>
P1 Incident Response (5-10 minutes):
1. newrelic: comprehensive debug for <app>
2. newrelic: current failures for <app>
3. newrelic: error reasons for <app>
4. splunk: p1 root cause in production
5. newrelic: failed endpoints for <app>
Capacity Planning (5 minutes):
1. newrelic: capacity forecast for <app>
2. newrelic: resource saturation
3. newrelic: memory leak detection
4. newrelic: connection pool status for <app>
Universal query tool with automatic platform detection.
query(
text="What's the error rate for my-service?",
execute=False # Set True to execute
)
Generate or execute New Relic NRQL queries.
newrelic_query(
intent="error rate",
app_name="my-service",
time_range="1 hour ago"
)
Generate or execute Splunk SPL queries.
splunk_query(
intent="top errors",
index="production",
time_range="-24h"
)
Generate or execute kubectl commands.
kubectl_command(
intent="get logs",
namespace="production",
resource_name="my-pod"
)
View available query templates.
list_templates(platform="newrelic") # or "splunk", "kubectl", or None for all
Get reference documentation for a query language.
get_schema(platform="splunk")
Define any REST API in JSON format and it automatically becomes an MCP tool.
- Edit
apis/custom_apis.json - Add your API definition with endpoints
- Call
reload_custom_apis()or restart the server
# List available custom APIs
list_custom_apis()
# Get endpoint details with sample request/response
get_api_endpoint_info(api_id="github_issues", endpoint_id="list_issues")
# Call an API
call_custom_api(
api_id="github_issues",
endpoint_id="list_issues",
params={"owner": "microsoft", "repo": "vscode", "state": "open"}
)
# Reload after editing custom_apis.json
reload_custom_apis(){
"apis": [
{
"id": "my_api",
"name": "My API",
"description": "Description here",
"enabled": true,
"base_url": "https://api.example.com",
"auth": {
"type": "bearer",
"token_env": "MY_API_TOKEN"
},
"endpoints": [
{
"id": "get_data",
"name": "Get Data",
"method": "GET",
"path": "/data/{id}",
"parameters": {
"path": {
"id": {"type": "string", "required": true}
},
"query": {
"limit": {"type": "integer", "default": 10}
}
},
"sample_request": {"id": "123", "limit": 5},
"sample_response": {"status": 200, "body": {...}}
}
]
}
]
}Supported Auth Types:
none- No authenticationbearer- Bearer token from env varbasic- Username/password from env varsheader- Custom header with templateheaders- Multiple custom headersapi_key- API key as query parameter
The server now includes 55+ production-ready templates across debugging, incident response, deployment analysis, capacity planning, and business metrics.
Just deployed? Check the impact:
newrelic: deployment comparison for payment-api before 14:30 and after 14:30
newrelic: version errors for payment-api
newrelic: rollback validation for payment-api
Got a P1 incident? Start here:
newrelic: comprehensive debug for payment-api
newrelic: current failures for payment-api
newrelic: error reasons for payment-api
splunk: p1 root cause in production
Memory or capacity issues:
newrelic: memory leak detection
newrelic: connection pool status for payment-api
newrelic: capacity forecast for payment-api
splunk: memory leak logs
Need to show business impact:
newrelic: revenue impact for payment-api with 150 avg transaction
newrelic: checkout funnel for ecommerce-app
newrelic: sla compliance for payment-api with 1000ms threshold
These help you debug production issues quickly - usually within 5 minutes you'll know whats happening.
Current state snapshot:
newrelic: current failures for payment-api
# Returns: failure count, failure rate %, P50/P95/P99 latency
Find whats broken:
newrelic: failed endpoints for payment-api
# Shows which endpoints have errors, sorted by worst first
Understand the errors:
newrelic: error types for payment-api
newrelic: error reasons for payment-api
# Get error classes, messages, HTTP status codes
Check performance:
newrelic: latency metrics for payment-api
newrelic: endpoint performance for payment-api
# P50/P95/P99 latency per endpoint with timeline
See overall health:
newrelic: availability for payment-api
newrelic: debug all for payment-api
# Uptime %, success rate, comprehensive metrics
Check dependecies:
newrelic: service calls for payment-api
newrelic: database queries for payment-api
# External service health, slow database queries
For critical production incidents - these get you to root cause in under 10 mins.
Critical metrics dashboard:
newrelic: critical metrics for payment-api
# Error rate, throughput, latency, apdex - all in one
Error spike investigation:
newrelic: error spike analysis for payment-api
splunk: p1 error timeline in production
# When errors started, error types, timeline
Find affected services:
newrelic: affected endpoints for payment-api
splunk: p1 affected services in production
# Which endpoints failing, error rates
Check external dependancies:
newrelic: external services for payment-api
splunk: p1 dependency failures in production
# Upstream/downstream service health
Database bottlenecks:
newrelic: database performance for payment-api
splunk: p1 database issues in production
# Slow queries, connection issues
Infrastructure problems:
newrelic: host resources
splunk: p1 resource exhaustion in production
# CPU, memory, disk usage
Trace a specific request:
newrelic: trace request 550e8400-e29b-41d4-a716-446655440000
splunk: p1 request tracing in production with trace abc123
# Follow request through all services
Most P1 incidents are caused by deployments - these help you validate releases and make rollback decisions in seconds.
Compare before/after deployment:
newrelic: deployment comparison for payment-api before 14:30 and after 14:30
# See if error rate or latency increased after deploy
Check which version has issues:
newrelic: version errors for payment-api
# Compare error rates across versions (v2.3.1 vs v2.3.0)
Validate a rollback worked:
newrelic: rollback validation for payment-api
# Confirms metrics returned to normal after rollback
Monitor canary deployment:
newrelic: canary health for payment-api
# Compare canary vs production metrics to decide if safe to promote
Find recent changes:
newrelic: recent changes for payment-api
splunk: deployment logs
# What changed recently, correlate with error spikes
View deployment history:
newrelic: deployment timeline for payment-api
splunk: version error logs
# Timeline of deployments and their impact
Prevent incidents before they happen - detect memory leaks, pool exhaustion, and capacity issues early.
Detect memory leaks:
newrelic: memory leak detection
splunk: memory leak logs
# Growing memory trend, OutOfMemory errors
Thread pool saturation:
newrelic: thread pool for payment-api
# Active threads, queued tasks, utilization %
Connection pool health:
newrelic: connection pool status for payment-api
splunk: connection pool logs
# DB/cache pool utilization, timeouts
Message queue backlog:
newrelic: queue depth
splunk: queue backlog logs
# Queue depth, publish/consume rates
Rate limiting issues:
newrelic: rate limiting for payment-api
splunk: rate limit logs
# 429 errors, which endpoints rate limited
Autoscaling events:
newrelic: autoscale triggers
# When and why instances scaled up/down
Capacity forecasting:
newrelic: capacity forecast for payment-api
# Project capacity needs for next few hours
Find resource bottlenecks:
newrelic: resource saturation
# Which hosts hitting CPU/memory/disk limits
Translate technical issues to business impact - what executives actually care about.
Checkout flow health:
newrelic: checkout funnel for ecommerce-app
splunk: checkout errors
# Success rate at each checkout step, where users drop off
Calculate revenue loss:
newrelic: revenue impact for payment-api with 200 avg transaction
# Estimated $ lost from failed transactions
Critical transaction monitoring:
newrelic: critical transactions for payment-api
splunk: critical transaction logs
# Payment, order, purchase success rates
SLA tracking:
newrelic: sla compliance for payment-api with 1000ms threshold
splunk: sla breach logs
# % meeting SLA, breach timeline
User journey analysis:
newrelic: user journey for ecommerce-app
splunk: user journey logs
# Home > Product > Cart > Checkout > Confirm funnel
Cart abandonment:
newrelic: cart abandonment for ecommerce-app
# Abandonment rate correlated with errors
Conversion impact:
newrelic: conversion impact for ecommerce-app
# Conversion rate vs error rate correlation
The router automatically detects which platform to use based on your query:
These route to New Relic:
- "whats the error rate for my application?"
- "show me latency metrics"
- "deployment comparison for payment-api"
- "memory leak detection"
- "revenue impact for checkout"
These route to Splunk:
- "show me error logs from production"
- "deployment logs in the last 2 hours"
- "connection pool exhausted errors"
- "p1 root cause analysis"
These route to Kubernetes:
- "list all pods in production namespace"
- "get logs for payment-pod"
- "describe service api-gateway"
| Category | Templates | Use For | Time to Insight |
|---|---|---|---|
| Debug | 14 | Current state, errors, latency | 30 seconds - 2 mins |
| P1 Incident | 9 | Critical outages, spike analysis | 5-10 mins to root cause |
| Deployment | 6 | Deploy validation, rollback decisions | 30 seconds |
| Capacity | 8 | Memory leaks, resource planning | 2-5 mins |
| Business | 7 | Revenue impact, SLA tracking | 1-2 mins |
| Splunk Logs | 19 | Log analysis, error context | 1-3 mins |
Total: 55+ production-ready templates
When a critical incident occurs, follow this systematic approach:
Step 1: Immediate Assessment (30 seconds)
newrelic: comprehensive debug for <app>
This gives you the complete picture - error rate, latency, throughput, apdex.
Step 2: Identify What's Broken (1 minute)
newrelic: current failures for <app>
newrelic: failed endpoints for <app>
Now you know which specific endpoints are failing.
Step 3: Understand Why (2 minutes)
newrelic: error reasons for <app>
splunk: p1 root cause in production
Get the specific error messages, HTTP status codes, and stack traces.
Step 4: Check Recent Changes (1 minute)
newrelic: recent changes for <app>
splunk: deployment logs
Was there a recent deployment? Configuration change?
Step 5: Make Decision (1 minute)
- If deployment-related: Rollback immediately
- If dependency-related: Check external services
- If capacity-related: Scale up resources
Total time to root cause: 5-10 minutes
After every deployment, validate health in 30 seconds:
Check impact immediately:
newrelic: deployment comparison for <app> before <deploy-time> and after <deploy-time>
Decision criteria:
- Error rate increased >2x → Rollback
- Latency increased >2x → Rollback
- Error rate increased 1.5-2x → Monitor closely
- Metrics similar or better → Deploy successful
Validate the rollback:
newrelic: rollback validation for <app>
Confirms metrics returned to normal.
When application performance degrades over time:
Detect the leak (2 minutes):
newrelic: memory leak detection
Look for memory growth >0.5% per minute.
Get error details:
splunk: memory leak logs
Find OutOfMemory errors, heap warnings.
Check related resources:
newrelic: thread pool for <app>
newrelic: connection pool status for <app>
Decision:
- Memory growing steadily → Memory leak, restart required
- Thread pool exhausted → Increase thread pool or scale
- Connection pool saturated → Increase pool size
When executives ask "how much is this costing us?":
Calculate revenue impact:
newrelic: revenue impact for <app> with <avg-transaction-value> avg transaction
Example output: 500 failed transactions × $150 = $75,000 lost
Show conversion impact:
newrelic: conversion impact for <app>
Compare today's conversion rate vs baseline.
Identify broken user journeys:
newrelic: checkout funnel for <app>
newrelic: user journey for <app>
Where are users dropping off?
Escalation thresholds:
- Revenue loss >$10k/hour → Page executives
- Checkout success <95% → Critical priority
- Conversion rate drop >20% → Immediate investigation
Before major events (Black Friday, product launch):
Forecast capacity needs:
newrelic: capacity forecast for <app>
See current throughput trend and projected load.
Check current resource usage:
newrelic: resource saturation
Which resources will hit limits first?
Validate autoscaling:
newrelic: autoscale triggers
Ensure autoscaling configured properly.
Check connection pools:
newrelic: connection pool status for <app>
Make sure database can handle increased load.
Pre-emptive actions:
- Scale up if CPU/memory >70% during normal load
- Increase connection pools if utilization >60%
- Add read replicas if database queries slow
Rollback immediately if:
- New version error rate >2x old version
- New version latency >2x old version
- New version apdex score <0.5
- Critical transaction success rate <95%
- Revenue-impacting errors detected
Monitor closely if:
- Error rate increased 1.5-2x (prepare for rollback)
- Latency increased 1.5-2x (investigate)
- Minor increase in errors but not critical endpoints
Critical (take action now):
- CPU >90%
- Memory >90% or growing >0.5%/minute
- Thread pool utilization >90%
- Connection pool utilization >95%
- Queue depth growing continuously
Warning (plan action):
- CPU 75-90%
- Memory 75-90%
- Thread pool 75-90%
- Connection pool 85-95%
- Queue depth stable but high
Critical severity (page executives):
- Revenue loss >$50k/hour
- Checkout success <90%
- Payment success <95%
- Major customer impact
High severity (page senior engineers):
- Revenue loss $10k-50k/hour
- Checkout success 90-95%
- Payment success 95-98%
- Multiple customers affected
Medium severity (standard response):
- Revenue loss <$10k/hour
- Checkout success >95%
- Payment success >98%
- Limited customer impact
# Compare metrics before/after deploy
newrelic: deployment comparison for payment-api before 14:30 and after 14:30
# Check error rates by version
newrelic: version errors for payment-api
# Validate rollback restored health
newrelic: rollback validation for payment-api
# Monitor canary deployment
newrelic: canary health for payment-api
# See what changed recently
newrelic: recent changes for payment-api
splunk: deployment logs# Quick health check
newrelic: comprehensive debug for payment-api
# Current failure metrics
newrelic: current failures for payment-api
# Which endpoints broken
newrelic: failed endpoints for payment-api
# Why they're failing
newrelic: error reasons for payment-api
newrelic: error types for payment-api
# Performance analysis
newrelic: latency metrics for payment-api
newrelic: endpoint performance for payment-api
# Dependency health
newrelic: service calls for payment-api
newrelic: database queries for payment-api# Detect memory leaks
newrelic: memory leak detection
splunk: memory leak logs
# Check resource usage
newrelic: resource saturation
newrelic: thread pool for payment-api
newrelic: connection pool status for payment-api
# Queue analysis
newrelic: queue depth
splunk: queue backlog logs
# Capacity planning
newrelic: capacity forecast for payment-api
newrelic: autoscale triggers
# Rate limiting
newrelic: rate limiting for payment-api
splunk: rate limit logs# Revenue calculation
newrelic: revenue impact for payment-api with 150 avg transaction
# Funnel analysis
newrelic: checkout funnel for ecommerce-app
newrelic: user journey for ecommerce-app
# Transaction monitoring
newrelic: critical transactions for payment-api
splunk: critical transaction logs
# SLA tracking
newrelic: sla compliance for payment-api with 1000ms threshold
splunk: sla breach logs
# Conversion analysis
newrelic: conversion impact for ecommerce-app
newrelic: cart abandonment for ecommerce-app# Initial assessment
newrelic: critical metrics for payment-api
newrelic: comprehensive debug for payment-api
# Error investigation
newrelic: error spike analysis for payment-api
splunk: p1 error timeline in production
# Service health
newrelic: affected endpoints for payment-api
splunk: p1 affected services in production
# Dependency check
newrelic: external services for payment-api
splunk: p1 dependency failures in production
# Infrastructure
newrelic: host resources
splunk: p1 resource exhaustion in production
# Request tracing
newrelic: trace request <trace-id>
splunk: p1 request tracing in production with trace <trace-id>Diagnosis:
newrelic: deployment comparison for <app> before <time> and after <time>
newrelic: latency metrics for <app>
newrelic: database queries for <app>Common causes:
- New code introduced N+1 queries
- Database connection pool too small
- External API calls not optimized
- Memory leak causing GC pressure
Solution:
- Rollback if latency >2x baseline
- Check database query performance
- Increase connection pool if saturated
- Monitor memory growth
Diagnosis:
newrelic: error timeline for <app>
newrelic: error reasons for <app>
splunk: p1 error timeline in productionCommon causes:
- Connection pool exhaustion (errors every N minutes)
- Rate limiting from external service
- Memory pressure causing timeouts
- Load balancer health check failures
Solution:
- Check connection pool utilization
- Review rate limiting logs
- Verify autoscaling thresholds
- Check load balancer configuration
Diagnosis:
newrelic: checkout funnel for ecommerce-app
newrelic: revenue impact for payment-api with <avg-value> avg transaction
splunk: checkout errorsCommon causes:
- Payment gateway timeout
- Session expiration
- Inventory service unavailable
- Database connection issues
Solution:
- Check payment gateway health
- Verify session timeout settings
- Test inventory service
- Scale database connections
Diagnosis:
newrelic: memory leak detection
newrelic: thread pool for <app>
splunk: memory leak logsCommon causes:
- Memory leak in application code
- Connection leak (unclosed connections)
- Large object retention
- Thread leak
Solution:
- Restart affected instances
- Heap dump analysis
- Review recent code changes
- Check connection pool settings
Common parameters used across templates:
app_name (required in most queries)
- Your application name in New Relic
- Example: payment-api, ecommerce-app, user-service
time_range (optional, varies by template)
- New Relic: "5 minutes ago", "1 hour ago", "1 day ago"
- Splunk: "-5m", "-1h", "-24h"
- Default varies by template (5min for current state, 1h for analysis)
deployment_time (required for deployment comparison)
- Epoch timestamp or ISO format
- Example: 1642604400 or "14:30"
avg_transaction_value (optional, for revenue impact)
- Average dollar value per transaction
- Example: 100, 150, 200
- Default: 100
sla_threshold (optional, for SLA compliance)
- Response time threshold in milliseconds
- Example: 500, 1000, 2000
- Default: 1000
cpu_threshold / memory_threshold
- Resource utilization percentage
- Example: 80, 85, 90
- Defaults: cpu=80, memory=85
error_threshold (optional, for filtering)
- Error rate percentage threshold
- Example: 1, 5, 10
- Default: 1 (shows endpoints with >1% errors)
slow_threshold (optional, for database queries)
- Query duration threshold in milliseconds
- Example: 100, 500, 1000
- Default: 100
Tip 1: Combine New Relic + Splunk
- New Relic tells you WHAT is happening (metrics)
- Splunk tells you WHY (logs, errors, stack traces)
- Always use both for complete root cause analysis
Tip 2: Establish Baselines
- Know your normal error rate (e.g., 0.5%)
- Know your normal p95 latency (e.g., 300ms)
- Know your normal conversion rate (e.g., 3.2%)
- Deviations become obvious when you have baselines
Tip 3: Run Deployment Checks Automatically After every deploy:
- Wait 5 minutes for metrics to stabalize
- Run deployment comparison
- Automatic rollback if error rate >2x
Tip 4: Use Time Ranges Strategically
- Current state: 5 minutes (what's happening now)
- Troubleshooting: 30 minutes - 1 hour (enough context)
- Analysis: 1-4 hours (see patterns)
- Capacity planning: 4-24 hours (trends)
Tip 5: Start Broad, Then Focus
- Start with comprehensive debug (all metrics)
- Identify problem area (errors vs latency vs capacity)
- Use specific template for deep dive
- Get logs from Splunk for details
Tip 6: Revenue First for Executives When reporting to executives:
- Lead with business impact ($75k revenue loss)
- Then technical details (payment gateway timeout)
- Then action plan (switching to backup gateway)
- Avoid technical jargon in initial report
┌────────────────────────────────────────────────────────┐
│ MCP Server │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Intent Router │ │
│ │ (Classifies queries → platforms) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ NerdGraph │ │ Splunk │ │ kubectl │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GraphQL │ │ REST API │ │ CLI │ │
│ │ API │ │ │ │ Subprocess │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────────────────────────────────┘
pytest tests/ruff check src/
ruff format src/MIT