Prometheus Alert Explainer MCP
Name: prometheus.alert_explainer
Problem: On-call load increases when alerts lack immediate causal & impact context.
Inputs:
{
"alert_name": "HighErrorRate",
"fingerprint": "ab12cd34",
"lookback_minutes": 60,
"correlate_labels": ["service", "cluster"],
"include_runs": 3
}
Algorithm:
- Fetch active alert via Alertmanager API
- Pull firing expression & evaluate for lookback range
- Compute deviation vs last include_runs resolved windows
- Correlate with related alerts sharing correlate_labels
- Surface top contributing label combinations (Pareto 80%)
Output:
{
"alert": {"name":"HighErrorRate","state":"firing","severity":"page"},
"expression": "increase(http_requests_total{code=~'5..'}[5m]) / increase(http_requests_total[5m]) > 0.02",
"current_value": 0.037,
"baseline_p95": 0.011,
"deviation_ratio": 3.36,
"correlated_alerts": ["LatencySLODegradation"],
"top_contributors": [
{"service":"checkout","cluster":"prod-a","error_ratio":0.061,"share":0.44},
{"service":"cart","cluster":"prod-a","error_ratio":0.049,"share":0.31}
],
"impact_assessment": "Approx 3.4x baseline. Two services drive 75% of errors.",
"recommended_actions": [
"Check recent deploys for checkout,cart in prod-a",
"Examine trace error spans correlated to 5xx",
"Enable temporary adaptive throttling if saturation rises"
]
}
Failure Modes:
- Alert not found -> 404 style output with nulls
- Expression no longer valid -> mark stale=true
Security:
- Alertmanager read token; optional redaction for labels containing pii=true annotation
Extensions:
- Pull recent code changes (Git provider) for owning teams
- Predict time-to-breach SLO based on derivative