Runbooks
Incident runbooks
These pages are linked from Prometheus alert rules. Each one is terse on purpose: symptoms, check, remediation, post-incident.
For live incidents: check `#ops-alerts` in Slack for the full alert payload, then follow the relevant runbook from the appropriate step.
Rate limiter failing open
API-key rate limiter fell back to allow-all because Redis is unreachable.
High error rate
A service is returning 5xx above the configured threshold.
Disk full on host
EBS volume on the prod EC2 is above 85% — containers at risk of eviction.
DB connection pool exhausted
A service is opening PG connections faster than it releases them.
Alertmanager silenced
Alerts are being suppressed globally — verify no stale silence survived past its window.