Runbooks

Incident runbooks

These pages are linked from Prometheus alert rules. Each one is terse on purpose: symptoms, check, remediation, post-incident.

For live incidents: check `#ops-alerts` in Slack for the full alert payload, then follow the relevant runbook from the appropriate step.

Rate limiter failing open

API-key rate limiter fell back to allow-all because Redis is unreachable.

A service is returning 5xx above the configured threshold.

EBS volume on the prod EC2 is above 85% — containers at risk of eviction.

A service is opening PG connections faster than it releases them.

Alerts are being suppressed globally — verify no stale silence survived past its window.

Loading...جاري التحميل...

Runbooks

These pages are linked from Prometheus alert rules. Each one is terse on purpose: symptoms, check, remediation, post-incident.

For live incidents: check `#ops-alerts` in Slack for the full alert payload, then follow the relevant runbook from the appropriate step.

API-key rate limiter fell back to allow-all because Redis is unreachable.

A service is returning 5xx above the configured threshold.

EBS volume on the prod EC2 is above 85% — containers at risk of eviction.

A service is opening PG connections faster than it releases them.

Alerts are being suppressed globally — verify no stale silence survived past its window.