Runbook
Alertmanager globally silenced
Symptoms
- No alerts in `#ops-alerts` for several hours despite normal traffic.
- Prometheus shows `ALERTS{alertstate="firing"}` but Alertmanager is not routing them.
Check
check.sh
# Active silences (amtool is inside the container) docker exec via_prod-alertmanager amtool silence --alertmanager.url=http://localhost:9093 query # Anything firing but suppressed? curl -s http://localhost:9093/api/v2/alerts?silenced=true | jq '.[].labels' # Sanity: is Prometheus seeing rules at all? curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state=="firing")'
Remediation
- List every silence and expire any whose window has passed but status is still `active`:expire.sh
# Expire a specific silence by ID docker exec via_prod-alertmanager amtool silence --alertmanager.url=http://localhost:9093 expire <silence-id> # Or nuke them all (use sparingly) for id in $(docker exec via_prod-alertmanager amtool silence --alertmanager.url=http://localhost:9093 query -q); do docker exec via_prod-alertmanager amtool silence --alertmanager.url=http://localhost:9093 expire "$id" done
- If notifications still do not arrive, re-verify the delivery chain — Slack webhook, PagerDuty integration key, SMTP relay.
- Fire a synthetic test alert:test.sh
curl -X POST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[{ "labels": {"alertname":"Heartbeat","severity":"warn"}, "annotations":{"summary":"runbook ping"} }]'
Post-incident
- Agree on a silence policy: every silence must reference a ticket and expire within 24 h.
- Add a `DeadMansSwitch` rule so a missing heartbeat is itself an alert.