Runbook
High 5xx error rate
Symptoms
- 5xx rate above 5% for 5 minutes.
- Sentry shows a new error burst localised to one service.
Check
check.sh
SVC=orders-service # adjust to the alerting service
# Last 100 lines of stderr
docker logs --tail=100 via_prod-$SVC 2>&1 | grep -E 'ERROR|CRITICAL|Traceback'
# Current error rate from the gateway (last 5 min)
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"5..",service="'$SVC'"}[5m])) / sum(rate(http_requests_total{service="'$SVC'"}[5m]))'
# Is the DB / Redis / RabbitMQ the real source?
docker exec via_prod-$SVC curl -sf http://localhost:808X/health || echo "unhealthy"Remediation
- Identify the pattern. If it is a DB error → follow
db-pool-exhausted. If a recent deploy is suspect: - Roll back to the last-known-good image:rollback.sh
cd ~/via-backend/backend/compose docker tag ghcr.io/via-logistics/$SVC:last-known-good ghcr.io/via-logistics/$SVC:latest docker compose -f docker-compose.prod.yml up -d --force-recreate $SVC
- If errors persist after rollback, the bug is upstream (Paymob, SES, DB). Page the owner.
Post-incident
- Attach the Sentry issue link + stack trace to the post-incident report.
- Add a regression test for the edge case that triggered the error.