Runbook
DB connection pool exhausted
Symptoms
- `QueuePool limit of size X overflow Y reached` or `too many clients already` in logs.
- One or more services’ `/health` endpoints return 503.
Check
check.sh
# How many connections does Postgres see? docker exec via_prod-identity-db psql -U "$IDENTITY_POSTGRES_USER" -d "$IDENTITY_POSTGRES_DB" -c \ "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;" # Long-running queries (> 30s) docker exec via_prod-identity-db psql -U "$IDENTITY_POSTGRES_USER" -d "$IDENTITY_POSTGRES_DB" -c \ "SELECT pid, age(clock_timestamp(), query_start) AS age, state, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < NOW() - INTERVAL '30s' ORDER BY age DESC LIMIT 10;" # Gunicorn worker count vs pool size — pool must be >= workers × request concurrency docker exec via_prod-orders-service env | grep -E 'GUNICORN_WORKERS|POOL_SIZE|MAX_OVERFLOW'
Remediation
- Cancel hung queries:cancel.sql
SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state != 'idle' AND query_start < NOW() - INTERVAL '5m';
- Restart the affected service to force it to release leaked connections:
docker compose -f docker-compose.prod.yml restart <svc>. - If it recurs, raise `DB_POOL_SIZE` or lower `GUNICORN_WORKERS` and redeploy.
Post-incident
- Grep the code for unclosed `AsyncSession`/`get_db` paths that may have leaked.
- Chart `pg_stat_activity` count in Grafana so the next incident shows up earlier.