Runbook
Rate limiter failing open
VIARateLimitFailOpenSustained · VIARateLimitRedisErrorsHigh
Symptoms
- Prometheus alert: 50+ `fail_open` decisions in 5 minutes.
- Gateway Redis error rate > 5/s.
- Unexpected traffic spike from one or more API keys.
Check
check.sh
# 1. Is Redis reachable from the gateway container?
docker exec via_prod-api-gateway redis-cli -h redis -a "$REDIS_PASSWORD" ping
# Expect: PONG
# 2. Current error-type breakdown (last 5 min) via Prometheus:
curl -s 'http://localhost:9090/api/v1/query?query=sum by (error_type) (rate(via_rate_limit_redis_errors_total[5m]))'
# 3. Top keys currently triggering fail_open:
curl -s 'http://localhost:9090/api/v1/query?query=topk(5, sum by (api_key_id) (rate(via_rate_limit_decisions_total{decision="fail_open"}[5m])))'Remediation
- If Redis is down: restart the container
docker compose -f compose/docker-compose.prod.yml restart redis. - If abuse is suspected: flip fail-mode to closed
API_KEY_RATE_LIMIT_FAIL_MODE=closedvia GitHub repo variables and redeploy. - Scale Redis (or upgrade the instance type) if CPU > 80%.
- Revoke the abusive API key from the developer dashboard and notify the customer.
Post-incident
- Record the offending key + tenant in the post-incident report.
- Revert fail-mode to open after Redis has been stable for an hour.
- Review thresholds in `rate-limit.yml` if they misfired.