Incident response
OpenAI bill shock: bot abuse and rate-limit checklist
When spend spikes in hours, speed matters. This checklist helps isolate abuse patterns and apply containment quickly.
Full guide: Bot attacks and LLM cost spikes: prevention playbook
Immediate containment (first 30 minutes)
- Rotate exposed API keys.
- Apply burst limits and retry backoff immediately.
- Separate real and test traffic to avoid false diagnosis.
- Identify top endpoint and tenant concentration.
Fast evidence to collect (before you change anything)
- Lock the spike window in UTC and note the exact start time.
- Compare requests/hour and tokens/request to the trailing baseline.
- Check 429/5xx error rate and confirm whether the client is retrying.
- Break down spend by endpointTag and tenant/user to find concentration.
- Confirm whether a promptVersion or routing change shipped inside the same window.
Root cause patterns to validate
- Abuse traffic from a small actor set
- Client-side retry loop after provider timeout
- Prompt change that inflated output tokens
- Unexpected job or cron replay
Abuse vs retry storm vs verbosity regression (quick distinction)
- Abuse: unknown-user ratio spikes + repeated prompts + concentration on one endpoint.
- Retry storm: error rate spikes + the same request repeats (idempotency key or externalRequestId).
- Verbosity regression: request volume stays flat, but outputTokens/request jumps after a promptVersion deploy.
- Batch replay: volume spikes from internal job identities in a narrow time window.
Rate limits are not spend limits (common misconception)
Rate limits protect availability. They do not guarantee a predictable bill.
To prevent bill shock you need cost controls too: max-token caps, endpoint-level quotas, and budget workflows with an owner.
- Cap output tokens on public-facing endpoints and long-form flows.
- Rate-limit by endpointTag (not just globally) to protect expensive routes.
- Use degraded-mode behavior: shorter answers, fewer tool calls, smaller context.
Recovery controls
- Add alert thresholds before historical average is exceeded.
- Introduce model tiering for non-critical paths.
- Require explicit externalRequestId for idempotent ingest.
Post-incident review
Record detection time, containment time, and cost impact window.
Convert the incident into one permanent guardrail policy.
What to alert on
- tokens/hour or requests/hour jump vs trailing baseline
- unknown-user ratio spike (new/unauthenticated traffic concentration)
- 429/5xx rate increase + retry multiplier (same request repeating)
- outputTokens/request drift after a promptVersion change
- endpointTag or tenant concentration (one driver dominates spend)
Execution checklist
- Rotate compromised keys and invalidate leaked credentials.
- Add per-endpoint rate limits and max output tokens for public endpoints.
- Assign one incident owner and publish update cadence (ETA + containment steps).
- Separate demo/test traffic (dataMode) so baselines and alerts are trustworthy.
- Write one post-incident guardrail change (policy, cap, routing rule, or release gate).
FAQ
Do OpenAI rate limits prevent bill shock?
No. Rate limits control throughput, not total spend. You can stay within rate limits and still burn budget via long prompts/outputs, retries, or routing drift to expensive models.
How do I tell a retry storm from abuse traffic quickly?
Retry storms usually correlate with elevated 429/5xx rates and repeated requests for the same user action (often sharing an idempotency key or externalRequestId). Abuse patterns skew toward unknown-user concentration, repeated prompts, and bursts focused on one public endpoint.
Should we block requests when budget is exceeded?
Blocking is a last resort for user experience. Prefer degraded modes first: cap output tokens, reduce context, tier models on low-risk paths, and throttle non-critical endpoints. Use hard blocks for abuse-prone routes when needed.
What is the most effective post-incident guardrail?
Make the driver attributable and actionable: enforce endpointTag + promptVersion + externalRequestId tagging, set alert thresholds with a named owner, and add one permanent cap or routing policy for the endpoint that caused the incident.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.