Fast checklist
15-minute LLM cost spike checklist for on-call teams
Use this page during incidents. For deeper diagnosis patterns and attribution logic, use the root-cause analysis guide.
Full guide: Bot attacks and LLM cost spikes: prevention playbook
Minute 0-5: classify the spike
- Is the change volume-driven, token-driven, or both?
- Did any deploy happen in the same window?
- Is the spike isolated to one endpoint or tenant?
Minute 5-10: identify dominant driver
- Open Top Endpoints and rank by spend.
- Open Top Users and rank concentration.
- Compare promptVersion cost/request before and after spike.
Minute 10-15: apply temporary controls
- Contain retries and suspicious traffic.
- Route non-critical paths to lower-cost model tier.
- Set temporary token limits where acceptable.
- Notify owner with exact endpoint/tenant/promptVersion driver.
Containment options by spike type
- Volume spike: rate-limit the dominant endpointTag and throttle unknown identities.
- Token spike: cap output tokens and reduce context (summarize history, shrink retrieval top-k).
- Deploy spike: rollback the last promptVersion or gate traffic to a canary cohort.
- Abuse spike: rotate keys, block sources, and isolate public endpoints.
Incident note template (write this while it is fresh)
- Time window + baseline comparison window.
- Dominant endpointTag driver + cost/request delta.
- Dominant tenant/user driver + concentration percentage.
- promptVersion correlation (what changed and when).
- Action taken (cap, throttle, route, rollback) + owner.
- Follow-up: one permanent guardrail to add.
After-action
Convert this checklist run into a permanent guardrail policy so the next spike is detected earlier.
What to alert on
- request burst with low identity diversity
- token-per-request surge without feature traffic growth
- retry ratio increase without an upstream outage explanation
- new high-cost endpointTag suddenly dominating spend
Execution checklist
- Confirm abuse signal: burst, key leak, prompt injection, or scraping.
- Rotate compromised keys and block abusive sources immediately.
- Apply per-endpoint rate limits and output caps to contain spend.
- Document dominant endpointTag, tenant/user concentration, and time window.
- Convert the incident into one permanent guardrail update.
FAQ
What is the fastest way to find the cost spike driver?
Start with Top Endpoints (feature concentration), then Top Users (tenant concentration), then Prompt Versions (deploy correlation). This order finds the dominant driver quickly without guessing.
Should we optimize prompts immediately during an incident?
No. Contain first (caps, throttles, routing, rollback). Optimization comes after spend stabilizes so you do not chase moving targets while the bill keeps growing.
How do we avoid false alarms from demo/test traffic?
Separate demo/test traffic using dataMode and environment. Alerts and burn-rate checks should be scoped to real production traffic so thresholds remain trustworthy.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.