Pillar
Bot attacks and LLM cost spikes: prevention playbook
Security and cost operations overlap during bot abuse incidents. This pillar centralizes spike prevention controls.
What this guide answers
- What category of cost or governance problem this topic solves.
- Which request-level signals matter most when diagnosing it.
- Which follow-up guide or control workflow to apply next.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "public.chat",
"promptVersion": "public_v1",
"userId": "anon_ip_hash",
"inputTokens": 260,
"outputTokens": 190,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Shipping provider keys to the client or logging them in plaintext.
- No per-endpoint rate limits for high-cost workflows.
- Treating retry storms as "just reliability" while costs multiply.
- Delaying containment while searching for perfect root-cause.
How to verify in the Opsmeter.io dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Prevention and response stack
- Rate-limit patterns by endpoint criticality
- Retry backoff and duplicate suppression
- Key leak response with rotation timeline
- Concentration-based alerts and owner actions
Use this workflow
Turn diagnosis into action
Identify the cost driver, validate it with attribution, then apply one durable control before the next billing cycle.
Apply in your workspace
Re-run this workflow on your own spend data
Follow the same path from article insight to telemetry verification, then validate with your own cost signals.
Detection signals to monitor continuously
- Sudden request burst with low identity diversity
- Retry ratio increase without corresponding provider outage
- Tenant or endpoint concentration jump in short windows
- Fast rise in token-per-request with unchanged feature traffic
Incident ownership model
- Security owner handles key rotation and abuse source blocking.
- Platform owner applies retry and rate-limit containment.
- Product owner evaluates model/tokens guardrails by feature.
- Finance owner logs cost impact and post-incident actions.
Containment first: stop the financial bleeding
When abuse hits, speed matters more than perfect diagnosis. Containment reduces the blast radius so you can investigate safely.
Treat cost spikes as incidents: identify the driver, contain, then harden.
- Throttle public endpoints and non-critical features first.
- Cap output tokens to prevent long abusive completions.
- Block obvious automation patterns (IP ranges, user agents, failed auth bursts).
- Rotate compromised keys and revoke leaked credentials immediately.
Rate-limit patterns that work for LLM endpoints
Global rate limits are rarely enough. LLM costs vary by endpointTag, so enforcement must also vary by endpoint criticality and cost profile.
- Per-endpointTag limits (high-cost endpoints get tighter limits).
- Per-tenant limits (one customer should not drain shared margin).
- Burst limits + sustained limits (stop spikes and slow drains).
- Identity-aware limits (unknown-user traffic is higher risk).
Key leak response checklist (first hour)
- Rotate keys and invalidate all leaked credentials.
- Audit recent request logs for new endpoints and new traffic sources.
- Identify the top endpointTag and tenant/user concentration during the spike.
- Add temporary strict caps (tokens, requests) until stable.
- Create a permanent control: secret scanning, least privilege, and rotation policy.
Signals that distinguish abuse from regressions
- Abuse: burst traffic with low identity diversity and high error variance.
- Regressions: cost/request drift after deploy with stable traffic volume.
- Retry storms: higher retry ratio and longer tail latency, often with upstream errors.
- Pricing drift: unknown-model ratio rises or cost snapshots look inconsistent.
Post-incident hardening (make the next incident cheaper)
- Add alerting on unknown-user concentration and token-per-request spikes.
- Require per-endpoint output caps and max tool calls for agent workflows.
- Add per-tenant budgets for high-variance accounts.
- Document owner actions and update the response runbook.
- Run a weekly review of top endpoints/users to catch slow-drain abuse.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.