Pillar
Bot attacks and LLM cost spikes: prevention playbook
Security and cost operations overlap during bot abuse incidents. This pillar centralizes spike prevention controls.
Prevention and response stack
- Rate-limit patterns by endpoint criticality
- Retry backoff and duplicate suppression
- Key leak response with rotation timeline
- Concentration-based alerts and owner actions
Detection signals to monitor continuously
- Sudden request burst with low identity diversity
- Retry ratio increase without corresponding provider outage
- Tenant or endpoint concentration jump in short windows
- Fast rise in token-per-request with unchanged feature traffic
Incident ownership model
- Security owner handles key rotation and abuse source blocking.
- Platform owner applies retry and rate-limit containment.
- Product owner evaluates model/tokens guardrails by feature.
- Finance owner logs cost impact and post-incident actions.
Containment first: stop the financial bleeding
When abuse hits, speed matters more than perfect diagnosis. Containment reduces the blast radius so you can investigate safely.
Treat cost spikes as incidents: identify the driver, contain, then harden.
- Throttle public endpoints and non-critical features first.
- Cap output tokens to prevent long abusive completions.
- Block obvious automation patterns (IP ranges, user agents, failed auth bursts).
- Rotate compromised keys and revoke leaked credentials immediately.
Rate-limit patterns that work for LLM endpoints
Global rate limits are rarely enough. LLM costs vary by endpointTag, so enforcement must also vary by endpoint criticality and cost profile.
- Per-endpointTag limits (high-cost endpoints get tighter limits).
- Per-tenant limits (one customer should not drain shared margin).
- Burst limits + sustained limits (stop spikes and slow drains).
- Identity-aware limits (unknown-user traffic is higher risk).
Key leak response checklist (first hour)
- Rotate keys and invalidate all leaked credentials.
- Audit recent request logs for new endpoints and new traffic sources.
- Identify the top endpointTag and tenant/user concentration during the spike.
- Add temporary strict caps (tokens, requests) until stable.
- Create a permanent control: secret scanning, least privilege, and rotation policy.
Signals that distinguish abuse from regressions
- Abuse: burst traffic with low identity diversity and high error variance.
- Regressions: cost/request drift after deploy with stable traffic volume.
- Retry storms: higher retry ratio and longer tail latency, often with upstream errors.
- Pricing drift: unknown-model ratio rises or cost snapshots look inconsistent.
Post-incident hardening (make the next incident cheaper)
- Add alerting on unknown-user concentration and token-per-request spikes.
- Require per-endpoint output caps and max tool calls for agent workflows.
- Add per-tenant budgets for high-variance accounts.
- Document owner actions and update the response runbook.
- Run a weekly review of top endpoints/users to catch slow-drain abuse.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "public.chat",
"promptVersion": "public_v1",
"userId": "anon_ip_hash",
"inputTokens": 260,
"outputTokens": 190,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Shipping provider keys to the client or logging them in plaintext.
- No per-endpoint rate limits for high-cost workflows.
- Treating retry storms as "just reliability" while costs multiply.
- Delaying containment while searching for perfect root-cause.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.