Alert design
AI cost anomaly detection: practical thresholds that actually work
Most alert systems fail from noisy thresholds. Better thresholds use trend context and clear owner workflows.
Full guide: LLM budget alert policy: thresholds and escalation
Threshold model that scales
- Budget warning threshold (example: 80 percent)
- Budget exceeded threshold (100 percent)
- Burn-rate threshold versus trailing baseline
- Endpoint concentration threshold for dominant drivers
Threshold templates
- Low-volume workspace: prioritize budget warning + endpoint concentration.
- Growing workspace: add burn-rate > 2-3x baseline checks.
- High-volume workspace: add promptVersion drift checks after deploy.
- Critical workspace: require owner acknowledgement on exceeded state.
Reduce false positives
- Separate demo/test traffic with dataMode and environment.
- Correlate spend jump with request-volume jump before paging.
- Mute known migration windows with short maintenance policy.
- Keep one action owner per alert channel.
Developer-friendly signals (tokens/hour and cost/request)
Budget thresholds are necessary, but they are sometimes slow to react for fast incidents. Engineers often prefer rate-based signals they can reason about quickly.
Add one tokens/hour or requests/hour check, plus cost/request drift, so a regression is visible even when absolute spend is still small.
- tokens/hour or requests/hour vs trailing baseline (detect volume bursts)
- cost/request vs baseline (detect efficiency regressions)
- endpointTag concentration change (detect one feature going wild)
- promptVersion correlation (detect deploy-linked drift)
Alerts vs spending caps (set expectations)
Some teams expect a budget system to behave like a hard spending cap on API keys. In most stacks, alerts are an operations workflow, while hard caps require runtime enforcement.
Design your thresholds around the control you actually have: alerts and playbooks first, then enforcement where it is safe for user experience.
- Define what happens at warning (human workflow) vs exceeded (incident decision).
- Attach top endpointTag + tenant/user drivers to every alert.
- Decide how to degrade safely (smaller context, shorter outputs, fewer tools).
- Only hard-block non-critical endpoints when you have clear messaging.
What to do when anomaly fires
- Classify anomaly: traffic, token, deploy, or abuse.
- Open Top Endpoints and Top Users immediately.
- Apply temporary containment and log decision.
- Convert repeated anomalies into permanent guardrails.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "billing.guardrail_check",
"promptVersion": "budget_v1",
"userId": "tenant_acme_hash",
"inputTokens": 240,
"outputTokens": 80,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Setting static thresholds without burn-rate checks.
- No single owner or escalation path for warning/exceeded states.
- Alerting on totals only (missing endpoint and tenant concentration context).
- Including demo/staging traffic in production spend policy decisions.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.