Pillar
LLM budget alert policy: thresholds and escalation
This page defines the operating policy behind alerts: who owns each threshold, when escalation starts, and how to avoid alert noise.
Core controls
- Warning threshold before hard budget cap
- Exceeded threshold with explicit owner action
- Burn-rate checks against historical baseline
- Endpoint and tenant concentration checks when alerts fire
Operational policy
Alerts are workflow triggers, not passive notifications.
Each threshold must have one owner and one documented response path.
Escalation model that avoids alert fatigue
- Assign one primary owner and one backup owner per workspace.
- Escalate warning alerts only after burn-rate confirmation.
- Escalate exceeded alerts immediately with endpoint/tenant context.
- Review false positives weekly and retune thresholds.
Policy outputs your team should keep
- Threshold table with rationale and owner names
- Incident response path for warning and exceeded states
- Monthly threshold review cadence
- Post-incident policy change log
Example thresholds (start simple)
A good first version is small: warning + exceeded + one burn-rate confirmation check.
Attach context (top endpointTag + tenant) so the owner can act without opening five dashboards.
- Budget warning (80% of monthly budget) - Owner: primary on-call. First action: confirm burn-rate, identify top driver, apply safe caps.
- Budget exceeded (100% of monthly budget) - Owner: incident owner. First action: contain spend, decide degrade/pause, publish ETA.
- Burn-rate confirm (>2-3x baseline spend/day) - Owner: primary on-call. First action: validate window + dataMode; escalate only if confirmed.
- Concentration check (top endpointTag > 50% of spend) - Owner: feature owner. First action: contain that endpointTag first; add a durable guardrail.
- Unknown model ratio (unknown models > 1-2%) - Owner: platform owner. First action: triage catalog/pricing mapping so cost math stays credible.
How to choose thresholds (warning vs exceeded)
A good alert policy detects problems early without paging the team for normal variance. The most reliable thresholds combine a static budget with a burn-rate check.
Start simple: one workspace budget, one warning threshold, one exceeded threshold, and one owner. Then add endpoint and tenant granularity where it matters.
- Warning: early signal when you still have time to act (containment, routing, caps).
- Exceeded: incident trigger that requires an owner decision (pause, degrade, or approve overrun).
- Burn-rate: “are we spending faster than normal?” even if totals are still small.
Burn-rate checks that reduce false alarms
Burn-rate is the easiest way to detect “something changed” without relying on perfect forecasts.
Compare the current period to your baseline (last 7/14/30 days) and require confirmation before escalation.
- Compute spend/day and cost/request for the last 1h/24h window.
- Compare to baseline spend/day and baseline cost/request.
- Confirm dataMode/environment (real vs demo/test, prod vs staging).
- Escalate only when both totals and burn-rate indicate drift.
- Attach the top endpointTag and top tenant/user drivers to the alert.
Alert payload fields that make response fast
- budget state (warning/exceeded) + current burn-rate vs baseline
- top endpointTag contributors and their cost/request deltas
- top tenants/users and concentration %
- promptVersion changes in the same window (deploy correlation)
- retry ratio and status distribution (multiplier detection)
- unknown-model ratio (pricing/coverage issues)
Response playbooks (what owners should do)
Budgets are not a billing feature; they are an operations workflow. Every threshold should have a documented response.
Treat the first response as containment. Optimization comes after the incident is stable.
- Identify the dominant driver (endpointTag + tenant/user).
- Contain: cap output tokens, throttle non-critical endpoints, or route to cheaper models.
- Stop multipliers: reduce retries, fix timeouts, and block abuse patterns.
- Decide: accept overrun, pause non-critical flows, or apply temporary limits.
- Write one post-incident change (policy, cap, or rollout gate).
Hard caps vs soft caps (hybrid model)
Soft caps improve collaboration; hard caps enforce strict boundaries. Most teams need both, scoped by feature criticality.
Use soft caps for core user paths to avoid outages, and hard caps for abuse-prone or non-critical workflows.
- Degrade gracefully before blocking: smaller context, shorter outputs, cheaper models.
- Apply hard caps to batch jobs and optional features first.
- Review outcomes weekly and retune thresholds to reduce alert fatigue.
Maintenance cadence (keep policy healthy)
- Weekly: review false positives and adjust burn-rate confirmation logic.
- Weekly: review top 3 endpoints and tenants by spend (ownership clarity).
- Monthly: review budgets against traffic growth and model pricing changes.
- After incidents: add exactly one durable control to prevent recurrence.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "billing.guardrail_check",
"promptVersion": "budget_v1",
"userId": "tenant_acme_hash",
"inputTokens": 240,
"outputTokens": 80,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Setting static thresholds without burn-rate checks.
- No single owner or escalation path for warning/exceeded states.
- Alerting on totals only (missing endpoint and tenant concentration context).
- Including demo/staging traffic in production spend policy decisions.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Templates
Budget warning alert (message template)
# Budget warning
Workspace:
Window (UTC):
Budget: $ / month
Current spend: $ (X%)
Burn-rate vs baseline:
Top driver:
- endpointTag:
- cost/request delta:
- top tenant/user:
Action taken:
Next update ETA:
Budget exceeded incident update (message template)
# Budget exceeded (incident)
Workspace:
Start time (UTC):
Projected impact if unchanged:
Root cause hypothesis:
Dominant driver:
- endpointTag:
- tenant/user:
- promptVersion correlation:
Containment applied:
- caps/throttles/routing:
- user impact:
Next verification step:
Owner + ETA:
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.