Retry control
Retry storms: how retries can multiply your LLM bill
A retry storm is one of the fastest ways to inflate spend while still seeing mostly valid responses.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
How retry storms start
- Aggressive client retries on timeout
- Shared retry policy across user-facing and batch paths
- Missing jitter and max-attempt caps
- No idempotency key on retried requests
Detection signals
- Request count rises faster than successful user actions.
- Latency and timeout ratio increase together with spend.
- Same endpointTag dominates both errors and spend.
- Duplicate externalRequestId patterns appear in telemetry.
Why retries multiply cost (simple math)
Retries are a cost multiplier because the provider bills per attempt, not per successful outcome. Even small increases in attempts-per-success can double effective cost.
Track attempts-per-success per endpointTag so you can contain the right feature path instead of guessing.
- attemptCost = average cost of one attempt
- attemptsPerSuccess = attempts / successful requests
- effectiveCostPerSuccess = attemptCost * attemptsPerSuccess
Containment
- Cap max retries by endpoint criticality.
- Use exponential backoff with jitter.
- Introduce circuit-breaker behavior for known provider failures.
- Separate batch retry policy from interactive traffic.
Idempotency and request IDs (avoid double-billing patterns)
- Reuse externalRequestId for one logical user action across retries.
- Track attempt number and final status so effective cost is explainable.
- Avoid layered retries (proxy + app) without one owner and one policy.
- Disable automatic retries on non-idempotent endpoints unless you can reconcile duplicates.
Long-term guardrail
Track retry ratio as a cost-control metric, not only reliability metric.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "checkout.ai_summary",
"promptVersion": "summary_v3",
"userId": "tenant_acme_hash",
"inputTokens": 540,
"outputTokens": 180,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Comparing totals only instead of cost/request and token deltas by promptVersion.
- Skipping long-tail outlier review (p95/p99) where regressions hide.
- Letting retrieval config drift (top-k/chunk overlap) without a token budget.
- Not capping output tokens on low-risk endpoints after a deploy.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.