Opsmeter logo
Opsmeter
AI Cost & Inference Control

Retry control

Retry storms: how retries can multiply your LLM bill

A retry storm is one of the fastest ways to inflate spend while still seeing mostly valid responses.

RetriesCost spikesReliability

Full guide: Prompt deploy cost regressions: catch silent cost spikes

How retry storms start

  • Aggressive client retries on timeout
  • Shared retry policy across user-facing and batch paths
  • Missing jitter and max-attempt caps
  • No idempotency key on retried requests

Detection signals

  1. Request count rises faster than successful user actions.
  2. Latency and timeout ratio increase together with spend.
  3. Same endpointTag dominates both errors and spend.
  4. Duplicate externalRequestId patterns appear in telemetry.

Why retries multiply cost (simple math)

Retries are a cost multiplier because the provider bills per attempt, not per successful outcome. Even small increases in attempts-per-success can double effective cost.

Track attempts-per-success per endpointTag so you can contain the right feature path instead of guessing.

  • attemptCost = average cost of one attempt
  • attemptsPerSuccess = attempts / successful requests
  • effectiveCostPerSuccess = attemptCost * attemptsPerSuccess

Containment

  • Cap max retries by endpoint criticality.
  • Use exponential backoff with jitter.
  • Introduce circuit-breaker behavior for known provider failures.
  • Separate batch retry policy from interactive traffic.

Idempotency and request IDs (avoid double-billing patterns)

  • Reuse externalRequestId for one logical user action across retries.
  • Track attempt number and final status so effective cost is explainable.
  • Avoid layered retries (proxy + app) without one owner and one policy.
  • Disable automatic retries on non-idempotent endpoints unless you can reconcile duplicates.

Long-term guardrail

Track retry ratio as a cost-control metric, not only reliability metric.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "checkout.ai_summary",
  "promptVersion": "summary_v3",
  "userId": "tenant_acme_hash",
  "inputTokens": 540,
  "outputTokens": 180,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Comparing totals only instead of cost/request and token deltas by promptVersion.
  • Skipping long-tail outlier review (p95/p99) where regressions hide.
  • Letting retrieval config drift (top-k/chunk overlap) without a token budget.
  • Not capping output tokens on low-risk endpoints after a deploy.

How to verify in Opsmeter Dashboard

  1. Use Overview to confirm spike window and budget posture.
  2. Use Top Endpoints to find feature-level concentration.
  3. Use Top Users to find tenant-level concentration.
  4. Use Prompt Versions to validate deploy-linked cost drift.

Related guides

Open operations docsRead AI cost spike guideCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack