Opsmeter logo
Opsmeter
AI Cost & Inference Control

Root cause workflow

Root cause an LLM cost spike: endpoint, tenant, deploy

Use this guide for diagnosis logic and evidence order. For on-call execution, use the 15-minute checklist article.

AttributionRoot causePrompt versions

Full guide: Bot attacks and LLM cost spikes: prevention playbook

Use totals as entry point, not final answer

Totals tell you when spend changed, but they do not identify which endpoint or deploy created the change.

Root cause requires request-level tags and a deterministic investigation order.

Root-cause sequence

  1. Step 1: validate spike window in Overview.
  2. Step 2: rank spend by endpointTag.
  3. Step 3: rank spend by userId/tenant mapping.
  4. Step 4: compare promptVersion cost/request before and after deploy.
  5. Step 5: isolate one dominant driver and assign owner.

Classify the spike type first (so you do not chase the wrong thing)

Spikes feel mysterious when you treat them as one category. Start by classifying: volume, efficiency, routing, retries, or abuse.

Each category has a different fix. Prompt tuning does not solve retry storms, and model swaps do not solve context bloat.

  • Volume spike: requests/hour up (same cost/request).
  • Efficiency regression: cost/request up (volume flat) after deploy.
  • Routing drift: a higher-cost model share increases on one endpoint.
  • Retry storm: errors + retries increase calls per user action.
  • Abuse/bot: unknown-user concentration + burst traffic on public endpoints.

Fields required for reliable attribution

  • externalRequestId (stable across retries)
  • endpointTag, promptVersion, userId (optional but recommended)
  • provider/model and token counts
  • latency and status for retry/noise correlation

Evidence checklist (what to compare before vs after)

  1. Requests/hour and tokens/request (input + output) per endpointTag.
  2. Cost/request and model mix per endpointTag.
  3. Error rate (429/5xx) and retry multiplier (same externalRequestId repeating).
  4. Tenant/user concentration change (top 1 tenant share and top 5 share).
  5. PromptVersion changes in the same window (deploy correlation).

Validation checklist before closing incident

  1. Cost/request returned to baseline.
  2. Dominant endpoint share normalized.
  3. No abnormal retry ratio in the same window.
  4. Alert threshold tuned to catch recurrence earlier.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "public.chat",
  "promptVersion": "public_v1",
  "userId": "anon_ip_hash",
  "inputTokens": 260,
  "outputTokens": 190,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Looking only at monthly totals instead of isolating the spike time window (UTC).
  • Changing models globally instead of scoping fixes to one endpointTag.
  • Ignoring retry storms (the cost multiplier is calls per action, not token price).
  • Missing promptVersion tagging, so deploy correlation becomes guesswork.
  • Mixing demo/test traffic with production, then “fixing” a false spike.

How to verify in Opsmeter Dashboard

  1. In Overview, set the exact spike window and compare it to the prior baseline window.
  2. In Top Endpoints, rank spend and cost/request to find the dominant feature driver.
  3. In Top Users, check tenant/user concentration and unknown-user bursts.
  4. In Prompt Versions, compare before/after cost/request for versions shipped in the spike window.
  5. Validate with raw samples: retries, model mix, and token deltas match the hypothesized driver.

Related guides

Use the 15-minute checklistOpen compare hubCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack