Model strategy
Model swap regressions: cheaper models can cost more
Lower token price does not guarantee lower total cost. Reliability and retry behavior can erase apparent savings.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
Failure mode
- A cheaper model increases failure or retry rate.
- Output quality drift increases downstream rework calls.
- Latency degradation forces larger buffers and duplicate requests.
Cheaper token price vs cheaper outcome
Teams often optimize for list price per token and miss the true metric: cost per successful outcome. If success rate drops, the retry multiplier can erase savings.
Model swaps can also change behavior: longer answers, more tool calls, or more user follow-up questions. All of those increase total request cost.
- Track effective cost per successful request (include retries and fallbacks).
- Watch attempts-per-success and rework loops (re-asks, edits, escalations).
- Compare before/after windows by endpointTag so ownership is clear.
A simple success-adjusted cost model
- attemptCost = cost per attempt (tokens * pricing table)
- attemptsPerSuccess = attempts / successful requests
- successAdjustedCost = attemptCost * attemptsPerSuccess
- Add downstream multipliers when applicable (human review, extra tool steps, re-asks)
Evaluation workflow
- Measure success-adjusted cost per request before rollout.
- Track retry and fallback rate by model.
- Compare endpoint-level margin impact after model swap.
Rollout guardrails that prevent surprise regressions
- Canary by endpointTag (start with low-risk, low-variance endpoints).
- Set a maximum retry policy and a fallback path for failures.
- Cap output tokens on endpoints where verbosity drift is likely.
- Gate rollout on tail risk (p95/p99 tokens and latency), not only averages.
- Add budget alerts for the rollout window (burn-rate + endpoint concentration).
Hybrid routing patterns (keep savings without losing quality)
A full swap is rarely required. Many teams get most of the savings by routing low-risk endpoints to a cheaper tier and keeping high-risk endpoints on a stronger model.
The routing unit should be endpointTag or intent class, not a global switch.
- Route by endpointTag risk: summaries and drafts on cheaper tier, critical decisions on stronger tier.
- Fallback only on failures (avoid always-on double calls).
- Use a degraded mode when budgets are exceeded (shorter outputs, fewer tools).
Post-rollout monitors that catch regression fast
- success rate and retry ratio by endpointTag
- cost/request (not just token price) by promptVersion
- latency shifts that trigger timeouts and duplicate requests
- top tenants affected by quality drift and rework loops
- outputTokens growth when users ask for re-answers
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "checkout.ai_summary",
"promptVersion": "summary_v3",
"userId": "tenant_acme_hash",
"inputTokens": 540,
"outputTokens": 180,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Comparing totals only instead of cost/request and token deltas by promptVersion.
- Skipping long-tail outlier review (p95/p99) where regressions hide.
- Letting retrieval config drift (top-k/chunk overlap) without a token budget.
- Not capping output tokens on low-risk endpoints after a deploy.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.