Pillar
Prompt deploy cost regressions: catch silent cost spikes
Prompt changes can improve output quality while harming margin. This pillar centralizes deploy-time cost regression controls.
Regression patterns teams miss
- System prompt growth and context creep
- Output verbosity drift after release
- Tool output inflation in multi-step chains
- Retry/fallback behavior that multiplies requests
Deploy-time review order
- Tag every release with promptVersion.
- Compare cost/request and token deltas against baseline.
- Check endpoint concentration shifts in same window.
- Rollback or constrain rollout when threshold is exceeded.
Baseline metrics that catch regressions early
- avgInputTokens and avgOutputTokens by promptVersion
- cost/request by endpointTag and promptVersion
- retry ratio and fallback frequency (hidden multipliers)
- latency deltas that correlate with context growth
- top outliers (long-tail requests) rather than averages only
Regression types (symptoms and fastest containment)
Most regressions fit a repeatable pattern. The fastest response is to identify the pattern and apply the matching containment control.
Use this list as a deploy checklist and as an on-call reference when a cost spike is deploy-linked.
- Context creep: input tokens grow after deploy. Watch avgInputTokens + cost/request. Contain by reducing topK, compressing context, trimming the system prompt.
- Verbosity drift: outputs get longer (quality may improve). Watch avgOutputTokens + p95 outputs. Contain by capping output tokens and enforcing a response contract.
- Tool output inflation: tool steps or tool output size increases. Watch tool calls/request + tokens/request. Contain by limiting tool calls and summarizing tool output.
- Retry multiplier: errors cause repeated calls. Watch error rate + retries by externalRequestId. Contain with backoff/jitter, respecting Retry-After, and fixing timeouts.
- Routing drift: more traffic hits expensive models. Watch model mix + cost/request. Contain by pinning models per endpointTag and versioning routing policy.
Guardrails that prevent margin drift
- Set max output tokens by endpoint criticality.
- Gate rollouts when token deltas exceed your policy threshold.
- Treat RAG retrieval config (top-k, chunk size) as a deploy surface.
- Separate demo/test dataMode from production dashboards.
- Document every regression as one permanent control update.
Root causes (and how to fix them)
Cost regressions usually come from a small set of mechanisms. The fix is rarely “optimize tokens” in the abstract; it is almost always one concrete constraint.
Use the root cause category to choose the fastest containment action, then follow with a durable change (a rule, a cap, or a rollout gate).
- Context creep: reduce retrieved chunks, compress context, or move rarely-used instructions into on-demand retrieval.
- Verbosity drift: add a response contract (format + length) and enforce max output tokens.
- Tool output bloat: summarize tool payloads before reinjection and cap tool output size.
- Retry multiplier: cap attempts, add jitter/backoff, and keep externalRequestId stable across retries.
- Routing drift: pin model tiers per endpointTag and alert on unexpected tier changes.
Deploy controls for RAG and retrieval configuration
Retrieval settings behave like a “hidden prompt” that can double inputTokens without anyone touching the prompt text.
Treat top-k, chunk overlap, and reranking as deploy surfaces: version them, review deltas, and gate when costs jump.
- Log retrieval parameters per request (top-k, chunk size, overlap, reranker version).
- Track avgInputTokens deltas by promptVersion and by endpointTag.
- Alert when retrieval hit-rate drops while tokens increase.
- Prefer fewer, higher-quality chunks over larger context payloads.
- Create a rollback path for retrieval config separate from prompt text.
Agent and tool-chain regressions
Multi-step chains hide spend in intermediate steps. A small change to a tool schema or intermediate output format can multiply downstream token usage.
The safest control is per-step attribution: measure cost per workflow stage and cap the stages most likely to loop.
- Measure cost per workflow step and require an owner for the top outlier stage each week.
- Cap tool call count per request and alert on loop patterns.
- Summarize long tool outputs (logs, traces, tables) before reinjection.
- Gate rollouts when tool output size or step count increases.
- Keep a “degraded mode” path (skip optional tools) for budgetExceeded incidents.
Success-adjusted cost (the number that matters)
A lower token price does not guarantee lower cost if success rate drops and retries increase. The true metric is cost per successful outcome.
Track success-adjusted cost at the endpointTag level so teams can trade quality and cost with real evidence.
- Include retries and fallbacks in effective cost per successful request.
- Track rework loops (users asking for re-answers) as a hidden multiplier.
- Compare before/after windows per promptVersion to attribute changes to deploys.
- Watch p95/p99 token outliers; regressions often live in the tail.
Post-deploy report template (keep it lightweight)
- What changed (promptVersion + summary of intent).
- Token deltas (avgInputTokens, avgOutputTokens, tail outliers).
- Cost/request delta by endpointTag (before vs after).
- Retry/fallback changes (rates and drivers).
- Decision: accept, rollback, or mitigate (and the one permanent control to add).
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "checkout.ai_summary",
"promptVersion": "summary_v3",
"userId": "tenant_acme_hash",
"inputTokens": 540,
"outputTokens": 180,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Comparing totals only instead of cost/request and token deltas by promptVersion.
- Skipping long-tail outlier review (p95/p99) where regressions hide.
- Letting retrieval config drift (top-k/chunk overlap) without a token budget.
- Not capping output tokens on low-risk endpoints after a deploy.
How to verify in Opsmeter Dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Templates
Post-deploy regression report (template)
# Prompt deploy regression report
Window (UTC):
EndpointTag:
PromptVersion:
Baseline (before):
- requests/hour:
- avg input/output tokens:
- cost/request:
- error+retry rate:
Current (after):
- requests/hour:
- avg input/output tokens:
- cost/request:
- error+retry rate:
Hypothesis:
Containment (what we changed):
Decision: accept / rollback / mitigate
One durable control to add:
Owner + ETA:
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.