Prompt Versions
Prompt Impact compare A vs B: catch regressions before rollout
Compare prompt versions with confidence rules so expensive regressions are blocked before they reach full traffic.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
What A-vs-B should answer
- Did cost/request move materially after the new prompt version?
- Are input or output tokens driving the delta?
- Is latency shifting enough to create timeout/retry risk?
Minimum data quality rules
- Use one endpointTag and one time window for both versions.
- Require a minimum sample size before trusting deltas.
- Treat low-confidence results as advisory, not release blocking.
- Verify model mix did not change between A and B.
Release gate decision policy
Define numeric thresholds before rollout. Example: cost/request +15% and outputTokens +20% triggers manual approval.
Without a numeric gate, teams normalize drift and only notice at month-end.
- Block rollout when cost/request exceeds threshold and confidence is high.
- Allow rollout with monitoring when confidence is low.
- Always log owner, decision, and follow-up action.
Fast containment if regression is already live
- Rollback promptVersion on top-cost endpoints first.
- Apply output token cap while rollback propagates.
- Re-run A-vs-B check and confirm baseline recovery.
- Add a release checklist entry for future prompt deploys.
What to alert on
- cost/request drift by endpointTag or promptVersion
- unexpected tenant concentration in Top Users
- request burst with falling success ratio
- budget warning, spend-alert, and exceeded state transitions
Execution checklist
- Confirm spike type: volume, token, deploy, or abuse signal.
- Assign one incident owner and one communication channel.
- Apply immediate containment before deep optimization.
- Document the dominant endpoint, tenant, and promptVersion driver.
- Convert findings into one permanent guardrail update.
FAQ
Is userId required?
No. userId is optional, but recommended for tenant-level attribution. If needed, send a hashed identifier.
Where should token usage values come from?
Prefer provider usage fields first. If unavailable, use tokenizer estimates and mark uncertainty in your workflow.
How should retries be handled?
Keep the same externalRequestId for the same logical request so idempotency remains stable across retries.
Can telemetry break production flow?
It should not. Use short timeouts, catch errors, and keep telemetry asynchronous so provider calls keep running.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.