Prompt regression
System prompt growth: how hidden context quietly inflates LLM spend
System prompts change less visibly than user prompts, but they often drive sustained token inflation after release cycles.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
What this guide answers
- What changed in cost, cost per request, or budget posture.
- Which endpoint, prompt, model, or tenant likely drove the delta.
- Which validation step or control to apply next in Opsmeter.io.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "checkout.ai_summary",
"promptVersion": "summary_v3",
"userId": "tenant_acme_hash",
"inputTokens": 540,
"outputTokens": 180,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Comparing totals only instead of cost/request and token deltas by promptVersion.
- Skipping long-tail outlier review (p95/p99) where regressions hide.
- Letting retrieval config drift (top-k/chunk overlap) without a token budget.
- Not capping output tokens on low-risk endpoints after a deploy.
How to verify in the Opsmeter.io dashboard
- Use Overview to confirm spike window and budget posture.
- Use Top Endpoints to find feature-level concentration.
- Use Top Users to find tenant-level concentration.
- Use Prompt Versions to validate deploy-linked cost drift.
Where hidden growth appears
- Policy and style instructions appended over time.
- Safety and routing directives duplicated across layers.
- Embedded examples that never get pruned.
Use this workflow
Turn diagnosis into action
Identify the cost driver, validate it with attribution, then apply one durable control before the next billing cycle.
Apply in your workspace
Re-run this workflow on your own spend data
Follow the same path from article insight to telemetry verification, then validate with your own cost signals.
Containment checks
- Diff system prompt payloads per release.
- Track baseline input token deltas by promptVersion.
- Set max token guardrails for low-risk endpoints.
Measure a system prompt budget (so drift is visible)
Hidden context is expensive because it is paid on every request. Treat system prompt size as a budgeted dependency like latency or error rate.
Track the baseline inputTokens for each endpointTag and alert when it grows after a promptVersion or routing change.
- Baseline: avgInputTokens and p95 inputTokens per endpointTag.
- Change detection: compare before/after windows per promptVersion.
- Ownership: assign a maintainer for shared instruction layers.
Pruning strategies that reduce hidden context
- Remove duplicated policy blocks across layers (system + tool + router).
- Replace long examples with short templates and references.
- Move rarely-used instructions into on-demand retrieval.
- Keep a strict "prompt budget" per endpointTag and enforce it.
- Review promptVersion diffs with token deltas, not only output quality.
Deploy guardrails (keep the prompt from growing back)
- Require a promptVersion bump when shared instruction layers change.
- Gate releases on inputTokens deltas, not only output quality.
- Cap output tokens on endpoints where verbosity drift is likely.
- Move long policies into retrieval only when needed (avoid always-on context).
- Review tail outliers (p95/p99) where hidden context hurts most.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.