Opsmeter logo
Opsmeter
AI Cost & Inference Control

Prompt regression

System prompt growth: how hidden context quietly inflates LLM spend

System prompts change less visibly than user prompts, but they often drive sustained token inflation after release cycles.

Prompt versionsArchitecture

Full guide: Prompt deploy cost regressions: catch silent cost spikes

Where hidden growth appears

  • Policy and style instructions appended over time.
  • Safety and routing directives duplicated across layers.
  • Embedded examples that never get pruned.

Containment checks

  1. Diff system prompt payloads per release.
  2. Track baseline input token deltas by promptVersion.
  3. Set max token guardrails for low-risk endpoints.

Measure a system prompt budget (so drift is visible)

Hidden context is expensive because it is paid on every request. Treat system prompt size as a budgeted dependency like latency or error rate.

Track the baseline inputTokens for each endpointTag and alert when it grows after a promptVersion or routing change.

  • Baseline: avgInputTokens and p95 inputTokens per endpointTag.
  • Change detection: compare before/after windows per promptVersion.
  • Ownership: assign a maintainer for shared instruction layers.

Pruning strategies that reduce hidden context

  • Remove duplicated policy blocks across layers (system + tool + router).
  • Replace long examples with short templates and references.
  • Move rarely-used instructions into on-demand retrieval.
  • Keep a strict "prompt budget" per endpointTag and enforce it.
  • Review promptVersion diffs with token deltas, not only output quality.

Deploy guardrails (keep the prompt from growing back)

  1. Require a promptVersion bump when shared instruction layers change.
  2. Gate releases on inputTokens deltas, not only output quality.
  3. Cap output tokens on endpoints where verbosity drift is likely.
  4. Move long policies into retrieval only when needed (avoid always-on context).
  5. Review tail outliers (p95/p99) where hidden context hurts most.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "checkout.ai_summary",
  "promptVersion": "summary_v3",
  "userId": "tenant_acme_hash",
  "inputTokens": 540,
  "outputTokens": 180,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Comparing totals only instead of cost/request and token deltas by promptVersion.
  • Skipping long-tail outlier review (p95/p99) where regressions hide.
  • Letting retrieval config drift (top-k/chunk overlap) without a token budget.
  • Not capping output tokens on low-risk endpoints after a deploy.

How to verify in Opsmeter Dashboard

  1. Use Overview to confirm spike window and budget posture.
  2. Use Top Endpoints to find feature-level concentration.
  3. Use Top Users to find tenant-level concentration.
  4. Use Prompt Versions to validate deploy-linked cost drift.

Related guides

Read prompt regression pillarOpen operations docsCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack