Prompt regression
RAG context creep: how top-k and chunk size inflate cost
Top-k, chunk overlap, and retrieval fan-out can double spend without changing model list price. Treat retrieval configuration as a cost control surface.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
What this guide answers
- What changed in cost, cost per request, or budget posture.
- Which endpoint, prompt, model, or tenant likely drove the delta.
- Which validation step or control to apply next in Opsmeter.io.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "rag.answer",
"promptVersion": "rag_v4",
"userId": "tenant_acme_hash",
"inputTokens": 2100,
"outputTokens": 340,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Increasing topK to fix one edge case and leaving it high for all traffic.
- Using heavy overlap (or duplicate sources) that silently doubles input tokens.
- Not caching retrieval for repeated queries and high-traffic tenants.
- Changing retrieval config without a version tag, then losing attribution.
- Debugging cost drift by changing models instead of measuring input-token growth.
How to verify in the Opsmeter.io dashboard
- Open Top Endpoints and identify the RAG endpointTag with the highest spend delta.
- Compare avgInputTokens and cost/request before and after the retrieval change window.
- Inspect outliers (highest inputTokens) and map them to document size, topK, and overlap.
- Check Prompt Versions (or deploy notes) to correlate the spike to a retrieval/prompt change.
- Apply a rollback (topK/overlap) and confirm cost/request returns to baseline.
Why context growth happens
- Top-k defaults increase after relevance tuning and never reset.
- Chunk overlap multiplies duplicate tokens across similar passages.
- Fallback retrieval chains pull extra context even on simple prompts.
Use this workflow
Turn diagnosis into action
Identify the cost driver, validate it with attribution, then apply one durable control before the next billing cycle.
Apply in your workspace
Re-run this workflow on your own spend data
Follow the same path from article insight to telemetry verification, then validate with your own cost signals.
A simple cost model for retrieval (what actually drives the bill)
RAG spend is mostly input-token spend. Retrieval settings decide how many tokens you stuff into the prompt before the model even starts answering.
A rough model is: inputTokens ~= systemPrompt + userPrompt + (topK * chunkSize) + overlap + tool outputs. Small parameter changes compound quickly at scale.
- topK increases input tokens linearly (and sometimes latency exponentially via tool chains).
- Chunk overlap is duplicated tokens every request.
- Long "instruction" system prompts can hide inside retrieval wrappers and grow over time.
What to measure per deploy
- avgInputTokens delta by endpointTag
- cost/request drift by promptVersion
- retrieval hit-rate versus token growth
- latency regression when context payload grows
Symptoms that look like "model pricing" problems (but are retrieval drift)
- Cost/request rises while request volume stays flat.
- Latency rises in the same window as avgInputTokens.
- The answer quality is unchanged, but context payload doubled.
- A subset of tenants suddenly dominate spend due to long documents or repeated queries.
How to reduce RAG token cost without losing accuracy
- Use dynamic top-k (lower for simple questions, higher for complex ones).
- Add a reranker and retrieve fewer, higher-quality chunks.
- Reduce chunk overlap; overlap is duplicated tokens every request.
- Compress context (summary or extractive highlights) before the final call.
- Cache retrieval results for repeated queries and high-traffic tenants.
Make retrieval changes attributable (treat it like a deploy)
Most teams version prompt changes but forget retrieval configuration. That makes cost drift feel random.
Version retrieval config (topK, chunking, reranker) alongside promptVersion so you can correlate cost/request changes to a single change event.
- Ship retrieval changes behind a canary.
- Record a retrievalVersion or include it in promptVersion notes.
- Alert on avgInputTokens delta after retrieval changes.
- Review top drivers weekly and reset topK defaults that crept up.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.