Prompt regression
RAG context creep: how top-k and chunk size inflate cost
Top-k, chunk overlap, and retrieval fan-out can double spend without changing model list price. Treat retrieval configuration as a cost control surface.
Full guide: Prompt deploy cost regressions: catch silent cost spikes
Why context growth happens
- Top-k defaults increase after relevance tuning and never reset.
- Chunk overlap multiplies duplicate tokens across similar passages.
- Fallback retrieval chains pull extra context even on simple prompts.
A simple cost model for retrieval (what actually drives the bill)
RAG spend is mostly input-token spend. Retrieval settings decide how many tokens you stuff into the prompt before the model even starts answering.
A rough model is: inputTokens ~= systemPrompt + userPrompt + (topK * chunkSize) + overlap + tool outputs. Small parameter changes compound quickly at scale.
- topK increases input tokens linearly (and sometimes latency exponentially via tool chains).
- Chunk overlap is duplicated tokens every request.
- Long "instruction" system prompts can hide inside retrieval wrappers and grow over time.
What to measure per deploy
- avgInputTokens delta by endpointTag
- cost/request drift by promptVersion
- retrieval hit-rate versus token growth
- latency regression when context payload grows
Symptoms that look like "model pricing" problems (but are retrieval drift)
- Cost/request rises while request volume stays flat.
- Latency rises in the same window as avgInputTokens.
- The answer quality is unchanged, but context payload doubled.
- A subset of tenants suddenly dominate spend due to long documents or repeated queries.
How to reduce RAG token cost without losing accuracy
- Use dynamic top-k (lower for simple questions, higher for complex ones).
- Add a reranker and retrieve fewer, higher-quality chunks.
- Reduce chunk overlap; overlap is duplicated tokens every request.
- Compress context (summary or extractive highlights) before the final call.
- Cache retrieval results for repeated queries and high-traffic tenants.
Make retrieval changes attributable (treat it like a deploy)
Most teams version prompt changes but forget retrieval configuration. That makes cost drift feel random.
Version retrieval config (topK, chunking, reranker) alongside promptVersion so you can correlate cost/request changes to a single change event.
- Ship retrieval changes behind a canary.
- Record a retrievalVersion or include it in promptVersion notes.
- Alert on avgInputTokens delta after retrieval changes.
- Review top drivers weekly and reset topK defaults that crept up.
What to send (payload example)
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "provider_id",
"model": "model_id",
"endpointTag": "rag.answer",
"promptVersion": "rag_v4",
"userId": "tenant_acme_hash",
"inputTokens": 2100,
"outputTokens": 340,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}Common mistakes
- Increasing topK to fix one edge case and leaving it high for all traffic.
- Using heavy overlap (or duplicate sources) that silently doubles input tokens.
- Not caching retrieval for repeated queries and high-traffic tenants.
- Changing retrieval config without a version tag, then losing attribution.
- Debugging cost drift by changing models instead of measuring input-token growth.
How to verify in Opsmeter Dashboard
- Open Top Endpoints and identify the RAG endpointTag with the highest spend delta.
- Compare avgInputTokens and cost/request before and after the retrieval change window.
- Inspect outliers (highest inputTokens) and map them to document size, topK, and overlap.
- Check Prompt Versions (or deploy notes) to correlate the spike to a retrieval/prompt change.
- Apply a rollback (topK/overlap) and confirm cost/request returns to baseline.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.