Opsmeter logo
Opsmeter
AI Cost & Inference Control

Prompt regression

RAG context creep: how top-k and chunk size inflate cost

Top-k, chunk overlap, and retrieval fan-out can double spend without changing model list price. Treat retrieval configuration as a cost control surface.

Prompt versionsArchitectureOperations

Full guide: Prompt deploy cost regressions: catch silent cost spikes

Why context growth happens

  • Top-k defaults increase after relevance tuning and never reset.
  • Chunk overlap multiplies duplicate tokens across similar passages.
  • Fallback retrieval chains pull extra context even on simple prompts.

A simple cost model for retrieval (what actually drives the bill)

RAG spend is mostly input-token spend. Retrieval settings decide how many tokens you stuff into the prompt before the model even starts answering.

A rough model is: inputTokens ~= systemPrompt + userPrompt + (topK * chunkSize) + overlap + tool outputs. Small parameter changes compound quickly at scale.

  • topK increases input tokens linearly (and sometimes latency exponentially via tool chains).
  • Chunk overlap is duplicated tokens every request.
  • Long "instruction" system prompts can hide inside retrieval wrappers and grow over time.

What to measure per deploy

  1. avgInputTokens delta by endpointTag
  2. cost/request drift by promptVersion
  3. retrieval hit-rate versus token growth
  4. latency regression when context payload grows

Symptoms that look like "model pricing" problems (but are retrieval drift)

  • Cost/request rises while request volume stays flat.
  • Latency rises in the same window as avgInputTokens.
  • The answer quality is unchanged, but context payload doubled.
  • A subset of tenants suddenly dominate spend due to long documents or repeated queries.

How to reduce RAG token cost without losing accuracy

  • Use dynamic top-k (lower for simple questions, higher for complex ones).
  • Add a reranker and retrieve fewer, higher-quality chunks.
  • Reduce chunk overlap; overlap is duplicated tokens every request.
  • Compress context (summary or extractive highlights) before the final call.
  • Cache retrieval results for repeated queries and high-traffic tenants.

Make retrieval changes attributable (treat it like a deploy)

Most teams version prompt changes but forget retrieval configuration. That makes cost drift feel random.

Version retrieval config (topK, chunking, reranker) alongside promptVersion so you can correlate cost/request changes to a single change event.

  1. Ship retrieval changes behind a canary.
  2. Record a retrievalVersion or include it in promptVersion notes.
  3. Alert on avgInputTokens delta after retrieval changes.
  4. Review top drivers weekly and reset topK defaults that crept up.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "rag.answer",
  "promptVersion": "rag_v4",
  "userId": "tenant_acme_hash",
  "inputTokens": 2100,
  "outputTokens": 340,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Increasing topK to fix one edge case and leaving it high for all traffic.
  • Using heavy overlap (or duplicate sources) that silently doubles input tokens.
  • Not caching retrieval for repeated queries and high-traffic tenants.
  • Changing retrieval config without a version tag, then losing attribution.
  • Debugging cost drift by changing models instead of measuring input-token growth.

How to verify in Opsmeter Dashboard

  1. Open Top Endpoints and identify the RAG endpointTag with the highest spend delta.
  2. Compare avgInputTokens and cost/request before and after the retrieval change window.
  3. Inspect outliers (highest inputTokens) and map them to document size, topK, and overlap.
  4. Check Prompt Versions (or deploy notes) to correlate the spike to a retrieval/prompt change.
  5. Apply a rollback (topK/overlap) and confirm cost/request returns to baseline.

Related guides

Open prompt regression guideOpen operations docsCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack