Opsmeter.io logo
Opsmeter.io
AI Cost & Inference Control

Prompt regression

Ops guideMOFU profile

RAG context creep: how top-k and chunk size inflate cost

Top-k, chunk overlap, and retrieval fan-out can double spend without changing model list price. Treat retrieval configuration as a cost control surface.

Prompt versionsArchitectureOperations

Full guide: Prompt deploy cost regressions: catch silent cost spikes

What this guide answers

  • What changed in cost, cost per request, or budget posture.
  • Which endpoint, prompt, model, or tenant likely drove the delta.
  • Which validation step or control to apply next in Opsmeter.io.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "rag.answer",
  "promptVersion": "rag_v4",
  "userId": "tenant_acme_hash",
  "inputTokens": 2100,
  "outputTokens": 340,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Increasing topK to fix one edge case and leaving it high for all traffic.
  • Using heavy overlap (or duplicate sources) that silently doubles input tokens.
  • Not caching retrieval for repeated queries and high-traffic tenants.
  • Changing retrieval config without a version tag, then losing attribution.
  • Debugging cost drift by changing models instead of measuring input-token growth.

How to verify in the Opsmeter.io dashboard

  1. Open Top Endpoints and identify the RAG endpointTag with the highest spend delta.
  2. Compare avgInputTokens and cost/request before and after the retrieval change window.
  3. Inspect outliers (highest inputTokens) and map them to document size, topK, and overlap.
  4. Check Prompt Versions (or deploy notes) to correlate the spike to a retrieval/prompt change.
  5. Apply a rollback (topK/overlap) and confirm cost/request returns to baseline.

Why context growth happens

  • Top-k defaults increase after relevance tuning and never reset.
  • Chunk overlap multiplies duplicate tokens across similar passages.
  • Fallback retrieval chains pull extra context even on simple prompts.

Use this workflow

Turn diagnosis into action

Identify the cost driver, validate it with attribution, then apply one durable control before the next billing cycle.

Apply in your workspace

Re-run this workflow on your own spend data

Follow the same path from article insight to telemetry verification, then validate with your own cost signals.

Quickstart pathSend a first payload, confirm attribution, then return here for operations context.Open quickstart
Evaluation pathPair this guide with trust proof, status, and compare surfaces during review.Open trust proof pack

A simple cost model for retrieval (what actually drives the bill)

RAG spend is mostly input-token spend. Retrieval settings decide how many tokens you stuff into the prompt before the model even starts answering.

A rough model is: inputTokens ~= systemPrompt + userPrompt + (topK * chunkSize) + overlap + tool outputs. Small parameter changes compound quickly at scale.

  • topK increases input tokens linearly (and sometimes latency exponentially via tool chains).
  • Chunk overlap is duplicated tokens every request.
  • Long "instruction" system prompts can hide inside retrieval wrappers and grow over time.

What to measure per deploy

  1. avgInputTokens delta by endpointTag
  2. cost/request drift by promptVersion
  3. retrieval hit-rate versus token growth
  4. latency regression when context payload grows

Symptoms that look like "model pricing" problems (but are retrieval drift)

  • Cost/request rises while request volume stays flat.
  • Latency rises in the same window as avgInputTokens.
  • The answer quality is unchanged, but context payload doubled.
  • A subset of tenants suddenly dominate spend due to long documents or repeated queries.

How to reduce RAG token cost without losing accuracy

  • Use dynamic top-k (lower for simple questions, higher for complex ones).
  • Add a reranker and retrieve fewer, higher-quality chunks.
  • Reduce chunk overlap; overlap is duplicated tokens every request.
  • Compress context (summary or extractive highlights) before the final call.
  • Cache retrieval results for repeated queries and high-traffic tenants.

Make retrieval changes attributable (treat it like a deploy)

Most teams version prompt changes but forget retrieval configuration. That makes cost drift feel random.

Version retrieval config (topK, chunking, reranker) alongside promptVersion so you can correlate cost/request changes to a single change event.

  1. Ship retrieval changes behind a canary.
  2. Record a retrievalVersion or include it in promptVersion notes.
  3. Alert on avgInputTokens delta after retrieval changes.
  4. Review top drivers weekly and reset topK defaults that crept up.

Related guides

Open prompt regression guideOpen operations docsCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack