Cost attribution
OpenAI cost per endpoint: how to compute cost per request correctly
Endpoint-level cost is where product decisions happen. Use a normalized request model and include retry overhead to avoid false conclusions.
Full guide: LLM cost attribution: endpoint, prompt version, tenant, and user
Computation model
- Map provider usage to input/output token fields.
- Attach endpointTag and promptVersion per request.
- Include retry attempts in effective cost per successful request.
- Aggregate by endpoint and tenant for ownership reviews.
Per-request cost formula (pricing table + token usage)
Do not hardcode costs into app logic. Use a versioned pricing table (effective dates) and compute request cost from measured usage.
The baseline formula is: cost = inputTokens * inputPrice + outputTokens * outputPrice (plus any additional token classes your provider bills).
- Keep price snapshots immutable by effective date (audit-safe).
- Separate input vs output token rates (they often differ).
- Treat unknown-model requests as a queue to resolve, not as "other".
Retries, fallbacks, and success-adjusted endpoint cost
A cheaper attempt can still be more expensive if it takes more attempts to succeed. For endpoint ownership, track effective cost per successful request.
Roll retries and fallback calls into the same externalRequestId so one logical user action has one traceable cost.
- Reuse externalRequestId across retries for the same logical request.
- Record attempt number and final status (success/failure).
- Compute attempts-per-success by endpointTag and model.
- Alert when retries rise even if token price stays constant.
Common implementation errors
- Mixing test/demo rows with real traffic.
- Missing externalRequestId reuse across retries.
- Ignoring unknown-model rows in endpoint totals.
Example endpoint rollup (what the report should show)
- EndpointTag: support.reply
- Volume: request count + successful request count
- Cost: total cost + effective cost per successful request
- Tokens: avgInputTokens vs avgOutputTokens + p95 outliers
- Drivers: top tenants/users + promptVersion deltas in the same window
What to report for decision-ready endpoint ownership
- cost/request plus token/request (input vs output split)
- top tenants and users driving the endpoint spend
- promptVersion changes in the same window
- retry ratio and fallback behavior that inflates "effective" cost
- a baseline comparison period so changes are explainable
What to alert on
- cost/request drift by endpointTag or promptVersion
- unexpected tenant concentration in Top Users
- request burst with falling success ratio
- budget warning, spend-alert, and exceeded state transitions
Execution checklist
- Confirm spike type: volume, token, deploy, or abuse signal.
- Assign one incident owner and one communication channel.
- Apply immediate containment before deep optimization.
- Document the dominant endpoint, tenant, and promptVersion driver.
- Convert findings into one permanent guardrail update.
FAQ
Do we need per-user tracking for OpenAI cost monitoring?
It depends. If you are B2B, tenant-level tracking is usually the fastest path to margin control. Per-user tracking helps when you need granular abuse detection or internal chargeback.
Should we ignore unknown models until later?
No. Unknown-model rows break endpoint ownership reports. Treat unknown-model pricing as an operational queue and resolve it quickly so dashboards stay trustworthy.
What if token usage is missing for some requests?
Use provider usage when available. If it is missing, fall back to tokenizer estimates and flag those rows so audits and pricing decisions are not built on uncertain data.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.