Pillar
Per-tenant LLM margin operating model for AI SaaS
Tenant profitability connects telemetry to finance decisions. This pillar standardizes per-tenant margin workflows.
What to monitor weekly
- Tenant spend concentration and trend
- Tenant cost per workflow or feature
- Negative-margin tenant early signals
- Budget drift caused by one customer segment
Margin review cadence
- Review top 10 tenants by spend and by margin delta.
- Track cost per workflow for high-volume tenants.
- Compare planned versus actual margin after major deploys.
- Flag tenants with sustained negative unit economics.
Actions when one tenant drives risk
- Apply tenant-specific budget thresholds.
- Tune model tiering for non-critical flows.
- Introduce pricing or quota adjustments with customer success.
- Document exception policy for strategic accounts.
Why per-tenant visibility is mandatory in AI SaaS
Workspace totals can look healthy while one customer quietly destroys margin. Tenant-level attribution turns “we spent more” into “this account drove the change”.
Per-tenant LLM costs also reduce internal debate: instead of guessing, you can show the feature mix, promptVersion changes, and endpoints responsible.
Minimum telemetry to make tenant profitability real
- tenantId (or stable tenant mapping) for commercial ownership
- endpointTag for feature-level margin breakdown
- promptVersion for deploy-linked cost drift
- dataMode/environment to keep reporting clean
- plan tier or segment label (so finance can interpret outcomes)
A simple margin model you can run weekly
You do not need perfect accounting to make good decisions. Start with a simple, repeatable model and improve it over time.
The goal is to identify negative-margin tenants early, understand the drivers, and choose a policy response: price, quota, routing, or product change.
- Gross margin (tenant) = revenue - (LLM cost + variable infra cost estimate)
- LLM cost (tenant) = sum(requestCost) grouped by tenantId
- Driver view = endpointTag + promptVersion + retry ratio over the same window
Segment tenants by cost drivers (not just ARR)
- High-volume / low-risk: route to cheaper models, strict output caps.
- Low-volume / high-stakes: allow flagship models with tighter QA gates.
- Tool-heavy workflows: monitor tool output bloat and step counts.
- RAG-heavy workflows: track avgInputTokens and retrieval parameters.
Per-tenant budgets and recovery actions
Tenant budgets are an escalation mechanism. They protect shared margin and create a clear owner path for exceptions.
When a tenant hits warning/exceeded, you need a pre-decided set of actions that preserve user experience while containing spend.
- Warning: notify account owner with top endpointTag and promptVersion drivers.
- Contain: cap output tokens and throttle non-critical endpoints for the tenant.
- Degrade: route low-risk paths to cheaper models for the exceeded tenant.
- Decide: approve overrun, enforce quota, or upsell to a higher tier.
- Document: one permanent policy update for the next cycle.
Pricing levers that protect margin without harming retention
The best pricing policy matches cost drivers. If cost is driven by a small set of endpoints, attach quotas and overages to those workflows.
Avoid surprise enforcement. Use clear warning thresholds, transparent quotas, and a documented path to upgrades.
- Quota by feature (endpointTag) rather than only by total requests.
- Overage pricing for heavy usage instead of silently absorbing cost.
- Separate demo/test usage from billable reporting (dataMode).
- Use plan-tier routing rules to keep costs predictable.
What to alert on
- cost/request drift by endpointTag or promptVersion
- unexpected tenant concentration in Top Users
- request burst with falling success ratio
- budget warning, spend-alert, and exceeded state transitions
Execution checklist
- Confirm spike type: volume, token, deploy, or abuse signal.
- Assign one incident owner and one communication channel.
- Apply immediate containment before deep optimization.
- Document the dominant endpoint, tenant, and promptVersion driver.
- Convert findings into one permanent guardrail update.
FAQ
Is userId required?
No. userId is optional, but recommended for tenant-level attribution. If needed, send a hashed identifier.
Where should token usage values come from?
Prefer provider usage fields first. If unavailable, use tokenizer estimates and mark uncertainty in your workflow.
How should retries be handled?
Keep the same externalRequestId for the same logical request so idempotency remains stable across retries.
Can telemetry break production flow?
It should not. Use short timeouts, catch errors, and keep telemetry asynchronous so provider calls keep running.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.