Opsmeter logo
Opsmeter
AI Cost & Inference Control

Model strategy

Choosing models for cost: when to use mini vs flagship models

Model tiering is a product decision. Use business-risk segmentation to reserve flagship models for high-value paths.

OperationsArchitecture

Full guide: LLM cost attribution: endpoint, prompt version, tenant, and user

Tiering pattern

  • Mini model for high-volume low-risk tasks.
  • Flagship model for high-stakes user outcomes.
  • Fallback route only when quality threshold fails.

Decision checklist

  1. Map endpointTag to business criticality tier.
  2. Track quality and cost together by promptVersion.
  3. Alert on fallback-rate spikes that affect margin.

Safe rollout approach

  • Start with one endpointTag and one tenant cohort.
  • Measure success-adjusted cost (include retries and rework).
  • Use promptVersion to tie results to deploy windows.
  • Keep an explicit fallback route and alert on fallback spikes.
  • Promote changes only when both quality and cost improve.

Map endpointTag to risk tiers (simple and effective)

The same model choice can be great for one endpoint and terrible for another. The right unit is the endpointTag (feature path).

Tier endpoints by business risk and user impact, then assign model tiers intentionally.

  • Tier 1 (high-stakes): flagship models with strict regression monitoring
  • Tier 2 (medium): balanced models + output caps
  • Tier 3 (low-risk/high-volume): mini models + aggressive caps and throttles

Quality gates that prevent “cheaper model” regressions

  • Track success rate and retry ratio by endpointTag after rollout.
  • Watch rework loops (users asking for re-answers) as a hidden multiplier.
  • Measure outputTokens growth (verbosity drift) by promptVersion.
  • Compare cost/request, not just token price.

Common mistakes

  1. Switching models globally instead of per endpointTag.
  2. Optimizing token price while retries increase total cost.
  3. No promptVersion tagging, so changes cannot be traced.
  4. No caps or degraded-mode policy when budgets hit warning/exceeded.

What to send (payload example)

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "provider_id",
  "model": "model_id",
  "endpointTag": "checkout.ai_summary",
  "promptVersion": "summary_v3",
  "userId": "tenant_acme_hash",
  "inputTokens": 540,
  "outputTokens": 180,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Common mistakes

  • Missing endpointTag or using inconsistent naming across teams.
  • Not tagging promptVersion, so deploys cannot be linked to spend changes.
  • Sending raw user identifiers instead of hashed mapping for privacy.
  • Mixing demo/test dataMode into production operational reviews.

How to verify in Opsmeter Dashboard

  1. Use Overview to confirm spike window and budget posture.
  2. Use Top Endpoints to find feature-level concentration.
  3. Use Top Users to find tenant-level concentration.
  4. Use Prompt Versions to validate deploy-linked cost drift.

Related guides

Open cost-per-call pillarOpen compare hubCompare alternatives

Evaluation resources

For security and procurement reviews, use our trust summary before final tool selection.

Open trust proof pack