Operations
Alerts inbox to root cause: drill-down workflow for fast containment
A clean drill-down path from alert to root cause reduces MTTR and turns alerts into operational actions.
Full guide: LLM cost attribution: endpoint, prompt version, tenant, and user
The investigation path that should be one click
- Alert event -> investigation time window
- Time window -> current vs baseline compare
- Top driver -> focused endpoint/prompt/user/tenant view
- Focused view -> containment action and postmortem note
Dimension-specific drill-down map
- Endpoint driver -> Top Endpoints with focused endpointTag
- Prompt driver -> Prompt Versions with focused promptVersion
- User/Tenant driver -> Top Users with focused identity context
- All paths preserve date range, dataMode, and environment filters
Containment actions by driver type
- Volume driver: throttle non-critical endpoint traffic.
- Token driver: cap output and rollback promptVersion.
- Identity concentration: enforce tenant/user limits and investigate abuse.
- Unknown-model driver: fill pricing map before finance reconciliation.
Post-incident hardening
- Document root cause and exact containment step.
- Convert one manual action into a policy/threshold rule.
- Update runbook owner and escalation channel.
- Review weekly summaries for recurrence signals.
What to alert on
- cost/request drift by endpointTag or promptVersion
- unexpected tenant concentration in Top Users
- request burst with falling success ratio
- budget warning, spend-alert, and exceeded state transitions
Execution checklist
- Confirm spike type: volume, token, deploy, or abuse signal.
- Assign one incident owner and one communication channel.
- Apply immediate containment before deep optimization.
- Document the dominant endpoint, tenant, and promptVersion driver.
- Convert findings into one permanent guardrail update.
FAQ
Is userId required?
No. userId is optional, but recommended for tenant-level attribution. If needed, send a hashed identifier.
Where should token usage values come from?
Prefer provider usage fields first. If unavailable, use tokenizer estimates and mark uncertainty in your workflow.
How should retries be handled?
Keep the same externalRequestId for the same logical request so idempotency remains stable across retries.
Can telemetry break production flow?
It should not. Use short timeouts, catch errors, and keep telemetry asynchronous so provider calls keep running.
Related guides
Evaluation resources
For security and procurement reviews, use our trust summary before final tool selection.