Each episode models support-operations judgment: ticket selection, evidence gathering, safe escalation, and structured customer communication.
The benchmark includes an interactive console, an oracle trajectory viewer, a reproducible inference script, and captured validation results.
Agents receive partial credit for meaningful progress and get penalized for loops, irrelevant inspection, or unsafe direct actions.
/console
Manual benchmark UI for resetting episodes, sending structured actions, and inspecting observation/state updates in real time.
/trajectory-viewer
Judge-friendly oracle viewer with per-step rewards, rubric progress, penalties, and final score breakdown.
/benchmark-card
Compact machine-readable summary of task counts, validation posture, and public benchmark routes.
/tasks
Fixture catalog containing 30 surfaced fixtures with truthful core, v2, generalization, and showcase labels plus judged/oracle metadata.
`billing_seat_adjustment`, `login_incident_triage`, and `suspicious_admin_request` remain the stable baseline tasks used for direct before-versus-after comparison.
`customer_escalation_chain`, `multi_tier_billing_dispute`, `data_breach_response_lifecycle`, `contract_renewal_negotiation`, `service_reinstatement_review`, and `api_partner_access_audit` add multi-agent, long-horizon, and world-aware behavior.
Eighteen surfaced variants now act as judged held-out fixtures, so we can train on the 9 canonical tasks and test transfer to unseen but structurally similar support cases at a more credible benchmark scale.
`tax_exemption_credit_review`, `api_rate_limit_escalation`, and `admin_role_transfer_verification` remain available as showcase demos and oracle-viewer examples without changing the main training story.
/ unless they request HTML.
This landing page is a human-facing view layered on top of the same judged contract.