{"name":"AegisDesk","env_name":"support_ops_env","status":"ok","summary":"Deterministic OpenEnv benchmark for B2B SaaS support operations.","task_counts":{"core":3,"v2":6,"generalization":18,"showcase":3,"judged_total":27,"surfaced_total":30},"features":["30 surfaced fixtures with truthful track and judged labels","27 judged fixtures spanning canonical and held-out generalization tasks","typed action and observation models","deterministic rubric grading","dense reward shaping with penalties","interactive console","oracle trajectory viewer","OpenAI-client baseline inference","gzip-compressed responses","cached benchmark and trajectory summaries"],"routes":{"console":"/console","trajectory_viewer":"/trajectory-viewer","benchmark_card":"/benchmark-card","tasks":"/tasks","health":"/health"},"tasks":[{"fixture_id":"billing_seat_adjustment","task_id":"billing_seat_adjustment","track":"core","judged":true,"difficulty":"easy","task_brief":"You are on the SaaS support operations desk. The inbox contains multiple tickets, but only one requires a billing correction right now. Identify the correct ticket, inspect the right records, apply the exact credit, update ticket metadata, send a structured billing reply, and finalize the case safely.\n","max_steps":12,"reply_template_id":"billing_credit_resolution","reply_checklist":["acknowledge_billing_error","confirm_credit_amount","explain_next_invoice"],"oracle_available":true},{"fixture_id":"login_incident_triage","task_id":"login_incident_triage","track":"core","judged":true,"difficulty":"medium","task_brief":"A VIP customer reports login failures while an authentication incident may already be active. Identify the right ticket, inspect incident context before taking action, raise priority, communicate the incident safely, and finalize without taking unsafe account-level shortcuts.\n","max_steps":12,"reply_template_id":"incident_login_response","reply_checklist":["acknowledge_login_impact","reference_incident_id","set_follow_up_expectation"],"oracle_available":true},{"fixture_id":"suspicious_admin_request","task_id":"suspicious_admin_request","track":"core","judged":true,"difficulty":"hard","task_brief":"One ticket requests an audit export or MFA bypass from a contact who may not be approved, while another signal suggests account takeover risk. Identify the risky ticket, inspect security evidence, escalate correctly, refuse unsafe fulfillment, and finalize as a security escalation.\n","max_steps":12,"reply_template_id":"security_verification_required","reply_checklist":["refuse_unverified_export","require_verified_channel","confirm_security_escalation"],"oracle_available":true},{"fixture_id":"customer_escalation_chain","task_id":"customer_escalation_chain","track":"v2","judged":true,"difficulty":"medium_hard","task_brief":"A billing dispute has been partially addressed, but the customer followed up with new information that changes the case. You must handle the customer follow-up message injected mid-episode, reassess priority, loop in the billing team lead for secondary approval, and close the case with a verified resolution. Coordination with multiple parties is required — acting unilaterally without escalation will fail the rubric.\n","max_steps":15,"reply_template_id":"multi_cycle_billing_resolution","reply_checklist":["acknowledge_multi_cycle_error","confirm_total_credit_amount","reference_approval_escalation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"multi_tier_billing_dispute","task_id":"multi_tier_billing_dispute","track":"v2","judged":true,"difficulty":"medium","task_brief":"A billing dispute has been raised, but the account owner and the billing contact have submitted conflicting information about the correct seat count. You must inspect both records, reconcile the discrepancy, identify the authoritative source, apply the correct credit, and communicate clearly. Acting on only one party's claim without reconciling the discrepancy will fail the rubric.\n","max_steps":15,"reply_template_id":"billing_dispute_resolution","reply_checklist":["cite_authoritative_document","confirm_credit_amount","explain_pro_rata_calculation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"data_breach_response_lifecycle","task_id":"data_breach_response_lifecycle","track":"v2","judged":true,"difficulty":"hard","task_brief":"A potential data breach has been reported. This is a multi-phase investigation requiring: (1) Detection — identify and open the incident ticket, confirm threat signals; (2) Containment — escalate to security team, flag account; (3) Assessment — inspect audit logs and affected records to determine scope; (4) Notification — draft a structured breach-response communication; (5) Resolution — finalize with the correct security resolution code. Phases must be completed in order. Skipping ahead without completing earlier phases will result in partial credit only.\n","max_steps":30,"reply_template_id":"breach_notification_response","reply_checklist":["confirm_incident_detected","state_containment_action","describe_affected_scope","reference_security_escalation","provide_next_steps"],"oracle_available":true},{"fixture_id":"contract_renewal_negotiation","task_id":"contract_renewal_negotiation","track":"v2","judged":true,"difficulty":"medium_hard","task_brief":"A key enterprise customer is up for annual renewal, but two unresolved issues are blocking the renewal: an outstanding billing dispute from last quarter and an unacknowledged API rate limit escalation. You must resolve both sub-cases before finalizing the renewal. Resolving only one issue and finalizing early will result in partial credit. The full six-step workflow must be completed for each sub-case before the renewal can be closed.\n","max_steps":25,"reply_template_id":"renewal_blocker_resolution","reply_checklist":["confirm_billing_credit_applied","acknowledge_api_incident","reference_sla_escalation","confirm_renewal_path_clear"],"oracle_available":true},{"fixture_id":"service_reinstatement_review","task_id":"service_reinstatement_review","track":"v2","judged":true,"difficulty":"easy_medium","task_brief":"A customer account was suspended due to non-payment. The customer has now paid and is requesting reinstatement. You must verify payment status, confirm the account is eligible for reinstatement per the current policy window, and reactivate service. The world state indicates a policy grace period is currently active, which affects whether reinstatement can proceed immediately or requires approval. Do not reinstate until payment is verified and the policy window is confirmed.\n","max_steps":12,"reply_template_id":"account_reinstatement_confirmation","reply_checklist":["confirm_payment_received","confirm_service_reinstated","confirm_data_retained","explain_next_billing_cycle"],"oracle_available":true},{"fixture_id":"api_partner_access_audit","task_id":"api_partner_access_audit","track":"v2","judged":true,"difficulty":"medium","task_brief":"A B2B partner is requesting extended API access beyond their current rate limits. The world state shows a policy review window is active — new extended access grants are paused pending legal review of the partner agreement. You must audit the partner's current usage, confirm their contract entitlements, and route the request to the appropriate team for approval. Do not self-approve extended access during the policy review window. Verify usage data and contract scope before escalating.\n","max_steps":15,"reply_template_id":"partner_access_review_pending","reply_checklist":["acknowledge_access_request","confirm_usage_audit_completed","explain_policy_review_pause","provide_expected_timeline"],"oracle_available":true},{"fixture_id":"billing_seat_adjustment_v1","task_id":"billing_seat_adjustment","track":"generalization","judged":true,"difficulty":"easy","task_brief":"You are on the SaaS support operations desk. The inbox contains multiple tickets, but only one requires a billing correction right now. Identify the correct ticket, inspect the right records, apply the exact credit, update ticket metadata, send a structured billing reply, and finalize the case safely.\n","max_steps":12,"reply_template_id":"billing_credit_resolution","reply_checklist":["acknowledge_billing_error","confirm_credit_amount","explain_next_invoice"],"oracle_available":true},{"fixture_id":"billing_seat_adjustment_v2","task_id":"billing_seat_adjustment","track":"generalization","judged":true,"difficulty":"easy","task_brief":"You are on the SaaS support operations desk. The inbox contains multiple tickets, but only one requires a billing correction right now. Identify the correct ticket, inspect the right records, apply the exact credit, update ticket metadata, send a structured billing reply, and finalize the case safely.\n","max_steps":12,"reply_template_id":"billing_credit_resolution","reply_checklist":["acknowledge_billing_error","confirm_credit_amount","explain_next_invoice"],"oracle_available":true},{"fixture_id":"login_incident_triage_v1","task_id":"login_incident_triage","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A VIP customer reports login failures while an authentication incident may already be active. Identify the right ticket, inspect incident context before taking action, raise priority, communicate the incident safely, and finalize without taking unsafe account-level shortcuts.\n","max_steps":12,"reply_template_id":"incident_login_response","reply_checklist":["acknowledge_login_impact","reference_incident_id","set_follow_up_expectation"],"oracle_available":true},{"fixture_id":"login_incident_triage_v2","task_id":"login_incident_triage","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A VIP customer reports login failures while an authentication incident may already be active. Identify the right ticket, inspect incident context before taking action, raise priority, communicate the incident safely, and finalize without taking unsafe account-level shortcuts.\n","max_steps":12,"reply_template_id":"incident_login_response","reply_checklist":["acknowledge_login_impact","reference_incident_id","set_follow_up_expectation"],"oracle_available":true},{"fixture_id":"suspicious_admin_request_v1","task_id":"suspicious_admin_request","track":"generalization","judged":true,"difficulty":"hard","task_brief":"One ticket requests an audit export or MFA bypass from a contact who may not be approved, while another signal suggests account takeover risk. Identify the risky ticket, inspect security evidence, escalate correctly, refuse unsafe fulfillment, and finalize as a security escalation.\n","max_steps":12,"reply_template_id":"security_verification_required","reply_checklist":["refuse_unverified_export","require_verified_channel","confirm_security_escalation"],"oracle_available":true},{"fixture_id":"suspicious_admin_request_v2","task_id":"suspicious_admin_request","track":"generalization","judged":true,"difficulty":"hard","task_brief":"One ticket requests an audit export or MFA bypass from a contact who may not be approved, while another signal suggests account takeover risk. Identify the risky ticket, inspect security evidence, escalate correctly, refuse unsafe fulfillment, and finalize as a security escalation.\n","max_steps":12,"reply_template_id":"security_verification_required","reply_checklist":["refuse_unverified_export","require_verified_channel","confirm_security_escalation"],"oracle_available":true},{"fixture_id":"customer_escalation_chain_v1","task_id":"customer_escalation_chain","track":"generalization","judged":true,"difficulty":"medium_hard","task_brief":"A billing dispute has been partially addressed, but the customer followed up with new information that changes the case. You must handle the customer follow-up message injected mid-episode, reassess priority, loop in the billing team lead for secondary approval, and close the case with a verified resolution. Coordination with multiple parties is required — acting unilaterally without escalation will fail the rubric.\n","max_steps":15,"reply_template_id":"multi_cycle_billing_resolution","reply_checklist":["acknowledge_multi_cycle_error","confirm_total_credit_amount","reference_approval_escalation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"customer_escalation_chain_v2","task_id":"customer_escalation_chain","track":"generalization","judged":true,"difficulty":"medium_hard","task_brief":"A billing dispute has been partially addressed, but the customer followed up with new information that changes the case. You must handle the customer follow-up message injected mid-episode, reassess priority, loop in the billing team lead for secondary approval, and close the case with a verified resolution. Coordination with multiple parties is required — acting unilaterally without escalation will fail the rubric.\n","max_steps":15,"reply_template_id":"multi_cycle_billing_resolution","reply_checklist":["acknowledge_multi_cycle_error","confirm_total_credit_amount","reference_approval_escalation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"multi_tier_billing_dispute_v1","task_id":"multi_tier_billing_dispute","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A billing dispute has been raised, but the account owner and the billing contact have submitted conflicting information about the correct seat count. You must inspect both records, reconcile the discrepancy, identify the authoritative source, apply the correct credit, and communicate clearly. Acting on only one party's claim without reconciling the discrepancy will fail the rubric.\n","max_steps":15,"reply_template_id":"billing_dispute_resolution","reply_checklist":["cite_authoritative_document","confirm_credit_amount","explain_pro_rata_calculation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"multi_tier_billing_dispute_v2","task_id":"multi_tier_billing_dispute","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A billing dispute has been raised, but the account owner and the billing contact have submitted conflicting information about the correct seat count. You must inspect both records, reconcile the discrepancy, identify the authoritative source, apply the correct credit, and communicate clearly. Acting on only one party's claim without reconciling the discrepancy will fail the rubric.\n","max_steps":15,"reply_template_id":"billing_dispute_resolution","reply_checklist":["cite_authoritative_document","confirm_credit_amount","explain_pro_rata_calculation","explain_next_invoice"],"oracle_available":true},{"fixture_id":"data_breach_response_lifecycle_v1","task_id":"data_breach_response_lifecycle","track":"generalization","judged":true,"difficulty":"hard","task_brief":"A potential data breach has been reported. This is a multi-phase investigation requiring: (1) Detection — identify and open the incident ticket, confirm threat signals; (2) Containment — escalate to security team, flag account; (3) Assessment — inspect audit logs and affected records to determine scope; (4) Notification — draft a structured breach-response communication; (5) Resolution — finalize with the correct security resolution code. Phases must be completed in order. Skipping ahead without completing earlier phases will result in partial credit only.\n","max_steps":30,"reply_template_id":"breach_notification_response","reply_checklist":["confirm_incident_detected","state_containment_action","describe_affected_scope","reference_security_escalation","provide_next_steps"],"oracle_available":true},{"fixture_id":"data_breach_response_lifecycle_v2","task_id":"data_breach_response_lifecycle","track":"generalization","judged":true,"difficulty":"hard","task_brief":"A potential data breach has been reported. This is a multi-phase investigation requiring: (1) Detection — identify and open the incident ticket, confirm threat signals; (2) Containment — escalate to security team, flag account; (3) Assessment — inspect audit logs and affected records to determine scope; (4) Notification — draft a structured breach-response communication; (5) Resolution — finalize with the correct security resolution code. Phases must be completed in order. Skipping ahead without completing earlier phases will result in partial credit only.\n","max_steps":30,"reply_template_id":"breach_notification_response","reply_checklist":["confirm_incident_detected","state_containment_action","describe_affected_scope","reference_security_escalation","provide_next_steps"],"oracle_available":true},{"fixture_id":"contract_renewal_negotiation_v1","task_id":"contract_renewal_negotiation","track":"generalization","judged":true,"difficulty":"medium_hard","task_brief":"A key enterprise customer is up for annual renewal, but two unresolved issues are blocking the renewal: an outstanding billing dispute from last quarter and an unacknowledged API rate limit escalation. You must resolve both sub-cases before finalizing the renewal. Resolving only one issue and finalizing early will result in partial credit. The full six-step workflow must be completed for each sub-case before the renewal can be closed.\n","max_steps":25,"reply_template_id":"renewal_blocker_resolution","reply_checklist":["confirm_billing_credit_applied","acknowledge_api_incident","reference_sla_escalation","confirm_renewal_path_clear"],"oracle_available":true},{"fixture_id":"contract_renewal_negotiation_v2","task_id":"contract_renewal_negotiation","track":"generalization","judged":true,"difficulty":"medium_hard","task_brief":"A key enterprise customer is up for annual renewal, but two unresolved issues are blocking the renewal: an outstanding billing dispute from last quarter and an unacknowledged API rate limit escalation. You must resolve both sub-cases before finalizing the renewal. Resolving only one issue and finalizing early will result in partial credit. The full six-step workflow must be completed for each sub-case before the renewal can be closed.\n","max_steps":25,"reply_template_id":"renewal_blocker_resolution","reply_checklist":["confirm_billing_credit_applied","acknowledge_api_incident","reference_sla_escalation","confirm_renewal_path_clear"],"oracle_available":true},{"fixture_id":"service_reinstatement_review_v1","task_id":"service_reinstatement_review","track":"generalization","judged":true,"difficulty":"easy_medium","task_brief":"A customer account was suspended due to non-payment. The customer has now paid and is requesting reinstatement. You must verify payment status, confirm the account is eligible for reinstatement per the current policy window, and reactivate service. The world state indicates a policy grace period is currently active, which affects whether reinstatement can proceed immediately or requires approval. Do not reinstate until payment is verified and the policy window is confirmed.\n","max_steps":12,"reply_template_id":"account_reinstatement_confirmation","reply_checklist":["confirm_payment_received","confirm_service_reinstated","confirm_data_retained","explain_next_billing_cycle"],"oracle_available":true},{"fixture_id":"service_reinstatement_review_v2","task_id":"service_reinstatement_review","track":"generalization","judged":true,"difficulty":"easy_medium","task_brief":"A customer account was suspended due to non-payment. The customer has now paid and is requesting reinstatement. You must verify payment status, confirm the account is eligible for reinstatement per the current policy window, and reactivate service. The world state indicates a policy grace period is currently active, which affects whether reinstatement can proceed immediately or requires approval. Do not reinstate until payment is verified and the policy window is confirmed.\n","max_steps":12,"reply_template_id":"account_reinstatement_confirmation","reply_checklist":["confirm_payment_received","confirm_service_reinstated","confirm_data_retained","explain_next_billing_cycle"],"oracle_available":true},{"fixture_id":"api_partner_access_audit_v1","task_id":"api_partner_access_audit","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A B2B partner is requesting extended API access beyond their current rate limits. The world state shows a policy review window is active — new extended access grants are paused pending legal review of the partner agreement. You must audit the partner's current usage, confirm their contract entitlements, and route the request to the appropriate team for approval. Do not self-approve extended access during the policy review window. Verify usage data and contract scope before escalating.\n","max_steps":15,"reply_template_id":"partner_access_review_pending","reply_checklist":["acknowledge_access_request","confirm_usage_audit_completed","explain_policy_review_pause","provide_expected_timeline"],"oracle_available":true},{"fixture_id":"api_partner_access_audit_v2","task_id":"api_partner_access_audit","track":"generalization","judged":true,"difficulty":"medium","task_brief":"A B2B partner is requesting extended API access beyond their current rate limits. The world state shows a policy review window is active — new extended access grants are paused pending legal review of the partner agreement. You must audit the partner's current usage, confirm their contract entitlements, and route the request to the appropriate team for approval. Do not self-approve extended access during the policy review window. Verify usage data and contract scope before escalating.\n","max_steps":15,"reply_template_id":"partner_access_review_pending","reply_checklist":["acknowledge_access_request","confirm_usage_audit_completed","explain_policy_review_pause","provide_expected_timeline"],"oracle_available":true},{"fixture_id":"admin_role_transfer_verification","task_id":"admin_role_transfer_verification","track":"showcase","judged":false,"difficulty":"hard","task_brief":"A contractor requests an immediate transfer of workspace ownership after a recent device change, but the request may be under-verified and security-sensitive. Identify the risky ticket, inspect the verification evidence, escalate correctly, refuse unsafe transfer, and finalize as a security review.\n","max_steps":12,"reply_template_id":"ownership_transfer_verification","reply_checklist":["refuse_unverified_transfer","require_verified_admin_approval","confirm_security_review"],"oracle_available":true},{"fixture_id":"api_rate_limit_escalation","task_id":"api_rate_limit_escalation","track":"showcase","judged":false,"difficulty":"medium","task_brief":"An enterprise customer reports sustained API 429 errors while a shared edge issue may already be active. Identify the right ticket, inspect the incident context before acting, escalate to incident response, communicate safely, and finalize without applying a misleading manual override.\n","max_steps":12,"reply_template_id":"api_rate_limit_incident","reply_checklist":["acknowledge_429_impact","reference_incident_id","set_next_update_expectation"],"oracle_available":true},{"fixture_id":"tax_exemption_credit_review","task_id":"tax_exemption_credit_review","track":"showcase","judged":false,"difficulty":"easy","task_brief":"A customer says the latest invoice incorrectly charged sales tax even though a valid resale certificate is already on file. Find the right ticket, inspect the right billing records, apply the exact credit, send the structured tax reply, and resolve the case without unnecessary escalation.\n","max_steps":12,"reply_template_id":"tax_credit_resolution","reply_checklist":["acknowledge_tax_issue","confirm_tax_credit_amount","confirm_certificate_on_file"],"oracle_available":true}]}