OpenEnv Benchmark • Live Space

AegisDesk

A real-world agent benchmark for B2B SaaS support operations. Agents must triage a live-looking support inbox, inspect the right records, follow policy, avoid unsafe shortcuts, and finalize a deterministic, gradable resolution.
30 surfaced fixtures
27 judged fixtures
3 showcase fixtures
Deterministic scores in [0, 1]
OpenAI-client inference path

Why This Project Stands Out

Core Tasks3
Round 2 Tasks6
Held-out Generalization18
Showcase Fixtures3
ScoringDeterministic
Real workflow, not a toy

Each episode models support-operations judgment: ticket selection, evidence gathering, safe escalation, and structured customer communication.

Judge-friendly by design

The benchmark includes an interactive console, an oracle trajectory viewer, a reproducible inference script, and captured validation results.

Dense rewards with safety penalties

Agents receive partial credit for meaningful progress and get penalized for loops, irrelevant inspection, or unsafe direct actions.

Live Routes

/console Manual benchmark UI for resetting episodes, sending structured actions, and inspecting observation/state updates in real time.
/trajectory-viewer Judge-friendly oracle viewer with per-step rewards, rubric progress, penalties, and final score breakdown.
/benchmark-card Compact machine-readable summary of task counts, validation posture, and public benchmark routes.
/tasks Fixture catalog containing 30 surfaced fixtures with truthful core, v2, generalization, and showcase labels plus judged/oracle metadata.

Task Catalog Highlights

Core

`billing_seat_adjustment`, `login_incident_triage`, and `suspicious_admin_request` remain the stable baseline tasks used for direct before-versus-after comparison.

Round 2

`customer_escalation_chain`, `multi_tier_billing_dispute`, `data_breach_response_lifecycle`, `contract_renewal_negotiation`, `service_reinstatement_review`, and `api_partner_access_audit` add multi-agent, long-horizon, and world-aware behavior.

Held-out Generalization

Eighteen surfaced variants now act as judged held-out fixtures, so we can train on the 9 canonical tasks and test transfer to unseen but structurally similar support cases at a more credible benchmark scale.

Showcase

`tax_exemption_credit_review`, `api_rate_limit_escalation`, and `admin_role_transfer_verification` remain available as showcase demos and oracle-viewer examples without changing the main training story.

API clients and validators still receive the standard JSON health response from / unless they request HTML. This landing page is a human-facing view layered on top of the same judged contract.