OpenEnv Benchmark • Live Space

AegisDesk

A real-world agent benchmark for B2B SaaS support operations. Agents must triage a live-looking support inbox, inspect the right records, follow policy, avoid unsafe shortcuts, and finalize a deterministic, gradable resolution.

Open Interactive Console Open Trajectory Viewer View Benchmark Card

30 surfaced fixtures

27 judged fixtures

3 showcase fixtures

Deterministic scores in [0, 1]

OpenAI-client inference path

Why This Project Stands Out

Core Tasks3

Round 2 Tasks6

Held-out Generalization18

Showcase Fixtures3

ScoringDeterministic

Real workflow, not a toy

Each episode models support-operations judgment: ticket selection, evidence gathering, safe escalation, and structured customer communication.

Judge-friendly by design

The benchmark includes an interactive console, an oracle trajectory viewer, a reproducible inference script, and captured validation results.

Dense rewards with safety penalties

Agents receive partial credit for meaningful progress and get penalized for loops, irrelevant inspection, or unsafe direct actions.

Live Routes

/console Manual benchmark UI for resetting episodes, sending structured actions, and inspecting observation/state updates in real time.

/trajectory-viewer Judge-friendly oracle viewer with per-step rewards, rubric progress, penalties, and final score breakdown.

/benchmark-card Compact machine-readable summary of task counts, validation posture, and public benchmark routes.

/tasks Fixture catalog containing 30 surfaced fixtures with truthful core, v2, generalization, and showcase labels plus judged/oracle metadata.

Task Catalog Highlights

Core

`billing_seat_adjustment`, `login_incident_triage`, and `suspicious_admin_request` remain the stable baseline tasks used for direct before-versus-after comparison.

Round 2

`customer_escalation_chain`, `multi_tier_billing_dispute`, `data_breach_response_lifecycle`, `contract_renewal_negotiation`, `service_reinstatement_review`, and `api_partner_access_audit` add multi-agent, long-horizon, and world-aware behavior.

Held-out Generalization

Eighteen surfaced variants now act as judged held-out fixtures, so we can train on the 9 canonical tasks and test transfer to unseen but structurally similar support cases at a more credible benchmark scale.

Showcase

`tax_exemption_credit_review`, `api_rate_limit_escalation`, and `admin_role_transfer_verification` remain available as showcase demos and oracle-viewer examples without changing the main training story.

API clients and validators still receive the standard JSON health response from / unless they request HTML. This landing page is a human-facing view layered on top of the same judged contract.