Devops Bench for Autonomous Operations

Current Standings

Weighted average overall accuracy (30d). Calculated from Outcome Validity (score 0-5) across runs, with later runs weighted more heavily.

Performance comparison across runs.

Lower is better.

Fulfillment of task requirements.

Volume of data sent to the model.

Volume of data generated by the model.

Comparing key metrics across active agent variants.

Name	Latency	Overall Accuracy	Avg Input Tokens	Avg Output Tokens

Comparing agent pass rates and stability across core evaluation tasks.

>= 90% Success

< 80% Success

80-89% Success

Showing 1 to 10 of 100 entries