Devops Bench for Autonomous Operations

Current Standings

Weighted average overall accuracy (30d). Calculated from Outcome Validity (score 0-5) across runs, with later runs weighted more heavily.

Overall Accuracy Comparison

Performance comparison across runs.

Metric Comparisons

Mean Latency (Seconds)

Lower is better.

Outcome Validity

Fulfillment of task requirements.

Input Tokens

Volume of data sent to the model.

Output Tokens

Volume of data generated by the model.

Agent Configurations

Comparing key metrics across active agent variants.

Name Latency Overall Accuracy Avg Input Tokens Avg Output Tokens

Specific Task Performance Matrix

Comparing agent pass rates and stability across core evaluation tasks.

>= 90% Success
< 80% Success
80-89% Success
Showing 1 to 10 of 100 entries