Current Standings
Weighted average overall accuracy (30d). Calculated from Outcome Validity (score 0-5) across runs, with later runs weighted more heavily.
Overall Accuracy Comparison
Performance comparison across runs.
Metric Comparisons
Mean Latency (Seconds)
Lower is better.
Outcome Validity
Fulfillment of task requirements.
Input Tokens
Volume of data sent to the model.
Output Tokens
Volume of data generated by the model.
Agent Configurations
Comparing key metrics across active agent variants.
| Name | Latency | Overall Accuracy | Avg Input Tokens | Avg Output Tokens |
|---|
Specific Task Performance Matrix
Comparing agent pass rates and stability across core evaluation tasks.
>= 90% Success
< 80% Success
80-89% Success
Showing 1 to 10 of 100
entries