k8s-ai-bench Leaderboard
k8s-ai-bench is a benchmark that evaluates and compares Large Language Models on their capability to solve real-world Kubernetes tasks.
Tasks range from creating deployments to debugging crash loops and we test capabilities across 120 execution runs per model. Check tasks for more details.
Pass@1
Can the agent solve the task on the first try? Measures immediate correctness.
Pass@5
Can the agent solve the task at least once in 5 attempts? shows potential capability.
Pass^5
Can the agent solve the task 5 times in a row? Measures reliability and consistency.
| Model |
Score
|
Type
|
|---|