k8s-ai-bench Leaderboard

k8s-ai-bench is a benchmark that evaluates and compares Large Language Models on their capability to solve real-world Kubernetes tasks.

Tasks range from creating deployments to debugging crash loops and we test capabilities across 120 execution runs per model. Check tasks for more details.

Pass@1

Can the agent solve the task on the first try? Measures immediate correctness.

Pass@5

Can the agent solve the task at least once in 5 attempts? shows potential capability.

Pass^5

Can the agent solve the task 5 times in a row? Measures reliability and consistency.

Model	Score	Type Prop OSS