About k8s-ai-bench

k8s-ai-bench is a benchmark that evaluates and compares Large Language Models on their capability to solve real-world Kubernetes tasks. Tasks range from creating deployments to debugging crash loops and we test capabilities across 120 execution runs per model. Check tasks for more details.

How it works

The benchmark runs a series of 24 predefined tasks against different LLMs. Each task is attempted multiple times to gauge consistency and potential.

Metrics

  • Pass@1: Measures immediate correctness. Can the agent solve the task on the first try?
  • Pass@5: Measures potential capability. Can the agent solve the task at least once in 5 attempts?
  • Pass^5: Measures reliability & consistency. Can the agent solve the task 5 times in a row?

Links