50 days of cto bench

Back to Blog

Feb 4, 2026

Written by

Simon Spurrier

We shipped cto bench because our friends kept asking for it. They wanted to know which models our users preferred for real software engineering tasks.

The results were often surprising.

But aren’t there already loads of benchmarks? Yes, but recently it has been difficult to know who or what to believe. Over time, traditional benchmarks become more noisy and less reliable. There are plenty of examples of models touting top scores in their marketing but failing to live up to expectations in the real world.

Benchmaxxing

At first, benchmarks usually attract a small research crowd since they are usually themselves run by researchers. Then, once a benchmark has gained authority and recognition, the marketers move in and take advantage.

Once a benchmark is on the internet, information relating to it is likely to work its way into model training data. This ranges from quite innocuous, where it’s just part of a massive amount of pre-training data, to more adversarial, where models are fine-tuned explicitly to optimise for benchmark performance. The model gets a good score, but fails to generalise as well as you might expect.

If you have a reasonable amount of money, you can also run the benchmark as many times as you like against any set of model hyperparameters, fine tunes, or harness variations. You can then cherry pick your results and claim pass@whatever and best of N.

It also goes the other way. Benchmarks can ruin models. Ever feel like ChatGPT has an overly verbose style, loves unordered lists, and uses emojis too often? According to Surge AI, you can probably blame LMArena since it encourages labs to optimise for impulsive user preference where attention spans are short and sycophancy wins.

An unbenchmaxxable benchmark

We designed cto bench to be resistant to overfitting, deep pockets and statistical anomalies.

There’s no test set and no arbitrary voting system. All the ‘test cases’ are real coding tasks and the results depend on whether the user merged the code or not.
No cherrypicking. You can’t run cto bench on your model or harness so you can’t cherrypick your results. When a new model is released, we add it to our rotation and the first score will appear a few days later.
It’s statistically significant and up-to-date. Tasks used in cto bench metrics are assigned to a model at random and only models that meet a minimum usage threshold over the measurement period are included. Each model has only its most recent score displayed which is updated every day.

The results

Our benchmark often surprises us. For example, smaller, cheaper, open source models beat flagship models all the time.

Minimax M2.1 has been consistently outperformed by Minimax M2, an older model by the same lab.

Gemini 3 Flash typically beats Gemini 3 Pro, a slower model that’s ~4x more expensive.

Most recently, Kimi K2.5 entered the leaderboard at the top spot, beating both Claude Sonnet 3.5 and GPT-5.2 Codex.

We regard these results as the gold standard, certainly for code agents built using harnesses like ours. We constantly update our automatic model routing logic to account for the latest success rates across every model we measure.

We hope they are useful for you, too.