cto bench

Model success rates on real end-to-end coding tasks from cto.new users

Leaderboard

Methodology

The benchmark measures merged code as a percentage of completed tasks

We report a 72-hour rolling success rate with a 2-day lag to allow for task resolution

We exclude data from teams that have never merged any code using cto.new

Only models which meet a minimum usage threshold for statistical significance are included

The leaderboard displays the most recently available measurements for models that meet the benchmark criteria within the last calendar month

Toolset

ReadFile

Reads text files from an absolute path.

WriteFile

Writes/overwrites a file at an absolute path.

EditFile

Replaces one uniquely identified text block in an existing file.

GlobTool

Finds files by glob patterns.

GrepTool

Searches file contents with regex.

LsTool

Lists directories/files at an absolute path with pagination.

TerminalTool

Runs shell commands and interacts with processes in a VM terminal.