cto bench
Model success rates on real end-to-end coding tasks from cto.new users
Leaderboard
Methodology
The benchmark measures merged code as a percentage of completed tasks
We report a 72-hour rolling success rate with a 2-day lag to allow for task resolution
We exclude data from teams that have never merged any code using cto.new
Only models which meet a minimum usage threshold for statistical significance are included
The leaderboard displays the most recently available measurements for models that meet the benchmark criteria within the last calendar month
Toolset
ReadFile
Reads text files from an absolute path.
WriteFile
Writes/overwrites a file at an absolute path.
EditFile
Replaces one uniquely identified text block in an existing file.
GlobTool
Finds files by glob patterns.
GrepTool
Searches file contents with regex.
LsTool
Lists directories/files at an absolute path with pagination.
TerminalTool
Runs shell commands and interacts with processes in a VM terminal.