Read the codebases of Codex, Gemini CLI, Mistral Vibe, and OpenCode, then forked three of them to run GLM-4.7 on Terminal-Bench 2.0. Same model, 2x performance gap — the scaffolding is what matters. Also benchmarked all four agents on an unpublished NP-hard optimization problem; Claude Code beat my 8-year-old C++ solution.
Builders behind this project.