I benchmarked a number of agentic LLMs in a simple autonomous dev loop: a “Tech Lead” agent researches the problem and spawns dev sub-agents to complete the task. The tech lead / dev agents each have an assigned LLM and toolset. More details regarding methodology can be found in this post. Take a look at the demos below.

tl;dr: GLM 4.6 was the clear winner, and one of the cheapest to boot.

Prompt

Create a falling sand cellular automata web page using only vanilla html, javascript, and css (No external packages). Sand should fall from the top center of the page and collect at the bottom of the page. When the user left clicks on the page, sand should be generated at the cursor location.

GLM 4.6 / GLM 4.6

  • Cost: $0.06
  • Execution Time: 3m 3s
  • Lines of code: 251

Claude 4.5 Sonnet / 4.5 Haiku

  • Cost: $1.47
  • Execution Time: 5m 4s
  • Lines of code: 1,142 (includes a readme and test scripts)

Gemini 2.5 Pro / Gemini 2.5 Flash

  • Cost: $0.12
  • Execution Time: 2m 21s
  • Lines of code: 87

GPT-5 / GPT-5 mini

  • Cost: $0.10
  • Execution Time: 5m 11s
  • Lines of code: 273

Qwen 3 Max / Qwen 3 Coder Plus

  • Cost: $0.20
  • Execution Time: 4m 52s
  • Lines of code: 207

Grok 4 / Grok Code Fast 1

  • Cost: $0.63
  • Execution Time: 11m 52s
  • Lines of code: 103

Future

The prompt here was pretty simple and likely something that Claude Code could have one-shot with little or no user interaction. I’d like to use the same system with increasingly complex projects to compare performance. If you’re interested in following along, check out the platform I used to run the benchmarks: Matic.