Experiments in Autonomous AI Development

I started out 2025 deeply skeptical of AI development tools. I had played around a bit with generating code via back-and-forth conversations with LLM chat apps (ChatGPT and Claude) but was unimpressed and viewed it as mostly a toy to get simple projects off the ground but actively time-wasting for large, complex codebases. Then in March I tried Claude Code.

I set up Claude Code on a side project (a codebase with ~10k lines of Elixir) and was immediately several times more productive. Switching between planning mode and build mode felt natural - very analogous to how I reasoned about coding tasks myself. Yes the AI made mistakes, but it built and tested the changes itself, so the debug feedback loop was tight and hands-off. I found myself spending much more time in “Architect” mode than thinking about individual lines of code or getting stuck on mundane bugs.

Fast forward to November 2025 and my conviction that we’re in the midst of a software revolution is stronger than ever. With models like Sonnet 4.5 and GPT-5, coding mistakes have become less frequent, larger and larger tasks are within the realm of end-to-end AI ownership, and the planning phase has felt more collaborative and productive than ever.

There are, of course, limitations. I work within a very large monorepo for my day job, and managing context for the AI is a challenging, but not impossible task. LLMs can be incredibly myopic. Just like humans, context switching for AI is costly. The psychological concept of “priming” is readily apparent in LLMs: if they were working on one part of the codebase during a previous phase of the conversation, they will tend to laser-focus on that part of the code if you bring up another issue or feature, regardless of if that aspect of the code is relevant to the problem at hand. As a coder, developing a mental model of the structure of the LLM’s context window is a skill that takes time and talent to master.

There’s a question on the minds of many software developers that leverage AI dev tools: how close are we to fully autonomous AI development? It’s a controversial topic, because many developers feel the same way I did in January: AI just hasn’t “clicked” and skepticism abounds. But for those who have begun the journey of mastering AI dev tools, there are also many who see a not-too-distant future where the job of software development looks wildly different than it does now.

I began building an AI agent workflow platform a few months ago as a way to understand their internals: how the core agent loop functions, what works and what doesn’t when handing off execution from one agent to another, how to manage tool calls, context, you name it. This project has developed into a pretty sophisticated AI dev tool in and of itself, with GitHub and git lifecycle integration, sandboxed container-based code development, sub-agents… It’s fun and informative to test different configurations and workflows to see how far the AI can get completely by itself on tasks ranging in complexity. I’m just scratching the surface, but thought I’d share my tinkering and experimentation so far.

Setup

The experiments were run using two services I developed specifically for testing out autonomous development.

Workflow execution platform

This is a web application I built on the Phoenix (Elixir) platform, with a working title of Matic. A workflow consists of a set of Agents connected to each other to form a directed execution graph. An agent is defined by its custom system prompt, model, the tools that are available to it (both local and remote MCP servers supported, as well as custom internal tools defined by the platform) and structured output. I use OpenRouter as the LLM provider.

Each agent defines what its expected input is (just a set of text fields) and applies these inputs to its template to construct an initial prompt. The structured output of the Agent is determined by the input fields of the downstream agent in the execution graph - if the Agent isn’t a leaf node, then its toolset includes a “Handoff tool” to set the template variables for the input of the next agent it hands execution to (alternatively, if the model supports “Structured Output”, it leverages that feature for handoff).

Each agent can optionally spawn sub-agents - this functionality is exposed as a custom spawn_subagent tool in its toolset. This flow is useful when the complexity or number of tasks to be completed in the workflow isn’t known ahead of time. For a software development workflow, this kind of loop is critical - the input could be anything from “add a button” to “build a full-fledged SaaS application”.

Docker MCP server

The second part of the architecture is a standalone MCP server intended to be run inside a docker container. The MCP server is exposed via HTTP and includes a set of standard software dev tools (bash for executing arbitrary commands, file editing and search tools, etc). The MCP server can also aggregate any number of other locally running MCP servers. This is useful for leveraging additional tools that must be run alongside the development environment - for example the Playwright MCP.

Putting it all together

I tried out a few configurations to run my experiments. First, I set up a simple workflow where a ProductOwner agent takes some prompt, hands off requirements to a TechLead agent who breaks those out into technical tasks, which hands off to a Developer agent to complete the tasks, and finally a QA agent to confirm everything works. This setup ended up not being optimal for a couple reasons:

As stated previously, this flow doesn’t really take task complexity into account. For a simple prompt, having a ProductOwner break it out into multiple requirements and phases was unnecessary. For a complex task, having a single Developer agent work on the whole thing was untenable.
There was no real feedback. Since the dev tasks were all completed by a single Developer agent who hands off to the QA at the end, if the Developer got stuck or did something wrong, the best case is that the QA would catch it and mark the workflow as failed. There was no way to catch drift in technical implementation early.

After adding sub-agent execution, the workflow started to click. For the purposes of this experiment, the workflow is incredibly simple: just a single TechLead agent that can spawn any number of Developer agents. At the beginning of the flow, the GitHub repo is cloned into the Docker container, a new branch is created for the current workflow execution, and any pre-defined setup scripts are run (for example, installing project dependencies). When a developer agent is spawned, execution is immediately handed off to this agent, who performs the dev task via the Docker MCP tools. The Developer agent’s output is structured to define a commit message, which is used to write a commit after the agent is finished with its task and before handing execution back to the TechLead. The TechLead then scrutinizes the changes to ensure its requirements are met before spawning another Developer agent, and so on. At the very end of the workflow, a hook is run to push the code and create a PR in GitHub.

The Prompt

For this first set of experiments I’ve deliberately kept the prompt pretty simple. This was for a number of reasons but most importantly, these experiments can get very expensive very fast, especially when using the big, state-of-the-art LLMs. I wanted a task that would most likely result in at least a couple of sub-agents running but wouldn’t break the bank. I also wanted an end product that would be flashy: easy to glance at and determine if it was high quality or not. I landed on the following prompt:

Create a falling sand cellular automata web page using only vanilla html, javascript, and css (No external packages). Sand should fall from the top center of the page and collect at the bottom of the page. When the user left clicks on the page, sand should be generated at the cursor location.

Some might argue that this prompt is too simple, that modern AI dev tools can easily one-shot it anyway. To that I say…probably, but that’s not really the point. These experiments were intended to serve two purposes: to evaluate a variety of LLMs on a uniform prompt with no developer interaction, and to stress test the workflow platform itself and uncover any quirks when running it under different models and configurations.

Results

Each workflow execution started off in a blank repository, and the final output was a PR in that repository. You can see these PRs here. The title and the contents of the PR were ai-generated - I added a prefix to label the models used during that workflow. For example, “[GPT-5 / GPT-5 mini]” means that I used the GPT-5 model for the TechLead agent and the GPT-5 mini model for the Developer agent.

Claude 4.5 Sonnet / 4.5 Haiku

Cost: $1.47
Execution Time: 5m 4s
Lines of code: 1,142 (includes a readme and test scripts)

Gemini 2.5 Pro / Gemini 2.5 Flash

Cost: $0.12
Execution Time: 2m 21s
Lines of code: 87

GPT-5 / GPT-5 mini

Cost: $0.10
Execution Time: 5m 11s
Lines of code: 273

Qwen 3 Max / Qwen 3 Coder Plus

Cost: $0.20
Execution Time: 4m 52s
Lines of code: 207

Grok 4 / Grok Code Fast 1

Cost: $0.63
Execution Time: 11m 52s
Lines of code: 103

GLM 4.6

Cost: $0.06
Execution Time: 3m 3s
Lines of code: 251

Conclusions

As you can see, the quality of the delivered product, as well as the details like cost, execution time, and lines of code, varied widely based on chosen model pairs. Claude models wrote extensive documentation and test scripts - while I didn’t explicitly ask for these in prompt, the system prompt of the Developer agent urges it to ensure a functional solution before returning. The most surprising finding to me was that the clear winner in my mind was GLM 4.6, which also happened to be the cheapest and one of the fastest models tested.

What’s next?

Now that I’ve tested a bit with building small greenfield applications autonomously, I’d like to branch out to more complex tasks. Services with both frontend and backend implementations, very large codebases, you name it. For these kinds of problems, context is king. I plan to add new tools specifically for managing context in large codebases: first, support for AGENTS.md files to provide a good starting point for each agent. Second, instructions and possibly specialized tools to manage context documents within each repo. AI agents excel at managing their own context by writing text-based files to keep track of longer projects or complex aspects of a codebase. By structuring these documents in the repo itself, we get version control out of the box and make it easy for the agent to choose which sub-folders and documents are relevant for the task.

While we may not be quite to the point where we can ask AI agents to build a complex application, walk away, and return to a fully functional, tested, and deployed service that meets all initial requirements, the truth is, by leveraging the right tools this dream does not seem very far off. It’s a safe bet that LLM technology will continue to progress quickly, at least in the short term. We should focus on building tools that both fully leverage existing models, and anticipate the capabilities of future models.

Setup#

Putting it all together#

The Prompt#

Results#

Conclusions#

What’s next?#