Vaguely Aligned

Strange Loops and technological absurdity

It's 2026. Can LLMs Play Nethack Yet?

Ah, NetHack. Despite playing for years off and on, I never quite made that first ascension, only managing to find an increasing number of unique and creative ways to die. Picking it back up recently (with similar results), I wondered how effectively modern LLMs could navigate the procedurally generated dungeons of NetHack and whether maybe, just maybe, the Amulet of Yendor was within reach for them. What follows is my journey towards building a new agent harness for NetHack; if you prefer to skip ahead to the project itself, take a look at the GitHub repo. ...

Self-directed AI Dev: Cellular Automata

I benchmarked a number of agentic LLMs in a simple autonomous dev loop: a “Tech Lead” agent researches the problem and spawns dev sub-agents to complete the task. The tech lead / dev agents each have an assigned LLM and toolset. More details regarding methodology can be found in this post. Take a look at the demos below. tl;dr: GLM 4.6 was the clear winner, and one of the cheapest to boot. ...

Experiments in Autonomous AI Development

I started out 2025 deeply skeptical of AI development tools. I had played around a bit with generating code via back-and-forth conversations with LLM chat apps (ChatGPT and Claude) but was unimpressed and viewed it as mostly a toy to get simple projects off the ground but actively time-wasting for large, complex codebases. Then in March I tried Claude Code. I set up Claude Code on a side project (a codebase with ~10k lines of Elixir) and was immediately several times more productive. Switching between planning mode and build mode felt natural - very analogous to how I reasoned about coding tasks myself. Yes the AI made mistakes, but it built and tested the changes itself, so the debug feedback loop was tight and hands-off. I found myself spending much more time in “Architect” mode than thinking about individual lines of code or getting stuck on mundane bugs. ...