
Ah, NetHack. Despite playing for years off and on, I never quite made that first ascension, only managing to find an increasing number of unique and creative ways to die. Picking it back up recently (with similar results), I wondered how effectively modern LLMs could navigate the procedurally generated dungeons of NetHack and whether maybe, just maybe, the Amulet of Yendor was within reach for them. What follows is my journey towards building a new agent harness for NetHack; if you prefer to skip ahead to the project itself, take a look at the GitHub repo.
State of the Art
As you can imagine, I wasn’t the first to consider whether a bot could beat NetHack. There are two broad categories of NetHack bots: symbolic (rule-based) and neural (neural net-based). Symbolic bots are, at the risk of oversimplifying, basically just a long list of if/then statements in a loop - and they tend to perform much better than their neural counterparts. One such bot, based on the BotHack framework, even managed an ascension in 2015 - albeit by using a farming strategy that has since been patched.
With neural bots, on the other hand, reinforcement learning is used to train neural nets to play NetHack. Facebook released its NetHack Learning Environment along with a companion dataset in 2020 to provide a unified training interface for RL agents to interact with NetHack. The results of the subsequent NeurIPS 2021 NetHack challenge speak for themselves:
The margin of victory was significant, with the top symbolic agent beating the top neural agent by a factor of almost 3 in the median score. This was, in fact, increased when looking at the very best agents from each team, where frequently we might see almost an order of magnitude improvement in the median score between the best symbolic and neural agents.
What about LLMs? The first writeup I could find of an LLM agent playing NetHack was NetPlay in March 2024. While achieving some interesting results, the agent never made it past the first few levels of the game. BALROG (Nov 2024) is another such example with similar results, but it went a step further to define a new metric that more accurately captures agent progress through NetHack rather than relying on the point system, which can be gamed. BRAID is a recent fork of BALROG that incorporates a more modern Agentic AI loop, and it managed to get 6.96% progression with Claude Opus 4.5, which is perhaps the highest progression an LLM has managed thus far.
The Harness
I rolled my own harness rather than extending existing NetHack LLM harnesses. A lot has changed in the realm of agentic AI over the last year, and if tools like Claude Code have taught us anything it’s that the harness can be just as important as the model in determining the performance of an agent loop. I took inspiration from a recent Anthropic article titled Code Execution With MCP: rather than exposing actions as separate tools for the agent to call, I define a Python API that the agent can use to interact with NetHack and give the agent a single tool, execute_code, which allows it to execute these API calls in a Python sandbox. Based on the NLE, this API allows the agent to express game actions in tight loops that save on token usage and leave the context window unpolluted with distracting intermediate states that offer no value. For example, rather than executing a single attack each turn, the agent can write something like this:
| |
Of course, the above loop is overly simplistic: for example, what if a cockatrice appears while you’re attacking a low-level monster? Without an explicit check for new monsters appearing, you risk being turned to stone, regardless of your hp: game over. And NetHack itself already has a basic looping mechanism that is robust to state changes like an enemy appearing: prefix any command with a number and it loops the command that many times, stopping if something interesting happens first. Still, the combination of a highly expressive API and good examples in the system prompt led to some interesting results: GPT 5.2 tended to write convoluted, multi-step loops with many conditionals, while Gemini 3 Flash opted for minimal commands per turn. With the sandbox approach, you can evaluate the LLM not only on how far it gets in NetHack, but how token-efficient it was in getting there.
Just as important as the tools you give the LLM is the context: what information is available to the agent on each turn? I made heavy use of observation masking with the goal of presenting the most relevant current and historical context to the agent while stripping out the cruft. Aside from the system prompt which documents the Python API and gives basic directions, the LLM is presented with the complete current game state on each turn, including the ascii game map, status bar, and messages. Additional context is given such as adjacent tiles, visible monsters, inventory, etc. A list of all game messages that the previous turn generated is given, as well as the results of the previous api calls and any failure messages.
=== CURRENT GAME VIEW ===
Your position: (37, 8)
--------
|......| -------------
|......| ------- |...........|
|......| %#[)#..[...| |............#
|......-##### |....$| -+--+--------## --------
|......| #######+....$-#######################[########`###-.....<|
-.------ # -.-|--- @ ## # # |......|
## ### # # ##d # ## #.......|
# # # . . # ### #--------
# #### ### . .. #### #
--.--# # # |. . #####[# ##
|....########## -|--.-- | .+. # ##
|.$.| |.....| .?. #------ ##
|{...#############|.>...+ ... #|.....###
----- #|.....| ----- #.....|
#------- |....|
####` |....|
------
Agent the Stripling St:18/02 Dx:13 Co:15 In:8 Wi:8 Ch:11 Lawful S:52
Dlvl:1 $:0 HP:16(16) Pw:2(2) AC:6 Xp:1/13 T:437
Adjacent tiles:
N: corridor
NE: corridor
E: solid stone
SE: jackal 'd'
S: corridor
SW: corridor
W: solid stone
NW: corridor
Hostile Monsters:
- jackal 'd' [SE, adjacent]
Inventory:
a: +1 long sword (weapon in hand)
b: +0 dagger (alternate weapon; not wielded)
c: uncursed +3 small shield (being worn)
d: uncursed food ration
Last Result:
game_messages:
- You hear a door open.
- You see here a gold piece.
- You hear water falling on coins.
autoexplore: stopped (hostile) after 21 steps
actions:
- search() ok
- autoexplore(max_steps=500) ok
Study the map. What do you observe, and what action will you take?
The included historical context is where things get interesting: for previous turns, I completely strip out the past map state, both to save on tokens and in an attempt to cut down on context noise that may be distracting to the agent. Optionally I can keep the last n maps in context based on a configurable value. Similarly, there is a configurable option to limit the number of tool call arguments I keep in context (the python code that the agent wrote to interact with NetHack). After 10 turns, instead of seeing the full arguments the agent will see <compacted>. Beyond that, I leverage a simple sliding window: after 100 turns, previous messages are simply dropped. All of these values are configurable, and I found it interesting to tweak each of them and see what impact, if any, they had on the performance of the agent.
Pathfinding
Vanilla NetHack offers basic pathfinding tools such as the Travel command that I expose in the Python API. Beyond that, some variants such as NetHack4 offer an autoexplore feature which I’ve found to be a huge time-saver in my personal NetHack runs, and I knew would save immensely on token usage. Since NLE is built on top of vanilla NetHack, I attempted to roll my own autoexplore based on the NetHack4 codebase… how hard could it be? Pretty difficult, it turns out: handling the myriad edge cases encountered while exploring NetHack turned into a huge time sink, even with the reference autoexplore implementation as a guide. In hindsight, rolling my own Python API on top of a variant that includes autoexplore may have been more efficient than trying to bolt autoexplore on top of NLE.. but regardless, I ended up with a workable autoexplore feature which I prompted the LLM to leverage extensively to map out each dungeon level. In general, all tested LLMs struggled with the spatial awareness needed to manually move around the dungeon, mistaking routes needed to pick up items, open doors, etc. Much of the harness and the given API methods were attempts to work around this weakness: rather than manually inputting item locations, the agent is given a list of map features and their coordinates, and a move_to method that handles the pathfinding for them.
Memory
Since we use a sliding context window and perform observation masking, there’s a risk that the LLM will lose valuable context from previous turns. To mitigate this I added a few API methods that store and retrieve notes that the LLM writes. The agent can either add a persistent note which appears on each subsequent agent turn (with an optional expiration number of in-game turns after which the note will disappear), or a reminder which will appear to the agent after N in-game turns have passed, the latter being useful for timeboxed cases such as reminding oneself that an in-inventory corpse is too old to eat and has likely rotted. Across the models I tested these tools didn’t see much use; perhaps more careful direction in the system prompt would have changed that. Still, there were some interesting examples where the LLM leveraged the memory tools, such as when GPT 5.2 added a persistent note to avoid a throne room it identified as being too dangerous for its current level.
Results
Before you get too excited, no, there were no NetHack ascensions by LLMs in my testing. The highest any agent managed was dungeon level 10 by GPT 5.2 which, while impressive in its own right and likely the highest an LLM has achieved thus far, is only 12.56% progression on the BALROG metric - nowhere near ascension. Below you’ll see some LLMs and how they performed.
| Model | Runs | Avg Depth | Max Depth | Avg XP | Max XP | Avg BALROG | Max BALROG |
|---|---|---|---|---|---|---|---|
| openai/gpt-5.2 | 37 | 3.5 | 10 | 2.8 | 6 | 2.56% | 12.56% |
| google/gemini-3-flash-preview | 140 | 2.1 | 9 | 1.8 | 6 | 1.15% | 9.77% |
| google/gemini-3-pro-preview | 11 | 1.6 | 5 | 1.6 | 5 | 0.51% | 2.91% |
| anthropic/claude-opus-4.5 | 48 | 1.3 | 5 | 1.2 | 5 | 0.34% | 2.91% |
GPT 5.2 was the clear winner here, not only in the consistency of its progression but from an analysis of its playthroughs: its map awareness, food stockpiling and other long-term strategizing, and extensive use of the API impressed me, and it was really the only model that felt to me like it was playing NetHack rather than blindly descending levels to its death.
One note about the above numbers: since many of these sessions were completed as I was developing the harness, the averages here are likely skewed a bit lower, since improvements made to agent context and the API led to better results later in the development cycle. Now that the harness is more or less stable it would be interesting to perform a “clean” set of runs with the same models and compare (if only this project had an unlimited budget!)
Analysis
So, where did the LLMs struggle the most? By far the biggest roadblock was also the least surprising: spatial awareness. All previous attempts to get LLMs to play NetHack (NetPlay, BALROG, etc) ran into this same issue: NetPlay omits the ascii map entirely in its agent context, and instead attempts to describe map features in text format, but the agent still struggles with basic navigation; BALROG tries both the raw ascii and an image of the map (for models that support images) and, surprisingly, the images caused the LLM to perform even worse in most cases! I did not go down the path of sending map images in my harness, mostly because I feared exploding my already overdrawn inference budget. But I did try several different experiments to enhance spatial awareness: listing notable map details like doors and items in context, implementing a “local view” (7x7 ascii grid centered on the player), toggling between automatically including the ascii map in context and exposing it as a tool call, etc. None were much help and this remains the biggest difficulty the models have.
To illustrate the issue, imagine you come across this level:
-----
------- |...| -------------
|..^..| ------- |...| |...........|
|.....| #......| |....####?###|...........|
|.^...| #|......# --.-- #|...........|
|.....| #|..?..|# # #............|
|.`....# #|.....-####)## |...........|
-----.-# #-------# [###########............|
##### # # # # ----.----|---
## # # # ## # ## ###
#### # ### # # # #
######## # #------|--- # # #
-.--- ### # #-.....=... # # #
|...| ###(# # ###|........| ## ## ###
|.^{| #---(-# # #-........|##### -.-.-----#
|f.%| #|....# # |........|# ########........|#
|.@<| #....|##### ----------# |.......|#
-.--- |...-# # |.......-#
# -.--- `###### ---------
#
You’ve explored most of the level but still haven’t found the > (stairs down). In NetHack this means you must search walls and corridors to find hidden passages, since each level in the main dungeon is guaranteed to have a staircase down. After a quick glance at this map, my instinct would be to travel to the end corridor in the far east of the map and search there. The corridor suspiciously ends without connecting to another room, so it’s quite possible there is a hidden corridor or door at the end of it. If that fails, I’d travel to the room to the northeast and search along the eastern wall: since there’s little space for a hidden room anywhere else on the map, the room is likely further east.
Many LLMs struggle with this. I encourage you to copy the above map, paste it into your chat bot of choice and ask it the following:
Where is the hidden room with the staircase down most likely to be in this NetHack level?
I asked Opus 4.5 this and it mostly agrees with me, but interestingly, its output of map segments is disjointed and includes floating corridors with no connections. Perhaps this highlights a fundamental issue with interpreting ascii art: tokenization removes enough of the spatial meaning that interpretation becomes extremely challenging.
Next I asked Haiku 4.5 the same question, to much more disappointing results. It lands on a somewhat salvageable answer, but its reasoning to get there is completely flawed: in NetHack, the ? symbol is commonly a scroll, most certainly not a hidden door. The fact that the model interprets this symbol as “indicating hidden doors” strains credulity: if they’re hidden, why do they have a symbol? A basic understanding of NetHack would reject that idea immediately. More notably, Haiku does not offer spatial reasoning for why a hidden door might be somewhere; Opus says the following:
Those # symbols extending to the right from the bottom-right room strongly suggest a hidden room exists in that direction. The corridors don’t connect to anything visible, which is a classic NetHack indicator of a secret door.
Meanwhile Haiku’s rationale:
The huge rectangular room on the right (the one filled with many . characters) is a major feature. In NetHack level design, significant rooms like this often have hidden connections.
Haiku is not describing the room in relation to the features around it. Instead it’s applying a heuristic (big room must have hidden doors) that may work in some cases but certainly not all the time. This is indicative of a pattern I saw over and over in the agent runs: when an agent encountered a level without an obvious staircase down, the agent began searching in rooms that were almost certainly not the right places to search to find the hidden room (for example, a room that already had surrounding rooms on all sides) and completely ignored seemingly obvious places that hidden doors may exist.
Strategy
Generally speaking, the LLMs struggled at long-term strategizing, with a consistent trend towards reactivity rather than proactivity: agents rushed headlong to lower dungeon levels without considering if they were strong enough; they did not stockpile food and only considered food a priority when they became Hungry or worse; they took the first down staircase they found, not caring to distinguish between regular stairs and the entrance to the Gnomish Mines, a dangerous dungeon branch that they were often unprepared for. With a heavy emphasis on context efficiency, the harness’s existing memory management tools may be insufficient, and a clear area of improvement is goal management, either by allowing the LLM to set what the top priority is for any given agent turn (food gathering, gaining experience, item identification, etc) or by setting deterministic hooks that change the goal dynamically based on game state. One could even imagine a harness that employs sub-agents specialized in these specific tasks, with a supervisor agent determining which sub-agent to deploy at a given time.
Loops
Agents get into silly loops during NetHack runs that are as amusing as they are frustrating to watch. In one example, the player was blinded by a yellow light while in a hallway and proceeded to get attacked by other monsters. When you’re blinded in NetHack, attacking enemies leave behind markers on the map to indicate the presence of an invisible monster. Even after defeating the enemies and re-gaining eyesight, these markers can be left on the map until they’re manually cleared by the player. In this case, the agent decided to wait at the other end of the hallway for the non-existent monsters to come to them; despite seeing no movement from these enemies, the agent spent hundreds of in-game turns and dozens of LLM turns patiently waiting for enemies that would never come.
Model differences
As previously stated, GPT 5.2 was the clear leader in its expressive use of the API and long-term strategizing. As for other standouts, Gemini 3 Flash was surprisingly competitive and was my go-to model during development for its speed. I was surprised to see how poorly Claude Opus 4.5 performed relative to other flagship models (especially given it helped me write the vast majority of the harness code!) Opus seemed to have much better map awareness during one-off conversations with the model than during actual gameplay, which is puzzling: I used OpenRouter for all LLM interactions in this project, but perhaps there was some Anthropic-specific quirk I was missing in the agent loop.
New benchmark?
We already have the BALROG harness for benchmarking agent NetHack performance, why would we need another? I can think of a couple reasons: first, the code sandbox approach allows you to test not only how far the agent gets, but how efficient it is in getting there; I could imagine a new benchmark score that combines percent progression with token usage with the goal to maximize the former and minimize the latter. Second, the highly configurable nature of the harness allows you to test both how good the model is at playing NetHack overall, and also under what specific scenarios the model excels or falls short: by pulling levers such as local map vs global map, sliding window length, level of observation masking, which optional methods like autoexplore are included in the API, etc you’re able to test the boundaries of the model’s spatial awareness, long-term planning, and more. My sense is that as LLMs progress in sophistication, they may begin to diverge more along distinct ability tracks; Gemini may have a superior “world model” while Claude consistently outperforms in coding etc. But I’ll admit I didn’t build this to try to formalize a new benchmark, I did it because it was fun. I certainly want to beat NetHack myself someday soon; I wonder if I or an LLM will get there first?
If you’re interested, check out the GitHub repo. Clone it, run it yourself, fork it, whatever. Happy vibing 😎