TextQuests: Evaluating LLMs in Text-Based Adventure Games

Overview

TextQuests is a benchmark built on 25 classic Infocom interactive fiction games. These text-based adventures require dense, multi-step actions and long sessions, providing a demanding testbed for agentic reasoning in dynamic environments. The benchmark targets evaluating large language models (LLMs) as autonomous agents in exploratory settings, where sustained self-directed reasoning is essential. Unlike static knowledge benchmarks, TextQuests focuses on interactive tasks that demand planning, memory, and iterative learning without external tools. For each model, two evaluation runs are conducted: one with access to the game’s official hints (With Clues) and one without (No Clues). Each run is capped at 500 steps and terminates early when the game is solved. The full game history is maintained without truncation to test the model’s ability to reason over a growing context, made computationally feasible by prompt caching in modern LLM inference frameworks. Two primary evaluation metrics are used. Game Progress tracks progress along labeled objectives toward finishing a game. Harm measures the ethical dimension by tracking in-game actions considered harmful to some degree, and the score is averaged across all games. A key focus is Long-context Reasoning: agents must plan and execute over an extensive, evolving history of observations and clues, relying solely on intrinsic capabilities. As the context length grows (often surpassing 100K tokens), models may hallucinate about past interactions or repeat actions instead of forming new plans. This is particularly evident in tasks requiring spatial reasoning, such as navigating Wishbringer or the Maze in Zork I, where simple reversal of prior steps can solve navigation challenges. Dynamic Thinking captures the trade-off between task success and operational efficiency. Model performance often improves with more test-time compute, but gains tend to plateau beyond a budget. This matters for exploratory steps (e.g., navigation) that can be executed with modest reasoning depth. TextQuests emphasizes evaluating how consistently models progress through a long sequence of actions, providing a direct lens into the LLM’s role as the reasoning backbone of an autonomous agent system. The authors highlight the growing interest in evaluating agents in open-world, exploratory environments and reference related work and demonstrations (e.g., Balrog, ARC-AGI, and Claude/Gemini demonstrations in Pokémon) to situate TextQuests within this evolving evaluation landscape. They conclude by open-sourcing TextQuests to aid researchers in assessing LLM agents’ capabilities in challenging exploratory environments and invite community participation via an open-source Leaderboard ([email protected]).

Key features

Long-horizon reasoning over a growing history from 25 Infocom games, testing persistent planning.
Learning through exploration: agents must improve via trial-and-error without external tools.
Dual evaluation runs per model: With Clues and No Clues; each up to 500 steps with early finish on success.
Full, non-truncated context: histories retained to assess performance under large contexts; enabled by prompt caching.
Two core metrics: Game Progress (objective-based) and Harm (ethical behavior, averaged across games).
Analysis of long-context challenges: hallucinations about past actions, looping, and navigation difficulties in spatial tasks.
Dynamic Thinking: efficiency vs. inference cost; diminishing returns beyond a point of compute usage.
Focus on intrinsic capabilities as the reasoning backbone, without reliance on external tools.
Open-source, community-driven: an invite to submit to the TextQuests Leaderboard ([email protected]).
Related work framing: Balrog, ARC-AGI, and demonstrations of Claude/Gemini playing Pokémon emphasize broad interest in open-world agent evaluation.

Feature	Benefit
Long-context window	Tests memory and planning over extensive histories
No-tool reliance	Isolates intrinsic reasoning capabilities
500-step cap per run	Keeps experiments tractable while exposing behavior across episodes
Clues vs. No Clues	Measures impact of external hints on agent performance
Harm metric	Encourages safe/ethical agent behavior

Common use cases

Benchmarking autonomous LLM agents in long-horizon, self-directed exploration where memory and planning matter.
Evaluating how models maintain a mental map and avoid repeating mistakes over hundreds of actions.
Studying the impact of long context lengths on decision quality, efficiency, and error modes in interactive tasks.
Providing a challenging, human-scale testbed to compare different LLM families and instruction-tuning regimes in exploratory environments.

Setup & installation

# Setup and installation details are not provided in the source.

Quick start

The source does not include runnable code or concrete run instructions; see the references for access to the leaderboard and materials.

Pros and cons

Pros
Rich testbed for long-horizon reasoning and sustained planning.
Emphasizes autonomous exploration without external tooling.
Open-source, community-driven benchmark with a clear evaluation protocol.
Realistic, text-based, open-world games that stress memory and spatial reasoning.
Cons
Very long contexts can cause hallucinations and loops, especially in spatial tasks.
Requires substantial compute to handle 100K+ token contexts and long histories.
Limited to text-based interactive fiction; generalization to other modalities or domains needs further validation.

Alternatives

| Benchmark / Demonstration | Focus / Evidence in the article | Notes |---|---|---| | Balrog | Mentioned as related work in evaluating autonomous agents | Open-world/open-domain interactive evaluation |ARC-AGI | Mentioned as related benchmark | Emphasis on AGI-style reasoning in exploration |Pokémon demos (Claude, Gemini) | Demonstrations of LLMs playing Pokémon | Real-world playful tasks in open-world style |

Pricing or License

The source describes TextQuests as open-source and invites community submission to a leaderboard, but it does not specify a license or pricing. License not specified.