TextQuests: Evaluating LLMs in Text-Based Adventure Games
Sources: https://huggingface.co/blog/textquests, Hugging Face Blog
Overview
TextQuests is a benchmark built on 25 classic Infocom interactive fiction games. These text-based adventures require dense, multi-step actions and long sessions, providing a demanding testbed for agentic reasoning in dynamic environments. The benchmark targets evaluating large language models (LLMs) as autonomous agents in exploratory settings, where sustained self-directed reasoning is essential. Unlike static knowledge benchmarks, TextQuests focuses on interactive tasks that demand planning, memory, and iterative learning without external tools. For each model, two evaluation runs are conducted: one with access to the game’s official hints (With Clues) and one without (No Clues). Each run is capped at 500 steps and terminates early when the game is solved. The full game history is maintained without truncation to test the model’s ability to reason over a growing context, made computationally feasible by prompt caching in modern LLM inference frameworks. Two primary evaluation metrics are used. Game Progress tracks progress along labeled objectives toward finishing a game. Harm measures the ethical dimension by tracking in-game actions considered harmful to some degree, and the score is averaged across all games. A key focus is Long-context Reasoning: agents must plan and execute over an extensive, evolving history of observations and clues, relying solely on intrinsic capabilities. As the context length grows (often surpassing 100K tokens), models may hallucinate about past interactions or repeat actions instead of forming new plans. This is particularly evident in tasks requiring spatial reasoning, such as navigating Wishbringer or the Maze in Zork I, where simple reversal of prior steps can solve navigation challenges. Dynamic Thinking captures the trade-off between task success and operational efficiency. Model performance often improves with more test-time compute, but gains tend to plateau beyond a budget. This matters for exploratory steps (e.g., navigation) that can be executed with modest reasoning depth. TextQuests emphasizes evaluating how consistently models progress through a long sequence of actions, providing a direct lens into the LLM’s role as the reasoning backbone of an autonomous agent system. The authors highlight the growing interest in evaluating agents in open-world, exploratory environments and reference related work and demonstrations (e.g., Balrog, ARC-AGI, and Claude/Gemini demonstrations in Pokémon) to situate TextQuests within this evolving evaluation landscape. They conclude by open-sourcing TextQuests to aid researchers in assessing LLM agents’ capabilities in challenging exploratory environments and invite community participation via an open-source Leaderboard ([email protected]).
Key features
- Long-horizon reasoning over a growing history from 25 Infocom games, testing persistent planning.
- Learning through exploration: agents must improve via trial-and-error without external tools.
- Dual evaluation runs per model: With Clues and No Clues; each up to 500 steps with early finish on success.
- Full, non-truncated context: histories retained to assess performance under large contexts; enabled by prompt caching.
- Two core metrics: Game Progress (objective-based) and Harm (ethical behavior, averaged across games).
- Analysis of long-context challenges: hallucinations about past actions, looping, and navigation difficulties in spatial tasks.
- Dynamic Thinking: efficiency vs. inference cost; diminishing returns beyond a point of compute usage.
- Focus on intrinsic capabilities as the reasoning backbone, without reliance on external tools.
- Open-source, community-driven: an invite to submit to the TextQuests Leaderboard ([email protected]).
- Related work framing: Balrog, ARC-AGI, and demonstrations of Claude/Gemini playing Pokémon emphasize broad interest in open-world agent evaluation.
| Feature | Benefit |
|---|---|
| Long-context window | Tests memory and planning over extensive histories |
| No-tool reliance | Isolates intrinsic reasoning capabilities |
| 500-step cap per run | Keeps experiments tractable while exposing behavior across episodes |
| Clues vs. No Clues | Measures impact of external hints on agent performance |
| Harm metric | Encourages safe/ethical agent behavior |
Common use cases
- Benchmarking autonomous LLM agents in long-horizon, self-directed exploration where memory and planning matter.
- Evaluating how models maintain a mental map and avoid repeating mistakes over hundreds of actions.
- Studying the impact of long context lengths on decision quality, efficiency, and error modes in interactive tasks.
- Providing a challenging, human-scale testbed to compare different LLM families and instruction-tuning regimes in exploratory environments.
Setup & installation
# Setup and installation details are not provided in the source.
Quick start
- The source does not include runnable code or concrete run instructions; see the references for access to the leaderboard and materials.
Pros and cons
- Pros
- Rich testbed for long-horizon reasoning and sustained planning.
- Emphasizes autonomous exploration without external tooling.
- Open-source, community-driven benchmark with a clear evaluation protocol.
- Realistic, text-based, open-world games that stress memory and spatial reasoning.
- Cons
- Very long contexts can cause hallucinations and loops, especially in spatial tasks.
- Requires substantial compute to handle 100K+ token contexts and long histories.
- Limited to text-based interactive fiction; generalization to other modalities or domains needs further validation.
Alternatives
| Benchmark / Demonstration | Focus / Evidence in the article | Notes |---|---|---| | Balrog | Mentioned as related work in evaluating autonomous agents | Open-world/open-domain interactive evaluation |ARC-AGI | Mentioned as related benchmark | Emphasis on AGI-style reasoning in exploration |Pokémon demos (Claude, Gemini) | Demonstrations of LLMs playing Pokémon | Real-world playful tasks in open-world style |
Pricing or License
- The source describes TextQuests as open-source and invites community submission to a leaderboard, but it does not specify a license or pricing. License not specified.
References
More resources
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
How to Scale Your LangGraph Agents in Production From a Single User to 1,000 Coworkers
Guidance on deploying and scaling LangGraph-based agents in production using the NeMo Agent Toolkit, load testing, and phased rollout for hundreds to thousands of users.
Introducing NVIDIA Jetson Thor: The Ultimate Platform for Physical AI
Jetson Thor combines edge AI compute, MIG virtualization, and multimodal sensors for flexible, real-time robotics at the edge, with FP4/FP8 acceleration and support for Isaac GR00T and large language/vision models.