The Xent Benchmark Explained

The Xent Benchmark

This article provides an overview of the Xent benchmark. In the course of this piece, we'll draw upon ideas presented in the Xent theory and Xent games articles. They aren’t prerequisites, but we recommend reading them if you want a deeper understanding of how the Xent benchmark works.

High-Level Overview

The Xent benchmark is an LLM benchmark designed to evaluate LLM agents on their general intelligence and capabilities. It comprises a set of games. Each game contains a set of "maps." An LLM agent plays multiple rounds of a game map, with opportunities to improve its score based on earlier rounds.

Currently, the benchmark includes five games: Condense, Contrast, Horizon, Synthesize, and Two-Ways. Each game has 10 generated maps, for a total of 60 maps. Each map is played for 20 rounds.

Scores for each map are normalized by the highest score on that map, so the highest score receives a 100 and all other scores are relative to that high score. Note that as players are added to the benchmark, the high-score will remain fixed, allowing scores to go above 100.

Scores for an agent are then averaged across all maps for a given game to compute the game-specific leaderboard. The per-game scores are then averaged across games to compute the overall leaderboard. The game-specific views include charts that characterize learning dynamics (described in Results).

Benchmark Configuration

The benchmark exposes several configuration points. Let’s focus on three.

Judge Model

The judge model is the LLM used to generate maps, score games, and enforce rules. This model should be a base (pretrained) LLM; Chatbot-tuned judges are possible, but introduce some complications we avoided for this benchmark version. Similarly, for simplicity, the judge should be a dense model, not a mixture of experts (MoE)

For the current version of the Xent benchmark we used Qwen3-14B-Base, which is state-of-the-art among the open pretrained dense models of its size.

Number of Maps Per Game

Since game maps are generated by the judge model, the number of generated maps is simply a configuration parameter. For the current Xent benchmark, we generated 10 maps per game. Experimentally we've found that 10 maps was a good trade-off between cost and accuracy.

Number of Game Iterations

Each Xent game is played for a number of rounds. For the current Xent benchmark, we configured 20 rounds per map.

The Games

Below we describe the games and why we selected them.

The Xent benchmark can be configured to include any valid Xent game. Games can be hand-written, programmatically generated, or created by AI agents.

For the current version, we wrote five games: "Condense," "Contrast," "Horizon", "Synthesize", and "Two-Ways." While Xent is intended to be general, covering a wide range of skills and capabilities, we prioritized depth over breadth in this release by probing a narrow skill set. Notably, even this reduced game set yields a leaderboard that aligns with our prior observations. Individual Xent games, even basic ones, can be deep. We plan to expand the benchmark to cover a broader range of skills and more human-facing tasks in future releases.

We summarize each included game below.

Note: see Xent games for notation and background.

Condense

assign(s=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s))
reward(xed(s | x1))

Condense is one of our favorite Xent games. It's also one of the most simple.

The point of Condense is to find a prefix to a given story that helps an LLM predict the story. There are only two restrictions: the player may not use words that are in the story¹, and the answer must be 10 tokens or less.

In simplified terms, the score is computed as likelihood(text | prefix) - likelihood(text) . As the prefix increases the text’s likelihood, the score increases.².

A simple example is setting s to "A long time ago in a galaxy far, far away....". In this case, an excellent prefix would likely be "Star Wars opening crawl". Once an LLM sees the prefix, it will have a very easy time predicting the subsequent text.

Another example: for “The process by which a liquid substance transforms into a gas or vapor,” a prefix such as “evaporation definition” performs well.

Condense is simple yet rich. If you're interested, a playable version is available at xentlabs.ai.

Contrast

assign(s1=story(), s2=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s1 + s2))
reward(xed(s1 | x1))
reward(dex(s2 | x1))

Contrast introduces a key inversion. The player is given two texts. For the first text, the player is effectively playing Condense. For the second, the player is playing “reverse Condense”: the prefix should make the second story less likely rather than more likely.

This inversion enables distinct strategies. For example, a player may find it more rewarding to reduce the likelihood of s2 and discount rewards from s1. It pressures agents to reason about cross-entropy dynamics and adapt to feedback across iterations.

Horizon

assign(s=story()))
assign(t1='After the rain', t2=' I eat food')

elicit(x1, 5)
ensure(xed(x1 | s) >= xed(t2 | t1))

elicit(x2, 5)
ensure(xed(x2 | s + x1) >= xed(t2 | t1))

reward(dex(x2 | s))

In Horizon, the player constructs a sequence of strings s -> x1 -> x2 and is rewarded for making x2 surprising when given s but not x1. The elements added by the player (x1 and x2) must be less suprising than a baseline comparison of t2 | t1.

The game mechanism for baseline surprisingness uses comparison of xent scores to establish a quantitative boundary for a qualitative question. The player must learn to estimate how likely their sequential texts are when compared to the baseline - a highly nuanced and sophisticated game mechanic.

Synthesize

assign(s1=story(), s2=story(), s3=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s1 + s2 + s3))
reward(xed(s1 | x1))
reward(xed(s2 | x1))
reward(xed(s3 | x1))

Synthesize extends Condense to three stories instead of one.

Synthesize is substantially harder than Condense and has a higher skill ceiling. If a player finds the right words that cut across the three texts, the rewards are larger; depending on the texts, such prefixes can be difficult to discover. With three stories, the common-word restriction is more constraining, making concise expression harder.

Two-Ways

assign(s1=story(), s2=story())
assign(s='I have a fun story to tell you:')
elicit(t, 10)
reward(nex(s2 | s + s1 + t))
reward(nex(s1 | s + s2 + t))
reward(xent(s2 | s + s1))
reward(xent(s1 | s + s2))

Two-Ways is a simple, yet challenging game. Given two stories s1 and s2, find t such that both s1 + t + s2 and s2 + t + s1 make sense. The game code itself is slightly complex and includes certain extra rewards to normalize and bound the rewards from the game. But the underlying game is simple and clear to play.

The Results

With that background, we summarize how results are presented.

The overview tab shows the average of the per-game scores for each player. This cumulative score indicates the broad-strokes strength of an LLM.

The game-specific tabs have leaderboards that contain the average scores per-play overall all maps for that game. These scores indicates the strength of an LLM on a specific game (and, correspondingly, the skills required to play that game well).

Each map's scores are normalized by the highest score on that map. Using the highest score, all scores are computed as 100 * score / max_score.

Each game tab includes ARMS (Average Running Max Score) and ASPI (Average Score Per Iteration) charts. These metrics are accumulated across all the maps for the game. A point on ASPI is the mean score at iteration n across maps.

The ARMS chart gives an indication as to how well a model can continue to improve. A flat ARMS indicates the model was not able to improve on its results any further - typically this occurs when the data extends beyond its effective context window³. A steep slope on ARMS shows that models are able to ingest new data and make significant improvements based on it.

ASPI reflects variability across iterations. Low variance suggests convergence; sustained variance suggests ongoing exploration.

When comparing two agents, the one with higher ARMS can exhibit lower (and more variable) ASPI. This pattern suggests continued exploration rather than early convergence. For example, in Synthesize, o3 shows higher ARMS with lower, more variable ASPI than gpt-5; this pattern appears across all three games but is most pronounced on Synthesize.

Farewell for Now

Phew! That was a lot of text. Hopefully you now have a better understanding of the Xent benchmark, how exactly it works, and what the results mean.

If you want to learn more, you should read our Xent theory and Xent games articles. And if you want to learn a lot more, you should read our paper.

As always, feel free to reach out to us at ace@xentlabs.aior ch@xentlabs.ai - we would love to hear from you.

Astute readers may note that this rule enforcement is slightly different from the one presented in our Xent games article. The ensure statement used in the other piece is fully supported in XGL. However, the judge model (Qwen3-14B-Base) that we used for this first version of the Xent benchmark was simply not strong enough to reliably enforce such rules accurately. This will be addressed by us running subsequent benchmark versions with a strong judge, as well as industry wide model improvements. ↩
Technically this isn't looking at the likelihood, but the cross-entropy ("xent"). See the Xent games article for more details here. This fact was glossed over in the interest of giving a quick introduction. Likelihood seems like a sufficiently close concept to get the meaning of the game. ↩
While LLMs literally support extremely long context windows, experience has shown that there is a much smaller window that allows for effective usage of data. Once data is pushed out of the "effective context window", the model can no longer leverage that data as well as it could. ↩