The Xent Benchmark Explained
The Xent Benchmark
This article provides an overview of the Xent benchmark. In the course of this piece, we'll draw upon ideas presented in the Xent theory and Xent games articles. They aren't prerequisites, but we recommend reading them if you want a deeper understanding of how the Xent benchmark works.
High-Level Overview
The Xent benchmark is an LLM benchmark designed to evaluate LLM agents on their general intelligence and capabilities. It comprises a set of games. Each game contains a set of "maps." An LLM agent plays multiple rounds of a game map, with opportunities to improve its score based on earlier rounds.
Currently, the benchmark includes three games: Condense, Contrast, and Synthesize. Each game has 20 generated maps, for a total of 60 maps. Each map is played for approximately 30 rounds.
Scores for an agent are summed across all maps for a given game to compute the game-specific leaderboard, and across all games to compute the overall leaderboard. The game-specific views include charts that characterize learning dynamics (described in Results).
Benchmark Configuration
The benchmark exposes several configuration points. Let's focus on three.
Judge Model
The judge model is the LLM used to generate maps, score games, and enforce rules. This model should be a base (pretrained) LLM; Chatbot-tuned judges are possible, but introduce some complications we avoided for this benchmark version. Similarly, for simplicity, the judge should be a dense model, not a mixture of experts (MoE)
For the current version of the Xent benchmark we used Qwen3-14B-Base, which is state-of-the-art among the open pretrained dense models of its size.
Number of Maps Per Game
Since game maps are generated by the judge model, the number of generated maps is simply a configuration parameter. For the current Xent benchmark, we generated 20 maps per game. Experimentally we've found that 20 maps was a good trade-off between cost and accuracy.
Number of Game Iterations
Xent caps gameplay by executed code lines ("steps"), not by round count. This evenly distributes compute across games. Longer games, with more lines of code (and thus more compute per round), therefore run fewer rounds.
For the current Xent benchmark, we configured 100 steps per map. In practice, with the relatively simple games selected for the benchmark, this yields approximately 30 rounds played per map.
The Games
Below we describe the games and why we selected them.
The Xent benchmark can be configured to include any valid Xent game. Games can be hand-written, programmatically generated, or created by AI agents.
For the current version, we wrote three simple games: "Condense," "Contrast," and "Synthesize." These games are closely related. While Xent is intended to be general, covering a wide range of skills and capabilities, we prioritized depth over breadth in this release by probing a narrow skill set.
Notably, even this reduced game set yields a leaderboard that aligns with our prior observations. Individual Xent games, even basic ones, can be deep. We plan to expand the benchmark to cover a broader range of skills and more human-facing tasks in future releases.
We summarize each included game below.
Note: see Xent games for notation and background.
Condense
assign(s=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s))
reward(xed(s | x1))
Condense is one of our favorite Xent games. It's also one of the most simple.
The point of Condense is to find a prefix to a given story that helps an LLM predict the story. There are only two restrictions: the player may not use words that are in the story1, and the answer must be 10 tokens or less.
In simplified terms, the score is computed as likelihood(text | prefix) - likelihood(text)
. As the prefix increases the text's likelihood, the score increases.2.
A simple example is setting s
to "A long time ago in a galaxy far, far away....". In this case, an excellent prefix would likely be "Star Wars opening crawl". Once an LLM sees the prefix, it will have a very easy time predicting the subsequent text.
Another example: for “The process by which a liquid substance transforms into a gas or vapor,” a prefix such as “evaporation definition” performs well.
Condense is simple yet rich. If you're interested, a playable version is available at xentlabs.ai.
Synthesize
assign(s1=story(), s2=story(), s3=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s1 + s2 + s3))
reward(xed(s1 | x1))
reward(xed(s2 | x1))
reward(xed(s3 | x1))
Synthesize extends Condense to three stories instead of one.
Synthesize is substantially harder than Condense and has a higher skill ceiling. If a player finds the right words that cut across the three texts, the rewards are larger; depending on the texts, such prefixes can be difficult to discover.
With three stories, the common-word restriction is more constraining, making concise expression harder.
Contrast
assign(s1=story(), s2=story())
elicit(x, 10)
assign(x1=remove_common_words(x, s1 + s2))
reward(xed(s1 | x1))
reward(dex(s2 | x1))
Contrast introduces a key inversion. The player is given two texts. For the first text, the player is effectively playing Condense. For the second, the player is playing “reverse Condense”: the prefix should make the second story less likely rather than more likely.
This inversion enables distinct strategies. For example, a player may find it more rewarding to reduce the likelihood of s2
and discount rewards from s1
. It pressures agents to reason about cross-entropy dynamics and adapt to feedback across iterations.
The Results
With that background, we summarize how results are presented.
The overview tab shows the sum of all the maps for each player. This cumulative score indicates the broad-strokes strength of an LLM.
The game-specific tabs have leaderboards that contain the sum of all maps for that game. These scores indicates the strength of an LLM on a specific game (and, correspondingly, the skills required to play that game well).
Each game tab includes ARMS (Average Running Max Score) and ASPI (Average Score Per Iteration) charts. These metrics are accumulated across all the maps for the game. A point on ASPI is the mean score at iteration n across maps.
The ARMS chart gives an indication as to how well a model can continue to improve. A flat ARMS indicates the model was not able to improve on its results any further - typically this occurs when the data extends beyond its effective context window3. A steep slope on ARMS shows that models are able to ingest new data and make significant improvements based on it.
ASPI reflects variability across iterations. Low variance suggests convergence; sustained variance suggests ongoing exploration.
When comparing two agents, the one with higher ARMS can exhibit lower (and more variable) ASPI. This pattern suggests continued exploration rather than early convergence. For example, in Synthesize, o3 shows higher ARMS with lower, more variable ASPI than gpt-5; this pattern appears across all three games but is most pronounced on Synthesize.
Farewell for Now
Phew! That was a lot of text. Hopefully you now have a better understanding of the Xent benchmark, how exactly it works, and what the results mean.
If you want to learn more, you should read our Xent theory and Xent games articles. And if you want to learn a lot more, you should read our paper.
As always, feel free to reach out to us at ace@xentlabs.ai or ch@xentlabs.ai - we would love to hear from you.
Footnotes
-
Astute readers may note that this rule enforcement is slightly different from the one presented in our Xent games article. The
ensure
statement used in the other piece is fully supported in XGL. However, the judge model (Qwen3-14B-Base) that we used for this first version of the Xent benchmark was simply not strong enough to reliably enforce such rules accurately. This will be addressed by us running subsequent benchmark versions with a strong judge, as well as industry wide model improvements. ↩ -
Technically this isn't looking at the likelihood, but the cross-entropy ("xent"). See the Xent games article for more details here. This fact was glossed over in the interest of giving a quick introduction. Likelihood seems like a sufficiently close concept to get the meaning of the game. ↩
-
While LLMs literally support extremely long context windows, experience has shown that there is a much smaller window that allows for effective usage of data. Once data is pushed out of the "effective context window", the model can no longer leverage that data as well as it could. ↩