Experiment

The Xent Benchmark in 10 Ideas

DATE: October 4, 2025 · SPECIMEN: XENT-BENCHMARK-10

Clément Hongler

My Ten Favorite Xent Benchmark Ideas

The Xent Benchmark in 10 Ideas

Since the Xent paper is relatively long, here is the list of my favorite ten ideas that we came up to build the Xent Game Theory benchmark that are reported in the paper (note: these ideas may be new or not; to us, they were new; at the very least, we believe that their combination is most likely new).

Trust and uncheatability in a benchmark cannot be achieved without some sort of game-play where a model interacts with an environment (where the trust is guaranteed by using verifiably random seeds); anything else is prone to cheating and data contamination.
All that can be learned superficially about an external world can be encoded in a judge model pre-trained on the data of that world (as we know how to do it); and that all deep understanding about the said world can be derived from exploring the implicit knowledge about the said judge model via game-play.
The concept of playability of a game (i.e. the ability to make progress via repeated few-shot play), as being central to measuring useful skill (i.e. the games that measure skill are playable ones).
The link between in-context learning and reinforcement learning for xent games, with the former being useful to directly measure model's abilities, and the latter to measure transfer values between games.
The understanding that there is an unbounded number of games and tasks, but that these form a web with measurable links, rather than a disconnected set of objects; the understanding that these links are defined from transfer values.
The conjecture that xent games capture all the 'reasonable' game skills associated with an environment arbitrated by a judge model. This conjecture is simply based on the high modularity of the space of xent games and the fact that we could not find any playable game based on a judge model that could not be approximated by a xent game.
The conjecture that the space of xent games is a tightly inter-connected space: this follows from the modularity of the game space, where from each game a certain number of moves can be performed in game space (which the xent game language makes obvious), keeping (as suggested experimentally) good transfer values.
The idea that general abilities may be probed by a sequence of ever-increasing scopes, constructed from an algorithm motivated by evolution-based ideas, exploring the set of games like a web crawler would, using the links.
The notion of benchmark from a scope (i.e. a set of games, representing a set of skills), a fair cover can be extracted using transfer value, leading to a set of games that fairly represent the skills associated with the scope: this enables us to use the set of scores of the games in the cover as the basis for a benchmark associated with that scope.
The measure of capability on a given game in terms of the speed of learning to reach a certain level. This is in our opinion the good way to measure of intelligence: the speed at which a model is able to learn from the hard rules of its environment and the feedback it receives from the environment.