Thoughts on ARC-AGI-2
No AGI yet
human
cognition
and
videgames
03 Apr 2025
No AGI yet, human cognition and videgames
The recent ARC-AGI-2 release makes me think about how we learn and puzzle videogames. Recently OpenAI recent o3 models reached 75.7% and 87.5% accuracy on the first version of the ARC-AGI benchmark. Hovewer, it had a per-task cost of 200$ for the high efficiency version, and much more for the low efficiency version. They don't mention the cost, they just say it requires 172x the compute. You can make an estimate yourself...
I won't make an analysis of o3 and LLMs, for those I'll refer to the blog post and the YouTube interview with Machine Learning Street Talk. What matters for now is that the the second version of this benchmark has been released, bringing the performance of every model on earth below 5%. I've long held the belief that LLMs alone do not and cannot be AGI. The test-time compute used in o3 and recent reasoning models is valuable, but is not the System 2 thinking we need.
We need something that allows LLMs to implement symbolic reasoning without hardcoding anything. We need something that allows them to form some self-consistent building blocks, and allow them to compose them, to tinker with them, like you'd do when mixing concepts in your head, like you do when you compose math concepts you have not fully interlalized. In other words, these systems need to be able to build discrete rules and then apply them.
In On the measure of intelligence François Chollet suggested narrowing the scope of general intelligence and focusing, at least for now, away from the one solving all economic problems, to our own type of intelligence, the only one we know to be general. It's becoming increasingly more difficult to evaluate LLMs performance; in many cases, we resort to "LLMs as judges" when there is no ground truth or for tasks in which LLMs already outperform humans. Before ARC-AGI, no one had taken this particular approach — at least not with success. The tasks in this benchmark are (almost) all straightforward for humans and require zero previous knowledge, yet they assume certain cognitive priors, to mention a couple, the ones that probably allow us to recognize intent in moving abstract shapes, or perceive adjacent objects as a single coherent thing. This benchmark requires models to adapt to completely new problems on the fly; as Chollet says, models cannot "buy" performance with sane levels of compute.
This makes me think about puzzle videogames, specifically those that try to convey ideas without any textual explanation, relying only on a sequence of puzzles that gradually introduce the game's logic. Some of these are called "system-based" puzzles because they build upon a consistent set of rules.
By contrast, the puzzles in typical "brain teaser" magazines rarely deviate from a familiar template; once you've seen one, you know the general idea, you just have to "execute". Solving them can require outside cultural knowledge. Professor Layton, for instance, presents self-contained puzzles that often require external knowledge.Sudoku is another example where the rules never change.
Imagine a more freeform puzzle design in which you start with a few axioms (like in mathematics) discovered by simple movements and interaction with the environment, then each level is specifically designed to make you see a constrained part of the world to to force you to see it is a specific way and discover corollaries. Level after level, your model of the world expands. You usually have all the "tools" from the start, but it's not obvious how to use them until you experiment. Interactions aren't gated or unlocked; they are revealed through insight.
Take The Witness, by Jonathan Blow. It's built around questions like: What does it mean to know? Why are we curious? How do we foster epiphanies? The entire game has you drawing lines on panels to solve puzzles, but there's no text to explain the rules and nobody giving you clear goals. You learn solely by trial and error: you attempt a solution, see if it works, create and refine your hypothesis. As you solve more panels and explore different areas, your understanding grows. Panels that once looked impossible become straightforward once you've internalized the underlying mechanics. They were never complicated; it's not like those games that give you its rules for free, then ask you plainly to repeat a known solution, but with an obfuscated environment, that makes recognizing the important part of the solution, hard. In these games each level, each idea, should be implemented as simply as possible, in order to communicate its message as clearly as possible.
The missing piece for AI seems to be the ability to engage in a similarly flexible process: look at a new problem, devise or guess a rule, see if it works, refine, and repeat. Reinforcement learning setups, like the one used for DeepSeek-R1, move somewhat in that direction, but true autonomy — and truly general puzzle-solving — will demand some symbolic manipulation and self-guided exploration. That's the sort of System 2 reasoning these tasks (and these games) implicitly reward. The future is exciting, I'm sure we will arrive there, also thanks the valuable feedback from ARG-AGI-2.