Is Pokémon a tough benchmark for artificial intelligence? A team of researchers thinks Super Mario Bros. is more challenging. Researchers at the University of California, San Diego's Hao Artificial Intelligence Laboratory (HaoAILab) put artificial intelligence into live streaming of the Super Mario Bros. game on Friday. Anthropic's Claude3.7 performed best, followed by Claude3.5. Google's Gemini1.5Pro and OpenAI's GPT-4o performed poorly.
You know, the version of Super Mario Bros. is not exactly the same as the original version released in 1985. The game runs in an emulator and integrates with the GamingAgent framework to let artificial intelligence control Mario.
The GamingAgent developed by HaoAILab provides basic instructions to the artificial intelligence, such as "If there is an obstacle or enemy approaching, move left/jump to avoid" as well as game screenshots. The AI then generates the inputs that control Mario in the form of Python code.
However, Hao said, the game forces each model to "learn" to plan complex operations and develop game strategies. Interestingly, the lab found that inferential models (such as OpenAI's o1 model, which "thinks" about a problem step by step to arrive at a solution) performed less well than "non-inferential" models, even though they were generally stronger on most benchmarks.
Researchers say one of the main reasons why inference models have trouble playing such real-time games is that they take a while - often seconds - to decide on an action. Timing is everything in Super Mario Bros. A second can mean the difference between making a safe jump or falling into an abyss.
Games have been used as a benchmark for artificial intelligence for decades. But some experts question the wisdom of linking AI's gaming skills to technological advances. Unlike the real world, games tend to be abstract, relatively simple, and they theoretically provide unlimited amounts of data for training artificial intelligence.
Recent flashy gaming benchmarks suggest that OpenAI research scientist and founding member Andrej Karpathy is facing an "evaluation crisis."
"I really don't know what [AI] metrics to look at right now. TLDR, my reaction is that I really don't know how good these models are right now," he wrote in a post on X.
But at least we can watch AI play Mario.