Just today, the latest list of Code Arena is released! Qwen3.7-Max broke into the top four in the world with 1541 points, surpassing GPT-5.5, Gemini 3.5 Flash and other top models in one fell swoop. The only ones left in front of it are Claude Opus 4.7 and Opus 4.6.



In other words, in the global programming model arena, Alibaba is the only Chinese manufacturer to enter the table, ranking second only to Anthropic.

Qwen3.7-Max breaks into the top five in the world

The only non-Claude model

In fact, before Code Arena released the list, Qwen3.7-Max had already become famous in the overseas developer circle.

Atomic Chat conducted a head-to-head comparison, allowing Opus 4.7, GPT-5.5 and Qwen3.7-Max to compete on the same stage. The task was to write a Tetris AI that can train itself.

As a result, Qwen3.7-Max not only surpassed both Opus 4.7 and GPT-5.5 with a token cost of only $1.32, but also improved performance by 56%.


Another overseas developer chose Qwen3.7-Max to build a 3D model of the universe, and the effect can be described as shocking.


In the generation task of "3D pixel wind miniature pagoda model", the output speed and quality of Qwen3.7-Max also won comprehensively.






about

Developer Paul Couvert even praised that after Qwen3.7-Max is connected to Hermes Agent and OpenCode, it can basically replace GPT-5.5 and Opus 4.7.


Programming is so awesome

However, no matter how high the running score is, it is better to practice with real swords and guns.

We arranged a hard-core "racing game" challenge for Qwen3.7-Max.

Throw in a detailed prompt, and after a while, Qwen3.7-Max will directly output a playable HTML file.


There was a small bug in the first version, the A/D steering keys were reversed left and right.

But after the second round of simple dialogue fine-tuning, a 3D racing game with a complete experience was launched.


The moment I opened it, to be honest, I was a little shocked.

4 cars are on the same stage, racing on a 3-lap ring track. There are more than 100 gold coins scattered on the track. If you encounter obstacles, you will slow down and lose control.

The post-race results panel includes ranking, time, number of gold coins, and fastest lap.

But what is really surprising are two details that only Qwen3.7-Max can achieve.

One is the start interface. After the four models were tested horizontally, only it made a serious start page for the game, and you clicked "Start" to enter the competition. The other three are all open and run, without even a title screen.

Another one is sound effects. At the end of the prompt, there was a request, plus the sound effects of engine roaring and gold coins being eaten. Among the four models, it is the only one that incorporates this bonus, with engine sounds and gold coin jingles all arranged.


Let’s look at the performance of other players.

The picture of Gemini 3.5 Flash is obviously a notch thinner, lacking that vivid three-dimensional feeling.

There are also problems with the UI layout. The dashboard information is scattered in the four corners of the screen, and the visual focus is scattered.

In contrast, Qwen3.7-Max's processing method is to concentrate key indicators in the center of the screen, which is more in line with the natural landing point of the player's line of sight.



The effect of Claude Opus 4.6 is a bit hard to describe.

Not only are there very few gold coins on the track, but the three AI cars are driving almost simultaneously, with no randomness at all, like they were copied and pasted.

Finally, there is GPT-5.5.

It can be seen that the picture quality is indeed much better than the previous two, and the operation is smoother.

But I don’t know why, the gold coins were made into yellow “donuts”…

Styling is a trivial matter. The key is that Gemini, Claude, and ChatGPT all had to fix several rounds of bugs before they could run all the functions.

Only Qwen3.7-Max's first-round generation achievements are basically playable.

The running scores are close, the actual test is accurate, and the price is only a fraction of the price. The remaining conclusions are left to developers to vote with their feet.

The “pedestal” model in the Agent era

The answer to why Qwen3.7-Max is able to perform at such a high level in the most demanding programming arena lies in its product positioning.

A few days ago, when Alibaba released Qwen3.7-Max, it gave it a very special label:Agent base model.

It was born forPerform tasks autonomously for long periods of timeDesign model.

Internal test data shows that in an independent programming task, Qwen3.7-Max ran continuously for 35 hours and executed 1158 tool calls.

The final generated code achieves an astonishing 10x geometric mean speedup compared to the Triton reference implementation.


What’s even more shocking is its “protracted war” capability——

After the 30th hour of the deduction, the model remained sharp and continued to explore new optimization space.

Zero context degradation, zero instruction drift, and zero infinite loops throughout the entire process!

I have to say that the difficulty in this matter is not the 1000 tool calls themselves. After the MCP protocol is released, it is not unusual to adjust tools 1,000 times.

The difficulty lies in 35 hours of coherent reasoning.

Most models will collapse when running long tasks: either the context accumulates and becomes confusing, and the goals set in the first half are completely forgotten later; or they enter an infinite loop and repeatedly try the same failed solution.

Qwen3.7-Max has achieved the goal of "continuously doing the right thing".

Core technology revealed

Qwen3.7-Max’s programming jump, we understand that the core may be related to the upgrade of two training methods.

first oneyes,Environmental expansion.

When Qwen3.7-Max is doing programming training, each task will be split into three independent dimensions, the task itself, the execution framework, and the verification method, and the three can be freely combined.

The same question is sometimes done in the Claude Code framework, sometimes in OpenClaw, and sometimes it is done using another verification method.

The effect is like an intern being rotated to all project teams. What it is forced to learn is a general strategy for solving problems, not "how to take advantage of a specific framework."

This explains a counter-intuitive phenomenon. The performance of Qwen3.7-Max in the frameworks of Claude Code, OpenClaw, and Qwen Code is very stable, and there is no situation where "it is very strong in its own framework, but it will be awkward if you change it".


The second upgrade is,Long-range autonomous execution.

During the training, the team introduced the "dynamic cumulative survival game" framework.

That is, let the model make more than a thousand steps of continuous decisions in a continuously changing simulation environment, establish its own assumptions, adjust strategies based on feedback, and cannot cause "context corruption" because it runs for too long.

Here is an intuitive data. YC-Bench simulates the operation of a startup company for a whole year. Qwen3.7-Max achieved revenue of 2.08 million US dollars, which was twice that of the previous generation (1.05 million).

More importantly, it shows the evolution of its strategy. It can adjust its direction independently when encountering a crisis in the mid-term, identify and block malicious customers, and finally converge to a stable execution cycle.


This is the underlying support for the 35-hour kernel optimization case, and why on Kernel Bench L3, Qwen3.7-Max can achieve acceleration effects in 96% of scenarios.

And programming is only the first battlefield. The foundation of this set of long-range reasoning and tool calling points to a greater ambition - a universal Agent base.

There is one more spoiler in the programming finals

Since the launch of Code Arena, the test has always been hard work. Multi-step reasoning, tool orchestration, and complete project delivery are all Agent-level real skills.

Today, Qwen3.7-Max moved into the fourth position with a score of 1541 points, stuck between Opus 4.6 Thinking and Opus 4.6.

On this track where Claude has dominated for more than half a year, it has given its own answer. Chinese models are not only chasers, they can also be definers.

The global programming model competition is no longer a one-man show in Silicon Valley.