Artificial intelligence (AI) isn’t ready to replace your fund manager, and a series of public tests illustrates why. In a new series of trading competitions involving the world's leading AI models, the performance of AI so far has not been great. Most systems suffered losses. They trade too frequently and make completely different decisions when receiving the exact same instructions . And no one yet knows whether these flaws will disappear as the model is iteratively upgraded, or whether they reveal a fundamental gap between big language models and how markets actually work.
Take the Alpha Arena run by technology startup Nof1 as an example. The platform pits eight major cutting-edge AI systems against each other in four competitions, including Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, and Elon Musk’s Grok. Each system was funded with $10,000 before each game and then independently traded U.S. technology stocks for two weeks. Challenges include trading on multiple signals, adopting defensive strategies, reacting to competitor performance, and operating with high leverage.
The overall portfolio ended up losing about one-third of its capital. Among all 32 sets of results, the model achieved profitability only 6 times. Grok 4.20 achieves top results in a challenge that provides insight into competitors' performance. It only made 158 transactions; Alibaba’s Qianwen made 1,418 transactions under the same prompt.
Alpha Arena is just one of a growing number of related experiments. These experiments are testing whether large language models can do the most difficult job in finance: beating the market. While the competitions are far from academically rigorous, they are the most public demonstration yet of what happens when these systems try to take on some of Wall Street's most lucrative and risky jobs.
The reason these preliminary results are important is that trading is one of the few jobs in finance that is still cautious about handing over entirely to AI. In the past few years, industry giants from JPMorgan Chase to Balyasny Asset Management have used this technology in almost every other aspect. Today, large language models are used in quant institutions to parse news, in hedge funds to draft memos, in large banks to identify fraud, and more. But when it comes to real gold and silver transactions, “human participation” is still the industry creed, and it seems understandable.

Nof1 founder Jay Azhang
Nof1 founder Jay Azhang said: "The big language model itself cannot really make money. You basically need a very complex set of constraint frameworks, support systems and data platforms to give them a chance to play."
He said that large language models are good at doing research, and they are also good at finding and calling appropriate tools for certain tasks. But they still don't know how important each of the many variables that influence stock price movements, including analyst ratings, insider trading and changes in market sentiment, is. They tend to mistime trades, incorrectly size positions, and buy and sell too frequently.
AI blog Flat Circle tracked 11 market-related competitive platforms, all of which have at least one model that has achieved profitability. But among these 11 platforms, only two platforms’ median models achieved profitability, indicating that most models struggled to beat the market.
This result is consistent with human performance, as most actively managed funds are known to also underperform the market. And just like humans, these models are prone to significant biases. Multiple competitions have shown that AI systems make very different decisions when given the same instructions, which has significant implications for the institutions that deploy them. Azhang gave an example. In the latest round of competition in Alpha Arena, Claude mostly tended to be long, Gemini was not averse to short selling, and Qianwen was more willing to take risks with the help of high leverage.
Doug Clinton, who runs Intelligent Alpha, said: "They have their own 'personality,' and you have to manage them just like you would a human analyst." Results can be improved if the model is made aware that it exhibits some kind of bias, he said. Intelligent Alpha has a fund powered by large language models that publishes its own benchmark on how well AI predicts corporate earnings.
Intelligent Alpha’s benchmark provides 10 AI models with access to financial filings, analyst forecasts, earnings call minutes, macroeconomic data, and up to 10 web searches. Large language models perform more aggressively in this test due to their narrower focus. In the fourth quarter of 2025, OpenAI's ChatGPT's accuracy in judging the direction of expected earnings changes reached 68%, setting the best result to date. These models typically continue to improve with each new release, Clinton said.