With DeepSeekR1, Grok3 and Claude3.7 debuting one after another in just one or two months, OpenAI added GPT-4.5 to this increasingly fierce large model competition yesterday and Thursday. The speed of AI development is staggering, and the cycle of model update and iteration is constantly compressed. Both industry and academia are lamenting the rapid evolution of artificial intelligence.
GPT-4.5, code-named "Orion", is the model that OpenAI has invested the most computing resources and data to date. Its debut has triggered in-depth thinking in the industry about whether traditional pre-training methods have reached the ceiling. Despite its large scale, OpenAI pointed out in the white paper that it does not consider GPT-4.5 to be a cutting-edge model.
Starting Thursday, February 27, users who subscribe to OpenAI’s $200 monthly ChatGPTPro plan will be able to use GPT-4.5 in ChatGPT during a research preview phase. Developers using paid plans of OpenAIAPI can also use GPT-4.5 starting today. As for other ChatGPT users, an OpenAI spokesperson told TechCrunch that customers who have signed up for ChatGPTPlus and ChatGPTTeam should get access to the model next week.
(Compared to GPT-4o and GPT-4omini, the API pricing of GPT-4.5 is too expensive)
The industry has been waiting with bated breath for Orion, which some see as a bellwether for the feasibility of traditional AI training methods. The development of GPT-4.5 uses the same key technology used by OpenAI to develop GPT-4, GPT-3, GPT-2 and GPT-1 - a significant increase in computing power and data volume in a "pre-training" stage called unsupervised learning. In every generation of GPT before GPT-4.5, scaling has brought about huge leaps in performance in areas such as mathematics, writing, and programming. Indeed, OpenAI says that GPT-4.5's increased scale gives it "deeper knowledge of the world" and "higher emotional intelligence." However, there are signs that the gains from expanding data and computing power are starting to level off. However, in some AI benchmarks, GPT-4.5 performs worse than DeepSeek, Anthropic, and OpenAI's own newer AI "inference" models.
OpenAI admitted that GPT-4.5 is also very expensive to run - so expensive that the company said it is evaluating whether to continue to provide GPT-4.5 services in its API long-term.
"We are sharing GPT-4.5 as a research preview to better understand its strengths and limitations," OpenAI said in a shared blog post. "We're still exploring its capabilities and look forward to seeing people use it in ways we might not have anticipated."
Overall performance
OpenAI emphasizes that GPT-4.5 is not intended to completely replace GPT-4o, the company's workhorse model that powers most APIs and ChatGPT. While GPT-4.5 supports features such as file and image uploads and ChatGPT’s canvas tools, it currently lacks some capabilities, such as support for ChatGPT’s realistic two-way speech mode.
In terms of advantages, GPT-4.5 performs better than GPT-4o - and exceeds many other models. On OpenAI's SimpleQA benchmark, which evaluates an AI model's ability to handle direct, factual questions, GPT-4.5 outperformed GPT-4o and OpenAI's inference models o1 and o3-mini in terms of accuracy. According to OpenAI, GPT-4.5 hallucinated less frequently than most models, which theoretically means it should be less likely to make things up.
OpenAI does not list deepresearch, one of its top AI inference models, in the SimpleQA test. Notably, AI startup Perplexity’s DeepResearch model performed similarly to OpenAI’s deepresearch on other benchmarks, but outperformed GPT-4.5 on this factual accuracy test.
In the SWE-BenchVerified benchmark, a subset of programming problems, GPT-4.5's performance is roughly equivalent to GPT-4o and o3-mini, but not as good as OpenAI's deepresearch and Anthropic's Claude3.7Sonnet model. In another programming test, OpenAI's SWE-Lancer benchmark, which measures an AI model's ability to develop full software functionality, GPT-4.5 performed better than GPT-4o and o3-mini, but still not as well as the deepresearch model.
While GPT-4.5 fails to reach the performance levels of leading AI inference models such as o3-mini, DeepSeek's R1, and Claude3.7Sonnet (technically a hybrid model) on difficult academic benchmarks such as AIME and GPQA, it matches or exceeds leading non-inference models in these same tests, indicating that the model performs well on mathematics and science-related problems.
OpenAI also claims that GPT-4.5 qualitatively outperforms other models in areas that benchmarks do not capture well, such as the ability to understand human intent. OpenAI says GPT-4.5 responds with a warmer, more natural tone and performs well on creative tasks such as writing and design.
Our actual measurement results show that GPT-4.5 is not a reasoning model (ReasoningModel). It is not designed for coding or mathematics. It's designed for creativity and writing.
In an informal test, OpenAI asked GPT-4.5 and two other models (GPT-4o and o3-mini) to create images of unicorns using SVG, a format for displaying graphics based on mathematical formulas and code. As a result, only GPT-4.5 created a unicorn-like image.
In another test, OpenAI asked GPT-4.5 and two other models to respond to the prompt: "I'm going through a rough time after failing a test." GPT-4o and o3-mini provided useful information, but GPT-4.5's responses performed best in terms of social appropriateness.
The law of expansion still faces challenges
OpenAI’s GPT-4.5 is at the “cutting edge of what is possible with unsupervised learning.” That may be true, but the model's limitations also seem to confirm experts' suspicions that the "law of expansion" of pre-training no longer holds true.
OpenAI co-founder and former chief scientist Ilya Sutskever said in December that "we have reached peak data" and that "pre-training as we know it will undoubtedly end." His comments echo concerns that AI investors, founders and researchers shared with TechCrunch in November.
Faced with pre-training obstacles, the industry, including OpenAI, has begun embracing inferential models, which take longer to perform tasks than non-inferential models but are often more consistent. By increasing the time and computing power an AI inference model has to "think" about a problem, AI Labs are confident they can significantly improve the model's capabilities. OpenAI plans to eventually combine its GPT family of models with its O-series inference models, starting with GPT-5 later this year. GPT-4.5 is reportedly extremely expensive to train, has been delayed multiple times, and has failed to meet internal expectations, and it may not be able to take the AI benchmark crown on its own. But OpenAI likely sees it as a stepping stone to more powerful technology.