On Thursday, OpenAI officially released a new generation of basic models, GPT‑5.4, positioning it as “the most powerful, efficient, cutting-edge model for professional work to date.” In addition to the standard version, OpenAI simultaneously launched two variants: GPT‑5.4 Thinking, which emphasizes complex reasoning capabilities, and GPT‑5.4 Pro, which is targeted at high-performance application scenarios.

In terms of model capabilities, the API version of GPT‑5.4 supports a context window of up to 1 million tokens, far exceeding any model previously provided by OpenAI, which is beneficial for processing long chain workflows such as long documents, complex projects or multi-round tasks. OpenAI also emphasized the improvement of token usage efficiency, saying that GPT-5.4 can complete tasks of the same difficulty as the previous generation model with significantly fewer tokens, thus forming advantages in cost and response speed.

The latest benchmark test results show that GPT‑5.4 has achieved a significant lead in multiple authoritative evaluations, including setting new records in the two "computer operation" scenario tests of OSWorld‑Verified and WebArena Verified, and achieving the highest score of 83% on OpenAI's own knowledge work assessment set GDPval. GPT‑5.4 also ranked first in the APEX‑Agents benchmark set by startup Mercor for professional skills such as law and finance.

Mercor CEO Brendan Foody said in a statement that GPT‑5.4 excels at producing long-term deliverables, including presentations, financial models and legal analysis, “while maintaining top performance, faster and at a lower cost than comparable cutting-edge models.”

In terms of reliability, GPT‑5.4 continues OpenAI’s research and development direction to reduce “illusions” and factual errors. Official internal evaluation results show that compared with GPT-5.2, the new model has a 33% reduction in the probability of errors at the level of a single statement, and an 18% reduction in the probability of errors in the overall answer.

This release also comes with an important API layer change: OpenAI launches a new tool calling mechanism called Tool Search. In the old solution, the system prompt must inject the definitions of all available tools into the model at once. As the number of tools increases, this part of the prompt itself will occupy a large amount of tokens. The new Tool Search allows models to query tool definitions on demand, significantly reducing overhead in systems with larger tool sizes, making invocations faster and less expensive.

Focusing on safety and controllability, OpenAI has added a new safety assessment this time to test the model’s “chain-of-thought” performance in multi-step tasks. Researchers have long been worried that models with reasoning capabilities may "disguise" or hide the true reasoning path during the chain thinking process. Previous research has shown that this may indeed happen under certain conditions. New evaluation results given by OpenAI show that in the version of GPT-5.4 Thinking, the probability of such "deceptive" performance is even lower. "This shows that the model lacks the ability to actively hide the reasoning process, and thinking chain monitoring is still an effective security tool."

Through the simultaneous launch of GPT‑5.4 and its Pro and Thinking versions, OpenAI is trying to find a new balance between professional productivity, cost efficiency, and security controllability, pushing large models further into high-value scenarios such as law, finance, and knowledge work.