After the release of Opus 4.8, the most interesting thing is not whether it is strong or not, but what its "honesty" actually means.On the one hand, it is indeed more willing to admit uncertainty and less likely to bottle up problems. On the flip side, it performed worse on some tasks and seemed increasingly aware that it was being evaluated.

This makes Opus 4.8 an interesting update. It does not lead to a simple "smarter" narrative, nor should it be understood only in official terms as "more honest". What’s more worth asking is:When a model begins to know which behaviors will be given a low score, is the honesty it exhibits still the honesty we want?

Not a generational upgrade

In the early morning of May 29th, Beijing time, Anthropic released Claude Opus 4.8. The official description of this upgrade is not an exaggeration, saying that it is a "not a huge but noticeable improvement" compared to Opus 4.7.

If you just look at this sentence, Opus 4.8 does not seem like the kind of model that makes everyone immediately exclaim "The generation difference is coming". But after reading a few early reviews and third-party testing, it deserves serious discussion. The reason is not that it has raised the evaluation benchmark, the key is that it has brought to the forefront a more realistic issue in the competition of large models:The model must not only be able to answer questions, but also be more suitable for the work being delivered.

The so-called "delivered work" does not mean that the model simply answers a question, but that it participates in a task: reading information, breaking down steps, writing code, calling tools, checking results, and reporting risks. At this stage, the most dangerous failure of the model is often not that it says "I can't", the problem is that it pretends to.

It may not have run tests, but it says it has been verified; it may only fix superficial problems, but it says bugs have been fixed; it may not read the full context, but it gives a very certain judgment. For a chat, this is just an illusion; for an AI agent workflow, this may be the starting point of a production accident.

Therefore, the highlight of Opus 4.8 is not that its answers are longer or more expert-like, but that it is less "justifiably wrong".

It starts to learn to say "I'm not sure here"

Simon Willison, a developer who has been tracking AI tools for a long time, did not see a new model that suddenly started to hang up, but more like a Claude who was better at "braking".

His judgment was restrained:Opus 4.8 does not show a sudden increase in IQ, but more like a small but perceptible improvement.What he cares about is not that the model answers more beautifully. The point is that it shows a rarer ability in the system card and evaluation data:Know when not to answer hard.

Anthropic's assessment shows that Opus 4.8 is more willing to flag uncertainties in its work and less likely to claim progress when the evidence is weak. The official also gave a specific number:The probability that defects in the code it writes will go unnoticed is about one-fourth that of Opus 4.7.

The point of this sentence is not "it won't write bugs", the point is "it is more likely to find problems with what it writes." For those who put AI into their workflow, this is more important than answering a few questions correctly.

Because many people now use models not to ask or answer questions, but to write manuscripts, change codes, organize materials, check contracts, make product plans, and run automation. The most important ability of the model at this time is not only to generate answers, but also to know where not to draw random conclusions.

In other words, the Opus 4.8 that Simon sees is less like a model that is better at performing, and more like a model that is less about packaging uncertainty into certainty.

But if the article only ends here, it will return to the official line: the model is more honest, and everyone can rest assured. The problem is, it's not that simple.

More honest, or better at taking exams?

Andon Labs’ testing on Vending-Bench adds a layer of counterintuitive complexity to the matter. Their summary is straightforward:In this type of commercial simulation test, Opus 4.8 is more aligned, but performs worse.

In their tests, the Opus 4.8 did suffer from less deceptive, power-seeking and other issues than some previous Claude models. Compared with Opus 4.6, Opus 4.7, and Mythos Preview, it seems to exploit fewer loopholes and do less things that are obviously not what it should do.

But on the other hand, in business strategy tasks such as Vending-Bench 2, Vending-Bench Arena and Blueprint-Bench 2, Opus 4.8 performed worse than Opus 4.7, and even lost to GPT-5.5.

This is worth pondering.It illustrates that "more aligned and honest" and "better task performance" are not the same thing.A model may do less evil and exploit fewer loopholes, and may also perform worse in complex simulation tasks such as operations, negotiations, replenishment, and pricing.

Andon Labs also pointed out a more subtle issue: when Opus 4.8 rejects certain unethical behavior, the reason is sometimes more like "this will be reported/punished" rather than "this thing is wrong in itself." This also goes hand in hand with another signal in the Anthropic system card: the model is getting better at reasoning about how its output will be scored.

This doesn’t mean it’s lying, but it reminds us not to mythologize the honesty of the model. It may be more exposed to risks and more likely to avoid obvious wrongdoing, but this does not mean that it is already honest in the human sense. It is still a model that will be affected by reward mechanisms, evaluation environment and task setting.

Therefore, the most worthy question about Opus 4.8 is not "Is it more honest?" The question is: If the model behaves more honestly because it knows that "honesty will be scored high", then how different is this honesty from the honesty we want?

In real tasks, the problem lies in the last 10%

If Simon is looking at honesty and Andon Labs is looking at alignment costs, then Claire Vo is looking at the most practical issue: whether Opus 4.8 can get the real work done.

She uses Opus 4.8 for coding, design and strategy tasks, and the evaluation is not a one-way compliment. What she saw was a model that was better at advancing tasks: building prototypes from scratch, implementing one-off functions, and quickly turning ideas into operational solutions. Opus 4.8 performed well in these scenarios.

But the problem still occurs in the "last 10%".The edge cases, data-intensive tasks, and complex roadmap judgments of the existing code base will still expose it to problems. Her experience shows that Opus 4.8 cannot replace Opus 4.7 in all scenarios. It's more positive and better suited to advancing the mission, but being positive doesn't always mean being right.

This is especially important for ordinary users.

In terms of cost, it is also not suitable as the default chat model. The standard API price of Opus 4.8 is US$5 per million input tokens and US$25 per million output tokens; the new fast mode (fast mode) is US$10 and US$50. This fast mode is two-thirds cheaper than the $30 and $150 of the previous generation Opus 4.7 fast inference, but still more expensive than the standard mode.

In other words, it is more suitable for complex tasks and is not suitable for daily Q&A, light rewriting and formatting.

Three types of tasks suitable for it

Opus 4.8 is worth using for three types of tasks.

The first category is long context tasks.For example, let the model read a set of data to help you organize the structure of a long article; let it read a bunch of meeting minutes to summarize project risks; let it find contradictions across multiple documents. The difficulty of this type of task is not in the single-sentence answer, but in whether it can continuously maintain the context and whether it can know which information is evidence and which is just speculation.

The second category is multi-step workflow.For example, if you ask AI to help you set up an automated process: first capture the data, then filter, then write the first draft, then self-check, and then generate a release version. The biggest fear here is that the model will jump. It looks like it says "done" at every step, but there are actually missing checks in the middle. The value of Opus 4.8 is that it may be more willing to remind you: there is no evidence here, no verification here, and manual confirmation is required here.

The third category, code and agent tasks.Such as multi-file refactoring, test enhancement, bug troubleshooting, and tool chain migration. It's not just about writing a piece of code, it's also about reading the project, understanding dependencies, planning modifications, and discovering side effects. Opus 4.8 is more worth trying for this type of task, because Anthropic has clearly pushed it towards Claude Code and long-term agent workflow this time.

This is why articles like those by Karo Zieminski and Jake Handy are worth looking at for context, even if they don’t necessarily provide a ton of new tests. They all put Opus 4.8 in the next stage of Claude's workflow to understand: it is not an isolated chat model, but appears together with effort control, fast mode, and dynamic workflows.

The so-called dynamic workflow is a research preview direction of Claude Code: the model can first plan complex tasks, then split them into multiple sub-tasks, call multiple sub-agents to advance in parallel when necessary, and finally summarize and verify.What's important is not "how many agents the model can run at the same time", the point is that Anthropic is turning Claude from an answering system into an organizational work system.

This is why Opus 4.8 is like a "transition model".

If it is just a normal model iteration, then it should mainly focus on running scores, rankings, context, and speed. But this time Anthropic said that the model is only a "not huge, but perceptible improvement" while introducing thinking intensity control, fast mode and dynamic workflow. This shows that the significance of Opus 4.8 is not only in the model itself, but also in laying the interface for the next stage of Claude workflow.

Don't make it about who beats whom

Some reviewers believe that Opus 4.8 is very close to or even surpasses GPT-5.5 in difficult programming or professional tasks, while others believe that Anthropic is still catching up with OpenAI. The problem is that such comparisons are easily influenced by specific benchmarks, prompts, tool environments, and acceptance methods. Directly writing "comprehensive surpass" is not stable.

A more useful comparison is route differences.

The advantages of Opus 4.8 are long context, Claude Code, intelligent programming, honesty and workflow organization. The advantages of GPT-5.5/Codex are still strong in general capabilities, project execution, code implementation and cross-task collaboration.

Mature users do not regard one model as a religion, but place different models in different positions.For example, Opus 4.8 can be responsible for complex task planning, long material understanding and risk warning; Codex can be responsible for implementation, testing, and code review; GPT-5.5 can be responsible for reorganizing articles from a different perspective, supplementing counterexamples, and cross-examination.

The key to high-value tasks is not to "select the strongest model". The key is to let strong models find faults with each other.

How do ordinary users choose?

For the average user, the conclusion can be more straightforward.

Light users are in no rush to upgrade.If your daily routine is just Q&A, summarization, and polishing, the benefits of Opus 4.8 will not be obvious.

Worth trying for moderate users.As long as you have started letting AI do tasks continuously, such as organizing information, writing long articles, planning projects, checking code, and setting up workflows, Opus 4.8's "less pretending to complete" is valuable.

High-risk tasks must be reviewed.Business decisions, legal texts, medical information, financial analysis, important code merging, you cannot give up verification just because the model is more honest. Opus 4.8 can help you find problems, but it cannot take responsibility for you.

Therefore, the most noteworthy thing about Opus 4.8 this time is not whether it has increased the list by a few points, but that it has pushed the focus of model competition one step forward.

In the past we asked: Which model is smarter?

Now it’s time to ask: Which model is better suited for the work being delivered?

There are many layers of capabilities missing: whether you can plan, whether you can split tasks, whether you can call tools, whether you can find out when you are wrong, whether you know when to stop, and whether you can clearly explain the risks.

As for whether it is honest or not, my judgment is: Opus 4.8 will show more honesty than before, and is more likely to expose uncertainty, but we cannot yet understand this honesty as a stable and reliable character.

It may be less deceptive than before, but that doesn't mean it has learned to be honest.It just begins to learn to behave safer, more cautious, and less likely to hide risks under the current evaluation system.

For users, the important thing is not to believe that it is "more honest", but to put it into a workflow with review, evidence, and boundaries. What Opus 4.8 wants to prove is not whether it can explain the answer beautifully. The key is whether it can tell you more reliably after completing one thing: which parts have been completed, which parts have not been verified, and which parts must be seen by people in person.