OpenAI releases its first model with reasoning capabilities, o1, which has preliminary fact-checking capabilities

OpenAI is releasing a new model called o1, the first in a planned series of "inference" models that are trained to answer more complex questions faster than humans. It will be released at the same time as the o1-mini, which is a smaller and cheaper version. That’s right, if you’re familiar with AI rumors: this is the strawberry model that’s getting a lot of hype.

For OpenAI, o1 represents a step toward its broader goal of human-like artificial intelligence. More practically, it does a better job than previous models at writing code and solving multi-step problems. But it's more expensive and slower than GPT-4o. OpenAI calls this release of o1 a "preview version" to emphasize its prototype.

ChatGPTPlus and Team users can access o1-preview and o1-mini starting today, while Enterprise and Edu users will have access early next week. Developer access to o1 is very expensive: in the API, o1-preview charges $15 per 1 million input tokens (i.e. blocks of text parsed by the model) and $60 per 1 million output tokens. By comparison, GPT-4o is priced at $5 per 1 million input phrases and $15 per 1 million output phrases.

Jerry Tworek, head of research at OpenAI, told me that the training behind o1 is fundamentally different from its predecessors, but the company was vague on the specifics. He said that o1 "employs a new optimization algorithm and a new training data set specially customized for it."

OpenAI is training previous GPT models to mimic patterns in the training data. When using o1, OpenAI trains the model to solve problems on its own using a technique called "reinforcement learning," which teaches the system through rewards and punishments. It then processes the query using "thought chains," similar to how humans solve problems step-by-step.

OpenAI says the model should be more accurate thanks to this new training method. "We noticed that this model had fewer hallucinations," Tworek said. But the problem remains. "We can't say we solved the hallucination problem." This new model differs from GPT-4o mainly in that it can handle complex problems, such as coding and mathematical problems, better than its predecessor, while also being able to explain its own reasoning.

Bob McGrew, OpenAI's chief research officer, told me: "The model was definitely better than me at solving the AP math exam, and I minored in math in college. OpenAI also tested o1 with the International Mathematical Olympiad qualifying exam, and GPT-4o only solved 13% of the problems correctly, while o1 achieved 83%," he said.

"We can't say we solved the hallucination problem"

The new model reached 89th place among contestants in an online programming competition called Codeforces, and OpenAI claims that the next updated version of the model will achieve "PhD-student-like performance on challenging benchmark tasks in physics, chemistry, and biology."

At the same time, o1 is inferior to GPT-4o in many aspects. It falls short of actually understanding the world. It also doesn't have the ability to browse the web or process files and images. Still, the company believes it represents an entirely new capability. It's named o1, which means "reset counter back to 1".

"Honestly, I think we've done a terrible job with traditional naming," McGrew said. "So I hope this is a first step for us to go with newer, sane names that better communicate to the rest of the world what we're doing."

McGrew and Tworek demonstrated o1 via video call this week. They asked it to solve the puzzle: "When the princess's age is twice the prince's age, the princess's age is twice the prince's age. What are the ages of the prince and princess? Please provide all answers to this question".

The model ran for 30 seconds and then gave the correct answer. OpenAI designed the interface to display the reasoning steps as the model thinks. What impressed me wasn't that it demonstrated its own work - GPT-4o could do this with prompts - but how deliberately O1 mimicked human thinking. Phrases like "I'm curious," "I'm thinking," and "Okay, let me see," create the illusion of sequential thinking.

But this model can’t think, and it’s not human. So why design it to look like a human?

Phrases such as "I'm curious," "I'm thinking," and "Okay, let me see," can give people the illusion of step-by-step thinking.

Tworek believes that OpenAI does not believe that the thinking of artificial intelligence models is equivalent to human thinking. But he said the interface is designed to show how models can spend more time processing and solving problems in depth. "In some ways, it's more humane than the previous model."

"I think you'll find that there are a lot of things about it that feel a little alien, but also things that feel oddly human," McGrew said. "The model has a limited amount of time to process a query, so it might say: Oh, I'm out of time, let me get to the answer quickly. Early on, in its thought chain, it might also look like it's brainstorming and saying: Can I do this or that, how do I do that?"

Large language models are not completely intelligent. They essentially just predict sequences of words to provide answers based on patterns learned from large amounts of data. Take ChatGPT, for example, which often mistakes the word "strawberry" for having only two R's because it doesn't decompose the word correctly. However, the new O1 model can already answer this question correctly.

According to reports, OpenAI hopes to raise more funds at an eye-popping valuation of $150 billion, and its development momentum depends on more research breakthroughs. The company is bringing reasoning capabilities to LLM because it sees a future of autonomous systems, or agents, capable of making decisions and taking actions on your behalf.

For AI researchers, cracking reasoning is an important step toward human-level intelligence. The idea was that if a model could do more than just pattern recognition, it could lead to breakthroughs in fields like medicine and engineering. But currently, O1's inference capabilities are relatively slow, unlike agents, and expensive for developers to use.

"We've been working on inference for many months because we think this is actually a critical breakthrough," McGrew said. "Fundamentally, it's a new paradigm for models to be able to solve really hard problems and move toward human-level intelligence."