OpenAI Sturm und Drang: Only 12 examples can create a dedicated AI expert. The core technology comes from Byte?

On the second day of the OpenAI “12 Days” event, we witnessed the official release of ReinforcementFine-Tuning technology and saw a demonstration of ChatGPTPro. Although Sam Altman did not visit the scene in person, his team gave us an in-depth analysis of this technology, which indicates that AI model customization may usher in a major breakthrough.

12 examples to customize expert model

Today's conference brings an announcement that may seem inconspicuous but could have a significant impact on people's lives.

Today's announcement is a surprise for enterprise users. Organizations will be able to customize o1mini to their needs through Reinforced Fine-Tuning using minimal data.

Some of you may already be familiar with the supervisory fine-tuning API launched by OpenAI early last year. Supervised fine-tuning is a powerful tool that allows the model to imitate features found in input text or images. This method is very useful for scenarios where the tone, style, or response format of the model needs to be adjusted. But supervisory fine-tuning requires large amounts of data in specialized areas. The advantage of intensive fine-tuning is that it can quickly adjust the model's reasoning method with a very small number of high-quality examples. This kind of efficiency has been difficult to achieve in previous supervisory fine-tuning.

The working principle of reinforcement fine-tuning is: when the model encounters a problem, it is given a certain amount of thinking space to solve the problem, and then the model's final answer is scored. Through the mechanism of reinforcement learning, the ideas that lead to the correct answers are strengthened, while the ideas that lead to the wrong answers are weakened.

The relevant papers given by AIoverview are:It turned out to be a paper from ByteDance at the ACL2024 summit in January this year, and it was not the first of its kind for OpenAI.

According to the paper, reinforcement fine-tuning (ReFT) starts with supervised fine-tuning (SFT), which usually lasts one to two epochs. At this stage, the model acquires the basic ability to correctly solve mathematical problems. After this, ReFT takes the training of the model to a new level by adopting reinforcement learning (RL) algorithms using methods such as Proximal Policy Optimization (PPO). This advanced stage allows the model to explore and learn a variety of correct solutions and reasoning methods. In this context, ReFT is efficient because it uses existing training data, which already contains the correct answers.

These answers form the basis for rewards during PPO training, eliminating the need for an additional, separately trained reward system. This is an important difference from other methods such as RLHF,The latter relies on rewards determined by human-annotated data.

Screenshot source: https://arxiv.org/pdf/2401.08967v1

It is worth noting that OpenAI said that based on reinforcement fine-tuning, with only a few dozen examples, the model can master the ability to reason in new and effective ways in a specific domain.

In fact, "this can be done with only 12 examples, which cannot be achieved in conventional fine-tuning." At the press conference, Julie Wong, a researcher at OpenAI, further emphasized.

The effect of enhanced fine-tuning is also amazing. The score is not only higher than o1mini, but also surpasses the o1 version just released yesterday.

OpenAI CEO Sam Altman, although not present on today's livestream, discussed the announcement on the X platform. He claimed that the new feature "works amazingly and is one of my biggest surprises in 2024."

Of course, Altman has a vested interest in promoting his company's new ideas, but considering there's a lot of exciting stuff coming from OpenAI in 2024, and he called it one of the biggest surprises of the year, that's certainly high praise.

According to OpenAI speakers, scientists, developers, and researchers can customize powerful o1 inference models based on their own data, rather than relying solely on publicly available data.

Practitioners in various fields can create expert models based on o1 through reinforcement learning, thereby improving the overall professional level in the field. This marks a key step in AI customization, allowing AI models to show better performance in professional fields.

Live demonstration of enhanced fine-tuning to improve large models

At the scene, OpenAI researchers used Berkeley Lab computational biologist Justin Reese to demonstrate how enhanced fine-tuning can significantly improve the performance of o1mini. Specifically, a list of symptoms is given and the model is asked to predict which gene may cause the genetic disease.

First, look at the dataset used to train the model and the scorer used to evaluate the model. Justin's team collected a dataset containing about 1,100 examples. The training dataset is just a JSON-L file. Each line in the file is an example on which you want the model to be trained. Additionally, verification data is uploaded in the demo.

“There is no overlap between the validation and training datasets in terms of correct genes. This means that the model cannot cheat, or it cannot learn to just remember a list of symptoms and associate them with genes, it must generalize from the training dataset to the validation dataset,” explains John Allard of OpenAI Research Institute.

Then, start a training job on OpenAI’s training infrastructure. You can select the training set and validation set in the web interface and configure them accordingly.

Finally evaluate the resulting fine-tuned model so you can see how much it improves over the base model you started with. The scorer function is simply to take the model's output and the correct answer, compare them, and return a score between 0 and 1. 0 means the model did not get the correct answer at all, and 1 means the model got the correct answer.

Allard said that intensive fine-tuning can take hours to days to run, so he showed the results of a previous run on the same data set. The model gives that the most likely candidate gene is also TSC2, and the correct answer is indeed the same. Therefore, the model can pass on topat1, topat5 and topatmax.

In addition, during the fine-tuning process, you can also observe the changing trends of model performance indicators:

During the test, OpenAI set up the operation of three different models: the first was for the o1 model released yesterday, the second was for o1mini, and the last was the enhanced and fine-tuned o1mini. As can be seen, o1mini achieved a score of 17% on approximately 200 datasets, o1 did better and achieved 25%, and the fine-tuned o1mini achieved a score of 31%.

Conclusion

OpenAI’s 12-day event is suspended for the weekend. Not every announcement will be a blockbuster, and OpenAI itself has said that you can expect new things "big and small."

The following is a list of what foreign media can see at next week’s event (there will be some surprises): Sora-ai video generation, Canvas update (may include images), GPT-4o video analysis, GPT-4o image generation, advanced voice and video, etc.

Ultraman's interaction with netizens on Twitter seems to imply that the next 10 live broadcasts will report Sora's latest developments.