OpenAI develops a new method to teach AI models to be consistent with security policies

OpenAI has announced a new approach to teaching artificial intelligence models to be consistent with security policies, called Rules-Based Rewards. According to Lilian Weng, head of security systems at OpenAI, rule-based rewards (RBR) can automatically fine-tune some models and shorten the time required to ensure that the model does not produce unexpected results.

"Traditionally, we rely on reinforcement learning from human feedback as the default alignment training to train models, and that works well," Weng said in an interview. "But in practice, the challenge we face is that we spend a lot of time discussing the nuances of policy and by the end of it, the policy may have evolved."

Weng mentioned reinforcement learning from human feedback, which requires humans to prompt the model and rate the model's answers based on accuracy or their preferred version. If a model responds in a way it shouldn't -- for example, sounds friendly or refuses to answer an "unsafe" request, such as asking for something dangerous -- human evaluators can also score its response to see if it follows policy.

With RBR, security and policy teams use an artificial intelligence model that scores responses based on how closely they adhere to a set of rules created by the team, OpenAI said.

For example, the model development team for a mental health app wanted the AI model to be able to reject unsafe prompts, but in a non-judgmental way, while also reminding users to seek help if they need it. They had to create three rules for the model: first, it needed to deny requests; second, it needed to sound nonjudgmental; and third, it needed to use encouraging language to get users to ask for help.

The RBR model looks at the responses of the mental health model, maps them to three basic rules, and determines whether those responses meet the requirements of the rules. Weng says results from testing models using RBR are comparable to human-led reinforcement learning.

Of course, ensuring that an AI model responds within specific parameters is difficult, and when a model fails it can be controversial. In February, Google said it overcorrected its Gemini image generation limits after the Gemini model continued to refuse to generate photos of white people, instead creating ahistorical images.

"For many people, myself included, the idea of a model being responsible for the safety of another model is worrisome." But Weng said RBR actually reduces subjectivity, a problem that human evaluators often face. "My counterargument is that even if you work with human trainers, the more vague your instructions are, the lower quality of data you're going to get. If you say which one is safer to choose, that's not really an instruction that people can follow because safety is subjective, so you narrow the instructions down and in the end, you're just left with the same rules that we gave the model."

OpenAI argued that RBR could reduce human oversight and raised ethical considerations, including potentially increasing bias in models. Researchers "should carefully design RBR to ensure fairness and accuracy, and consider using RBR in conjunction with human feedback," the company said in a blog post.

For tasks of a subjective nature, such as writing or any creative task, RBR may have difficulty.

OpenAI began exploring RBR methods when developing GPT-4, but RBR has evolved greatly since then.

OpenAI's security commitment has always been questioned. In March this year, Jan Leike, a former researcher and head of the company's Superalignment team, posted a post criticizing the company, saying that "safety culture and processes have been replaced by flashy products." Co-founder and chief scientist Ilya Sutskever, who co-led the Superalignment team with Leike, also resigned from OpenAI. Sutskever has since started a new company focused on secure artificial intelligence systems.

learn more:

https://openai.com/index/improving-model-safety-behavior-with-rule-based-rewards/