This update attracted attention largely because one of the reasons why OpenAI CEO Sam Altman was ousted by the board of directors seemed to be related to large model security issues. After the high-level personnel chaos, two "decelerationist" members of the OpenAI board of directors, Ilya Sutskever and Helen Toner, lost their board seats.
In the article, OpenAI discusses its latest “Preparedness Framework,” OpenAI’s process for tracking, assessing, predicting, and preventing catastrophic risks from increasingly powerful models. How to define catastrophic risk? OpenAI said, “By catastrophic risk, we mean any risk that could result in hundreds of billions of dollars in economic losses or cause serious injury or death to many people—including, but not limited to, existential risks.”
Three sets of security teams cover different time frames and risks.
According to information on the OpenAI official website, models in production are managed by the "Security System" team. Leading edge models in development have “readiness” teams that identify and quantify risks before the model is released. Then there's the "superalignment" team, which is working on theoretical guidelines for "superintelligent" models.
OpenAI’s team will rate each model based on four risk categories: cybersecurity, “persuasion” (i.e., disinformation), model autonomy (i.e., acting on its own), and CBRN (chemical, biological, radiological, and nuclear threats, such as the ability to create new pathogens).
OpenAI assumes various mitigations: for example, models maintain reasonable reservations about describing the process of making napalm or pipe bombs. After taking into account known mitigation measures, if a model is still assessed as having a "high" risk, it will not be deployed, and if a model presents any "critical" risks, it will not be developed further.
And the person who created the model is not necessarily the best person to evaluate the model and make recommendations. It is for this reason that OpenAI is forming a "cross-functional security advisory group" that will be at the technical level, review the researchers' reports and make recommendations from a higher perspective, hoping to discover some "unknown unknowns" for it.
The process requires those recommendations to be sent to both the board and leadership, which will decide whether to continue or cease operations, but the board will be able to reverse those decisions. This will hopefully avoid having high-risk products or processes approved without the board’s knowledge.
However, what still worries the outside world is that if the expert panel makes recommendations and the CEO makes decisions based on that information, will OpenAI's current board of directors really feel empowered to refute and apply the brakes? If they did, would the outside public hear about it? At present, apart from OpenAI's commitment to solicit independent third-party audits, its transparency issues have not really been resolved.
1. Assessment and scoring
We will run evaluations and continually update our model’s “scorecard”. We will evaluate all leading-edge models, including twice the effective computation during training runs. We will push the model to its limits. These findings will help us assess the risks of leading-edge models and measure the effectiveness of any proposed mitigation measures. Our goal is to detect specific edges of insecurity to effectively mitigate the risk of exposure. To track the safety level of our models, we will produce risk “scorecards” and detailed reports.
The "scorecard" will evaluate all cutting-edge models.
2. Set risk thresholds
We will define risk thresholds that trigger security measures. We defined risk level thresholds based on the following initial tracking categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy. We specify four security risk levels, and only models with a post-mitigation score of "medium" or below can be deployed; only models with a post-mitigation score of "high" or below can be further developed. We will also implement additional security measures for models with high or severe risk (pre-mitigation).
Risk level.
3. Set new operational structure for overseeing technical work and safety decision-making
We will establish an operational structure with a dedicated team to oversee technical work and security decisions. The readiness team will drive technical work to examine the limits of cutting-edge model capabilities, conduct assessments, and synthesize reports. This technical work is critical to decisions about OpenAI security model development and deployment. We are creating a cross-functional security advisory group to review all reports and send them to both leadership and the board of directors. While leadership is the decision-maker, the board has the power to overturn decisions.
New operational structure for overseeing technical work and safety decision-making.
4. Increase security and external accountability
We will develop protocols to improve safety and external accountability. The Readiness Team will conduct regular security drills to stress-test our business and our own culture. Some security issues can arise quickly, so we have the ability to flag urgent issues for quick response. We thought it would be helpful for this work to receive feedback from people outside OpenAI and hopefully have it reviewed by a qualified independent third party. We will continue to have others form red teams and evaluate our models, and we plan to share updates externally.
5. Reduce other known and unknown security risks
We will help mitigate other known and unknown security risks. We will work closely with external parties as well as internal teams such as security systems to track real-world abuse. We will also work with Superalignment to track urgent risks of misalignment. We are also pioneering new research that measures how risk evolves as models scale to help predict risk ahead of time, similar to our earlier success with the Law of Scale. Finally, we will run a continuous process to try to resolve any emerging "unknown unknowns".