A recent study published in Nature Medicine found that ChatGPT Health, a chatbot launched by OpenAI for medical scenarios, often underestimates the severity of medical emergencies when grading (triaging) cases. The research team fed 60 real-world medical cases into the system and compared its triage recommendations with the judgments of three clinicians based on guidelines and experience.

The results showed that among the cases that doctors determined should go to the emergency department immediately, ChatGPT Health had 51.6% determined to be "can see a doctor within 24 to 48 hours," which is the so-called "low grade." Situations classified as emergencies include diabetic ketoacidosis, impending respiratory failure and other serious illnesses that will lead to death if not treated promptly. Ashwin Ramaswamy, lead author of the study and a lecturer in urology at Mount Sinai Hospital in New York City, noted that any doctor with some training would assume that such patients must be rushed to the emergency department immediately, but the chatbot seemed to be "waiting for the condition to be undeniably serious" before recommending a trip. However, for emergencies such as stroke, which have very typical symptoms, ChatGPT Health achieved 100% accurate classification in this study.
The study also looked at how the system performed under different demographic characteristics: Each case was made into 16 variants, changing the patient's gender, race and other information, but by design, the conclusions should be the same regardless of the variant. The study found no evidence of systematic bias in results by gender or race.
The study also found that ChatGPT Health had the opposite problem with non-urgent cases: It "over-graded" 64.8% of non-urgent cases compared with doctors, such as asking a patient who had only had a sore throat for three days to be seen within 24 to 48 hours by home care. Ramaswamy said he struggled to see the logic behind the model's recommendations in different scenarios, saying its risk judgments were "kind of inverted, almost the opposite" of clinical risk.
ChatGPT Health's performance was similarly inconsistent in situations involving suicidal ideation or risk of self-harm. OpenAI's policy states that when a user expresses suicidal thoughts, the chatbot should direct them to call 988, the National Suicide and Crisis Hotline, and ChatGPT Health follows the same mechanism. But in this study, the system sometimes suggested calling 988 when it was not needed, but failed to give the advice when it was really necessary.
In response to the study's conclusions, an OpenAI spokesperson said that the company welcomes research on the application of artificial intelligence in the medical field, but believes that the design of this study does not represent the typical use or expected use scenarios of ChatGPT Health. According to OpenAI, ChatGPT Health’s interaction model encourages users to continue asking questions to provide more background information, rather than relying on it to make a one-time judgment on a single description. Currently, ChatGPT Health is still only open to limited users. OpenAI is continuing to improve the security and reliability of the model and has not yet fully promoted it. Official information also emphasizes that the product is "not for diagnosis or treatment," but is built on a more secure platform that allows users to upload more sensitive personal medical information.
A report released by OpenAI in January this year showed that more than 40 million people around the world have used ChatGPT to answer health-related questions. There are nearly 2 million conversations related to medical insurance every week. The vast majority of health consultations occur outside doctors’ normal consultation hours, and more than 500,000 messages every week come from areas that are more than 30 minutes’ drive from the hospital. Researchers point out that AI tools are very attractive to these people because they are low-cost to obtain, there is no limit on the number of questions and answers, and users can upload all the documents and details they want to discuss. In Ramaswamy’s view, many people are looking for more than just advice, but also a “medical companion” interactive experience.
However, several experts who were not involved in the research cautioned that the medical capabilities of current chatbots should not be overestimated. John Mafi, an internist at UCLA Health System, said that any AI medical product that affects life safety must go through rigorous randomized controlled trials to prove that the benefits outweigh the risks before being promoted on a large scale. Experts generally believe that chatbots can provide useful health information in many scenarios, but it is still difficult to replace doctors' face-to-face judgment.
Monica Agrawal, an assistant professor in the Department of Biostatistics and Computer Science at Duke University, pointed out that the outside world still lacks a transparent understanding of the training data and training methods of large-scale language models, and many existing evaluation indicators (such as high scores in licensing exams) do not directly represent their true medical ability. She also mentioned that large language models are "pandering" and tend to echo the user's opinions, even if those opinions are inaccurate, which may reinforce patients' existing misunderstandings and prejudices. Mafi added that AI tools are “designed to please you,” but doctors sometimes have to say things patients don’t want to hear.
On the question of whether it is safe to rely on chatbots to provide medical advice, Ramaswamy’s view is that, at least at the current stage, the answer is no, especially in emergency situations, AI should not be relied on, but doctors or emergency services should be contacted first. Ethan Goh, executive director of ARISE, an AI research network in Singapore, believes that in many specific situations, AI can indeed give safe and feasible suggestions, but the key is that users should be aware of its limitations and should not regard it as a substitute for doctors. Experts emphasize that a safer future direction is to use AI in conjunction with doctors, with continuous regulation and improvement of tools through closer cooperation between medical institutions and technology companies.
Ramaswamy said that if the model capabilities continue to improve, the establishment of a "patient-AI-doctor" three-party collaboration relationship in remote areas or global health scenarios with scarce medical resources may bring tangible benefits to patients. But before that, how to conduct rigorous enough evaluation and constraints on these systems before making decisions that truly affect lives is still a difficult problem facing the medical and technology industries.