In medical dramas, from George Clooney in "ER" to Noah Wyle in "ER," emergency physicians have long been portrayed as heroes who save lives. But a latest study from Harvard shows that in high-pressure emergency triage situations, artificial intelligence systems have surpassed human doctors in diagnostic accuracy. This result is described by researchers as a technological turning point that will "reshape medicine."

The study, published in the journal Science and led by a team at Harvard Medical School, is believed by independent experts to mark a "real advance" in AI's clinical reasoning capabilities, beyond just passing exams or solving artificially constructed test questions. The study used a large-scale experimental design to compare hundreds of doctors with a large language model (LLM), focusing on evaluating performance differences in key scenarios such as emergency triage and long-term treatment planning.

In one of the core experiments, the research team selected 76 real patients who visited the emergency room of a hospital in Boston. The AI ​​system and a team of two human doctors were fed the exact same standard electronic medical records, including vital sign data, demographic information and a few-sentence nurse description of the reason for the visit. Given this limited information to make an initial diagnosis, the AI ​​gave an accurate or very close diagnosis in 67% of cases, while human doctors were correct only between 50%–55% of the time.

Research points out that the advantages of AI are particularly prominent in triage scenarios where information is extremely limited and rapid judgment is required. When the AI ​​and doctors were provided with more detailed clinical information, the diagnostic accuracy of the AI ​​(using OpenAI's o1 inference model) further improved to 82%, while the accuracy of human experts ranged from 70%–79%, although this difference was not statistically significant.

In addition to emergency triage, AI has also shown superior performance to doctors in formulating long-term treatment plans. In another trial, the research team asked the AI ​​to review five clinical cases with 46 doctors, with tasks ranging from designing antibiotic regimens to planning long-term management plans such as end-of-life care processes. The results showed that treatment options given by AI scored significantly higher, with a score of 89%, while doctors who relied on traditional sources such as search engines scored only 34%.

Despite this, the researchers emphasized that it is far from time to "announce emergency doctors to be laid off." This study only compared the diagnostic capabilities of AI and humans at the level of medical record data that can be textified, and did not include many signals that are crucial in real clinical situations, such as patients' pain expressions, emotional states, body language, and even non-textual information such as interactions with family members. In other words, in this study, the AI ​​was closer to a “behind-the-scenes doctor” who gave a second opinion based on paper information.

“I don’t think our findings mean AI will replace doctors,” said Arjun Manrai, one of the study’s first authors and director of the AI ​​Lab at Harvard Medical School. "I think what it means is that we are witnessing a profound technological change that will reshape the entire health care system." Fellow lead author Adam Rodman, a clinician at Beth Israel Deaconess Medical Center in Boston, called large language models "one of the most impactful technologies in recent decades." He predicted that in the next ten years, AI will not replace doctors, but will form a new "tripartite care model" with doctors and patients - "doctors, patients and artificial intelligence systems."

The study also presented a representative clinical case: a patient came to the hospital with pulmonary blood clots and worsening symptoms. Human doctors initially judged that anticoagulant drug treatment failed, leading to disease progression; but the AI ​​noticed a key point after reading the medical history—the patient suffered from lupus erythematosus, an autoimmune disease that may also cause lung inflammation. Upon further inspection, the AI’s inference proved to be correct.

The clinical application of AI does not remain in the laboratory stage. A large number of doctors are already using it in practice. Nearly one in five U.S. doctors has introduced AI-assisted tools into their diagnostic procedures, according to recent research released by the American Medical Association. In the UK, a new survey from the Royal College of Physicians revealed that 16% of doctors use such technology on a daily basis, with a further 15% using it once or more per week, with "clinical decision support" being one of the most common usage scenarios.

However, British doctors also expressed high vigilance about AI when being surveyed, especially concerns about the risk of AI misdiagnosis and liability issues. Although billions of dollars have poured into medical AI startups around the world, once AI goes wrong, how to define responsibilities and who will bear the consequences is still an urgent institutional gap that needs to be resolved. “There is currently no formal accountability framework,” Rodman pointed out, stressing that patients “ultimately want to be guided, accompanied and explained by humans” when faced with life-and-death decisions or complex treatment plans.

Professor Ewen Harrison, co-director of the Center for Medical Informatics at the University of Edinburgh, said the research was significant because it showed "these systems are no longer just about passing medical exams or responding to artificially constructed test questions". In his view, AI is gradually becoming a useful "second opinion tool" for clinicians, especially in scenarios where it is necessary to comprehensively sort out potential diagnoses and avoid missing important causes of disease.

At the same time, Wei Xing, assistant professor at the School of Mathematics and Physical Sciences at the University of Sheffield in the UK, also reminded that some results in the study show that when doctors collaborate with AI, they may unconsciously rely on AI conclusions and weaken independent thinking. “This tendency is likely to increase further as AI becomes routinely used in clinical settings,” he noted. Xing Wei also emphasized that the study did not fully disclose in which types of patients AI performs worse, such as whether it is more difficult to diagnose elderly patients or patients who are not native English speakers. These are issues that cannot be ignored when evaluating safety.

Therefore, although the results of the Harvard trial are encouraging, it does not prove that AI is safe enough to be used routinely and independently in clinical diagnosis and treatment, nor does it mean that the public should turn to free AI tools as a substitute for professional medical advice. In the foreseeable future, AI is more likely to be used as a high-performance "intelligent stethoscope" and "second brain" to be embedded in the human-led medical system, promoting more accurate and efficient diagnosis and treatment, while also putting new issues about responsibility, ethics and trust before society.