Clinical AI that is more honest about what it doesn’t know
AI for Health Institute researchers develop framework that helps clinical language models know when to be confident, when to be cautious
Combining clinical expertise and experience with the vast and ever-increasing knowledge of artificial intelligence has the potential to transform healthcare by providing earlier diagnoses and predicting outcomes. However, today’s AI has inherent risks of error or overconfidence in a prediction.
Sizhe Wang, a graduate student in the lab of Chenyang Lu, the Fullgraf Professor in the McKelvey School of Engineering at Washington University in St. Louis, developed a framework that teaches clinical AI when to be confident and when to be cautious by providing more trustworthy estimates of certainty and uncertainty in its predictions. The model, called Clinical Uncertainty Risk Alignment (CURA), will be presented at the Association for Computational Linguistics annual meeting in July 2026.
Lu, also director of the AI for Health Institute at WashU, said AI-human collaboration is among the most important problems in AI for healthcare today, citing studies that show that combining AI with clinicians has led to poorer outcomes than with AI alone.
“This is counterintuitive, because AI provides data-driven predictions, and clinicians have clinical expertise,” he said. “If you combine them in the ideal world, you’re supposed to do better than AI alone. The problem is partly because there are cases where AI was wrong and the clinician followed the suggestion, or sometimes AI was correct, but the clinician rejected the prediction. That’s why this accurate or calibrated uncertainty estimate can be tremendously helpful.”
The CURA framework trains clinical language models to predict patient risk, then calibrates their uncertainty estimate so they better signal when they may be right or wrong. Wang and collaborators used the clinical notes and prediction labels from the MIMIC IV critical care dataset to fine-tune three pretrained clinical language models and then fine-tune them for calibrated uncertainty.
“With individual uncertainty calibration, we align a prediction’s uncertainty with the likelihood of error,” said Wang, a first-year doctoral student. “If the model’s prediction is likely correct, we encourage it to be more confident. If predictions are more likely to be wrong, we want the model to express higher uncertainty.”
Their evaluations of the framework measured whether the prediction was accurate and if the confidence was reliable on five clinical risk prediction tasks.
“Our main result is that we have better calibration with no loss in its ability to tell high‑risk from low‑risk cases,” Wang said. “CURA improves calibration consistently across all five tasks across three existing clinical language models.”
CURA shows that predictions with low uncertainty form a safer pool for auto triage, while predictions with high uncertainty flag ambiguous cases for a clinician’s review.
“Original clinical language models often showed near-zero uncertainty, indicating overconfidence,” Wang said. “CURA reduced that overconfidence and assigned higher uncertainty to difficult high-risk cases, helping prioritize cases for clinician review.”
Going forward, the team plans to extend CURA to broader patient populations and evaluate its benefits for trustworthy clinical decision making in healthcare settings.
Wang S, Xu Z, Najjuuko C, Alba C, Lu C. CURA: Clinical uncertainty risk alignment for language model–based risk prediction. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Accepted. https://arxiv.org/abs/2604.14651
This work was supported in part by the Fullgraf Foundation and the AI for Health Institute at Washington University in St. Louis. Charles Alba was partially supported by the National University of Singapore Development Grant and the Danforth Scholarship at Washington University in St. Louis. Claire Najjuuko was partially supported by the NIH Researcher Resilience Training Grant (R25MH118935-01). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.