To evaluate and compare the diagnostic performance of 2 clinical decision support system tools—ORADIII and ORAD DDx—against histopathological diagnosis in identifying intrabony jaw lesions using orthopantomograms.
Patients and Methods
A diagnostic accuracy, cross-sectional study was conducted in the Department of Oral Medicine and Radiology, Kathmandu University School of Medical Sciences, Dhulikhel Hospital, Kavre, Nepal, from January 1, 2025, to April 30, 2025, after institutional review committee approval. The study was conducted on a sample comprising both lesion and nonlesion cases based on radiographic evaluation. Diagnostic outputs from ORADIII and ORAD DDx were compared with histopathology. Key performance indicators—including sensitivity, specificity, accuracy, F1 score, positive predictive value, negative predictive value, and likelihood ratios (positive and negative)—were calculated for both systems.
Results
Among the 350 samples evaluated, including 175 lesion positive and 175 nonlesion cases, ORAD DDx demonstrated superior diagnostic performance compared with ORADIII. The sensitivity, specificity, accuracy, and F1 score for ORADIII were 64.57%, 60.00%, 62.28%, and 0.6314, respectively. In contrast, ORAD DDx achieved sensitivity, specificity, accuracy, and F1 score of 70.29%, 65.71%, 68.00%, and 0.687, respectively.
Conclusion
ORAD DDx showed better diagnostic performance than ORADIII across most metrics, indicating its potential as a more reliable clinical decision support system for diagnosis decision support for intrabony jaw lesions. This could also be due to its categorizing of lesions and variations. Further validation with larger, stratified, and multicenter data sets is recommended.
{"title":"Diagnostic Accuracy of Clinical Decision Support Systems ORADIII and ORAD DDx to Histopathological Diagnosis of Jaw Lesions","authors":"Harleen Bali MDS , Dashrath Kafle MDS , Sagar Adhikari MDS , Nitesh Kumar Chaurasia MDS , Pratibha Poudel MDS , Bhoj Raj Adhikari PhD , Garima Adhikari BDS , Sachita Thapa MDS","doi":"10.1016/j.mcpdig.2025.100306","DOIUrl":"10.1016/j.mcpdig.2025.100306","url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate and compare the diagnostic performance of 2 clinical decision support system tools—ORADIII and ORAD DDx—against histopathological diagnosis in identifying intrabony jaw lesions using orthopantomograms.</div></div><div><h3>Patients and Methods</h3><div>A diagnostic accuracy, cross-sectional study was conducted in the Department of Oral Medicine and Radiology, Kathmandu University School of Medical Sciences, Dhulikhel Hospital, Kavre, Nepal, from January 1, 2025, to April 30, 2025, after institutional review committee approval. The study was conducted on a sample comprising both lesion and nonlesion cases based on radiographic evaluation. Diagnostic outputs from ORADIII and ORAD DDx were compared with histopathology. Key performance indicators—including sensitivity, specificity, accuracy, F1 score, positive predictive value, negative predictive value, and likelihood ratios (positive and negative)—were calculated for both systems.</div></div><div><h3>Results</h3><div>Among the 350 samples evaluated, including 175 lesion positive and 175 nonlesion cases, ORAD DDx demonstrated superior diagnostic performance compared with ORADIII. The sensitivity, specificity, accuracy, and F1 score for ORADIII were 64.57%, 60.00%, 62.28%, and 0.6314, respectively. In contrast, ORAD DDx achieved sensitivity, specificity, accuracy, and F1 score of 70.29%, 65.71%, 68.00%, and 0.687, respectively.</div></div><div><h3>Conclusion</h3><div>ORAD DDx showed better diagnostic performance than ORADIII across most metrics, indicating its potential as a more reliable clinical decision support system for diagnosis decision support for intrabony jaw lesions. This could also be due to its categorizing of lesions and variations. Further validation with larger, stratified, and multicenter data sets is recommended.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"4 1","pages":"Article 100306"},"PeriodicalIF":0.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.mcpdig.2025.100304
{"title":"Reviewers for Mayo Clinic Proceedings: Digital Health (2025)","authors":"","doi":"10.1016/j.mcpdig.2025.100304","DOIUrl":"10.1016/j.mcpdig.2025.100304","url":null,"abstract":"","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100304"},"PeriodicalIF":0.0,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1016/j.mcpdig.2025.100300
Jiwoong Jeong MS , Chieh-Ju Chao MD , Reza Arsanjani MD , Chadi Ayoub MBBS, PhD , Steven J. Lester MD , Milagros Pereyra MD , Ebram F. Said MD , Michael Roarke BS , Cecilia Tagle-Cornell MS , Laura M. Koepke MSN , Yi-Lin Tsai MD , Chen Jung-Hsuan MD , Chun-Chin Chang MD , Juan M. Farina MD , Hari Trivedi MD , Bhavik N. Patel MD, MBA , Imon Banerjee PhD
Objective
To create an opportunistic screening model to predict coronary calcium burden and associated cardiovascular risk using only commonly available frontal chest x-rays (CXR) and patient demographics.
Patients and Methods
We proposed a novel multitask learning framework and trained a model using 2121 patients with paired gated computed tomography scans and CXR images internally (Mayo Clinic) from January 1, 2012, to December 31, 2022, with coronary artery calcification (CAC) scores (0, 1-99, and 100+) as ground truths. Results from the internal training were validated on multiple external datasets (Emory University Healthcare and Taipei Veterans General Hospital—from January 1, 2012, to December 31, 2022) with significant racial and ethnic differences.
Results
Classification performance between 0, 1-99, and 100+ CAC scores performed moderately on both the internal test and external datasets, reaching average f1-scores of 0.71±0.04 for Mayo, 0.65±0.02 for Emory University Healthcare, and 0.70±0.06 for Taipei Veterans General Hospital. For the clinically relevant risk identification, the performance of our model on the internal and 2 external datasets reached area under the receiver operating curves of 0.86±0.02, 0.77±0.03, and 0.82±0.03 for 0 versus 400+, respectively. For 0 versus 100+, we achieved area under the receiver operating curve of 0.83±0.03, 0.71±0.02, and 0.78±0.01, respectively. Prospective evaluation across 3 Mayo Clinic sites is on par with the external validations and reports only minimal temporal drift.
Conclusion
Open-source fusion artificial intelligence-CXR model performed better than existing state-of-the-art models for predicting CAC scores only on internal cohort, with robust performance on external datasets. This proposed model may be useful as a robust, first-pass opportunistic screening method for cardiovascular risk from regular CXR.
目的建立一种机会性筛查模型,仅利用常用的胸部x光片(CXR)和患者人口统计学数据预测冠状动脉钙负荷和相关心血管风险。患者和方法我们提出了一个新的多任务学习框架,并使用2012年1月1日至2022年12月31日在梅奥诊所(Mayo Clinic)内部进行的2121例患者的配对门控制计算机断层扫描和CXR图像训练了一个模型,其中冠状动脉钙化(CAC)评分(0、1-99和100+)作为基本事实。内部训练的结果在多个外部数据集(Emory University Healthcare and Taipei Veterans General hospital,从2012年1月1日至2022年12月31日)上进行验证,具有显著的种族和民族差异。结果0、1 ~ 99、100+ CAC评分在内部和外部数据集的分类表现均为中等,梅奥医院的平均评分为0.71±0.04,埃默里大学医疗保健为0.65±0.02,台北退伍军人总医院为0.70±0.06。对于临床相关风险识别,我们的模型在内部和2个外部数据集上的表现在受试者工作曲线下分别达到0.86±0.02,0.77±0.03和0.82±0.03,分别为0和400+。对于0和100+,我们获得的受试者工作曲线下面积分别为0.83±0.03,0.71±0.02和0.78±0.01。3个Mayo诊所站点的前瞻性评估与外部验证相同,报告的时间偏差最小。结论开源融合人工智能- cxr模型仅在内部队列上预测CAC分数优于现有最先进的模型,在外部数据集上具有稳健的性能。该模型可作为常规CXR中心血管风险的一种稳健的第一次机会性筛查方法。
{"title":"Artificial Intelligence Chest X-Ray Opportunistic Screening Model for Coronary Artery Calcium Deposition: A Multi-Objective Model With Multimodal Data Fusion","authors":"Jiwoong Jeong MS , Chieh-Ju Chao MD , Reza Arsanjani MD , Chadi Ayoub MBBS, PhD , Steven J. Lester MD , Milagros Pereyra MD , Ebram F. Said MD , Michael Roarke BS , Cecilia Tagle-Cornell MS , Laura M. Koepke MSN , Yi-Lin Tsai MD , Chen Jung-Hsuan MD , Chun-Chin Chang MD , Juan M. Farina MD , Hari Trivedi MD , Bhavik N. Patel MD, MBA , Imon Banerjee PhD","doi":"10.1016/j.mcpdig.2025.100300","DOIUrl":"10.1016/j.mcpdig.2025.100300","url":null,"abstract":"<div><h3>Objective</h3><div>To create an opportunistic screening model to predict coronary calcium burden and associated cardiovascular risk using only commonly available frontal chest x-rays (CXR) and patient demographics.</div></div><div><h3>Patients and Methods</h3><div>We proposed a novel multitask learning framework and trained a model using 2121 patients with paired gated computed tomography scans and CXR images internally (Mayo Clinic) from January 1, 2012, to December 31, 2022, with coronary artery calcification (CAC) scores (0, 1-99, and 100+) as ground truths. Results from the internal training were validated on multiple external datasets (Emory University Healthcare and Taipei Veterans General Hospital—from January 1, 2012, to December 31, 2022) with significant racial and ethnic differences.</div></div><div><h3>Results</h3><div>Classification performance between 0, 1-99, and 100+ CAC scores performed moderately on both the internal test and external datasets, reaching average f1-scores of 0.71±0.04 for Mayo, 0.65±0.02 for Emory University Healthcare, and 0.70±0.06 for Taipei Veterans General Hospital. For the clinically relevant risk identification, the performance of our model on the internal and 2 external datasets reached area under the receiver operating curves of 0.86±0.02, 0.77±0.03, and 0.82±0.03 for 0 versus 400+, respectively. For 0 versus 100+, we achieved area under the receiver operating curve of 0.83±0.03, 0.71±0.02, and 0.78±0.01, respectively. Prospective evaluation across 3 Mayo Clinic sites is on par with the external validations and reports only minimal temporal drift.</div></div><div><h3>Conclusion</h3><div>Open-source fusion artificial intelligence-CXR model performed better than existing state-of-the-art models for predicting CAC scores only on internal cohort, with robust performance on external datasets. This proposed model may be useful as a robust, first-pass opportunistic screening method for cardiovascular risk from regular CXR.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100300"},"PeriodicalIF":0.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.mcpdig.2025.100299
Jennifer L. St. Sauver PhD , Brandon R. Grossardt MS , Alexander D. Weston PhD , Hillary W. Garner MD , Alanna M. Chamberlain PhD , Walter A. Rocca MD , Perry J. Pickhardt MD , Blake Thackeray , Owen R. Keegan , Andrew D. Rule MD
Objective
To determine whether abdominal computed tomography (CT) measures of body composition are associated with fall risk in adults aged 20 to 89 years.
Patients and Methods
We identified persons who received an abdominal CT scan from 2010 to 2020 using the Rochester Epidemiology Project. We calculated subcutaneous adipose and visceral adipose tissue area, skeletal muscle area and density, and vertebral bone area and density using a validated deep learning algorithm applied to CT abdominal section. Sex-specific tertiles of body composition biomarkers were used for primary analyses. We identified falls using International Classification of Diseases codes and verified via chart review. Associations between body composition tertiles and falls were assessed using Cox proportional hazards models, and models were adjusted for body mass index and the presence of 18 chronic conditions.
Results
We included 3972 persons aged 20 to 89 years. Subcutaneous and visceral fat area, skeletal muscle area, bone area, and bone density were not associated with fall risk (all adjusted P>.05). By contrast, lower muscle density was associated with an increased risk of falls (adjusted hazard ratio, for the lowest tertile vs the middle tertile: 2.31; 95% CI, 1.70-3.14). The association between low muscle density and an increased risk of falls was most sizable in persons aged 45 to 64 years (adjusted hazard ratio, 4.98; 95% CI, 2.80-8.85).
Conclusion
Muscle density measures from abdominal CT scans may be useful for understanding physiologic changes in the abdomen that place persons at an increased risk of falls as early as middle age.
{"title":"Associations Between Deep Learning–Derived Fat, Muscle, and Bone Measures From Abdominal Computed Tomography Scans and Fall Risk in Persons Aged 20 Years or Older","authors":"Jennifer L. St. Sauver PhD , Brandon R. Grossardt MS , Alexander D. Weston PhD , Hillary W. Garner MD , Alanna M. Chamberlain PhD , Walter A. Rocca MD , Perry J. Pickhardt MD , Blake Thackeray , Owen R. Keegan , Andrew D. Rule MD","doi":"10.1016/j.mcpdig.2025.100299","DOIUrl":"10.1016/j.mcpdig.2025.100299","url":null,"abstract":"<div><h3>Objective</h3><div>To determine whether abdominal computed tomography (CT) measures of body composition are associated with fall risk in adults aged 20 to 89 years.</div></div><div><h3>Patients and Methods</h3><div>We identified persons who received an abdominal CT scan from 2010 to 2020 using the Rochester Epidemiology Project. We calculated subcutaneous adipose and visceral adipose tissue area, skeletal muscle area and density, and vertebral bone area and density using a validated deep learning algorithm applied to CT abdominal section. Sex-specific tertiles of body composition biomarkers were used for primary analyses. We identified falls using International Classification of Diseases codes and verified via chart review. Associations between body composition tertiles and falls were assessed using Cox proportional hazards models, and models were adjusted for body mass index and the presence of 18 chronic conditions.</div></div><div><h3>Results</h3><div>We included 3972 persons aged 20 to 89 years. Subcutaneous and visceral fat area, skeletal muscle area, bone area, and bone density were not associated with fall risk (all adjusted <em>P</em>>.05). By contrast, lower muscle density was associated with an increased risk of falls (adjusted hazard ratio, for the lowest tertile vs the middle tertile: 2.31; 95% CI, 1.70-3.14). The association between low muscle density and an increased risk of falls was most sizable in persons aged 45 to 64 years (adjusted hazard ratio, 4.98; 95% CI, 2.80-8.85).</div></div><div><h3>Conclusion</h3><div>Muscle density measures from abdominal CT scans may be useful for understanding physiologic changes in the abdomen that place persons at an increased risk of falls as early as middle age.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100299"},"PeriodicalIF":0.0,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.mcpdig.2025.100297
Jorge C. Correia MD, MSc , Katarzyna Wac PhD , Catherine Joly BSc , Jean-Philippe Assal MD , Surabhi Joshi MA , Cosette Fakih El Khoury PhD , Zoltan Pataky MD
{"title":"Therapeutic Patient Education in the Digital Era: Opportunities and Challenges in Diabetes Care","authors":"Jorge C. Correia MD, MSc , Katarzyna Wac PhD , Catherine Joly BSc , Jean-Philippe Assal MD , Surabhi Joshi MA , Cosette Fakih El Khoury PhD , Zoltan Pataky MD","doi":"10.1016/j.mcpdig.2025.100297","DOIUrl":"10.1016/j.mcpdig.2025.100297","url":null,"abstract":"","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100297"},"PeriodicalIF":0.0,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digital tools are often seen as promising avenues for promoting and sustaining healthy lifestyle behaviors. They not only offer benefits such as personalization, scalability, and cost-effectiveness but also raise significant ethical concerns. Issues such as equitable access, informed consent, and fair outcomes, particularly for vulnerable populations, must be addressed. An ethical framework is needed to guide the creation of digital lifestyle interventions. A narrative review was conducted across 3 domains: (1) general ethical principles for public health interventions, (2) ethical frameworks for lifestyle interventions, and (3) ethical considerations for digital tools in health promotion. A total of 16 articles were found across all 3 inclusion domains. The following 5 core ethical themes were identified: (1) respect for autonomy; (2) beneficence; (3) harms; (4) equity; and (5) responsibility, sustainability, and accountability. Two ethical considerations stood out in the context of digital interventions: health equity and privacy. Although digital tools may be an effective form of lifestyle intervention, they can disproportionately benefit individuals already in advantaged positions. We present a basic ethical framework for guiding the development and deployment of these digital tools. The framework highlights the tensions that may arise between competing ethical principles and helps developers determine which considerations are most relevant, and to whom, at different stages of intervention design and development.
{"title":"Creating a Basic Ethical Framework for Digital Lifestyle Interventions: A Narrative Review","authors":"Nicolien D.M. Dinklo MA , Maartje H.N. Schermer MD, PhD , Ineke Bolt PhD , Hafez Ismaili M’hamdi PhD","doi":"10.1016/j.mcpdig.2025.100295","DOIUrl":"10.1016/j.mcpdig.2025.100295","url":null,"abstract":"<div><div>Digital tools are often seen as promising avenues for promoting and sustaining healthy lifestyle behaviors. They not only offer benefits such as personalization, scalability, and cost-effectiveness but also raise significant ethical concerns. Issues such as equitable access, informed consent, and fair outcomes, particularly for vulnerable populations, must be addressed. An ethical framework is needed to guide the creation of digital lifestyle interventions. A narrative review was conducted across 3 domains: (1) general ethical principles for public health interventions, (2) ethical frameworks for lifestyle interventions, and (3) ethical considerations for digital tools in health promotion. A total of 16 articles were found across all 3 inclusion domains. The following 5 core ethical themes were identified: (1) respect for autonomy; (2) beneficence; (3) harms; (4) equity; and (5) responsibility, sustainability, and accountability. Two ethical considerations stood out in the context of digital interventions: health equity and privacy. Although digital tools may be an effective form of lifestyle intervention, they can disproportionately benefit individuals already in advantaged positions. We present a basic ethical framework for guiding the development and deployment of these digital tools. The framework highlights the tensions that may arise between competing ethical principles and helps developers determine which considerations are most relevant, and to whom, at different stages of intervention design and development.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100295"},"PeriodicalIF":0.0,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1016/j.mcpdig.2025.100296
Donald U. Apakama MD, MS , Kim-Anh-Nhi Nguyen MS , Daphnee Hyppolite MPA, RHIA , Shelly Soffer MD , Aya Mudrik BS , Emilia Ling MD, MBA, MS , Akini Moses MD , Ivanka Temnycky MS , Allison Glasser MBA , Rebecca Anderson MPH , Prathamesh Parchure MS , Evajoyce Woullard MS , Masoud Edalati PhD , Lili Chan MD, MS , Clair Kronk PhD , Robert Freeman RN , Arash Kia MD , Prem Timsina MD, PhD , Matthew A. Levin MD , Rohan Khera MD, MS , Girish N. Nadkarni MD, MPH
Objective
To evaluate whether generative pretrained transformer (GPT)-4 can detect and revise biased language in emergency department (ED) notes, against human-adjudicated gold-standard labels, and to identify modifiable factors associated with biased documentation.
Patients and Methods
We randomly sampled 50,000 ED medical and nursing notes from the Mount Sinai Health System (January 1, 2023, to December 31, 2023). We also randomly sampled 500 discharge notes from the Medical Information Mart for Intensive Care IV database. The GPT-4 flagged 4 types of bias: discrediting, stigmatizing/labeling, judgmental, and stereotyping. Two human reviewers verified model detections. We used multivariable logistic regression to examine associations between bias and health care utilization, presenting problems (eg, substance use), shift timing, and provider type. We then asked physicians to rate GPT-4’s proposed language revisions on a 10-point scale.
Results
The GPT-4 showed 97.6% sensitivity and 85.7% specificity compared with the human review. Biased language appeared in 6.5% (3229 of 50,000) of Mount Sinai notes and 7.4% (37 of 500) of Medical Information Mart for Intensive Care IV notes. In adjusted models, frequent health care utilization (adjusted odds ratio [aOR], 2.85; 95% CI, 1.95-4.17), substance use presentations (aOR, 3.09; 95% CI, 2.51-3.80), and overnight shifts (aOR, 1.37; 95% CI, 1.23-1.52) showed elevated odds of biased documentation. Physicians were more likely to include bias than nurses (aOR, 2.26; 95% CI, 2.07-2.46); GPT-4’s recommended revisions received mean physician ratings above 9 of 10.
Conclusion
The study showed that GPT-4 accurately detects biased language in clinical notes, identifies modifiable contributors to that bias, and delivers physician-endorsed revisions. This approach may help mitigate documentation bias and reduce disparities in care.
{"title":"Identifying Bias at Scale in Clinical Notes Using Large Language Models","authors":"Donald U. Apakama MD, MS , Kim-Anh-Nhi Nguyen MS , Daphnee Hyppolite MPA, RHIA , Shelly Soffer MD , Aya Mudrik BS , Emilia Ling MD, MBA, MS , Akini Moses MD , Ivanka Temnycky MS , Allison Glasser MBA , Rebecca Anderson MPH , Prathamesh Parchure MS , Evajoyce Woullard MS , Masoud Edalati PhD , Lili Chan MD, MS , Clair Kronk PhD , Robert Freeman RN , Arash Kia MD , Prem Timsina MD, PhD , Matthew A. Levin MD , Rohan Khera MD, MS , Girish N. Nadkarni MD, MPH","doi":"10.1016/j.mcpdig.2025.100296","DOIUrl":"10.1016/j.mcpdig.2025.100296","url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate whether generative pretrained transformer (GPT)-4 can detect and revise biased language in emergency department (ED) notes, against human-adjudicated gold-standard labels, and to identify modifiable factors associated with biased documentation.</div></div><div><h3>Patients and Methods</h3><div>We randomly sampled 50,000 ED medical and nursing notes from the Mount Sinai Health System (January 1, 2023, to December 31, 2023). We also randomly sampled 500 discharge notes from the Medical Information Mart for Intensive Care IV database. The GPT-4 flagged 4 types of bias: discrediting, stigmatizing/labeling, judgmental, and stereotyping. Two human reviewers verified model detections. We used multivariable logistic regression to examine associations between bias and health care utilization, presenting problems (eg, substance use), shift timing, and provider type. We then asked physicians to rate GPT-4’s proposed language revisions on a 10-point scale.</div></div><div><h3>Results</h3><div>The GPT-4 showed 97.6% sensitivity and 85.7% specificity compared with the human review. Biased language appeared in 6.5% (3229 of 50,000) of Mount Sinai notes and 7.4% (37 of 500) of Medical Information Mart for Intensive Care IV notes. In adjusted models, frequent health care utilization (adjusted odds ratio [aOR], 2.85; 95% CI, 1.95-4.17), substance use presentations (aOR, 3.09; 95% CI, 2.51-3.80), and overnight shifts (aOR, 1.37; 95% CI, 1.23-1.52) showed elevated odds of biased documentation. Physicians were more likely to include bias than nurses (aOR, 2.26; 95% CI, 2.07-2.46); GPT-4’s recommended revisions received mean physician ratings above 9 of 10.</div></div><div><h3>Conclusion</h3><div>The study showed that GPT-4 accurately detects biased language in clinical notes, identifies modifiable contributors to that bias, and delivers physician-endorsed revisions. This approach may help mitigate documentation bias and reduce disparities in care.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 4","pages":"Article 100296"},"PeriodicalIF":0.0,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}