Julia Mary Alber, David Askay, Anuraj Dhillon, Lauren Sandoval, Sofia Ramos, Katharine Santilena
Background: Despite public health efforts, tobacco use remains the leading cause of preventable death in the United States and continues to disproportionately affect underrepresented populations. Public policies are needed to improve health equity in tobacco-related health outcomes. One strategy for promoting public support for these policies is through health messaging. Improvements in artificial intelligence (AI) technology offer new opportunities to create tailored policy messages quickly; however, there is limited research on how the public might perceive the use of AI for public health messages.
Objective: This study aimed to examine how knowledge of AI use impacts perceptions of a tobacco control policy video.
Methods: A national sample of US adults (N=500) was shown the same AI-generated video that focused on a tobacco control policy. Participants were then randomly assigned to 1 of 4 conditions where they were (1) told the narrator of the video was AI, (2) told the narrator of the video was human, (3) told it was unknown whether the narrator was AI or human, or (4) not provided any information about the narrator.
Results: Perceived video rating, effectiveness, and credibility did not significantly differ among the conditions. However, the mean speaker rating was significantly higher (P=.001) when participants were told the narrator of the health message was human (mean 3.65, SD 0.91) compared to the other conditions. Notably, positive attitudes toward AI were highest among those not provided information about the narrator; however, this difference was not statistically significant (mean 3.04, SD 0.90).
Conclusions: Results suggest that AI may impact perceptions of the speaker of a video; however, more research is needed to understand if these impacts would occur over time and after multiple exposures to content. Further qualitative research may help explain why potential differences may have occurred in speaker ratings. Public health professionals and researchers should further consider the ethics and cost-effectiveness of using AI for health messaging.
{"title":"AI Awareness and Tobacco Policy Messaging Among US Adults: Electronic Experimental Study.","authors":"Julia Mary Alber, David Askay, Anuraj Dhillon, Lauren Sandoval, Sofia Ramos, Katharine Santilena","doi":"10.2196/72987","DOIUrl":"10.2196/72987","url":null,"abstract":"<p><strong>Background: </strong>Despite public health efforts, tobacco use remains the leading cause of preventable death in the United States and continues to disproportionately affect underrepresented populations. Public policies are needed to improve health equity in tobacco-related health outcomes. One strategy for promoting public support for these policies is through health messaging. Improvements in artificial intelligence (AI) technology offer new opportunities to create tailored policy messages quickly; however, there is limited research on how the public might perceive the use of AI for public health messages.</p><p><strong>Objective: </strong>This study aimed to examine how knowledge of AI use impacts perceptions of a tobacco control policy video.</p><p><strong>Methods: </strong>A national sample of US adults (N=500) was shown the same AI-generated video that focused on a tobacco control policy. Participants were then randomly assigned to 1 of 4 conditions where they were (1) told the narrator of the video was AI, (2) told the narrator of the video was human, (3) told it was unknown whether the narrator was AI or human, or (4) not provided any information about the narrator.</p><p><strong>Results: </strong>Perceived video rating, effectiveness, and credibility did not significantly differ among the conditions. However, the mean speaker rating was significantly higher (P=.001) when participants were told the narrator of the health message was human (mean 3.65, SD 0.91) compared to the other conditions. Notably, positive attitudes toward AI were highest among those not provided information about the narrator; however, this difference was not statistically significant (mean 3.04, SD 0.90).</p><p><strong>Conclusions: </strong>Results suggest that AI may impact perceptions of the speaker of a video; however, more research is needed to understand if these impacts would occur over time and after multiple exposures to content. Further qualitative research may help explain why potential differences may have occurred in speaker ratings. Public health professionals and researchers should further consider the ethics and cost-effectiveness of using AI for health messaging.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72987"},"PeriodicalIF":2.0,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12558419/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Baur, Jörg Ansorg, Christoph-Eckhard Heyde, Anna Voelker
<p><strong>Background: </strong>Large language models are increasingly applied in health care for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and a lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education, particularly in non-English settings.</p><p><strong>Objective: </strong>This study aimed to develop and evaluate a RAG-based chatbot that provides German-language, evidence-based information on common orthopedic conditions. We assessed the system's performance in terms of response accuracy, contextual precision, and alignment with retrieved sources. In addition, we examined user satisfaction, usability, and perceived trustworthiness.</p><p><strong>Methods: </strong>The chatbot integrated OpenAI's GPT language model with a Qdrant vector database for semantic search. Its corpus consisted of 899 curated German-language documents, including national orthopedic guidelines and patient education content from the Orthinform platform of the German Society of Orthopedics and Trauma Surgery. After preprocessing, the data were segmented into 18,197 retrievable chunks. Evaluation occurred in two phases: (1) human validation by 30 participants (orthopedic specialists, medical students, and nonmedical users), who rated 12 standardized chatbot responses using a 5-point Likert scale, and (2) automated evaluation of 100 synthetic queries using the Retrieval-Augmented Generation Assessment Scale, measuring answer relevancy, contextual precision, and faithfulness. A permanent disclaimer indicated that the chatbot provides general information only and is not intended for diagnosis or treatment decisions.</p><p><strong>Results: </strong>Human ratings indicated high perceived quality for accuracy (mean 4.55, SD 0.45), helpfulness (mean 4.61, SD 0.57), ease of use (mean 4.90, SD 0.30), and clarity (mean 4.77, SD 0.43), while trust scored slightly lower (mean 4.23, SD 0.56). Retrieval-Augmented Generation Assessment Scale evaluation confirmed strong technical performance for answer relevancy (mean 0.864, SD 0.223), contextual precision (mean 0.891, SD 0.201), and faithfulness (mean 0.853, SD 0.171). Performance was highest for knee and back-related topics and lower for hip-related queries (eg, gluteal tendinopathy), which showed elevated error rates in differential diagnosis.</p><p><strong>Conclusions: </strong>The chatbot demonstrated strong performance in delivering orthopedic patient education through an RAG framework. Its deployment on the national Orthinform platform has led to more than 9500 real-world user interactions, supporting its relevance and acceptance. Future improvements should focus on expanding domain coverage, enhancing retrieval pre
{"title":"Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study.","authors":"David Baur, Jörg Ansorg, Christoph-Eckhard Heyde, Anna Voelker","doi":"10.2196/75262","DOIUrl":"10.2196/75262","url":null,"abstract":"<p><strong>Background: </strong>Large language models are increasingly applied in health care for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and a lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education, particularly in non-English settings.</p><p><strong>Objective: </strong>This study aimed to develop and evaluate a RAG-based chatbot that provides German-language, evidence-based information on common orthopedic conditions. We assessed the system's performance in terms of response accuracy, contextual precision, and alignment with retrieved sources. In addition, we examined user satisfaction, usability, and perceived trustworthiness.</p><p><strong>Methods: </strong>The chatbot integrated OpenAI's GPT language model with a Qdrant vector database for semantic search. Its corpus consisted of 899 curated German-language documents, including national orthopedic guidelines and patient education content from the Orthinform platform of the German Society of Orthopedics and Trauma Surgery. After preprocessing, the data were segmented into 18,197 retrievable chunks. Evaluation occurred in two phases: (1) human validation by 30 participants (orthopedic specialists, medical students, and nonmedical users), who rated 12 standardized chatbot responses using a 5-point Likert scale, and (2) automated evaluation of 100 synthetic queries using the Retrieval-Augmented Generation Assessment Scale, measuring answer relevancy, contextual precision, and faithfulness. A permanent disclaimer indicated that the chatbot provides general information only and is not intended for diagnosis or treatment decisions.</p><p><strong>Results: </strong>Human ratings indicated high perceived quality for accuracy (mean 4.55, SD 0.45), helpfulness (mean 4.61, SD 0.57), ease of use (mean 4.90, SD 0.30), and clarity (mean 4.77, SD 0.43), while trust scored slightly lower (mean 4.23, SD 0.56). Retrieval-Augmented Generation Assessment Scale evaluation confirmed strong technical performance for answer relevancy (mean 0.864, SD 0.223), contextual precision (mean 0.891, SD 0.201), and faithfulness (mean 0.853, SD 0.171). Performance was highest for knee and back-related topics and lower for hip-related queries (eg, gluteal tendinopathy), which showed elevated error rates in differential diagnosis.</p><p><strong>Conclusions: </strong>The chatbot demonstrated strong performance in delivering orthopedic patient education through an RAG framework. Its deployment on the national Orthinform platform has led to more than 9500 real-world user interactions, supporting its relevance and acceptance. Future improvements should focus on expanding domain coverage, enhancing retrieval pre","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75262"},"PeriodicalIF":2.0,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12551339/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145356933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaiying Lin, Abdur Rasool, Saimourya Surabhi, Cezmi Mutlu, Haopeng Zhang, Dennis P Wall, Peter Washington
<p><strong>Background: </strong>Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains under-studied.</p><p><strong>Objective: </strong>This study aimed to evaluate whether incorporating structured clinical assessment scales improved the diagnostic performance of LLM-based chatbots for neuropsychiatric conditions (we evaluated autism spectrum disorder, aphasia, and depression datasets) across two prompting strategies: (1) direct diagnosis and (2) code generation. We aimed to contextualize LLM-based diagnostic performance by benchmarking it against prior work that applied traditional machine learning classifiers to the same datasets, allowing us to assess whether LLMs offer competitive or complementary capabilities in clinical classification tasks.</p><p><strong>Methods: </strong>We tested two approaches using ChatGPT, Gemini, and Claude models: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism spectrum disorder), AphasiaBank (aphasia), and Distress Analysis Interview Corpus-Wizard-of-Oz interviews (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.</p><p><strong>Results: </strong>Across all 3 datasets, incorporating clinical assessment scales led to little improvement in performance, and results remained inconsistent and generally below those reported in previous studies. On the AphasiaBank dataset, the direct diagnosis approach using ChatGPT with GPT-4 produced a low F<sub>1</sub>-score of 65.6% and specificity of 33%. The code generation method improved results, with ChatGPT with GPT-4o reaching an F<sub>1</sub>-score of 81.4%, specificity of 78.6%, and sensitivity of 84.3%. ChatGPT with GPT-o3 and Gemini 2.5 Pro performed even better, with F<sub>1</sub>-scores of 86.5% and 84.3%, respectively. For the ASDBank dataset, direct diagnosis results were lower, with F<sub>1</sub>-scores of 56% for ChatGPT with GPT-4 and 54% for ChatGPT with GPT-4o. Under code generation, ChatGPT with GPT-o3 reached 67.9%, and Claude 3.5 performed reasonably well with 60%. Gemini 2.5 Pro failed to respond under this assessment condition. In the Distress Analysis Interview Corpus-Wizard-of-Oz dataset, direct diagnosis yielded high accuracy (70.9%) but poor F<sub>1</sub>-scores of 8% using ChatGPT with GPT-4o. Code generation improved specificity-88.6% with ChatGPT with GPT-4o-but F<sub>1</sub>-scores remained low overall. These findings suggest that, while clinical scales may help structure outputs, prompting alone remains insufficient for consistent diagnostic accuracy.</p><p><strong>Conclusions: </strong>Current LLM-based chatbots, when prompted naively, under
{"title":"Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation.","authors":"Kaiying Lin, Abdur Rasool, Saimourya Surabhi, Cezmi Mutlu, Haopeng Zhang, Dennis P Wall, Peter Washington","doi":"10.2196/75030","DOIUrl":"10.2196/75030","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains under-studied.</p><p><strong>Objective: </strong>This study aimed to evaluate whether incorporating structured clinical assessment scales improved the diagnostic performance of LLM-based chatbots for neuropsychiatric conditions (we evaluated autism spectrum disorder, aphasia, and depression datasets) across two prompting strategies: (1) direct diagnosis and (2) code generation. We aimed to contextualize LLM-based diagnostic performance by benchmarking it against prior work that applied traditional machine learning classifiers to the same datasets, allowing us to assess whether LLMs offer competitive or complementary capabilities in clinical classification tasks.</p><p><strong>Methods: </strong>We tested two approaches using ChatGPT, Gemini, and Claude models: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism spectrum disorder), AphasiaBank (aphasia), and Distress Analysis Interview Corpus-Wizard-of-Oz interviews (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.</p><p><strong>Results: </strong>Across all 3 datasets, incorporating clinical assessment scales led to little improvement in performance, and results remained inconsistent and generally below those reported in previous studies. On the AphasiaBank dataset, the direct diagnosis approach using ChatGPT with GPT-4 produced a low F<sub>1</sub>-score of 65.6% and specificity of 33%. The code generation method improved results, with ChatGPT with GPT-4o reaching an F<sub>1</sub>-score of 81.4%, specificity of 78.6%, and sensitivity of 84.3%. ChatGPT with GPT-o3 and Gemini 2.5 Pro performed even better, with F<sub>1</sub>-scores of 86.5% and 84.3%, respectively. For the ASDBank dataset, direct diagnosis results were lower, with F<sub>1</sub>-scores of 56% for ChatGPT with GPT-4 and 54% for ChatGPT with GPT-4o. Under code generation, ChatGPT with GPT-o3 reached 67.9%, and Claude 3.5 performed reasonably well with 60%. Gemini 2.5 Pro failed to respond under this assessment condition. In the Distress Analysis Interview Corpus-Wizard-of-Oz dataset, direct diagnosis yielded high accuracy (70.9%) but poor F<sub>1</sub>-scores of 8% using ChatGPT with GPT-4o. Code generation improved specificity-88.6% with ChatGPT with GPT-4o-but F<sub>1</sub>-scores remained low overall. These findings suggest that, while clinical scales may help structure outputs, prompting alone remains insufficient for consistent diagnostic accuracy.</p><p><strong>Conclusions: </strong>Current LLM-based chatbots, when prompted naively, under","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75030"},"PeriodicalIF":2.0,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12587012/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145350379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Mpox (monkeypox) outbreaks since 2022 have emphasized the importance of accessible health education materials. However, many Japanese online resources on mpox are difficult to understand, creating barriers for public health communication. Recent advances in artificial intelligence (AI) such as ChatGPT-4o show promise in generating more comprehensible and actionable health education content.</p><p><strong>Objective: </strong>The aim of this study was to evaluate the comprehensibility, actionability, and readability of Japanese health education materials on mpox compared with texts generated by ChatGPT-4o.</p><p><strong>Methods: </strong>A cross-sectional study was conducted using systematic quantitative content analysis. A total of 119 publicly available Japanese health education materials on mpox were compared with 30 texts generated by ChatGPT-4o. Websites containing videos, social media posts, academic papers, and non-Japanese language content were excluded. For generating ChatGPT-4o texts, we used 3 separate prompts with 3 different keywords. For each keyword, text generation was repeated 10 times, with prompt history deleted each time to prevent previous outputs from influencing subsequent generations and to account for output variability. The Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) was used to assess the understandability and actionability of the generated text, while the Japanese Readability Measurement System (jReadability) was used to evaluate readability. The Journal of the American Medical Association benchmark criteria were applied to evaluate the quality of the materials.</p><p><strong>Results: </strong>A total of 119 Japanese mpox-related health education web pages and 30 ChatGPT-4o-generated texts were analyzed. AI-generated texts significantly outperformed web pages in understandability, with 80% (24/30) scoring ≥70% in PEMAT-P (P<.001). Readability scores for AI texts (mean 2.9, SD 0.4) were also higher than those for web pages (mean 2.4, SD 1.0; P=.009). However, web pages included more visual aids and actionable guidance such as practical instructions, which were largely absent in AI-generated content. Government agencies authored 90 (75.6%) out of 119 web pages, but only 31 (26.1%) included proper attribution. Most web pages (117/119, 98.3%) disclosed sponsorship and ownership.</p><p><strong>Conclusions: </strong>AI-generated texts were easier to understand and read than traditional web-based materials. However, web-based texts provided more visual aids and practical guidance. Combining AI-generated texts with traditional web-based materials may enhance the effectiveness of health education materials and improve accessibility to a broader audience. Further research is needed to explore the integration of AI-generated content into public health communication strategies and policies to optimize information delivery during health crises such as the mpox outbreak
{"title":"Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study.","authors":"Shinya Ito, Emi Furukawa, Tsuyoshi Okuhara, Hiroko Okada, Takahiro Kiuchi","doi":"10.2196/70604","DOIUrl":"10.2196/70604","url":null,"abstract":"<p><strong>Background: </strong>Mpox (monkeypox) outbreaks since 2022 have emphasized the importance of accessible health education materials. However, many Japanese online resources on mpox are difficult to understand, creating barriers for public health communication. Recent advances in artificial intelligence (AI) such as ChatGPT-4o show promise in generating more comprehensible and actionable health education content.</p><p><strong>Objective: </strong>The aim of this study was to evaluate the comprehensibility, actionability, and readability of Japanese health education materials on mpox compared with texts generated by ChatGPT-4o.</p><p><strong>Methods: </strong>A cross-sectional study was conducted using systematic quantitative content analysis. A total of 119 publicly available Japanese health education materials on mpox were compared with 30 texts generated by ChatGPT-4o. Websites containing videos, social media posts, academic papers, and non-Japanese language content were excluded. For generating ChatGPT-4o texts, we used 3 separate prompts with 3 different keywords. For each keyword, text generation was repeated 10 times, with prompt history deleted each time to prevent previous outputs from influencing subsequent generations and to account for output variability. The Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) was used to assess the understandability and actionability of the generated text, while the Japanese Readability Measurement System (jReadability) was used to evaluate readability. The Journal of the American Medical Association benchmark criteria were applied to evaluate the quality of the materials.</p><p><strong>Results: </strong>A total of 119 Japanese mpox-related health education web pages and 30 ChatGPT-4o-generated texts were analyzed. AI-generated texts significantly outperformed web pages in understandability, with 80% (24/30) scoring ≥70% in PEMAT-P (P<.001). Readability scores for AI texts (mean 2.9, SD 0.4) were also higher than those for web pages (mean 2.4, SD 1.0; P=.009). However, web pages included more visual aids and actionable guidance such as practical instructions, which were largely absent in AI-generated content. Government agencies authored 90 (75.6%) out of 119 web pages, but only 31 (26.1%) included proper attribution. Most web pages (117/119, 98.3%) disclosed sponsorship and ownership.</p><p><strong>Conclusions: </strong>AI-generated texts were easier to understand and read than traditional web-based materials. However, web-based texts provided more visual aids and practical guidance. Combining AI-generated texts with traditional web-based materials may enhance the effectiveness of health education materials and improve accessibility to a broader audience. Further research is needed to explore the integration of AI-generated content into public health communication strategies and policies to optimize information delivery during health crises such as the mpox outbreak","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e70604"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12579291/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145314268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pascal Lambert, Rayyan Khan, Marshall Pitz, Harminder Singh, Helen Chen, Kathleen Decker
Background: Cancer progression is an important outcome in cancer research. However, it is frequently documented only in electronic health records (EHRs) as unstructured text, which requires lengthy and costly chart reviews to extract for retrospective studies.
Objective: This study aimed to evaluate the performance of 3 deep learning language models in determining breast and colorectal cancer progression in EHRs.
Methods: EHRs for individuals diagnosed with stage 4 breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada, were extracted. A chart review was conducted to identify cancer progression in each EHR. Data were analyzed with pretrained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.
Results: Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than the Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.70-0.79 vs 0.49-0.71; scaled Brier scores for colorectal cancer models: 0.61-0.65 vs 0.49-0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6%-94.3% vs 76.6%-87.1%; colorectal cancer models: 73.1%-78.9% vs 62.8%-77.0%) and positive predictive value (breast cancer models: 77.9%-92.3% vs 80.6%-85.5%; colorectal cancer models: 81.6%-86.3% vs 72.9%-82.9%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word progression, which was influenced by the presence of other tokens and its position within an EHR.
Conclusions: The deep learning language models could help identify breast and colorectal cancer progression in EHRs and remove most charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence level rather than at the EHR level.
背景:癌症进展是癌症研究的一个重要结果。然而,它通常只作为非结构化文本记录在电子健康记录(EHRs)中,这需要冗长而昂贵的图表审查才能提取用于回顾性研究。目的:本研究旨在评估3种深度学习语言模型在电子病历中确定乳腺癌和结直肠癌进展的性能。方法:提取加拿大马尼托巴省2004年至2020年间诊断为4期乳腺癌或结直肠癌患者的电子病历。进行图表回顾,以确定每个电子病历中的癌症进展情况。使用预训练的深度学习语言模型(Bio+ClinicalBERT、clinicalbigbird和clinicallongformer)分析数据。灵敏度、阳性预测值、曲线下面积和尺度Brier评分用于评估性能。通过在EHRs中删除和添加令牌以及检查预测概率的变化来识别有影响力的令牌。结果:用于乳腺癌和结直肠癌队列的临床- bigbird和临床- longformer模型比Bio+ClinicalBERT模型具有更高的准确性(乳腺癌模型的缩放Brier评分:0.70-0.79 vs 0.49-0.71;结直肠癌模型的缩放Brier评分:0.61-0.65 vs 0.49-0.61)。与Bio+ClinicalBERT模型相比,同样的模型也显示出更高的敏感性(乳腺癌模型:86.6%-94.3%对76.6%-87.1%;结直肠癌模型:73.1%-78.9%对62.8%-77.0%)和阳性预测值(乳腺癌模型:77.9%-92.3%对80.6%-85.5%;结直肠癌模型:81.6%-86.3%对72.9%-82.9%)。所有模型都可以从图表审查过程中删除超过84%的图表。最有影响力的代币是单词进度,它受到其他代币的存在及其在EHR中的位置的影响。结论:深度学习语言模型可以帮助在电子病历中识别乳腺癌和结直肠癌的进展,并从图表审查过程中删除大多数图表。有限数量的令牌可能会影响模型预测。通过增加训练数据集的大小和在句子水平而不是在电子病历水平上分析电子病历,可以提高模型的性能。
{"title":"Deep Learning Models to Screen Electronic Health Records for Breast and Colorectal Cancer Progression: Performance Evaluation Study.","authors":"Pascal Lambert, Rayyan Khan, Marshall Pitz, Harminder Singh, Helen Chen, Kathleen Decker","doi":"10.2196/63767","DOIUrl":"10.2196/63767","url":null,"abstract":"<p><strong>Background: </strong>Cancer progression is an important outcome in cancer research. However, it is frequently documented only in electronic health records (EHRs) as unstructured text, which requires lengthy and costly chart reviews to extract for retrospective studies.</p><p><strong>Objective: </strong>This study aimed to evaluate the performance of 3 deep learning language models in determining breast and colorectal cancer progression in EHRs.</p><p><strong>Methods: </strong>EHRs for individuals diagnosed with stage 4 breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada, were extracted. A chart review was conducted to identify cancer progression in each EHR. Data were analyzed with pretrained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.</p><p><strong>Results: </strong>Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than the Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.70-0.79 vs 0.49-0.71; scaled Brier scores for colorectal cancer models: 0.61-0.65 vs 0.49-0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6%-94.3% vs 76.6%-87.1%; colorectal cancer models: 73.1%-78.9% vs 62.8%-77.0%) and positive predictive value (breast cancer models: 77.9%-92.3% vs 80.6%-85.5%; colorectal cancer models: 81.6%-86.3% vs 72.9%-82.9%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word progression, which was influenced by the presence of other tokens and its position within an EHR.</p><p><strong>Conclusions: </strong>The deep learning language models could help identify breast and colorectal cancer progression in EHRs and remove most charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence level rather than at the EHR level.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e63767"},"PeriodicalIF":2.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12559821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runa Bhaumik, Abhishikta Roy, Vineet Srivastava, Lokesh Boggavarapu, Ranganathan Chandrasekaran, Edward K Mensah, John Galvin
Background: Recent advances in large language models (LLMs), such as GPT-4o, offer a transformative opportunity to extract nuanced linguistic, emotional, and social features from campaign texts at scale. These models enable a deeper understanding of the factors influencing campaign success-far beyond what structured data alone can reveal. Given these advancements, there is a pressing need for an integrated modeling framework that leverages both LLM-derived features and machine learning algorithms to more accurately predict and explain success in medical crowdfunding.
Objective: This study addresses that gap by leveraging cutting-edge machine learning techniques alongside state-of-the-art large language models such as GPT-4o to automatically generate and extract nuanced linguistic, social, and clinical features from campaign narratives. By combining these features with ensemble learning approaches, the proposed methodology offers a novel and more comprehensive strategy for understanding and predicting crowdfunding success in the medical domain.
Methods: We used GPT-4o to extract linguistic and social determinants of health (SDOH) features from cancer crowdfunding campaign narratives. A Random Forest model with permutation importance was applied to rank features based on their contribution to predicting campaign success. Four machine learning algorithms-Random Forest, Gradient Boosting, Logistic Regression, and Elastic Net-were evaluated using stratified 10-fold cross-validation, with performance measured by accuracy, sensitivity, and specificity.
Results: Gradient Boosting consistently outperforms the other algorithms in terms of sensitivity (consistently around 0.786 to 0.798), indicating its superior ability to identify successful crowdfunding campaigns using linguistic and social determinants of health features. The permutation importance score reveals that for severe medical conditions, income loss, chemotherapy treatment, clear and effective communication, cognitive understanding, family involvement, empathy and social behaviors play an important role in the success of campaigns.
Conclusions: This study demonstrates that large language models like GPT-4o can effectively extract nuanced linguistic and social features from crowdfunding narratives, offering deeper insights than traditional methods. These features, when combined with machine learning, significantly improve the identification of key predictors of campaign success, such as medical severity, financial hardship, and empathetic communication. Our findings underscore the potential of LLMs to enhance predictive modeling in health-related crowdfunding and support more targeted policy and communication strategies to reduce financial vulnerability among cancer patients.
{"title":"Robust Cancer Crowdfunding Predictions: Leveraging Large Language Models and Machine Learning for Success Analysis.","authors":"Runa Bhaumik, Abhishikta Roy, Vineet Srivastava, Lokesh Boggavarapu, Ranganathan Chandrasekaran, Edward K Mensah, John Galvin","doi":"10.2196/73448","DOIUrl":"10.2196/73448","url":null,"abstract":"<p><strong>Background: </strong>Recent advances in large language models (LLMs), such as GPT-4o, offer a transformative opportunity to extract nuanced linguistic, emotional, and social features from campaign texts at scale. These models enable a deeper understanding of the factors influencing campaign success-far beyond what structured data alone can reveal. Given these advancements, there is a pressing need for an integrated modeling framework that leverages both LLM-derived features and machine learning algorithms to more accurately predict and explain success in medical crowdfunding.</p><p><strong>Objective: </strong>This study addresses that gap by leveraging cutting-edge machine learning techniques alongside state-of-the-art large language models such as GPT-4o to automatically generate and extract nuanced linguistic, social, and clinical features from campaign narratives. By combining these features with ensemble learning approaches, the proposed methodology offers a novel and more comprehensive strategy for understanding and predicting crowdfunding success in the medical domain.</p><p><strong>Methods: </strong>We used GPT-4o to extract linguistic and social determinants of health (SDOH) features from cancer crowdfunding campaign narratives. A Random Forest model with permutation importance was applied to rank features based on their contribution to predicting campaign success. Four machine learning algorithms-Random Forest, Gradient Boosting, Logistic Regression, and Elastic Net-were evaluated using stratified 10-fold cross-validation, with performance measured by accuracy, sensitivity, and specificity.</p><p><strong>Results: </strong>Gradient Boosting consistently outperforms the other algorithms in terms of sensitivity (consistently around 0.786 to 0.798), indicating its superior ability to identify successful crowdfunding campaigns using linguistic and social determinants of health features. The permutation importance score reveals that for severe medical conditions, income loss, chemotherapy treatment, clear and effective communication, cognitive understanding, family involvement, empathy and social behaviors play an important role in the success of campaigns.</p><p><strong>Conclusions: </strong>This study demonstrates that large language models like GPT-4o can effectively extract nuanced linguistic and social features from crowdfunding narratives, offering deeper insights than traditional methods. These features, when combined with machine learning, significantly improve the identification of key predictors of campaign success, such as medical severity, financial hardship, and empathetic communication. Our findings underscore the potential of LLMs to enhance predictive modeling in health-related crowdfunding and support more targeted policy and communication strategies to reduce financial vulnerability among cancer patients.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naga Sasidhar Kanaparthy, Yenny Villuendas-Rey, Tolulope Bakare, Zihan Diao, Mark Iscoe, Andrew Loza, Donald Wright, Conrad Safranek, Isaac V Faustino, Alexandria Brackett, Edward R Melnick, R Andrew Taylor
Background: As physicians spend up to twice as much time on electronic health record tasks as on direct patient care, digital scribes have emerged as a promising solution to restore patient-clinician communication and reduce documentation burden-making it essential to study their real-world impact on clinical workflows, efficiency, and satisfaction.
Objective: This study aimed to synthesize evidence on clinician efficiency, user satisfaction, quality, and practical barriers associated with the use of digital scribes using ambient listening and generative artificial intelligence (AI) in real-world clinical settings.
Methods: A rapid review was conducted to evaluate the real-world evidence of digital scribes using ambient listening and generative AI in clinical practice from 2014 to 2024. Data were collected from Ovid MEDLINE, Embase, Web of Science-Core Collection, Cochrane CENTRAL and Reviews, and PubMed Central. Predefined eligibility criteria focused on studies addressing clinical implementation, excluding those centered solely on technical development or model validation. The findings of each study were synthesized and analyzed through the QUEST human evaluation framework for quality and safety and the Systems Engineering Initiative for Patient Safety (SEIPS) 3.0 model to assess integration into clinicians' workflows and experience.
Results: Of the 1450 studies identified, 6 met the inclusion criteria. These studies included an observational study, a case report, a peer-matched cohort study, and survey-based assessments conducted across academic health systems, community settings, and outpatient practices. The major themes noted were as follows: (1) they decreased self-reported documentation times, with associated increased length of notes; (2) physician burnout measured using standardized scales was unaffected, but physician engagement improved; (3) physician productivity, assessed via billing metrics, was unchanged; and (4) the studies fell short when compared to standardized frameworks.
Conclusions: Digital scribes show promise in reducing documentation burden and enhancing clinician satisfaction, thereby supporting workflow efficiency. However, the currently available evidence is sparse. Future real-world, multifaceted studies are needed before AI scribes can be recommended unequivocally.
背景:由于医生在电子健康记录任务上花费的时间是直接患者护理的两倍,数字抄写员已经成为恢复患者-临床沟通和减少文档负担的有前途的解决方案,因此研究它们对临床工作流程、效率和满意度的实际影响至关重要。目的:本研究旨在综合临床医生效率、用户满意度、质量和实际障碍的证据,这些证据与在现实世界的临床环境中使用使用环境听力和生成人工智能(AI)的数字抄写器有关。方法:快速回顾2014年至2024年临床实践中使用环境聆听和生成式人工智能的数字抄写员的真实证据。数据收集自Ovid MEDLINE、Embase、Web of Science-Core Collection、Cochrane CENTRAL and Reviews和PubMed CENTRAL。预定义的资格标准侧重于解决临床实施的研究,排除那些仅以技术开发或模型验证为中心的研究。每项研究的结果通过QUEST质量和安全人类评估框架和患者安全系统工程倡议(SEIPS) 3.0模型进行综合和分析,以评估临床医生工作流程和经验的整合情况。结果:在纳入的1450项研究中,有6项符合纳入标准。这些研究包括一项观察性研究、一份病例报告、一项同行匹配队列研究,以及在学术卫生系统、社区环境和门诊实践中进行的基于调查的评估。注意到的主要主题如下:(1)他们减少了自我报告的文件时间,相应的增加了笔记的长度;(2)采用标准化量表测量的医生职业倦怠不受影响,但医生敬业度有所提高;(3)通过计费指标评估的医生生产力没有变化;(4)与标准化框架相比,这些研究存在不足。结论:数字抄写员有望减轻文件负担,提高临床医生的满意度,从而提高工作效率。然而,目前可获得的证据很少。在明确推荐人工智能抄写员之前,需要对未来的现实世界进行多方面的研究。
{"title":"Real-World Evidence Synthesis of Digital Scribes Using Ambient Listening and Generative Artificial Intelligence for Clinician Documentation Workflows: Rapid Review.","authors":"Naga Sasidhar Kanaparthy, Yenny Villuendas-Rey, Tolulope Bakare, Zihan Diao, Mark Iscoe, Andrew Loza, Donald Wright, Conrad Safranek, Isaac V Faustino, Alexandria Brackett, Edward R Melnick, R Andrew Taylor","doi":"10.2196/76743","DOIUrl":"10.2196/76743","url":null,"abstract":"<p><strong>Background: </strong>As physicians spend up to twice as much time on electronic health record tasks as on direct patient care, digital scribes have emerged as a promising solution to restore patient-clinician communication and reduce documentation burden-making it essential to study their real-world impact on clinical workflows, efficiency, and satisfaction.</p><p><strong>Objective: </strong>This study aimed to synthesize evidence on clinician efficiency, user satisfaction, quality, and practical barriers associated with the use of digital scribes using ambient listening and generative artificial intelligence (AI) in real-world clinical settings.</p><p><strong>Methods: </strong>A rapid review was conducted to evaluate the real-world evidence of digital scribes using ambient listening and generative AI in clinical practice from 2014 to 2024. Data were collected from Ovid MEDLINE, Embase, Web of Science-Core Collection, Cochrane CENTRAL and Reviews, and PubMed Central. Predefined eligibility criteria focused on studies addressing clinical implementation, excluding those centered solely on technical development or model validation. The findings of each study were synthesized and analyzed through the QUEST human evaluation framework for quality and safety and the Systems Engineering Initiative for Patient Safety (SEIPS) 3.0 model to assess integration into clinicians' workflows and experience.</p><p><strong>Results: </strong>Of the 1450 studies identified, 6 met the inclusion criteria. These studies included an observational study, a case report, a peer-matched cohort study, and survey-based assessments conducted across academic health systems, community settings, and outpatient practices. The major themes noted were as follows: (1) they decreased self-reported documentation times, with associated increased length of notes; (2) physician burnout measured using standardized scales was unaffected, but physician engagement improved; (3) physician productivity, assessed via billing metrics, was unchanged; and (4) the studies fell short when compared to standardized frameworks.</p><p><strong>Conclusions: </strong>Digital scribes show promise in reducing documentation burden and enhancing clinician satisfaction, thereby supporting workflow efficiency. However, the currently available evidence is sparse. Future real-world, multifaceted studies are needed before AI scribes can be recommended unequivocally.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76743"},"PeriodicalIF":2.0,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12513689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145276742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianxiu Liu, Fred Ssewamala, Ruopeng An, Mengmeng Ji
Background: Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.
Objective: This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.
Methods: Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.
Results: The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.
Conclusions: To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.
背景:糖尿病的早期诊断对于早期干预以减缓血糖异常及其合并症的进展至关重要。然而,在糖尿病患者中,约23%的人不知道自己的病情。目的:本研究旨在探讨自动机器学习(AutoML)模型和自我报告数据在检测美国成年人未确诊糖尿病中的潜在用途。方法:从1999-2020年全国健康与营养检查调查中检索个人数据,包括糖尿病生化检测、人口统计学特征、糖尿病家族史、人体测量测量、饮食摄入、健康行为和慢性病。未确诊的糖尿病定义为没有先前自我报告的诊断,但符合糖化血红蛋白升高、空腹血糖或2小时血糖的诊断标准。H2O AutoML框架允许自动超参数调优、模型选择和集成学习,用于自动化机器学习工作流程。为了进行对比分析,我们使用了4种传统的机器学习模型——逻辑回归、支持向量机、随机森林和极端梯度增强。使用接收器工作特性曲线下的面积来评估模型性能。结果:该研究纳入了11,815名年龄在20岁及以上的参与者,其中包括2256名未确诊的糖尿病患者和9559名非糖尿病患者。未确诊糖尿病患者的平均年龄为59.76岁(SD 15.0),无糖尿病患者的平均年龄为46.78岁(SD 17.2)。与4种传统的机器学习模型相比,AutoML模型表现出优越的性能。在测试集中,训练后的AutoML模型在接收者工作特征曲线下的面积为0.909 (95% CI 0.897-0.921)。该模型鉴别未确诊糖尿病和非糖尿病的敏感性为70.26%,特异性为90.46%,阳性预测值为64.10%,阴性预测值为92.61%。结论:据我们所知,本研究首次利用AutoML模型检测美国成人未确诊糖尿病。该模型的强大性能和对更广泛的美国人群的适用性使其成为大规模糖尿病筛查工作的有前途的工具。
{"title":"Use of Automated Machine Learning to Detect Undiagnosed Diabetes in US Adults: Development and Validation Study.","authors":"Jianxiu Liu, Fred Ssewamala, Ruopeng An, Mengmeng Ji","doi":"10.2196/68260","DOIUrl":"10.2196/68260","url":null,"abstract":"<p><strong>Background: </strong>Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.</p><p><strong>Objective: </strong>This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.</p><p><strong>Methods: </strong>Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.</p><p><strong>Results: </strong>The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.</p><p><strong>Conclusions: </strong>To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68260"},"PeriodicalIF":2.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanjay Basu, Bhairavi Muralidharan, Parth Sheth, Dan Wanek, John Morgan, Sadiq Patel
Background: Multidisciplinary care management teams must rapidly prioritize interventions for patients with complex medical and social needs. Current approaches rely on individual training, judgment, and experience, missing opportunities to learn from longitudinal trajectories and prevent adverse outcomes through recommender systems.
Objective: This study aims to evaluate whether a reinforcement learning approach could outperform standard care management practices in recommending optimal interventions for patients with complex needs.
Methods: Using data from 3175 Medicaid beneficiaries in care management programs across 2 states from 2023 to 2024, we compared alternative approaches for recommending "next best step" interventions: the standard experience-based approach (status quo) and a state-action-reward-state-action (SARSA) reinforcement learning model. We evaluated performance using clinical impact metrics, conducted counterfactual causal inference analyses to estimate reductions in acute care events, assessed fairness across demographic subgroups, and performed qualitative chart reviews where the models differed.
Results: In counterfactual analyses, SARSA-guided care management reduced acute care events by 12 percentage points (95% CI 2.2-21.8 percentage points, a 20.7% relative reduction; P=.02) compared to the status quo approach, with a number needed to treat of 8.3 (95% CI 4.6-45.2) to prevent 1 acute event. The approach showed improved fairness across demographic groups, including gender (3.8% vs 5.3% disparity in acute event rates, reduction 1.5%, 95% CI 0.3%-2.7%) and race and ethnicity (5.6% vs 8.9% disparity, reduction 3.3%, 95% CI 1.1%-5.5%). In qualitative reviews, the SARSA model detected and recommended interventions for specific medical-social interactions, such as respiratory issues associated with poor housing quality or food insecurity in individuals with diabetes.
Conclusions: SARSA-guided care management shows potential to reduce acute care use compared to standard practice. The approach demonstrates how reinforcement learning can improve complex decision-making in situations where patients face concurrent clinical and social factors while maintaining safety and fairness across demographic subgroups.
背景:多学科护理管理团队必须迅速对具有复杂医疗和社会需求的患者进行优先干预。目前的方法依赖于个人培训、判断和经验,失去了从纵向轨迹中学习的机会,并通过推荐系统预防不良后果。目的:本研究旨在评估强化学习方法在为有复杂需求的患者推荐最佳干预措施方面是否优于标准护理管理实践。方法:利用2023年至2024年两个州医疗管理项目3175名医疗补助受益人的数据,我们比较了推荐“下一个最佳步骤”干预措施的替代方法:标准的基于经验的方法(现状)和国家-行动-奖励-国家-行动(SARSA)强化学习模型。我们使用临床影响指标评估绩效,进行反事实因果推理分析以估计急性护理事件的减少,评估人口统计亚组的公平性,并在模型不同的地方进行定性图表回顾。结果:在反事实分析中,与现状方法相比,sarsa引导的护理管理减少了12个百分点的急性护理事件(95% CI 2.2-21.8个百分点,相对减少20.7%;P= 0.02),需要治疗8.3个(95% CI 4.6-45.2)才能预防1个急性事件。该方法显示不同人口统计群体的公平性得到改善,包括性别(急性事件发生率差异3.8% vs 5.3%,减少1.5%,95% CI 0.3%-2.7%)和种族和民族(差异5.6% vs 8.9%,减少3.3%,95% CI 1.1%-5.5%)。在定性评价中,SARSA模型发现并推荐了针对特定医疗-社会相互作用的干预措施,例如糖尿病患者与住房质量差或食物不安全相关的呼吸问题。结论:与标准做法相比,sars引导的护理管理显示出减少急性护理使用的潜力。该方法展示了强化学习如何在患者同时面临临床和社会因素的情况下改善复杂的决策,同时保持人口亚组的安全性和公平性。
{"title":"Reinforcement Learning to Prevent Acute Care Events Among Medicaid Populations: Mixed Methods Study.","authors":"Sanjay Basu, Bhairavi Muralidharan, Parth Sheth, Dan Wanek, John Morgan, Sadiq Patel","doi":"10.2196/74264","DOIUrl":"10.2196/74264","url":null,"abstract":"<p><strong>Background: </strong>Multidisciplinary care management teams must rapidly prioritize interventions for patients with complex medical and social needs. Current approaches rely on individual training, judgment, and experience, missing opportunities to learn from longitudinal trajectories and prevent adverse outcomes through recommender systems.</p><p><strong>Objective: </strong>This study aims to evaluate whether a reinforcement learning approach could outperform standard care management practices in recommending optimal interventions for patients with complex needs.</p><p><strong>Methods: </strong>Using data from 3175 Medicaid beneficiaries in care management programs across 2 states from 2023 to 2024, we compared alternative approaches for recommending \"next best step\" interventions: the standard experience-based approach (status quo) and a state-action-reward-state-action (SARSA) reinforcement learning model. We evaluated performance using clinical impact metrics, conducted counterfactual causal inference analyses to estimate reductions in acute care events, assessed fairness across demographic subgroups, and performed qualitative chart reviews where the models differed.</p><p><strong>Results: </strong>In counterfactual analyses, SARSA-guided care management reduced acute care events by 12 percentage points (95% CI 2.2-21.8 percentage points, a 20.7% relative reduction; P=.02) compared to the status quo approach, with a number needed to treat of 8.3 (95% CI 4.6-45.2) to prevent 1 acute event. The approach showed improved fairness across demographic groups, including gender (3.8% vs 5.3% disparity in acute event rates, reduction 1.5%, 95% CI 0.3%-2.7%) and race and ethnicity (5.6% vs 8.9% disparity, reduction 3.3%, 95% CI 1.1%-5.5%). In qualitative reviews, the SARSA model detected and recommended interventions for specific medical-social interactions, such as respiratory issues associated with poor housing quality or food insecurity in individuals with diabetes.</p><p><strong>Conclusions: </strong>SARSA-guided care management shows potential to reduce acute care use compared to standard practice. The approach demonstrates how reinforcement learning can improve complex decision-making in situations where patients face concurrent clinical and social factors while maintaining safety and fairness across demographic subgroups.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e74264"},"PeriodicalIF":2.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12547335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145254070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Simmich, Megan Heather Ross, Trevor Glen Russell
Background: Australians can face significant challenges in navigating the health care system, especially in rural and regional areas. Generative search tools, powered by large language models (LLMs), show promise in improving health information retrieval by generating direct answers. However, concerns remain regarding their accuracy and reliability when compared to traditional search engines in a health care context.
Objective: This study aimed to compare the effectiveness of a generative artificial intelligence (AI) search (ie, Microsoft Copilot) versus a conventional search engine (Google Web Search) for navigating health care information.
Methods: A total of 97 adults in Queensland, Australia, participated in a web-based survey, answering scenario-based health care navigation questions using either Microsoft Copilot or Google Web Search. Accuracy was assessed using binary correct or incorrect ratings, graded correctness (incorrect, partially correct, or correct), and numerical scores (0-2 for service identification and 0-6 for criteria). Participants also completed a Technology Rating Questionnaire (TRQ) to evaluate their experience with their assigned tool.
Results: Participants assigned to Microsoft Copilot outperformed the Google Web Search group on 2 health care navigation tasks (identifying aged care application services and listing mobility allowance eligibility criteria), with no clear evidence of a difference in the remaining 6 tasks. On the TRQ, participants rated Google Web Search higher in willingness to adopt and perceived impact on quality of life, and lower in effort needed to learn. Both tools received similar ratings in perceived value, confidence, help required to use, and concerns about privacy.
Conclusions: Generative AI tools can achieve comparable accuracy to traditional search engines for health care navigation tasks, though this did not translate into an improved user experience. Further evaluation is needed as AI technology improves and users become more familiar with its use.
{"title":"Assessing the Capability of Large Language Models for Navigation of the Australian Health Care System: Comparative Study.","authors":"Joshua Simmich, Megan Heather Ross, Trevor Glen Russell","doi":"10.2196/76203","DOIUrl":"10.2196/76203","url":null,"abstract":"<p><strong>Background: </strong>Australians can face significant challenges in navigating the health care system, especially in rural and regional areas. Generative search tools, powered by large language models (LLMs), show promise in improving health information retrieval by generating direct answers. However, concerns remain regarding their accuracy and reliability when compared to traditional search engines in a health care context.</p><p><strong>Objective: </strong>This study aimed to compare the effectiveness of a generative artificial intelligence (AI) search (ie, Microsoft Copilot) versus a conventional search engine (Google Web Search) for navigating health care information.</p><p><strong>Methods: </strong>A total of 97 adults in Queensland, Australia, participated in a web-based survey, answering scenario-based health care navigation questions using either Microsoft Copilot or Google Web Search. Accuracy was assessed using binary correct or incorrect ratings, graded correctness (incorrect, partially correct, or correct), and numerical scores (0-2 for service identification and 0-6 for criteria). Participants also completed a Technology Rating Questionnaire (TRQ) to evaluate their experience with their assigned tool.</p><p><strong>Results: </strong>Participants assigned to Microsoft Copilot outperformed the Google Web Search group on 2 health care navigation tasks (identifying aged care application services and listing mobility allowance eligibility criteria), with no clear evidence of a difference in the remaining 6 tasks. On the TRQ, participants rated Google Web Search higher in willingness to adopt and perceived impact on quality of life, and lower in effort needed to learn. Both tools received similar ratings in perceived value, confidence, help required to use, and concerns about privacy.</p><p><strong>Conclusions: </strong>Generative AI tools can achieve comparable accuracy to traditional search engines for health care navigation tasks, though this did not translate into an improved user experience. Further evaluation is needed as AI technology improves and users become more familiar with its use.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76203"},"PeriodicalIF":2.0,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12508777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145253999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}