Background: General anesthesia comprises 3 essential components-hypnosis, analgesia, and immobility. Among these, maintaining an appropriate hypnotic state, or anesthetic depth, is crucial for patient safety. Excessively deep anesthesia may lead to hemodynamic instability and postoperative cognitive dysfunction, whereas inadequate anesthesia increases the risk of intraoperative awareness. Electroencephalography (EEG)-based monitoring has therefore become a cornerstone for evaluating anesthetic depth. However, processed electroencephalography (pEEG) indices remain vulnerable to various sources of interference, including electromyographic activity, interindividual variability, and anesthetic drug effects, which can yield inaccurate numerical outputs.
Objective: With recent advances in machine learning, particularly unsupervised learning, data-driven methods that classify signals according to inherent patterns offer new possibilities for anesthetic depth analysis. This study aimed to establish a methodology for automatically identifying anesthesia depth using an unsupervised, machine learning-based clustering approach applied to pEEG data.
Methods: Standard frontal EEG data from participants undergoing elective lumbar spine surgery were retrospectively analyzed, yielding more than 16,000 data points. The signals were filtered with a fourth-order Butterworth bandpass filter and transformed using the fast Fourier transform to estimate power spectral density. Normalized band power ratios for delta, high-theta, alpha, and beta frequencies were extracted as input features. Fuzzy C-Means (FCM) clustering (c=3, m=2) was applied to categorize anesthetic depth into slight, proper, and deep clusters.
Results: FCM clustering successfully identified 3 physiologically interpretable clusters consistent with EEG dynamics during progressive anesthesia. As anesthesia deepened, frontal alpha oscillations became more prominent within a delta-dominant background, while beta activity decreased with loss of consciousness. The fuzzy membership values quantified transitional states and captured the continuum of anesthetic depth. Visualization confirmed strong correspondence among cluster transitions, Patient State Index trends, and spectral density patterns.
Conclusions: This study demonstrates the feasibility of using unsupervised machine learning to enhance anesthetic depth assessment. By applying FCM clustering to pEEG data, this approach improves the understanding of anesthesia depth and integrates effectively with existing monitoring modalities. The proposed FCM-based method complements current EEG indices and may assist anesthesia practitioners and even nonanesthesia professionals in assessing anesthetic depth to enhance patient safety.
{"title":"Enhancing Anesthetic Depth Assessment via Unsupervised Machine Learning in Processed Electroencephalography Analysis: Novel Methodological Study.","authors":"Po-Yu Huang, Wei-Lun Hong, Hui-Zen Hee, Wen-Kuei Chang, Ching-Hung Lee, Chien-Kun Ting","doi":"10.2196/77830","DOIUrl":"10.2196/77830","url":null,"abstract":"<p><strong>Background: </strong>General anesthesia comprises 3 essential components-hypnosis, analgesia, and immobility. Among these, maintaining an appropriate hypnotic state, or anesthetic depth, is crucial for patient safety. Excessively deep anesthesia may lead to hemodynamic instability and postoperative cognitive dysfunction, whereas inadequate anesthesia increases the risk of intraoperative awareness. Electroencephalography (EEG)-based monitoring has therefore become a cornerstone for evaluating anesthetic depth. However, processed electroencephalography (pEEG) indices remain vulnerable to various sources of interference, including electromyographic activity, interindividual variability, and anesthetic drug effects, which can yield inaccurate numerical outputs.</p><p><strong>Objective: </strong>With recent advances in machine learning, particularly unsupervised learning, data-driven methods that classify signals according to inherent patterns offer new possibilities for anesthetic depth analysis. This study aimed to establish a methodology for automatically identifying anesthesia depth using an unsupervised, machine learning-based clustering approach applied to pEEG data.</p><p><strong>Methods: </strong>Standard frontal EEG data from participants undergoing elective lumbar spine surgery were retrospectively analyzed, yielding more than 16,000 data points. The signals were filtered with a fourth-order Butterworth bandpass filter and transformed using the fast Fourier transform to estimate power spectral density. Normalized band power ratios for delta, high-theta, alpha, and beta frequencies were extracted as input features. Fuzzy C-Means (FCM) clustering (c=3, m=2) was applied to categorize anesthetic depth into slight, proper, and deep clusters.</p><p><strong>Results: </strong>FCM clustering successfully identified 3 physiologically interpretable clusters consistent with EEG dynamics during progressive anesthesia. As anesthesia deepened, frontal alpha oscillations became more prominent within a delta-dominant background, while beta activity decreased with loss of consciousness. The fuzzy membership values quantified transitional states and captured the continuum of anesthetic depth. Visualization confirmed strong correspondence among cluster transitions, Patient State Index trends, and spectral density patterns.</p><p><strong>Conclusions: </strong>This study demonstrates the feasibility of using unsupervised machine learning to enhance anesthetic depth assessment. By applying FCM clustering to pEEG data, this approach improves the understanding of anesthesia depth and integrates effectively with existing monitoring modalities. The proposed FCM-based method complements current EEG indices and may assist anesthesia practitioners and even nonanesthesia professionals in assessing anesthetic depth to enhance patient safety.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e77830"},"PeriodicalIF":3.8,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12880611/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unlabelled: Artificial intelligence (AI) scribes, software that can convert speech into concise clinical documents, have achieved remarkable clinical adoption at a pace rarely seen for digital technologies in health care. The reasons for this are understandable: the technology works well enough, it addresses a genuine pain point for clinicians, and it has largely sidestepped regulatory requirements. In many ways, clinical adoption of AI scribes has also occurred well ahead of robust evidence of their safety and efficacy. The papers in this theme issue demonstrate real progress in the technology and evidence of its benefit: documentation times are reported to decrease when using scribes, clinicians report feeling less burdened, and the notes produced are often of reasonable quality. Yet as we survey the emerging evidence base, there remains one outstanding and urgent unanswered question: Are AI scribes safe? We need to know the clinical outcomes achievable when scribes are used compared to other forms of note taking.
{"title":"AI Scribes: Are We Measuring What Matters?","authors":"Enrico Coiera, David Fraile-Navarro","doi":"10.2196/89337","DOIUrl":"10.2196/89337","url":null,"abstract":"<p><strong>Unlabelled: </strong>Artificial intelligence (AI) scribes, software that can convert speech into concise clinical documents, have achieved remarkable clinical adoption at a pace rarely seen for digital technologies in health care. The reasons for this are understandable: the technology works well enough, it addresses a genuine pain point for clinicians, and it has largely sidestepped regulatory requirements. In many ways, clinical adoption of AI scribes has also occurred well ahead of robust evidence of their safety and efficacy. The papers in this theme issue demonstrate real progress in the technology and evidence of its benefit: documentation times are reported to decrease when using scribes, clinicians report feeling less burdened, and the notes produced are often of reasonable quality. Yet as we survey the emerging evidence base, there remains one outstanding and urgent unanswered question: Are AI scribes safe? We need to know the clinical outcomes achievable when scribes are used compared to other forms of note taking.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e89337"},"PeriodicalIF":3.8,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12880588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyumin Park, Myung Jae Baik, YeongJun Hwang, Yen Shin, HoJae Lee, Ruda Lee, Sang Min Lee, Je Young Hannah Sun, Ah Rah Lee, Si Yeun Yoon, Dong-Ho Lee, Jihyung Moon, JinYeong Bak, Kyunghyun Cho, Jong-Woo Paik, Sungjoon Park
<p><strong>Background: </strong>Harmful suicide content on the internet poses significant risks, as it can induce suicidal thoughts and behaviors, particularly among vulnerable populations. Despite global efforts, existing moderation approaches remain insufficient, especially in high-risk regions such as South Korea, which has the highest suicide rate among Organisation for Economic Co-operation and Development countries. Previous research has primarily focused on assessing the suicide risk of the authors who wrote the content rather than the harmfulness of content itself which potentially leads the readers to self-harm or suicide, highlighting a critical gap in current approaches. Our study addresses this gap by shifting the focus from assessing the suicide risk of content authors to evaluating the harmfulness of the content itself and its potential to induce suicide risk among readers.</p><p><strong>Objective: </strong>This study aimed to develop an artificial intelligence (AI)-driven system for classifying online suicide-related content into 5 levels: illegal, harmful, potentially harmful, harmless, and non-suicide-related. In addition, the researchers construct a multimodal benchmark dataset with expert annotations to improve content moderation and assist AI models in detecting and regulating harmful content more effectively.</p><p><strong>Methods: </strong>We collected 43,244 user-generated posts from various online sources, including social media, question and answer (Q&A) platforms, and online communities. To reduce the workload on human annotators, GPT-4 was used for preannotation, filtering, and categorizing content before manual review by medical professionals. A task description document ensured consistency in classification. Ultimately, a benchmark dataset of 452 manually labeled entries was developed, including both Korean and English versions, to support AI-based moderation. The study also evaluated zero-shot and few-shot learning to determine the best AI approach for detecting harmful content.</p><p><strong>Results: </strong>The multimodal benchmark dataset showed that GPT-4 achieved the highest F1-scores (66.46 for illegal and 77.09 for harmful content detection). Image descriptions improved classification accuracy, while directly using raw images slightly decreased performance. Few-shot learning significantly enhanced detection, demonstrating that small but high-quality datasets could improve AI-driven moderation. However, translation challenges were observed, particularly in suicide-related slang and abbreviations, which were sometimes inaccurately conveyed in the English benchmark.</p><p><strong>Conclusions: </strong>This study provides a high-quality benchmark for AI-based suicide content detection, proving that large language models can effectively assist in content moderation while reducing the burden on human moderators. Future work will focus on enhancing real-time detection and improving the handling of subtle or disguise
{"title":"Iterative Large Language Model-Guided Sampling and Expert-Annotated Benchmark Corpus for Harmful Suicide Content Detection: Development and Validation Study.","authors":"Kyumin Park, Myung Jae Baik, YeongJun Hwang, Yen Shin, HoJae Lee, Ruda Lee, Sang Min Lee, Je Young Hannah Sun, Ah Rah Lee, Si Yeun Yoon, Dong-Ho Lee, Jihyung Moon, JinYeong Bak, Kyunghyun Cho, Jong-Woo Paik, Sungjoon Park","doi":"10.2196/73725","DOIUrl":"10.2196/73725","url":null,"abstract":"<p><strong>Background: </strong>Harmful suicide content on the internet poses significant risks, as it can induce suicidal thoughts and behaviors, particularly among vulnerable populations. Despite global efforts, existing moderation approaches remain insufficient, especially in high-risk regions such as South Korea, which has the highest suicide rate among Organisation for Economic Co-operation and Development countries. Previous research has primarily focused on assessing the suicide risk of the authors who wrote the content rather than the harmfulness of content itself which potentially leads the readers to self-harm or suicide, highlighting a critical gap in current approaches. Our study addresses this gap by shifting the focus from assessing the suicide risk of content authors to evaluating the harmfulness of the content itself and its potential to induce suicide risk among readers.</p><p><strong>Objective: </strong>This study aimed to develop an artificial intelligence (AI)-driven system for classifying online suicide-related content into 5 levels: illegal, harmful, potentially harmful, harmless, and non-suicide-related. In addition, the researchers construct a multimodal benchmark dataset with expert annotations to improve content moderation and assist AI models in detecting and regulating harmful content more effectively.</p><p><strong>Methods: </strong>We collected 43,244 user-generated posts from various online sources, including social media, question and answer (Q&A) platforms, and online communities. To reduce the workload on human annotators, GPT-4 was used for preannotation, filtering, and categorizing content before manual review by medical professionals. A task description document ensured consistency in classification. Ultimately, a benchmark dataset of 452 manually labeled entries was developed, including both Korean and English versions, to support AI-based moderation. The study also evaluated zero-shot and few-shot learning to determine the best AI approach for detecting harmful content.</p><p><strong>Results: </strong>The multimodal benchmark dataset showed that GPT-4 achieved the highest F1-scores (66.46 for illegal and 77.09 for harmful content detection). Image descriptions improved classification accuracy, while directly using raw images slightly decreased performance. Few-shot learning significantly enhanced detection, demonstrating that small but high-quality datasets could improve AI-driven moderation. However, translation challenges were observed, particularly in suicide-related slang and abbreviations, which were sometimes inaccurately conveyed in the English benchmark.</p><p><strong>Conclusions: </strong>This study provides a high-quality benchmark for AI-based suicide content detection, proving that large language models can effectively assist in content moderation while reducing the burden on human moderators. Future work will focus on enhancing real-time detection and improving the handling of subtle or disguise","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e73725"},"PeriodicalIF":3.8,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposures and outcomes or to obtain additional information on confounding variables. However, to protect patient confidentiality, unique patient identifiers are not provided, which makes data linkage across multiple sources challenging. The Saudi Real-World Evidence Network (SRWEN) aggregates electronic health records from various hospitals, which may require robust linkage techniques.
Objective: We aimed to evaluate and compare the performance of deterministic, probabilistic, and machine learning (ML) approaches for linking deidentified data of patients with multiple sclerosis (MS) from the SRWEN and Ministry of National Guard Health Affairs electronic health record systems.
Methods: A simulation-based validation framework was applied before linking real-world data sources. Deterministic linkage was based on predefined rules, whereas probabilistic linkage was based on a similarity score-based matching. For ML, both similarity score-based and classification approaches were applied using neural networks, logistic regression, and random forest models. The performance of each approach was assessed using confusion matrices, focusing on sensitivity, positive predictive value, F1 score, and computational efficiency.
Results: The study included linked data of 2247 patients with MS from 2016 to 2023. The deterministic approach resulted in an average F1 score of 97.2% in the simulation and demonstrated varying match rates in real-world linkage: 1046/2247 (46.6%) to 1946/2247 (86.6%). This linkage was computationally efficient, with run times of <1 second per rule. The probabilistic approach provided an average F1 score of 93.9% in the simulation, with real-world match rates ranging from 1472/2247 (65.5%) to 2144/2247 (95.4%) and processing times ranging from approximately 0.1 to 5 seconds per rule. ML approaches achieved high performance (F1 score reached 99.8%) but were computationally expensive. Processing times ranged from approximately 13 to 16,936 seconds for the classification-based approaches and from approximately 13 to 7467 seconds for the similarity score-based approaches. Real-world match rates from ML models were highly variable depending on the method used; the similarity score-based approach identified 789/2247 (35.1%) matched pairs, whereas the classification-based approach identified 2014/2247 (89.6%).
Conclusions: Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods and proved to be both flexible and efficient, particularly in real-world scenarios where unique identifiers are lacking. This method achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.
{"title":"Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods.","authors":"Ohoud Almadani, Yasser Albogami, Adel Alrwisan","doi":"10.2196/79869","DOIUrl":"10.2196/79869","url":null,"abstract":"<p><strong>Background: </strong>Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposures and outcomes or to obtain additional information on confounding variables. However, to protect patient confidentiality, unique patient identifiers are not provided, which makes data linkage across multiple sources challenging. The Saudi Real-World Evidence Network (SRWEN) aggregates electronic health records from various hospitals, which may require robust linkage techniques.</p><p><strong>Objective: </strong>We aimed to evaluate and compare the performance of deterministic, probabilistic, and machine learning (ML) approaches for linking deidentified data of patients with multiple sclerosis (MS) from the SRWEN and Ministry of National Guard Health Affairs electronic health record systems.</p><p><strong>Methods: </strong>A simulation-based validation framework was applied before linking real-world data sources. Deterministic linkage was based on predefined rules, whereas probabilistic linkage was based on a similarity score-based matching. For ML, both similarity score-based and classification approaches were applied using neural networks, logistic regression, and random forest models. The performance of each approach was assessed using confusion matrices, focusing on sensitivity, positive predictive value, F1 score, and computational efficiency.</p><p><strong>Results: </strong>The study included linked data of 2247 patients with MS from 2016 to 2023. The deterministic approach resulted in an average F1 score of 97.2% in the simulation and demonstrated varying match rates in real-world linkage: 1046/2247 (46.6%) to 1946/2247 (86.6%). This linkage was computationally efficient, with run times of <1 second per rule. The probabilistic approach provided an average F1 score of 93.9% in the simulation, with real-world match rates ranging from 1472/2247 (65.5%) to 2144/2247 (95.4%) and processing times ranging from approximately 0.1 to 5 seconds per rule. ML approaches achieved high performance (F1 score reached 99.8%) but were computationally expensive. Processing times ranged from approximately 13 to 16,936 seconds for the classification-based approaches and from approximately 13 to 7467 seconds for the similarity score-based approaches. Real-world match rates from ML models were highly variable depending on the method used; the similarity score-based approach identified 789/2247 (35.1%) matched pairs, whereas the classification-based approach identified 2014/2247 (89.6%).</p><p><strong>Conclusions: </strong>Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods and proved to be both flexible and efficient, particularly in real-world scenarios where unique identifiers are lacking. This method achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79869"},"PeriodicalIF":3.8,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872214/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donald Salami, Emily Koech, Janet M Turan, Kristen A Stafford, Lilly Muthoni Nyagah, Stephen Ohakanu, Anthony K Ngugi, Manhattan Charurat
Background: The Cox proportional hazards (CPH) model is a common choice for analyzing time-to-treatment interruptions in patients on antiretroviral therapy (ART), valued for its straightforward interpretability and flexibility in handling time-dependent covariates. Machine learning (ML) models have increasingly been adapted for handling temporal data, with added advantages of handling complex, nonlinear relationships and large datasets, and providing clear practical interpretations.
Objective: This study aims to compare the predictive performance of the traditional CPH model and ML models in predicting treatment interruptions among patients on ART, while also providing both global and individual-level explanations to support personalized, data-driven interventions for improving treatment retention.
Methods: Using data from 621,115 patients who started ART between 2017 and 2023, in Kenya, we compared the performance of the CPH with the following ML models-gradient boosting machine, extreme gradient boosting, regularized generalized linear models (Ridge, Lasso, and Elastic-Net), and recursive partitioning-in predicting first and multiple treatment interruptions. Explainable surrogate technique (model-agnostic) was applied to interpret the best performing model's predictions globally, using variable importance and partial dependence profiles, and at individual level, using breakdown additive, Shapley Additive Explanations, and ceteris paribus.
Results: The recursive partitioning model achieved the best performance with a predictive concordance index score of 0.81 for first treatment interruptions and 0.89 for multiple interruptions, outperforming the CPH model, which scored 0.78 and 0.87 for the same scenarios, respectively. Recursive partitioning's performance can be attributed to its ability to model nonlinear relationships and automatically detect complex interactions. The global model-agnostic explanations aligned closely with the interpretations offered by hazard ratios in the CPH model, while offering additional insights into the impact of specific features on the model's predictions. The breakdown additive and Shapley Additive Explanations explainers demonstrated how different variables contribute to the predicted risk at the individual patient level. The ceteris paribus profiles further explored the time-varying model to illustrate how changes in a patient's covariates over time could impact their predicted risk of treatment interruption.
Conclusions: Our results highlight the superior predictive performance of ML models and their ability to provide patient-specific risk predictions and insights that can support targeted interventions to reduce treatment interruptions in ART care.
{"title":"Prediction of First and Multiple Antiretroviral Therapy Interruptions in People Living With HIV: Comparative Survival Analysis Using Cox and Explainable Machine Learning Models.","authors":"Donald Salami, Emily Koech, Janet M Turan, Kristen A Stafford, Lilly Muthoni Nyagah, Stephen Ohakanu, Anthony K Ngugi, Manhattan Charurat","doi":"10.2196/78964","DOIUrl":"10.2196/78964","url":null,"abstract":"<p><strong>Background: </strong>The Cox proportional hazards (CPH) model is a common choice for analyzing time-to-treatment interruptions in patients on antiretroviral therapy (ART), valued for its straightforward interpretability and flexibility in handling time-dependent covariates. Machine learning (ML) models have increasingly been adapted for handling temporal data, with added advantages of handling complex, nonlinear relationships and large datasets, and providing clear practical interpretations.</p><p><strong>Objective: </strong>This study aims to compare the predictive performance of the traditional CPH model and ML models in predicting treatment interruptions among patients on ART, while also providing both global and individual-level explanations to support personalized, data-driven interventions for improving treatment retention.</p><p><strong>Methods: </strong>Using data from 621,115 patients who started ART between 2017 and 2023, in Kenya, we compared the performance of the CPH with the following ML models-gradient boosting machine, extreme gradient boosting, regularized generalized linear models (Ridge, Lasso, and Elastic-Net), and recursive partitioning-in predicting first and multiple treatment interruptions. Explainable surrogate technique (model-agnostic) was applied to interpret the best performing model's predictions globally, using variable importance and partial dependence profiles, and at individual level, using breakdown additive, Shapley Additive Explanations, and ceteris paribus.</p><p><strong>Results: </strong>The recursive partitioning model achieved the best performance with a predictive concordance index score of 0.81 for first treatment interruptions and 0.89 for multiple interruptions, outperforming the CPH model, which scored 0.78 and 0.87 for the same scenarios, respectively. Recursive partitioning's performance can be attributed to its ability to model nonlinear relationships and automatically detect complex interactions. The global model-agnostic explanations aligned closely with the interpretations offered by hazard ratios in the CPH model, while offering additional insights into the impact of specific features on the model's predictions. The breakdown additive and Shapley Additive Explanations explainers demonstrated how different variables contribute to the predicted risk at the individual patient level. The ceteris paribus profiles further explored the time-varying model to illustrate how changes in a patient's covariates over time could impact their predicted risk of treatment interruption.</p><p><strong>Conclusions: </strong>Our results highlight the superior predictive performance of ML models and their ability to provide patient-specific risk predictions and insights that can support targeted interventions to reduce treatment interruptions in ART care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e78964"},"PeriodicalIF":3.8,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12871577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ho Heon Kim, Gisu Hwang, Won Chan Jeong, Young Sin Ko
Background: Multiple instance learning (MIL) is widely used for slide-level classification in digital pathology without requiring expert annotations. However, even partial expert annotations offer valuable supervision; few studies have effectively leveraged this information within MIL frameworks.
Objective: This study aims to develop and evaluate a ranking-aware MIL framework, called rank induction, that effectively incorporates partial expert annotations to improve slide-level classification performance under realistic annotation constraints.
Methods: We developed rank induction, a MIL approach that incorporates expert annotations using a pairwise rank loss inspired by RankNet. The method encourages the model to assign higher attention scores to annotated regions than to unannotated ones, guiding it to focus on diagnostically relevant patches. We evaluated rank induction on 2 public datasets (Camelyon16 and DigestPath2019) and an in-house dataset (Seegene Medical Foundation-stomach; SMF-stomach) and tested its robustness under 3 real-world conditions: low-data regimes, coarse within-slide annotations, and sparse slide-level annotations.
Results: Rank induction outperformed existing methodologies, achieving an area under the receiver operating characteristic curve (AUROC) of 0.839 on Camelyon16, 0.995 on DigestPath2019, and 0.875 on SMF-stomach. It remained robust under low-data conditions, maintaining an AUROC of 0.761 with only 60.2% (130/216) of the training data. When using coarse annotations (with 2240-pixel padding), performance slightly declined to 0.823. Remarkably, annotating just 20% (18/89) of the slides was enough to reach near-saturated performance (AUROC of 0.806, vs 0.839 with full annotations).
Conclusions: Incorporating expert annotations through ranking-based supervision improves MIL-based classification. Rank induction remains robust even with limited, coarse, or sparsely available annotations, demonstrating its practicality in real-world scenarios.
背景:多实例学习(MIL)被广泛用于数字病理学的幻灯片级分类,而不需要专家注释。然而,即使是部分专家注释也提供了有价值的监督;很少有研究在MIL框架内有效地利用了这些信息。目的:本研究旨在开发和评估一种称为秩归纳的秩感知MIL框架,该框架有效地结合了部分专家注释,以提高现实标注约束下的幻灯片级分类性能。方法:我们开发了排名归纳,这是一种MIL方法,使用受RankNet启发的成对排名损失结合了专家注释。该方法鼓励模型将更高的注意力分数分配给已注释的区域,而不是未注释的区域,从而引导模型专注于诊断相关的补丁。我们在2个公共数据集(Camelyon16和DigestPath2019)和一个内部数据集(Seegene Medical foundation -胃;smf -胃)上评估了排名归纳,并在3个现实条件下测试了其稳健性:低数据机制、粗糙的幻灯片内注释和稀疏的幻灯片级注释。结果:等级归纳优于现有方法,Camelyon16的受试者工作特征曲线下面积(AUROC)为0.839,DigestPath2019为0.995,smf -胃为0.875。它在低数据条件下保持鲁棒性,仅使用60.2%(130/216)的训练数据,AUROC保持在0.761。当使用粗标注(2240像素填充)时,性能略微下降到0.823。值得注意的是,仅注释20%(18/89)的幻灯片就足以达到接近饱和的性能(AUROC为0.806,而完整注释的AUROC为0.839)。结论:通过基于排名的监督将专家注释纳入改进了基于mil的分类。即使使用有限的、粗糙的或稀疏可用的注释,排名归纳仍然是健壮的,这证明了它在现实场景中的实用性。
{"title":"Ranking-Aware Multiple Instance Learning for Histopathology Slide Classification: Development and Validation Study.","authors":"Ho Heon Kim, Gisu Hwang, Won Chan Jeong, Young Sin Ko","doi":"10.2196/84417","DOIUrl":"https://doi.org/10.2196/84417","url":null,"abstract":"<p><strong>Background: </strong>Multiple instance learning (MIL) is widely used for slide-level classification in digital pathology without requiring expert annotations. However, even partial expert annotations offer valuable supervision; few studies have effectively leveraged this information within MIL frameworks.</p><p><strong>Objective: </strong>This study aims to develop and evaluate a ranking-aware MIL framework, called rank induction, that effectively incorporates partial expert annotations to improve slide-level classification performance under realistic annotation constraints.</p><p><strong>Methods: </strong>We developed rank induction, a MIL approach that incorporates expert annotations using a pairwise rank loss inspired by RankNet. The method encourages the model to assign higher attention scores to annotated regions than to unannotated ones, guiding it to focus on diagnostically relevant patches. We evaluated rank induction on 2 public datasets (Camelyon16 and DigestPath2019) and an in-house dataset (Seegene Medical Foundation-stomach; SMF-stomach) and tested its robustness under 3 real-world conditions: low-data regimes, coarse within-slide annotations, and sparse slide-level annotations.</p><p><strong>Results: </strong>Rank induction outperformed existing methodologies, achieving an area under the receiver operating characteristic curve (AUROC) of 0.839 on Camelyon16, 0.995 on DigestPath2019, and 0.875 on SMF-stomach. It remained robust under low-data conditions, maintaining an AUROC of 0.761 with only 60.2% (130/216) of the training data. When using coarse annotations (with 2240-pixel padding), performance slightly declined to 0.823. Remarkably, annotating just 20% (18/89) of the slides was enough to reach near-saturated performance (AUROC of 0.806, vs 0.839 with full annotations).</p><p><strong>Conclusions: </strong>Incorporating expert annotations through ranking-based supervision improves MIL-based classification. Rank induction remains robust even with limited, coarse, or sparsely available annotations, demonstrating its practicality in real-world scenarios.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e84417"},"PeriodicalIF":3.8,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Jiang, Chun Yang, Wenbin Zhou, Cheng-Liang Yin, Shan Zhou, Rui He, Guanghui Ran, Wujie Wang, Meixian Wu, Juan Yu
<p><strong>Background: </strong>Artificial intelligence tools, particularly large language models (LLMs), have shown considerable potential across various domains. However, their performance in the diagnosis and treatment of breast cancer remains unknown.</p><p><strong>Objective: </strong>This study aimed to evaluate the performance of LLMs in supporting radiologists within multidisciplinary breast cancer teams, with a focus on their roles in facilitating informed clinical decisions and enhancing patient care.</p><p><strong>Methods: </strong>A set of 50 questions covering radiological and breast cancer guidelines was developed to assess breast cancer. These questions were posed to 9 popular LLMs and clinical physicians, with the expectation of receiving direct "Yes" or "No" answers along with supporting analysis. The performances of the 9 models, including ChatGPT-4.0, ChatGPT-4o, ChatGPT-4o mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Tongyi Qianwen 2.5, ChatGLM, and Ernie Bot 3.5, were evaluated against that of radiologists with varying experience levels (resident physicians, fellow physicians, and attending physicians). Responses were assessed for accuracy, confidence, and consistency based on alignment with the 2024 National Comprehensive Cancer Network Breast Cancer Guidelines and the 2013 American College of Radiology Breast Imaging-Reporting and Data System recommendations.</p><p><strong>Results: </strong>Claude 3 Opus and ChatGPT-4 achieved the highest confidence scores of 2.78 and 2.74, respectively, while ChatGPT-4o led in accuracy with a score of 2.92. In terms of response consistency, Claude 3 Opus and Claude 3.5 Sonnet led the pack with scores of 3.0, closely followed by ChatGPT-4o, Gemini 1.5 Pro, and ChatGPT-4o mini, all recording impressive scores exceeding 2.9. ChatGPT-4o mini excelled in clinical diagnostics with a top score of 3.0 among all LLMs, and this score was also higher than all physician groups; however, no statistically significant differences were observed between it and any physician group (all P>.05). ChatGPT-4 also had a higher score than the physician groups but showed comparable statistical performance to them (P>.05). Across radiological diagnostics, clinical diagnosis, and overall performance, ChatGPT-4o mini and the Claude models achieved higher mean scores than all physician groups. However, these differences were statistically significant only when compared to fellow physicians (P<.05). However, ChatGLM and Ernie Bot 3.5 underperformed across diagnostic areas, with lower scores than all physician groups but no statistically significant differences (all P>.05). Among physician groups, attending physicians and resident physicians exhibited comparable high scores in radiological diagnostic performance, whereas fellow physicians scored somewhat lower, though the difference was not statistically significant (P>.05).</p><p><strong>Conclusions: </strong>LLMs such as ChatGPT-4o and Claude 3 Opus showed po
{"title":"Evaluation of Large Language Models for Radiologists' Support in Multidisciplinary Breast Cancer Teams: Comparative Study.","authors":"Hong Jiang, Chun Yang, Wenbin Zhou, Cheng-Liang Yin, Shan Zhou, Rui He, Guanghui Ran, Wujie Wang, Meixian Wu, Juan Yu","doi":"10.2196/68182","DOIUrl":"https://doi.org/10.2196/68182","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence tools, particularly large language models (LLMs), have shown considerable potential across various domains. However, their performance in the diagnosis and treatment of breast cancer remains unknown.</p><p><strong>Objective: </strong>This study aimed to evaluate the performance of LLMs in supporting radiologists within multidisciplinary breast cancer teams, with a focus on their roles in facilitating informed clinical decisions and enhancing patient care.</p><p><strong>Methods: </strong>A set of 50 questions covering radiological and breast cancer guidelines was developed to assess breast cancer. These questions were posed to 9 popular LLMs and clinical physicians, with the expectation of receiving direct \"Yes\" or \"No\" answers along with supporting analysis. The performances of the 9 models, including ChatGPT-4.0, ChatGPT-4o, ChatGPT-4o mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Tongyi Qianwen 2.5, ChatGLM, and Ernie Bot 3.5, were evaluated against that of radiologists with varying experience levels (resident physicians, fellow physicians, and attending physicians). Responses were assessed for accuracy, confidence, and consistency based on alignment with the 2024 National Comprehensive Cancer Network Breast Cancer Guidelines and the 2013 American College of Radiology Breast Imaging-Reporting and Data System recommendations.</p><p><strong>Results: </strong>Claude 3 Opus and ChatGPT-4 achieved the highest confidence scores of 2.78 and 2.74, respectively, while ChatGPT-4o led in accuracy with a score of 2.92. In terms of response consistency, Claude 3 Opus and Claude 3.5 Sonnet led the pack with scores of 3.0, closely followed by ChatGPT-4o, Gemini 1.5 Pro, and ChatGPT-4o mini, all recording impressive scores exceeding 2.9. ChatGPT-4o mini excelled in clinical diagnostics with a top score of 3.0 among all LLMs, and this score was also higher than all physician groups; however, no statistically significant differences were observed between it and any physician group (all P>.05). ChatGPT-4 also had a higher score than the physician groups but showed comparable statistical performance to them (P>.05). Across radiological diagnostics, clinical diagnosis, and overall performance, ChatGPT-4o mini and the Claude models achieved higher mean scores than all physician groups. However, these differences were statistically significant only when compared to fellow physicians (P<.05). However, ChatGLM and Ernie Bot 3.5 underperformed across diagnostic areas, with lower scores than all physician groups but no statistically significant differences (all P>.05). Among physician groups, attending physicians and resident physicians exhibited comparable high scores in radiological diagnostic performance, whereas fellow physicians scored somewhat lower, though the difference was not statistically significant (P>.05).</p><p><strong>Conclusions: </strong>LLMs such as ChatGPT-4o and Claude 3 Opus showed po","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e68182"},"PeriodicalIF":3.8,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research letter summarizes early lessons from 4 enterprise implementations of artificial intelligence-enabled customer relationship management platforms in health care and describes governance practices associated with improvements in affordability, adherence, and access at program level.
{"title":"AI-Enabled Customer Relationship Management Platforms for Patient Services in Health Care, Early Lessons From Governance, and Program-Level Outcomes.","authors":"Anup Kant Gupta","doi":"10.2196/83564","DOIUrl":"10.2196/83564","url":null,"abstract":"<p><p>This research letter summarizes early lessons from 4 enterprise implementations of artificial intelligence-enabled customer relationship management platforms in health care and describes governance practices associated with improvements in affordability, adherence, and access at program level.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":"e83564"},"PeriodicalIF":3.8,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zhang, Xia Ren, Luojie Liu, Junjie Zha, Yijie Gu, Hongwei Ye
<p><strong>Background: </strong>Venous thromboembolism (VTE) is a common and severe complication in intensive care unit (ICU) patients with sepsis. Conventional risk stratification tools lack sepsis-specific features and may inadequately capture complex, nonlinear interactions among clinical variables.</p><p><strong>Objective: </strong>This study aimed to develop and validate an interpretable machine learning (ML) model for the early prediction of VTE in ICU patients with sepsis.</p><p><strong>Methods: </strong>This multicenter retrospective study used data from the Medical Information Mart for Intensive Care IV database for model development and internal validation, and an independent cohort from Changshu Hospital for external validation. Candidate predictors were selected through univariate analysis, followed by least absolute shrinkage and selection operator regression. Retained variables were used in multivariable logistic regression to identify independent predictors, which were then used to develop 9 ML models, including categorical boosting, decision tree, k-nearest neighbor, light gradient boosting machine, logistic regression, multilayer perceptron, naive Bayes, random forest, and support vector machine. Performance was evaluated by discrimination (area under the curve [AUC]), calibration, and clinical use (decision curve analysis). A subgroup analysis stratified by the Sequential Organ Failure Assessment score was conducted in the external cohort to assess model stability across sepsis severity levels. Model interpretability was assessed using Shapley Additive Explanations (SHAP) to quantify the contribution of features to the predicted risk.</p><p><strong>Results: </strong>A total of 25,197 patients from the Medical Information Mart for Intensive Care IV cohort and 328 patients from the external cohort were included, with VTE incidences of 844 out of 25,197 (3.4%) and 30 out of 328 (9.2%), respectively. The light gradient boosting machine model performed best, achieving an AUC of 0.956 in internal validation. Despite the higher VTE incidence and clinical severity in the external validation, the model maintained robust generalization with an AUC of 0.786. Notably, the model achieved enhanced discriminative ability in the severe sepsis subgroup (Sequential Organ Failure Assessment score >6) with an AUC of 0.816, compared with 0.769 in the mild to moderate sepsis subgroup. Calibration curves indicated strong agreement between predicted and observed outcomes, and decision curve analysis showed superior net benefit across clinically relevant thresholds. SHAP analysis identified central venous catheterization, serum chloride and bicarbonate levels, arterial catheterization, and prolonged partial thromboplastin time as the most influential predictors. Partial dependence plots revealed both linear and nonlinear associations between these variables and VTE risk. Individual-level force plots further enhanced interpretability by visualizing perso
{"title":"Machine Learning Algorithms to Predict Venous Thromboembolism in Patients With Sepsis in the Intensive Care Unit: Multicenter Retrospective Study.","authors":"Yan Zhang, Xia Ren, Luojie Liu, Junjie Zha, Yijie Gu, Hongwei Ye","doi":"10.2196/80969","DOIUrl":"https://doi.org/10.2196/80969","url":null,"abstract":"<p><strong>Background: </strong>Venous thromboembolism (VTE) is a common and severe complication in intensive care unit (ICU) patients with sepsis. Conventional risk stratification tools lack sepsis-specific features and may inadequately capture complex, nonlinear interactions among clinical variables.</p><p><strong>Objective: </strong>This study aimed to develop and validate an interpretable machine learning (ML) model for the early prediction of VTE in ICU patients with sepsis.</p><p><strong>Methods: </strong>This multicenter retrospective study used data from the Medical Information Mart for Intensive Care IV database for model development and internal validation, and an independent cohort from Changshu Hospital for external validation. Candidate predictors were selected through univariate analysis, followed by least absolute shrinkage and selection operator regression. Retained variables were used in multivariable logistic regression to identify independent predictors, which were then used to develop 9 ML models, including categorical boosting, decision tree, k-nearest neighbor, light gradient boosting machine, logistic regression, multilayer perceptron, naive Bayes, random forest, and support vector machine. Performance was evaluated by discrimination (area under the curve [AUC]), calibration, and clinical use (decision curve analysis). A subgroup analysis stratified by the Sequential Organ Failure Assessment score was conducted in the external cohort to assess model stability across sepsis severity levels. Model interpretability was assessed using Shapley Additive Explanations (SHAP) to quantify the contribution of features to the predicted risk.</p><p><strong>Results: </strong>A total of 25,197 patients from the Medical Information Mart for Intensive Care IV cohort and 328 patients from the external cohort were included, with VTE incidences of 844 out of 25,197 (3.4%) and 30 out of 328 (9.2%), respectively. The light gradient boosting machine model performed best, achieving an AUC of 0.956 in internal validation. Despite the higher VTE incidence and clinical severity in the external validation, the model maintained robust generalization with an AUC of 0.786. Notably, the model achieved enhanced discriminative ability in the severe sepsis subgroup (Sequential Organ Failure Assessment score >6) with an AUC of 0.816, compared with 0.769 in the mild to moderate sepsis subgroup. Calibration curves indicated strong agreement between predicted and observed outcomes, and decision curve analysis showed superior net benefit across clinically relevant thresholds. SHAP analysis identified central venous catheterization, serum chloride and bicarbonate levels, arterial catheterization, and prolonged partial thromboplastin time as the most influential predictors. Partial dependence plots revealed both linear and nonlinear associations between these variables and VTE risk. Individual-level force plots further enhanced interpretability by visualizing perso","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e80969"},"PeriodicalIF":3.8,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146095016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Rib fractures are present in 10%-15% of thoracic trauma cases but are often missed on chest radiographs, delaying diagnosis and treatment. Artificial intelligence (AI) may improve detection and triage in emergency settings.
Objective: This study aims to evaluate diagnostic accuracy, processing speed, and technical feasibility of an artificial intelligence-assisted rib fracture detection system using prospectively collected data within a real-world, high-volume emergency department workflow.
Methods: We conducted an observational feasibility study with prospective data collection of a faster region-based convolutional neural network-based AI model deployed in the emergency department to analyze 23,251 real-world chest radiographs (22,946 anteroposterior; 305 oblique) from April 1 to July 2, 2023. This study was approved by the Institutional Review Board of MacKay Memorial Hospital (IRB No. 20MMHIS483e). AI operated passively, without influencing clinical decision-making. The reference standard was the final report issued by board-certified radiologists. A subset of discordant cases underwent post hoc computed tomography review for exploratory analysis.
Results: AI achieved 74.5% sensitivity (95% CI 0.708-0.780), 93.3% specificity (95% CI 0.930-0.937), 24.2% positive predictive value, and 99.2% negative predictive value. Median inference time was 10.6 seconds versus 3.3 hours for radiologist reports (paired Wilcoxon signed-rank test W=112 987.5, P<.001). The analysis revealed peak imaging demand between 08:00 and 16:00 and Thursday-Saturday evenings. A 14-day graphics processing unit outage underscored the importance of infrastructure resilience.
Conclusions: The AI system demonstrated strong technical feasibility for real-time rib fracture detection in a high-volume emergency department setting, with rapid inference and stable performance during prospective deployment. Although the system showed high negative predictive value, the observed false-positive and false-negative rates indicate that it should be considered a supportive screening tool rather than a stand-alone diagnostic solution or a replacement for clinical judgment. These findings support further clinician-in-the-loop studies to evaluate clinical feasibility, workflow integration, and impact on diagnostic decision-making. However, interpretation is limited by reliance on radiology reports as the reference standard and the system's passive, non-interventional deployment.
背景:肋骨骨折在10%-15%的胸部创伤病例中存在,但在胸片上经常被遗漏,延误了诊断和治疗。人工智能(AI)可以改善紧急情况下的检测和分类。目的:本研究旨在评估人工智能辅助肋骨骨折检测系统的诊断准确性、处理速度和技术可行性,该系统使用现实世界中大量急诊科工作流程中前瞻性收集的数据。方法:我们对应用于急诊科的基于快速区域卷积神经网络的人工智能模型进行前瞻性数据收集,进行了一项观察性可行性研究,分析了2023年4月1日至7月2日23251张真实胸片(22946张正位片,305张斜位片)。本研究获得MacKay Memorial Hospital机构审查委员会(IRB No. 20MMHIS483e)批准。人工智能被动操作,不影响临床决策。参考标准是由委员会认证的放射科医生发布的最终报告。一部分不一致的病例进行了事后计算机断层扫描检查以进行探索性分析。结果:人工智能的敏感性为74.5% (95% CI 0.708 ~ 0.780),特异性为93.3% (95% CI 0.930 ~ 0.937),阳性预测值为24.2%,阴性预测值为99.2%。中位推断时间为10.6秒,而放射科医生报告的平均推断时间为3.3小时(配对Wilcoxon签名秩检验W=112 987.5, p)。结论:人工智能系统在大容量急诊科环境中显示出强大的实时肋骨骨折检测技术可行性,在预期部署期间具有快速推断和稳定的性能。虽然该系统显示出较高的阴性预测值,但观察到的假阳性和假阴性率表明,它应被视为一种支持性筛查工具,而不是一个独立的诊断解决方案或替代临床判断。这些发现支持进一步的临床循环研究,以评估临床可行性、工作流程整合以及对诊断决策的影响。然而,由于依赖作为参考标准的放射学报告和系统的被动、非介入性部署,解释受到限制。
{"title":"Prospective Diagnostic Accuracy and Technical Feasibility of Artificial Intelligence-Assisted Rib Fracture Detection on Chest Radiographs: Observational Study.","authors":"Shu-Tien Huang, Liong-Rung Liu, Ming-Feng Tsai, Ming-Yuan Huang, Hung-Wen Chiu","doi":"10.2196/77965","DOIUrl":"10.2196/77965","url":null,"abstract":"<p><strong>Background: </strong>Rib fractures are present in 10%-15% of thoracic trauma cases but are often missed on chest radiographs, delaying diagnosis and treatment. Artificial intelligence (AI) may improve detection and triage in emergency settings.</p><p><strong>Objective: </strong>This study aims to evaluate diagnostic accuracy, processing speed, and technical feasibility of an artificial intelligence-assisted rib fracture detection system using prospectively collected data within a real-world, high-volume emergency department workflow.</p><p><strong>Methods: </strong>We conducted an observational feasibility study with prospective data collection of a faster region-based convolutional neural network-based AI model deployed in the emergency department to analyze 23,251 real-world chest radiographs (22,946 anteroposterior; 305 oblique) from April 1 to July 2, 2023. This study was approved by the Institutional Review Board of MacKay Memorial Hospital (IRB No. 20MMHIS483e). AI operated passively, without influencing clinical decision-making. The reference standard was the final report issued by board-certified radiologists. A subset of discordant cases underwent post hoc computed tomography review for exploratory analysis.</p><p><strong>Results: </strong>AI achieved 74.5% sensitivity (95% CI 0.708-0.780), 93.3% specificity (95% CI 0.930-0.937), 24.2% positive predictive value, and 99.2% negative predictive value. Median inference time was 10.6 seconds versus 3.3 hours for radiologist reports (paired Wilcoxon signed-rank test W=112 987.5, P<.001). The analysis revealed peak imaging demand between 08:00 and 16:00 and Thursday-Saturday evenings. A 14-day graphics processing unit outage underscored the importance of infrastructure resilience.</p><p><strong>Conclusions: </strong>The AI system demonstrated strong technical feasibility for real-time rib fracture detection in a high-volume emergency department setting, with rapid inference and stable performance during prospective deployment. Although the system showed high negative predictive value, the observed false-positive and false-negative rates indicate that it should be considered a supportive screening tool rather than a stand-alone diagnostic solution or a replacement for clinical judgment. These findings support further clinician-in-the-loop studies to evaluate clinical feasibility, workflow integration, and impact on diagnostic decision-making. However, interpretation is limited by reliance on radiology reports as the reference standard and the system's passive, non-interventional deployment.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e77965"},"PeriodicalIF":3.8,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12854400/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}