Pub Date : 2025-09-10DOI: 10.1136/bmjhci-2025-101435
Susannah Mclean, Paul Miller, Alistair Ewing, Juliet Anne Spiller, Lynsey Fielden
Objective: To deploy a digital application of the Recommended Summary Plan for Emergency Care and Treatment (ReSPECT) across health boards (HBs).
Methods: Clinicians, patients and other regional stakeholders collaborated with the Scottish National Technology Service (NTS) defining requirements. Development was agile with user feedback.
Results: The ReSPECT web application developed on Scotland's National Digital Platform used an openEHR Clinical Data Repository. Plans can be viewed and edited across settings. Deployed in 2020, by July 2025, 8 of 14 HBs were onboarded and >5500 patients had digital ReSPECT plans.
Discussion: openEHR structures clinical data in a modular way, enabling other applications to use the same data layer. Close collaboration between technicians and users fulfilled the application's requirements and solved problems together.
Conclusions: Collaboration on the digital ReSPECT accelerated deployment, enabling more people's wishes and clinical recommendations to be captured and shared across care settings and transitions. openEHR technology enables new data uses.
{"title":"New openEHR technology and clinical collaboration in vital steps toward improved patient care and true interoperability: Scotland's first digital ReSPECT emergency care plan.","authors":"Susannah Mclean, Paul Miller, Alistair Ewing, Juliet Anne Spiller, Lynsey Fielden","doi":"10.1136/bmjhci-2025-101435","DOIUrl":"10.1136/bmjhci-2025-101435","url":null,"abstract":"<p><strong>Objective: </strong>To deploy a digital application of the Recommended Summary Plan for Emergency Care and Treatment (ReSPECT) across health boards (HBs).</p><p><strong>Methods: </strong>Clinicians, patients and other regional stakeholders collaborated with the Scottish National Technology Service (NTS) defining requirements. Development was agile with user feedback.</p><p><strong>Results: </strong>The ReSPECT web application developed on Scotland's National Digital Platform used an openEHR Clinical Data Repository. Plans can be viewed and edited across settings. Deployed in 2020, by July 2025, 8 of 14 HBs were onboarded and >5500 patients had digital ReSPECT plans.</p><p><strong>Discussion: </strong>openEHR structures clinical data in a modular way, enabling other applications to use the same data layer. Close collaboration between technicians and users fulfilled the application's requirements and solved problems together.</p><p><strong>Conclusions: </strong>Collaboration on the digital ReSPECT accelerated deployment, enabling more people's wishes and clinical recommendations to be captured and shared across care settings and transitions. openEHR technology enables new data uses.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12439144/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145039124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-10DOI: 10.1136/bmjhci-2025-101513
Michael Byczkowski
Data are the engine of modern medicine, yet its economic trade-off remains unequally distributed: hospitals and research institutions shoulder the effort of collection, while life science companies reap the financial rewards. This imbalance raises pressing questions about fairness, rights and sustainability.
{"title":"Data as medicine's backbone: redefining its value to foster innovation in the data economy.","authors":"Michael Byczkowski","doi":"10.1136/bmjhci-2025-101513","DOIUrl":"10.1136/bmjhci-2025-101513","url":null,"abstract":"<p><p>Data are the engine of modern medicine, yet its economic trade-off remains unequally distributed: hospitals and research institutions shoulder the effort of collection, while life science companies reap the financial rewards. This imbalance raises pressing questions about fairness, rights and sustainability.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12519381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145039197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: The objectives were to examine the associations between accelerometer-measured circadian rest-activity rhythm (CRAR), the most prominent circadian rhythm in humans and the risk of mortality from all-cause, cancer and cardiovascular disease (CVD) in patients with cancer.
Methods: 7456 cancer participants from the UK Biobank were included. All participants wore accelerometers from 2013 to 2015 and were followed up until 24 January 2024, with a median follow-up of 9.00 years. The multidimensional parameters of the CRAR were calculated using the 7-day accelerometer data collected under free-living conditions. Cox proportional hazard models were used to evaluate the associations between CRAR and all-cause, cancer and CVD mortality.
Results: Among 7456 cancer patients (mean age: 65.7±6.87 years; 58.85% women) aged 44-79 years, 934 (12.5%) deaths occurred over 9.00 years (64 525 person-years). CRAR disruptions, including low amplitude, low mesor and high fragmentation, were significantly associated with an increased risk of all-cause mortality (adjusted HR range, 1.30-2.00), cancer (adjusted HR range, 1.46-1.83) and CVD mortality (adjusted HR range, 1.73-2.66) in patients with cancer.
Discussion: These associations were robust across various cancer types. In addition, CRAR disruptions, particularly low amplitude, exceeded multiple traditional risk factors such as poor sleep, smoking, alcohol consumption, obesity and unhealthy diet in predicting mortality.
Conclusion: CRAR parameters may serve as novel and robust predictors of mortality in patients with cancer.
{"title":"Wearable device-measured circadian rest-activity rhythm with mortality risk in patients with cancer.","authors":"Xionge Mei, Nana Zheng, Biao Li, Yue Liu, Lulu Yang, Tong Luo, Ngan Yin Chan, Joey Wy Chan, Yaping Liu, Xiao Tan, Christian Benedict, Yun Kwok Wing, Jihui Zhang, Hongliang Feng","doi":"10.1136/bmjhci-2025-101553","DOIUrl":"10.1136/bmjhci-2025-101553","url":null,"abstract":"<p><strong>Objectives: </strong>The objectives were to examine the associations between accelerometer-measured circadian rest-activity rhythm (CRAR), the most prominent circadian rhythm in humans and the risk of mortality from all-cause, cancer and cardiovascular disease (CVD) in patients with cancer.</p><p><strong>Methods: </strong>7456 cancer participants from the UK Biobank were included. All participants wore accelerometers from 2013 to 2015 and were followed up until 24 January 2024, with a median follow-up of 9.00 years. The multidimensional parameters of the CRAR were calculated using the 7-day accelerometer data collected under free-living conditions. Cox proportional hazard models were used to evaluate the associations between CRAR and all-cause, cancer and CVD mortality.</p><p><strong>Results: </strong>Among 7456 cancer patients (mean age: 65.7±6.87 years; 58.85% women) aged 44-79 years, 934 (12.5%) deaths occurred over 9.00 years (64 525 person-years). CRAR disruptions, including low amplitude, low mesor and high fragmentation, were significantly associated with an increased risk of all-cause mortality (adjusted HR range, 1.30-2.00), cancer (adjusted HR range, 1.46-1.83) and CVD mortality (adjusted HR range, 1.73-2.66) in patients with cancer.</p><p><strong>Discussion: </strong>These associations were robust across various cancer types. In addition, CRAR disruptions, particularly low amplitude, exceeded multiple traditional risk factors such as poor sleep, smoking, alcohol consumption, obesity and unhealthy diet in predicting mortality.</p><p><strong>Conclusion: </strong>CRAR parameters may serve as novel and robust predictors of mortality in patients with cancer.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12421596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145032799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-22DOI: 10.1136/bmjhci-2024-101419
Amos Otieno Olwendo, Gideon Kikuvi, Simon Karanja
Introduction: This study seeks to determine incidence, comorbidities and drivers for new HIV infections to develop, test and validate a risk prediction model for screening for new cases of HIV.
Methods and analysis: The study has two components: a cross-sectional study to develop the prediction model using the HIV dataset from the Kenya AIDS and STI Control Programme and a 15-month prospective study for the validation of the model. Inferential analysis will be conducted using algorithms that perform best in disease prediction: Extreme Gradient Boosting (XGBoost) and Multilayer Perceptron. Model sensitivity and specificity will be examined using the receiver operating characteristic curve, and performance will be evaluated using metrics: accuracy, precision, recall and F1 score.
Ethics and dissemination: The study obtained ethical approval (JKU/ISERC/02321/1421) from the Jomo Kenyatta University of Agriculture and Technology Ethical and Research Board and a research licence (NACOSTI/P/24/414749) from the National Commission for Science, Technology and Innovation.
{"title":"Development and validation of a predictive model for new HIV infection screening among persons 15 years and above in primary healthcare settings in Kenya: a study protocol.","authors":"Amos Otieno Olwendo, Gideon Kikuvi, Simon Karanja","doi":"10.1136/bmjhci-2024-101419","DOIUrl":"https://doi.org/10.1136/bmjhci-2024-101419","url":null,"abstract":"<p><strong>Introduction: </strong>This study seeks to determine incidence, comorbidities and drivers for new HIV infections to develop, test and validate a risk prediction model for screening for new cases of HIV.</p><p><strong>Methods and analysis: </strong>The study has two components: a cross-sectional study to develop the prediction model using the HIV dataset from the Kenya AIDS and STI Control Programme and a 15-month prospective study for the validation of the model. Inferential analysis will be conducted using algorithms that perform best in disease prediction: Extreme Gradient Boosting (XGBoost) and Multilayer Perceptron. Model sensitivity and specificity will be examined using the receiver operating characteristic curve, and performance will be evaluated using metrics: accuracy, precision, recall and F1 score.</p><p><strong>Ethics and dissemination: </strong>The study obtained ethical approval (JKU/ISERC/02321/1421) from the Jomo Kenyatta University of Agriculture and Technology Ethical and Research Board and a research licence (NACOSTI/P/24/414749) from the National Commission for Science, Technology and Innovation.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12374640/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144942208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-18DOI: 10.1136/bmjhci-2024-101146
Ying Wang, Farah Magrabi
Objective: To evaluate the transferability of BERT (Bidirectional Encoder Representations from Transformers) to patient safety, we use it to classify incident reports characterised by limited data and encompassing multiple imbalanced classes.
Methods: BERT was applied to classify 10 incident types and 4 severity levels by (1) fine-tuning and (2) extracting word embeddings for feature representation. Training datasets were collected from a state-wide incident reporting system in Australia (n_type/severity=2860/1160). Transferability was evaluated using three datasets: a balanced dataset (type/severity: n_benchmark=286/116); a real-world imbalanced dataset (n_original=444/4837, rare types/severity<=1%); and an independent hospital-level reporting system (n_independent=6000/5950, imbalanced). Model performance was evaluated by F-score, precision and recall, then compared with convolutional neural networks (CNNs) using BERT embeddings and local embeddings from incident reports.
Results: Fine-tuned BERT outperformed small CNNs trained with BERT embedding and static word embeddings developed from scratch. The default parameters of BERT were found to be the most optimal configuration. For incident type, fine-tuned BERT achieved high F-scores above 89% across all test datasets (CNNs=81%). It effectively generalised to real-world settings, including rare incident types (eg, clinical handover with 11.1% and 30.3% improvement). For ambiguous medium and low severity levels, the F-score improvements ranged from 3.6% to 19.7% across all test datasets.
Discussion: Fine-tuned BERT led to improved performance, particularly in identifying rare classes and generalising effectively to unseen data, compared with small CNNs.
Conclusion: Fine-tuned BERT may be useful for classification tasks in patient safety where data privacy, scarcity and imbalance are common challenges.
{"title":"Assessing the transferability of BERT to patient safety: classifying multiple types of incident reports.","authors":"Ying Wang, Farah Magrabi","doi":"10.1136/bmjhci-2024-101146","DOIUrl":"10.1136/bmjhci-2024-101146","url":null,"abstract":"<p><strong>Objective: </strong>To evaluate the transferability of BERT (Bidirectional Encoder Representations from Transformers) to patient safety, we use it to classify incident reports characterised by limited data and encompassing multiple imbalanced classes.</p><p><strong>Methods: </strong>BERT was applied to classify 10 incident types and 4 severity levels by (1) fine-tuning and (2) extracting word embeddings for feature representation. Training datasets were collected from a state-wide incident reporting system in Australia (<i>n_type/severity=2860/1160</i>). Transferability was evaluated using three datasets: a balanced dataset (<i>type/severity: n_benchmark=286/116</i>); a real-world imbalanced dataset (<i>n_original=444/4837, rare types/severity<=1%</i>); and an independent hospital-level reporting system (<i>n_independent=6000/5950, imbalanced</i>). Model performance was evaluated by F-score, precision and recall, then compared with convolutional neural networks (CNNs) using BERT embeddings and local embeddings from incident reports.</p><p><strong>Results: </strong>Fine-tuned BERT outperformed small CNNs trained with BERT embedding and static word embeddings developed from scratch. The default parameters of BERT were found to be the most optimal configuration. For incident type, fine-tuned BERT achieved high F-scores above 89% across all test datasets (<i>CNNs=81%</i>). It effectively generalised to real-world settings, including rare incident types (eg, clinical handover with 11.1% and 30.3% improvement). For ambiguous medium and low severity levels, the F-score improvements ranged from 3.6% to 19.7% across all test datasets.</p><p><strong>Discussion: </strong>Fine-tuned BERT led to improved performance, particularly in identifying rare classes and generalising effectively to unseen data, compared with small CNNs.</p><p><strong>Conclusion: </strong>Fine-tuned BERT may be useful for classification tasks in patient safety where data privacy, scarcity and imbalance are common challenges.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366584/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144882072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1136/bmjhci-2024-101417
Lynsey Threlfall, Cen Cong, Victoria Riccalton, Edward Meinert, Chris Plummer
Introduction: The second iteration of the National Early Warning Score has been adopted widely within the UK and internationally. It uses routinely collected physiological measurements to standardise the assessment and response to acute illness. Its use is associated with reduced mortality but has limited positive and negative predictive accuracy. There is a growing body of research demonstrating the effectiveness of artificial intelligence (AI) in predicting clinical deterioration, but there is limited evidence to show which aspect of AI is best suited to this task. This systematic review aims to establish which AI or machine learning algorithm is best suited to analysing physiological data sets to predict patient deterioration in a hospital setting.
Methods and analysis: A systematic review will be conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) and the PICOS (Population, Intervention, Comparator, Outcome and Study) frameworks. Eight databases (PubMed, Embase, CINAHL, Cochrane Library, Web of Science, Scopus, IEEE Xplore and ACM Digital Library) will be used to search for studies published from 2007 to the present that meet the inclusion criteria. Two reviewers will screen the studies identified and extract data independently, with any discrepancies resolved by discussion. The review is expected to be completed by January 2026, and the results will be presented in publication by June 2026.
Ethics and dissemination: Ethical approval is not required as data will be obtained from published sources. Findings from this study will be disseminated via publication in a peer-reviewed journal.
简介:国家早期预警评分的第二次迭代已在英国和国际上广泛采用。它使用常规收集的生理测量来标准化对急性疾病的评估和反应。它的使用与死亡率降低有关,但具有有限的正面和负面预测准确性。越来越多的研究表明人工智能(AI)在预测临床恶化方面的有效性,但很少有证据表明人工智能的哪一方面最适合这项任务。本系统综述旨在确定哪种人工智能或机器学习算法最适合分析医院环境中的生理数据集,以预测患者的病情恶化。方法和分析:将按照PRISMA(系统评价和荟萃分析首选报告项目)和PICOS(人口、干预、比较物、结果和研究)框架进行系统评价。8个数据库(PubMed, Embase, CINAHL, Cochrane Library, Web of Science, Scopus, IEEE Xplore和ACM Digital Library)将被用于搜索2007年至今发表的符合纳入标准的研究。两名审稿人将筛选确定的研究并独立提取数据,任何差异通过讨论解决。评估预计将于2026年1月完成,结果将于2026年6月公布。伦理和传播:由于数据将从已发表的来源获得,因此不需要伦理批准。本研究结果将在同行评议的期刊上发表。
{"title":"Predicting patient deterioration with physiological data using AI: systematic review protocol.","authors":"Lynsey Threlfall, Cen Cong, Victoria Riccalton, Edward Meinert, Chris Plummer","doi":"10.1136/bmjhci-2024-101417","DOIUrl":"10.1136/bmjhci-2024-101417","url":null,"abstract":"<p><strong>Introduction: </strong>The second iteration of the National Early Warning Score has been adopted widely within the UK and internationally. It uses routinely collected physiological measurements to standardise the assessment and response to acute illness. Its use is associated with reduced mortality but has limited positive and negative predictive accuracy. There is a growing body of research demonstrating the effectiveness of artificial intelligence (AI) in predicting clinical deterioration, but there is limited evidence to show which aspect of AI is best suited to this task. This systematic review aims to establish which AI or machine learning algorithm is best suited to analysing physiological data sets to predict patient deterioration in a hospital setting.</p><p><strong>Methods and analysis: </strong>A systematic review will be conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) and the PICOS (Population, Intervention, Comparator, Outcome and Study) frameworks. Eight databases (PubMed, Embase, CINAHL, Cochrane Library, Web of Science, Scopus, IEEE Xplore and ACM Digital Library) will be used to search for studies published from 2007 to the present that meet the inclusion criteria. Two reviewers will screen the studies identified and extract data independently, with any discrepancies resolved by discussion. The review is expected to be completed by January 2026, and the results will be presented in publication by June 2026.</p><p><strong>Ethics and dissemination: </strong>Ethical approval is not required as data will be obtained from published sources. Findings from this study will be disseminated via publication in a peer-reviewed journal.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12336570/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144788245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-31DOI: 10.1136/bmjhci-2024-101285
Sam Freeman, Isuru Ranapanada, Md Ali Hossain, Kogul Srikandabala, Md Anisur Anisur Rahman, Damminda Alahakoon, Hamed Akhlaghi
Introduction: To address timely care in emergency departments, artificial neural networks (ANNs) with natural language processing will be applied to triage notes to predict patient disposition. This study will develop a predictive model that predicts disposition and type of admission.
Methods and analysis: This will include data preprocessing and quality enhancement, masked language modelling, ANN-based fusion network for prediction. Generative artificial intelligence, along with a medical dictionary, will be employed to augment and contextually reconstruct triage notes to disambiguate and improve linguistic quality. Text features will be extracted, and cluster analysis will be performed on the extracted topics and text features to identify distinct patterns.
{"title":"Multisite study using a customised NLP model to predict disposition in the emergency department: protocol paper.","authors":"Sam Freeman, Isuru Ranapanada, Md Ali Hossain, Kogul Srikandabala, Md Anisur Anisur Rahman, Damminda Alahakoon, Hamed Akhlaghi","doi":"10.1136/bmjhci-2024-101285","DOIUrl":"10.1136/bmjhci-2024-101285","url":null,"abstract":"<p><strong>Introduction: </strong>To address timely care in emergency departments, artificial neural networks (ANNs) with natural language processing will be applied to triage notes to predict patient disposition. This study will develop a predictive model that predicts disposition and type of admission.</p><p><strong>Methods and analysis: </strong>This will include data preprocessing and quality enhancement, masked language modelling, ANN-based fusion network for prediction. Generative artificial intelligence, along with a medical dictionary, will be employed to augment and contextually reconstruct triage notes to disambiguate and improve linguistic quality. Text features will be extracted, and cluster analysis will be performed on the extracted topics and text features to identify distinct patterns.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12314934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144764499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1136/bmjhci-2025-101433
Ben Bloom, Adrian Haimovich, Jason Pott, Sophie L Williams, Michael Cheetham, Sandra Langsted, Imogen Skene, Raine Astin-Chamberlain, Stephen H Thomas
Objectives: Identifying whether there is a traumatic intracranial bleed (ICB+) on head CT is critical for clinical care and research. Free text CT reports are unstructured and therefore must undergo time-consuming manual review. Existing artificial intelligence classification schemes are not optimised for the emergency department endpoint of classification of ICB+ or ICB-. We sought to assess three methods for classifying CT reports: a text classification (TC) programme, a commercial natural language processing programme (Clinithink) and a generative pretrained transformer large language model (Digitalizing English-language CT Interpretation for Positive Haemorrhage Evaluation Reporting (DECIPHER)-LLM).
Methods: Primary objective: determine the diagnostic classification performance of the dichotomous categorisation of each of the three approaches.
Secondary objective: determine whether the LLM could achieve a substantial reduction in CT report review workload while maintaining 100% sensitivity.Anonymised radiology reports of head CT scans performed for trauma were manually labelled as ICB+/-. Training and validation sets were randomly created to train the TC and natural language processing models. Prompts were written to train the LLM.
Results: 898 reports were manually labelled. Sensitivity and specificity (95% CI)) of TC, Clinithink and DECIPHER-LLM (with probability of ICB set at 10%) were respectively 87.9% (76.7% to 95.0%) and 98.2% (96.3% to 99.3%), 75.9% (62.8% to 86.1%) and 96.2% (93.8% to 97.8%) and 100% (93.8% to 100%) and 97.4% (95.3% to 98.8%).With DECIPHER-LLM probability of ICB+ threshold of 10% set to identify CT reports requiring manual evaluation, CT reports requiring manual classification reduced by an estimated 385/449 cases (85.7% (95% CI 82.1% to 88.9%)) while maintaining 100% sensitivity.
Discussion and conclusion: DECIPHER-LLM outperformed other tested free-text classification methods.
{"title":"Digitalizing English-language CT Interpretation for Positive Haemorrhage Evaluation Reporting: the DECIPHER study.","authors":"Ben Bloom, Adrian Haimovich, Jason Pott, Sophie L Williams, Michael Cheetham, Sandra Langsted, Imogen Skene, Raine Astin-Chamberlain, Stephen H Thomas","doi":"10.1136/bmjhci-2025-101433","DOIUrl":"10.1136/bmjhci-2025-101433","url":null,"abstract":"<p><strong>Objectives: </strong>Identifying whether there is a traumatic intracranial bleed (ICB+) on head CT is critical for clinical care and research. Free text CT reports are unstructured and therefore must undergo time-consuming manual review. Existing artificial intelligence classification schemes are not optimised for the emergency department endpoint of classification of ICB+ or ICB-. We sought to assess three methods for classifying CT reports: a text classification (TC) programme, a commercial natural language processing programme (Clinithink) and a generative pretrained transformer large language model (Digitalizing English-language CT Interpretation for Positive Haemorrhage Evaluation Reporting (DECIPHER)-LLM).</p><p><strong>Methods: </strong>Primary objective: determine the diagnostic classification performance of the dichotomous categorisation of each of the three approaches.</p><p><strong>Secondary objective: </strong>determine whether the LLM could achieve a substantial reduction in CT report review workload while maintaining 100% sensitivity.Anonymised radiology reports of head CT scans performed for trauma were manually labelled as ICB+/-. Training and validation sets were randomly created to train the TC and natural language processing models. Prompts were written to train the LLM.</p><p><strong>Results: </strong>898 reports were manually labelled. Sensitivity and specificity (95% CI)) of TC, Clinithink and DECIPHER-LLM (with probability of ICB set at 10%) were respectively 87.9% (76.7% to 95.0%) and 98.2% (96.3% to 99.3%), 75.9% (62.8% to 86.1%) and 96.2% (93.8% to 97.8%) and 100% (93.8% to 100%) and 97.4% (95.3% to 98.8%).With DECIPHER-LLM probability of ICB+ threshold of 10% set to identify CT reports requiring manual evaluation, CT reports requiring manual classification reduced by an estimated 385/449 cases (85.7% (95% CI 82.1% to 88.9%)) while maintaining 100% sensitivity.</p><p><strong>Discussion and conclusion: </strong>DECIPHER-LLM outperformed other tested free-text classification methods.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12306305/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144717429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1136/bmjhci-2025-101570
AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ
Objectives: To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.
Methods: We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.
Results: ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.
Discussion: Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.
Conclusion: ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.
{"title":"Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education.","authors":"AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ","doi":"10.1136/bmjhci-2025-101570","DOIUrl":"10.1136/bmjhci-2025-101570","url":null,"abstract":"<p><strong>Objectives: </strong>To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.</p><p><strong>Methods: </strong>We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.</p><p><strong>Results: </strong>ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.</p><p><strong>Discussion: </strong>Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.</p><p><strong>Conclusion: </strong>ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12306375/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144717428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-23DOI: 10.1136/bmjhci-2024-101371
Saba Esnaashari, Youmna Hashem, John Francis, Deborah Morgan, Anton Poletaev, Jonathan Bright
This research presents key findings from a project exploring UK doctors' perspectives on artificial intelligence (AI) in their work. Despite a growing interest in the use of AI in medicine, studies have yet to explore a representative sample of doctors' perspectives on, and experiences with, making use of different types of AI. Our research seeks to fill this gap by presenting findings from a survey exploring doctors' perceptions and experiences of using a variety of AI systems in their work. A sample of 929 doctors on the UK medical register participated in a survey between December 2023 and January 2024 which asked a range of questions about their understanding and use of AI systems.Overall, 29% of respondents reported using some form of AI in their practice within the last 12 months, with diagnostic-decision-support (16%) and generative-AI (16%) being the most prevalently used AI systems.We found that the majority of generative-AI users (62%) reported that these systems increase their productivity, and most diagnostic- decision-support users (62%) reported that the systems improve their clinical decision-making. More than half of doctors (52%) were optimistic about the integration of AI in healthcare, rising to 63% for AI users. Only 15% stated that advances in AI make them worried about their job security, with no significant difference between AI and non-AI users. However, there were relatively low reported levels of training, as well as understandings of risks and professional responsibilities, especially among generative-AI users. Just 12% of respondents agreed they have received sufficient training to understand their professional responsibilities when using AI, with this number decreasing to 8% for generative-AI users. We hope this work adds to the evidence base for policy-makers looking to support the integration of AI in healthcare.
{"title":"Exploring doctors' perspectives on generative-AI and diagnostic-decision-support systems.","authors":"Saba Esnaashari, Youmna Hashem, John Francis, Deborah Morgan, Anton Poletaev, Jonathan Bright","doi":"10.1136/bmjhci-2024-101371","DOIUrl":"10.1136/bmjhci-2024-101371","url":null,"abstract":"<p><p>This research presents key findings from a project exploring UK doctors' perspectives on artificial intelligence (AI) in their work. Despite a growing interest in the use of AI in medicine, studies have yet to explore a representative sample of doctors' perspectives on, and experiences with, making use of different types of AI. Our research seeks to fill this gap by presenting findings from a survey exploring doctors' perceptions and experiences of using a variety of AI systems in their work. A sample of 929 doctors on the UK medical register participated in a survey between December 2023 and January 2024 which asked a range of questions about their understanding and use of AI systems.Overall, 29% of respondents reported using some form of AI in their practice within the last 12 months, with diagnostic-decision-support (16%) and generative-AI (16%) being the most prevalently used AI systems.We found that the majority of generative-AI users (62%) reported that these systems increase their productivity, and most diagnostic- decision-support users (62%) reported that the systems improve their clinical decision-making. More than half of doctors (52%) were optimistic about the integration of AI in healthcare, rising to 63% for AI users. Only 15% stated that advances in AI make them worried about their job security, with no significant difference between AI and non-AI users. However, there were relatively low reported levels of training, as well as understandings of risks and professional responsibilities, especially among generative-AI users. Just 12% of respondents agreed they have received sufficient training to understand their professional responsibilities when using AI, with this number decreasing to 8% for generative-AI users. We hope this work adds to the evidence base for policy-makers looking to support the integration of AI in healthcare.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12306348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}