Pub Date : 2024-08-28DOI: 10.1101/2024.08.27.24312635
Rui Santos, Nicholas Bünger, Benedikt Herzog, Sebastiano Caprara
Advancements in artificial intelligence (AI) and the digitalization of healthcare are revolutionizing clinical practices, with the deployment of AI models playing a crucial role in enhancing diagnostic accuracy and treatment outcomes. Our current study aims at bridging image data collected in a clinical setting, with deployment of deep learning algorithms for the segmentation of the human spine. The developed pipeline takes a decentralized approach, where selected clinical images are sent to a trusted research environment, part of private tenant in a cloud service provider. As a use-case scenario, we used the TotalSegmentator CT-scan dataset, along with its annotated ground-truth spine data, to train a ResSegNet model native to the MONAI-Label framework. Training and validation were conducted using high performance GPUs available on demand in the Trusted Research Environment. Segmentation model performance benchmarking involved metrics such as dice score, intersection over union, accuracy, precision, sensitivity, specificity, bounding F1 score, Cohen’s kappa, area under the curve, and Hausdorff distance. To further assess model robustness, we also trained a state-of-the-art nnU-Net model using the same dataset and compared both models with a pre-trained spine segmentation model available within MONAI-Label. The ResSegNet model, deployable via MONAI-Label, demonstrated performance comparable to the state-of-the-art nnU-Net framework, with both models showing strong results across multiple segmentation metrics. This study successfully trained, evaluated and deployed a decentralized deep learning model for CT-scan spine segmentation in a cloud environment. This new model was validated against state-of-the-art alternatives. This comprehensive comparison highlights the value of the MONAI-Label as an effective tool for label generation, model training, and deployment, further highlighting its user-friendly nature and ease of deployment in clinical and research settings. Further we also demonstrate that such tools can be deployed in private and safe decentralized cloud environments for clinical use.
{"title":"Development of a cloud framework for training and deployment of deep learning models in Radiology: automatic segmentation of the human spine from CT-scans as a case-study","authors":"Rui Santos, Nicholas Bünger, Benedikt Herzog, Sebastiano Caprara","doi":"10.1101/2024.08.27.24312635","DOIUrl":"https://doi.org/10.1101/2024.08.27.24312635","url":null,"abstract":"Advancements in artificial intelligence (AI) and the digitalization of healthcare are revolutionizing clinical practices, with the deployment of AI models playing a crucial role in enhancing diagnostic accuracy and treatment outcomes. Our current study aims at bridging image data collected in a clinical setting, with deployment of deep learning algorithms for the segmentation of the human spine. The developed pipeline takes a decentralized approach, where selected clinical images are sent to a trusted research environment, part of private tenant in a cloud service provider. As a use-case scenario, we used the TotalSegmentator CT-scan dataset, along with its annotated ground-truth spine data, to train a ResSegNet model native to the MONAI-Label framework. Training and validation were conducted using high performance GPUs available on demand in the Trusted Research Environment. Segmentation model performance benchmarking involved metrics such as dice score, intersection over union, accuracy, precision, sensitivity, specificity, bounding F1 score, Cohen’s kappa, area under the curve, and Hausdorff distance. To further assess model robustness, we also trained a state-of-the-art nnU-Net model using the same dataset and compared both models with a pre-trained spine segmentation model available within MONAI-Label. The ResSegNet model, deployable via MONAI-Label, demonstrated performance comparable to the state-of-the-art nnU-Net framework, with both models showing strong results across multiple segmentation metrics. This study successfully trained, evaluated and deployed a decentralized deep learning model for CT-scan spine segmentation in a cloud environment. This new model was validated against state-of-the-art alternatives. This comprehensive comparison highlights the value of the MONAI-Label as an effective tool for label generation, model training, and deployment, further highlighting its user-friendly nature and ease of deployment in clinical and research settings. Further we also demonstrate that such tools can be deployed in private and safe decentralized cloud environments for clinical use.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1101/2024.08.27.24312630
Lena Schmidt, Pauline Addis, Erica Mattellone, Hannah OKeefe, Kamilla Nabiyeva, Uyen Kim Huynh, Nabamallika Dehingia, Dawn Craig, Fiona Campbell
Background: The United Nations Children's Fund (UNICEF) is the United Nations agency dedicated to promoting and advocating for the protection of children's rights, meeting their basic needs, and expanding their opportunities to reach their full potential. They achieve this by working with governments, communities, and other partners via programmes that safeguard children from violence, provide access to quality education, ensure that children survive and thrive, provide access to water, sanitation and hygiene, and provide life-saving support in emergency contexts. Programmes are evaluated as part of UNICEF Evaluation Policy, and the publicly available reports include a wealth of information on results, recommendations, and lessons learned. Objective: To critically explore UNICEF's impact, a systematic synthesis of evaluations was conducted to provide a summary of UNICEF main achievements and areas where they could improve, as a reflection of key recommendations, lessons learned, enablers, and barriers to achieving their goals and to steer its future direction and strategy. Since the evaluations are extensive, manual analysis was not feasible, so a semi-automated approach was taken. Methods: This paper examines the automation techniques used to try and increase the feasibility of undertaking broad evaluation syntheses analyses. Our semi-automated human-in-the-loop methods supported data extraction of data for 64 outcomes across 631 evaluation reports; each of which comprised hundreds of pages of text. The outcomes are derived from the five goal areas within UNICEF 2022-2025 Strategic Plan. For text pre-processing we implemented PDF-to-text extraction, section parsing, and sentence mining via a neural network. Data extraction was supported by a freely available text-mining workbench, SWIFT-Review. Here, we describe using comprehensive adjacency-search-based queries to rapidly filter reports by outcomes and to highlight relevant sections of text to expedite data extraction. Results: While the methods used were not expected to produce 100% complete results for each outcome, they present useful automation methods for researchers facing otherwise non-feasible evaluation syntheses tasks. We reduced the text volume down to 8% using deep learning (recall 0.93) and rapidly identified relevant evaluations across outcomes with a median precision of 0.6. All code is available and open-source. Conclusions: When the classic approach of systematically extracting information from all outcomes across all texts exceeds available resources, the proposed automation methods can be employed to speed up the process while retaining scientific rigour and reproducibility.
{"title":"Evaluation synthesis analysis can be accelerated through text mining, searching, and highlighting: A case-study on data extraction from 631 UNICEF evaluation reports","authors":"Lena Schmidt, Pauline Addis, Erica Mattellone, Hannah OKeefe, Kamilla Nabiyeva, Uyen Kim Huynh, Nabamallika Dehingia, Dawn Craig, Fiona Campbell","doi":"10.1101/2024.08.27.24312630","DOIUrl":"https://doi.org/10.1101/2024.08.27.24312630","url":null,"abstract":"Background: The United Nations Children's Fund (UNICEF) is the United Nations agency dedicated to promoting and advocating for the protection of children's rights, meeting their basic needs, and expanding their opportunities to reach their full potential. They achieve this by working with governments, communities, and other partners via programmes that safeguard children from violence, provide access to quality education, ensure that children survive and thrive, provide access to water, sanitation and hygiene, and provide life-saving support in emergency contexts. Programmes are evaluated as part of UNICEF Evaluation Policy, and the publicly available reports include a wealth of information on results, recommendations, and lessons learned. Objective: To critically explore UNICEF's impact, a systematic synthesis of evaluations was conducted to provide a summary of UNICEF main achievements and areas where they could improve, as a reflection of key recommendations, lessons learned, enablers, and barriers to achieving their goals and to steer its future direction and strategy. Since the evaluations are extensive, manual analysis was not feasible, so a semi-automated approach was taken. Methods: This paper examines the automation techniques used to try and increase the feasibility of undertaking broad evaluation syntheses analyses. Our semi-automated human-in-the-loop methods supported data extraction of data for 64 outcomes across 631 evaluation reports; each of which comprised hundreds of pages of text. The outcomes are derived from the five goal areas within UNICEF 2022-2025 Strategic Plan. For text pre-processing we implemented PDF-to-text extraction, section parsing, and sentence mining via a neural network. Data extraction was supported by a freely available text-mining workbench, SWIFT-Review. Here, we describe using comprehensive adjacency-search-based queries to rapidly filter reports by outcomes and to highlight relevant sections of text to expedite data extraction. Results: While the methods used were not expected to produce 100% complete results for each outcome, they present useful automation methods for researchers facing otherwise non-feasible evaluation syntheses tasks. We reduced the text volume down to 8% using deep learning (recall 0.93) and rapidly identified relevant evaluations across outcomes with a median precision of 0.6. All code is available and open-source. Conclusions: When the classic approach of systematically extracting information from all outcomes across all texts exceeds available resources, the proposed automation methods can be employed to speed up the process while retaining scientific rigour and reproducibility.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1101/2024.08.26.24312619
Zeki Kuş, Musa Aydin
MedSegBench is a comprehensive benchmark designed to evaluate deep learning models for medical image segmentation across a wide range of modalities. It covers a wide range of modalities, including 35 datasets with over 60,000 images from ultrasound, MRI, and X-ray. The benchmark addresses challenges in medical imaging by providing standardized datasets with train/validation/test splits, considering variability in image quality and dataset imbalances. The benchmark supports binary and multi-class segmentation tasks with up to 19 classes. It supports binary and multi-class segmentation tasks with up to 19 classes and uses the U-Net architecture with various encoder/decoder networks such as ResNets, EfficientNet, and DenseNet for evaluations. MedSegBench is a valuable resource for developing robust and flexible segmentation algorithms and allows for fair comparisons across different models, promoting the development of universal models for medical tasks. It is the most comprehensive study among medical segmentation datasets. The datasets and source code are publicly available, encouraging further research and development in medical image analysis.
{"title":"MedSegBench: A Comprehensive Benchmark for Medical Image Segmentation in Diverse Data Modalities","authors":"Zeki Kuş, Musa Aydin","doi":"10.1101/2024.08.26.24312619","DOIUrl":"https://doi.org/10.1101/2024.08.26.24312619","url":null,"abstract":"MedSegBench is a comprehensive benchmark designed to evaluate deep learning models for medical image segmentation across a wide range of modalities. It covers a wide range of modalities, including 35 datasets with over 60,000 images from ultrasound, MRI, and X-ray. The benchmark addresses challenges in medical imaging by providing standardized datasets with train/validation/test splits, considering variability in image quality and dataset imbalances. The benchmark supports binary and multi-class segmentation tasks with up to 19 classes. It supports binary and multi-class segmentation tasks with up to 19 classes and uses the U-Net architecture with various encoder/decoder networks such as ResNets, EfficientNet, and DenseNet for evaluations. MedSegBench is a valuable resource for developing robust and flexible segmentation algorithms and allows for fair comparisons across different models, promoting the development of universal models for medical tasks. It is the most comprehensive study among medical segmentation datasets. The datasets and source code are publicly available, encouraging further research and development in medical image analysis.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1101/2024.08.27.24312646
Jacob Beattie, Dylan Owens, Ann Marie Navar, Luiza Giuliani Schmitt, Kimberly Taing, Sarah Neufeld, Daniel Yang, Christian Chukwuma, Ahmed Gul, Dong Soo Lee, Neil Desai, Dominic Moon, Jing Wang, Steve Jiang, Michael Dohopolski
Purpose: Identifying potential participants for clinical trials using traditional manual screening methods is time-consuming and expensive. Structured data in electronic health records (EHR) are often insufficient to capture trial inclusion and exclusion criteria adequately. Large language models (LLMs) offer the potential for improved participant screening by searching text notes in the EHR, but optimal deployment strategies remain unclear. Methods: We evaluated the performance of GPT-3.5 and GPT-4 in screening a cohort of 74 patients (35 eligible, 39 ineligible) using EHR data, including progress notes, pathology reports, and imaging reports, for a phase 2 clinical trial in patients with head and neck cancer. Fourteen trial criteria were evaluated, including stage, histology, prior treatments, underlying conditions, functional status, etc. Manually annotated data served as the ground truth. We tested three prompting approaches (Structured Output (SO), Chain of Thought (CoT), and Self-Discover (SD)). SO and CoT were further tested using expert and LLM guidance (EG and LLM-G, respectively). Prompts were developed and refined using 10 patients from each cohort and then assessed on the remaining 54 patients. Each approach was assessed for accuracy, sensitivity, specificity, and micro F1 score. We explored two eligibility predictions: strict eligibility required meeting all criteria, while proportional eligibility used the proportion of criteria met. Screening time and cost were measured, and a failure analysis identified common misclassification issues. Results: Fifty-four patients were evaluated (25 enrolled, 29 not enrolled). At the criterion level, GPT-3.5 showed a median accuracy of 0.761 (range: 0.554-0.910), with the Structured Out- put + EG approach performing best. GPT-4 demonstrated a median accuracy of 0.838 (range: 0.758-0.886), with the Self-Discover approach achieving the highest Youden Index of 0.729. For strict patient-level eligibility, GPT-3.5's Structured Output + EG approach reached an accuracy of 0.611, while GPT-4's CoT + EG achieved 0.65. Proportional eligibility performed better over- all, with GPT-4's CoT + LLM-G approach having the highest AUC (0.82) and Youden Index (0.60). Screening times ranged from 1.4 to 3 minutes per patient for GPT-3.5 and 7.9 to 12.4 minutes for GPT-4, with costs of $0.02-$0.03 for GPT-3.5 and $0.15-$0.27 for GPT-4. Conclusion: LLMs can be used to identify specific clinical trial criteria but had difficulties identifying patients who met all criteria. Instead, using the proportion of criteria met to flag candidates for manual review maybe a more practical approach. LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials.
{"title":"Large Language Model Augmented Clinical Trial Screening","authors":"Jacob Beattie, Dylan Owens, Ann Marie Navar, Luiza Giuliani Schmitt, Kimberly Taing, Sarah Neufeld, Daniel Yang, Christian Chukwuma, Ahmed Gul, Dong Soo Lee, Neil Desai, Dominic Moon, Jing Wang, Steve Jiang, Michael Dohopolski","doi":"10.1101/2024.08.27.24312646","DOIUrl":"https://doi.org/10.1101/2024.08.27.24312646","url":null,"abstract":"Purpose: Identifying potential participants for clinical trials using traditional manual screening methods is time-consuming and expensive. Structured data in electronic health records (EHR) are often insufficient to capture trial inclusion and exclusion criteria adequately. Large language models (LLMs) offer the potential for improved participant screening by searching text notes in the EHR, but optimal deployment strategies remain unclear.\u0000Methods: We evaluated the performance of GPT-3.5 and GPT-4 in screening a cohort of 74 patients (35 eligible, 39 ineligible) using EHR data, including progress notes, pathology reports, and imaging reports, for a phase 2 clinical trial in patients with head and neck cancer. Fourteen trial criteria were evaluated, including stage, histology, prior treatments, underlying conditions, functional status, etc. Manually annotated data served as the ground truth. We tested three prompting approaches (Structured Output (SO), Chain of Thought (CoT), and Self-Discover (SD)). SO and CoT were further tested using expert and LLM guidance (EG and LLM-G, respectively). Prompts were developed and refined using 10 patients from each cohort and then assessed on the remaining 54 patients. Each approach was assessed for accuracy, sensitivity, specificity, and micro F1 score. We explored two eligibility predictions: strict eligibility required meeting all criteria, while proportional eligibility used the proportion of criteria met. Screening time and cost were measured, and a failure analysis identified common misclassification issues. Results: Fifty-four patients were evaluated (25 enrolled, 29 not enrolled). At the criterion level, GPT-3.5 showed a median accuracy of 0.761 (range: 0.554-0.910), with the Structured Out- put + EG approach performing best. GPT-4 demonstrated a median accuracy of 0.838 (range: 0.758-0.886), with the Self-Discover approach achieving the highest Youden Index of 0.729. For strict patient-level eligibility, GPT-3.5's Structured Output + EG approach reached an accuracy of 0.611, while GPT-4's CoT + EG achieved 0.65. Proportional eligibility performed better over- all, with GPT-4's CoT + LLM-G approach having the highest AUC (0.82) and Youden Index (0.60). Screening times ranged from 1.4 to 3 minutes per patient for GPT-3.5 and 7.9 to 12.4 minutes for GPT-4, with costs of $0.02-$0.03 for GPT-3.5 and $0.15-$0.27 for GPT-4.\u0000Conclusion: LLMs can be used to identify specific clinical trial criteria but had difficulties identifying patients who met all criteria. Instead, using the proportion of criteria met to flag candidates for manual review maybe a more practical approach. LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
From drafting responses to patient messages to clinical decision support to patient-facing educational chatbots, Large Language Models (LLMs) present many opportunities for use in clinical situations. In these applications, we must consider potential harms to minoritized groups through the propagation of medical misinformation or previously-held misconceptions. In this work, we evaluate the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT (GPT-4.0)) with a set of 38 prompts consisting of explicit questions and synthetic clinical notes created by medically trained reviewers and LGBTQIA+ health experts. The prompts explored clinical situations across two axes: (i) situations where historical bias has been observed vs. not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care vs. not relevant. Medically trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We find that all 4 LLMs evaluated generated inappropriate responses to our prompt set. LLM performance is strongly hampered by learned anti-LGBTQIA+ bias and over-reliance on the mentioned conditions in prompts. Given these results, future work should focus on tailoring output formats according to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients and care providers.
{"title":"Evaluating Anti-LGBTQIA+ Medical Bias in Large Language Models","authors":"Crystal Tin-Tin Chang, Neha Srivathsa, Charbel Bou-Khalil, Akshay Swaminathan, Mitchell R Lunn, Kavita Mishra, Roxana Daneshjou, Sanmi Koyejo","doi":"10.1101/2024.08.22.24312464","DOIUrl":"https://doi.org/10.1101/2024.08.22.24312464","url":null,"abstract":"From drafting responses to patient messages to clinical decision support to patient-facing educational chatbots, Large Language Models (LLMs) present many opportunities for use in clinical situations. In these applications, we must consider potential harms to minoritized groups through the propagation of medical misinformation or previously-held misconceptions. In this work, we evaluate the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT (GPT-4.0)) with a set of 38 prompts consisting of explicit questions and synthetic clinical notes created by medically trained reviewers and LGBTQIA+ health experts. The prompts explored clinical situations across two axes: (i) situations where historical bias has been observed vs. not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care vs. not relevant. Medically trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We find that all 4 LLMs evaluated generated inappropriate responses to our prompt set. LLM performance is strongly hampered by learned anti-LGBTQIA+ bias and over-reliance on the mentioned conditions in prompts. Given these results, future work should focus on tailoring output formats according to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients and care providers.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1101/2024.08.12.24311883
Michael Albrecht, Denton Shanks, Tina Shah, Taina Hudson, Jeffrey Thompson, Tanya Filardi, Kelli Wright, Greg Ator, Timothy Ryan Smith
Objective: This study assessed the effects of an ambient artificial intelligence (AI) documentation platform on clinicians' perceptions of documentation workflow. Materials and Methods: A pre- and post-implementation survey evaluated ambulatory clinician perceptions on impact of Abridge, an ambient AI documentation platform. Outcomes included clinical documentation burden, work after-hours, clinician burnout, work satisfaction, and patient access. Data were analyzed using descriptive statistics and proportional odds logistic regression to compare changes for concordant questions across pre- and post-surveys. Covariate analysis examined effect of specialty type and duration of use of the AI tool. Results: Survey response rates were 51.1% (94/181) pre-implementation and 75.9% (101/133) post-implementation. Clinician perception of ease of documentation workflow (OR = 6.91, 95% CI: 3.90 to 12.56, p<0.001) and in completing notes associated with usage of the AI tool (OR = 4.95, 95% CI: 2.87 to 8.69, p<0.001) was significantly improved. The majority of respondents agreed that the AI tool decreased documentation burden, decreased the time spent documenting outside clinical hours, reduced burnout risk, and increased job satisfaction, with 48% agreeing that an additional patient could be seen if needed. Clinician specialty type and number of days using the AI tool did not significantly affect survey responses. Discussion: Clinician experience and efficiency was dramatically improved with use of Abridge across a breadth of specialties.
{"title":"Enhancing Clinical Documentation Workflow with Ambient Artificial Intelligence: Clinician Perspectives on Work Burden, Burnout, and Job Satisfaction","authors":"Michael Albrecht, Denton Shanks, Tina Shah, Taina Hudson, Jeffrey Thompson, Tanya Filardi, Kelli Wright, Greg Ator, Timothy Ryan Smith","doi":"10.1101/2024.08.12.24311883","DOIUrl":"https://doi.org/10.1101/2024.08.12.24311883","url":null,"abstract":"Objective: This study assessed the effects of an ambient artificial intelligence (AI) documentation platform on clinicians' perceptions of documentation workflow.\u0000Materials and Methods: A pre- and post-implementation survey evaluated ambulatory clinician perceptions on impact of Abridge, an ambient AI documentation platform. Outcomes included clinical documentation burden, work after-hours, clinician burnout, work satisfaction, and patient access. Data were analyzed using descriptive statistics and proportional odds logistic regression to compare changes for concordant questions across pre- and post-surveys. Covariate analysis examined effect of specialty type and duration of use of the AI tool. Results: Survey response rates were 51.1% (94/181) pre-implementation and 75.9% (101/133) post-implementation. Clinician perception of ease of documentation workflow (OR = 6.91, 95% CI: 3.90 to 12.56, p<0.001) and in completing notes associated with usage of the AI tool (OR = 4.95, 95% CI: 2.87 to 8.69, p<0.001) was significantly improved. The majority of respondents agreed that the AI tool decreased documentation burden, decreased the time spent documenting outside clinical hours, reduced burnout risk, and increased job satisfaction, with 48% agreeing that an additional patient could be seen if needed. Clinician specialty type and number of days using the AI tool did not significantly affect survey responses.\u0000Discussion: Clinician experience and efficiency was dramatically improved with use of Abridge across a breadth of specialties.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1101/2024.08.23.24312514
Peter H Charlton, Erick Javier Arguello Prada, Jonathan Mant, Panicos A Kyriacou
Objective: Photoplethysmography is widely used for physiological monitoring, whether in clinical devices such as pulse oximeters, or consumer devices such as smartwatches. A key step in the analysis of photoplethysmogram (PPG) signals is detecting heartbeats. The MSPTD algorithm has been found to be one of the most accurate PPG beat detection algorithms, but is less computationally efficient than other algorithms. Therefore, the aim of this study was to develop a more efficient, open-source implementation of the MSPTD algorithm for PPG beat detection, named MSPTDfast (v.2). Approach: Five potential improvements to MSPTD were identified and evaluated on four datasets. MSPTDfast (v.2) was designed by incorporating each improvement which on its own reduced execu- tion time whilst maintaining a high F1-score. After internal validation, MSPTDfast (v.2) was benchmarked against state-of-the-art beat detection algorithms on four additional datasets. Main results: MSPTDfast (v.2) incorporated two key improvements: pre-processing PPG signals to reduce the sampling frequency to 20 Hz; and only calculating scalogram scales corresponding to heart rates >30 bpm. During internal validation MSPTDfast (v.2) was found to have an execution time of between approximately one-third and one-twentieth of MSPTD, and a comparable F1-score. During benchmarking MSPTDfast (v.2) was found to have the highest F1-score alongside MSPTD, and amongst one of the lowest execution times with only MSPTDfast (v.1), qppgfast and MMPD (v.2) achieving shorter execution times. Significance: MSPTDfast (v.2) is an accurate and efficient PPG beat detection algorithm, available in an open-source Matlab toolbox.
{"title":"The MSPTDfast photoplethysmography beat detection algorithm: Design, benchmarking, and open-source distribution","authors":"Peter H Charlton, Erick Javier Arguello Prada, Jonathan Mant, Panicos A Kyriacou","doi":"10.1101/2024.08.23.24312514","DOIUrl":"https://doi.org/10.1101/2024.08.23.24312514","url":null,"abstract":"Objective: Photoplethysmography is widely used for physiological monitoring, whether in clinical devices such as pulse oximeters, or consumer devices such as smartwatches. A key step in the analysis of photoplethysmogram (PPG) signals is detecting heartbeats. The MSPTD algorithm has been found to be one of the most accurate PPG beat detection algorithms, but is less computationally efficient than other algorithms. Therefore, the aim of this study was to develop a more efficient, open-source implementation of the MSPTD algorithm for PPG beat detection, named MSPTDfast (v.2). Approach: Five potential improvements to MSPTD were identified and evaluated on four datasets. MSPTDfast (v.2) was designed by incorporating each improvement which on its own reduced execu- tion time whilst maintaining a high F1-score. After internal validation, MSPTDfast (v.2) was benchmarked against state-of-the-art beat detection algorithms on four additional datasets. Main results: MSPTDfast (v.2) incorporated two key improvements: pre-processing PPG signals to reduce the sampling frequency to 20 Hz; and only calculating scalogram scales corresponding to heart rates >30 bpm. During internal validation MSPTDfast (v.2) was found to have an execution time of between approximately one-third and one-twentieth of MSPTD, and a comparable F1-score. During benchmarking MSPTDfast (v.2) was found to have the highest F1-score alongside MSPTD, and amongst one of the lowest execution times with only MSPTDfast (v.1), qppgfast and MMPD (v.2) achieving shorter execution times. Significance: MSPTDfast (v.2) is an accurate and efficient PPG beat detection algorithm, available in an open-source Matlab toolbox.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1101/2024.08.22.24312463
Tingyu Mo, Jacqueline Lam, Victor Li, Lawrence Cheung
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disorder. Early detection of AD is crucial for timely disease intervention. This study proposes a novel LLM framework, which extracts interpretable linguistic markers from LLM models and incorporates them into supervised AD detection models, while evaluating their model performance and interpretability. Our work consists of the following novelties: First, we design in-context few-shot and zero-shot prompting strategies to facilitate LLMs in extracting high-level linguistic markers discriminative of AD and NC, providing interpretation and assessment of their strength, reliability and relevance to AD classification. Second, we incorporate linguistic markers extracted by LLMs into a smaller AI-driven model to enhance the performance of downstream supervised learning for AD classification, by assigning higher weights to the high-level linguistic markers/features extracted from LLMs. Third, we investigate whether the linguistic markers extracted by LLMs can enhance theaccuracy and interpretability of the downstream supervised learning-based models for AD detection. Our findings suggest that the accuracy of the LLM-extracted linguistic markers-led supervised learning model is less desirable as compared to their counterparts that do not incorporate LLM-extracted markers, highlighting the tradeoffs between interpretability and accuracy in supervised AD classification. Although the use of these interpretable markers may not immediately lead to improved detection accuracy, they significantly improve medical diagnosis and trustworthiness. These interpretable markers allow healthcare professionals to gain a deeper understanding of the linguistic changes that occur in individuals with AD, enabling them to make more informed decisions and provide better patient care.
阿尔茨海默病(AD)是一种进行性、不可逆的神经退行性疾病。早期发现阿尔茨海默病对于及时干预疾病至关重要。本研究提出了一种新颖的 LLM 框架,该框架可从 LLM 模型中提取可解释的语言标记,并将其纳入有监督的 AD 检测模型,同时评估其模型性能和可解释性。我们的工作包括以下新颖之处:首先,我们设计了语境中的 "少镜头 "和 "零镜头 "提示策略,以帮助 LLM 提取可区分 AD 和 NC 的高级语言标记,并对其强度、可靠性和与 AD 分类的相关性进行解释和评估。其次,我们将 LLM 提取的语言标记纳入一个较小的人工智能驱动模型,通过为从 LLM 提取的高级语言标记/特征分配更高的权重,提高下游监督学习的 AD 分类性能。第三,我们研究了从 LLMs 中提取的语言标记是否能提高基于下游监督学习的 AD 检测模型的准确性和可解释性。我们的研究结果表明,由 LLM 提取的语言标记主导的监督学习模型的准确性不如不包含 LLM 提取的标记的同类模型那么理想,这凸显了在 AD 监督分类中可解释性和准确性之间的权衡。虽然使用这些可解释标记物可能不会立即提高检测准确性,但它们能显著改善医疗诊断和可信度。这些可解释标记让医护人员能够更深入地了解注意力缺失症患者的语言变化,从而做出更明智的决定,为患者提供更好的护理。
{"title":"Leveraging Large Language Models for Identifying Interpretable Linguistic Markers and Enhancing Alzheimer's Disease Diagnostics","authors":"Tingyu Mo, Jacqueline Lam, Victor Li, Lawrence Cheung","doi":"10.1101/2024.08.22.24312463","DOIUrl":"https://doi.org/10.1101/2024.08.22.24312463","url":null,"abstract":"Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disorder. Early detection of AD is crucial for timely disease intervention. This study proposes a novel LLM framework, which extracts interpretable linguistic markers from LLM models and incorporates them into supervised AD detection models, while evaluating their model performance and interpretability. Our work consists of the following novelties: First, we design in-context few-shot and zero-shot prompting strategies to facilitate LLMs in extracting high-level linguistic markers discriminative of AD and NC, providing interpretation and assessment of their strength, reliability and relevance to AD classification. Second, we incorporate linguistic markers extracted by LLMs into a smaller AI-driven model to enhance the performance of downstream supervised learning for AD classification, by assigning higher weights to the high-level linguistic markers/features extracted from LLMs. Third, we investigate whether the linguistic markers extracted by LLMs can enhance theaccuracy and interpretability of the downstream supervised learning-based models for AD detection. Our findings suggest that the accuracy of the LLM-extracted linguistic markers-led supervised learning model is less desirable as compared to their counterparts that do not incorporate LLM-extracted markers, highlighting the tradeoffs between interpretability and accuracy in supervised AD classification. Although the use of these interpretable markers may not immediately lead to improved detection accuracy, they significantly improve medical diagnosis and trustworthiness. These interpretable markers allow healthcare professionals to gain a deeper understanding of the linguistic changes that occur in individuals with AD, enabling them to make more informed decisions and provide better patient care.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1101/2024.08.23.24312060
January L Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom L Pollard, David Rotenberg, Benjamin Haibe-Kains
We outline the development of the Health Data Nexus, a data platform which enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security and privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.
我们概述了 Health Data Nexus 数据平台的开发过程,该平台通过基于云的计算环境实现数据存储和访问管理。我们介绍了这一安全平台在不断发展的公共部门研究领域中的重要性,该领域使用了大量数据,尤其是从卫生系统获取的临床数据,以及为三个目标用户群(数据提供者、研究人员和教育工作者)提供有意义的好处的重要性。然后,我们介绍了构建该平台所需的管理实践、技术标准以及数据安全和隐私保护措施的实施情况,并举例说明了该平台在促进数据集获取、新颖研究以及举办教育课程、研讨会和数据马拉松等方面的优势。最后,我们讨论了平台开发的关键原则,强调了灵活使用、合作开发和开源科学的重要性。
{"title":"Health Data Nexus: An Open Data Platform for AI Research and Education in Medicine","authors":"January L Adams, Rafal Cymerys, Karol Szuster, Daniel Hekman, Zoryana Salo, Rutvik Solanki, Muhammad Mamdani, Alistair Johnson, Katarzyna Ryniak, Tom L Pollard, David Rotenberg, Benjamin Haibe-Kains","doi":"10.1101/2024.08.23.24312060","DOIUrl":"https://doi.org/10.1101/2024.08.23.24312060","url":null,"abstract":"We outline the development of the Health Data Nexus, a data platform which enables data storage and access management with a cloud-based computational environment. We describe the importance of this secure platform in an evolving public sector research landscape that utilizes significant quantities of data, particularly clinical data acquired from health systems, as well as the importance of providing meaningful benefits for three targeted user groups: data providers, researchers, and educators. We then describe the implementation of governance practices, technical standards, and data security and privacy protections needed to build this platform, as well as example use-cases highlighting the strengths of the platform in facilitating dataset acquisition, novel research, and hosting educational courses, workshops, and datathons. Finally, we discuss the key principles that informed the platform's development, highlighting the importance of flexible uses, collaborative development, and open-source science.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1101/2024.08.23.24312502
Eva Maria Valerio de Sousa, Ajay Kumar, Charlie Coupland, Tânia F. Vaz, Will Jones, Rubén Valcarce-Diñeiro, Simon D.J. Calaminus
Manual counting of platelets, in microscopy images, is greatly time-consuming. Our goal was to automatically segment and count platelets images using a deep learning approach, applying U-Net and Fully Convolutional Network (FCN) modelling. Data preprocessing was done by creating binary masks and utilizing supervised learning with ground-truth labels. Data augmentation was implemented, for improved model robustness and detection. The number of detected regions was then retrieved as a count. The study investigated the U-Net models performance with different datasets, indicating notable improvements in segmentation metrics as the dataset size increased, while FCN performance was only evaluated with the smaller dataset and abandoned due to poor results. U-Net surpassed FCN in both detection and counting measures in the smaller dataset Dice 0.90, accuracy of 0.96 (U-Net) vs Dice 0.60 and 0.81 (FCN). When tested in a bigger dataset U-Net produced even better values (Dice 0.99, accuracy of 0.98). The U-Net model proves to be particularly effective as the dataset size increases, showcasing its versatility and accuracy in handling varying cell sizes and appearances. These data show potential areas for further improvement and the promising application of deep learning in automating cell segmentation for diverse life science research applications.
{"title":"U-Net as a deep learning-based method for platelets segmentation in microscopic images","authors":"Eva Maria Valerio de Sousa, Ajay Kumar, Charlie Coupland, Tânia F. Vaz, Will Jones, Rubén Valcarce-Diñeiro, Simon D.J. Calaminus","doi":"10.1101/2024.08.23.24312502","DOIUrl":"https://doi.org/10.1101/2024.08.23.24312502","url":null,"abstract":"Manual counting of platelets, in microscopy images, is greatly time-consuming. Our goal was to automatically segment and count platelets images using a deep learning approach, applying U-Net and Fully Convolutional Network (FCN) modelling. Data preprocessing was done by creating binary masks and utilizing supervised learning with ground-truth labels. Data augmentation was implemented, for improved model robustness and detection. The number of detected regions was then retrieved as a count. The study investigated the U-Net models performance with different datasets, indicating notable improvements in segmentation metrics as the dataset size increased, while FCN performance was only evaluated with the smaller dataset and abandoned due to poor results. U-Net surpassed FCN in both detection and counting measures in the smaller dataset Dice 0.90, accuracy of 0.96 (U-Net) vs Dice 0.60 and 0.81 (FCN). When tested in a bigger dataset U-Net produced even better values (Dice 0.99, accuracy of 0.98). The U-Net model proves to be particularly effective as the dataset size increases, showcasing its versatility and accuracy in handling varying cell sizes and appearances. These data show potential areas for further improvement and the promising application of deep learning in automating cell segmentation for diverse life science research applications.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}