数字抄写员评估：急诊科咨询电话的对话总结。

IF 2.1 2区医学 Q4 MEDICAL INFORMATICS Applied Clinical Informatics Pub Date : 2024-05-15 DOI:10.1055/a-2327-4121

Emre Sezgin, Joseph Winstead Sirrianni, Kelly Kranz

{"title":"数字抄写员评估：急诊科咨询电话的对话总结。","authors":"Emre Sezgin, Joseph Winstead Sirrianni, Kelly Kranz","doi":"10.1055/a-2327-4121","DOIUrl":null,"url":null,"abstract":"Objective: We present a proof-of-concept digital scribe system as an Emergency Department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation, and report its performance.Materials and methods: We use four pre-trained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.Results: The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1=0.49, F1ROUGE-2=0.23, F1ROUGE-L=0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1=0.28, F1ROUGE-2=0.11, F1ROUGE-L=0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.Discussion: The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories.Conclusion: The study provides evidence towards the potential of AI-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches, and comparative analysis to measure documentation burden and human factors.","PeriodicalId":48956,"journal":{"name":"Applied Clinical Informatics","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11268986/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls.\",\"authors\":\"Emre Sezgin, Joseph Winstead Sirrianni, Kelly Kranz\",\"doi\":\"10.1055/a-2327-4121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: We present a proof-of-concept digital scribe system as an Emergency Department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation, and report its performance.Materials and methods: We use four pre-trained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.Results: The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1=0.49, F1ROUGE-2=0.23, F1ROUGE-L=0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1=0.28, F1ROUGE-2=0.11, F1ROUGE-L=0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.Discussion: The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories.Conclusion: The study provides evidence towards the potential of AI-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches, and comparative analysis to measure documentation burden and human factors.\",\"PeriodicalId\":48956,\"journal\":{\"name\":\"Applied Clinical Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11268986/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Clinical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2327-4121\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Clinical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2327-4121","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：我们提出了一个概念验证数字抄写员系统，作为急诊科（ED）会诊呼叫的临床对话总结管道，为临床文档提供支持：我们提出了一个概念验证数字抄写员系统，作为急诊科（ED）会诊呼叫的临床对话总结管道，以支持临床文档，并报告其性能：我们使用四种预先训练好的大型语言模型来建立数字抄写员系统：我们使用四种预先训练好的大型语言模型来建立数字抄写员系统：T5-small、T5-base、PEGASUS-PubMed 和 BART-Large-CNN。我们的数据集包括 100 个急诊室临床医生之间的转诊对话和医疗记录。我们报告了 ROUGE-1、ROUGE-2 和 ROUGE-L，以比较模型性能。此外，我们还对转录内容进行了注释，以评估生成摘要的质量：结果：经过微调的 BART-Large-CNN 模型在摘要任务中表现出更高的性能，其 ROUGE 分数最高（F1ROUGE-1=0.49，F1ROUGE-2=0.23，F1ROUGE-L=0.35）。相比之下，PEGASUS-PubMed 则明显落后（F1ROUGE-1=0.28，F1ROUGE-2=0.11，F1ROUGE-L=0.22）。采用零镜头方法后，BART-Large-CNN 的性能下降了 50% 以上。注释显示，BART-Large-CNN 在识别关键信息方面的召回率为 71.4%，准确率为 67.7%：BART-Large-CNN 模型在微调和不微调的情况下都表现出了对临床对话结构的高度理解。尽管存在召回率高的情况，但该模型的性能仍存在差异，特别是在实现一致的正确性方面，这表明该模型仍有改进的余地。该模型在不同信息类别中的召回能力也各不相同：本研究证明了人工智能辅助工具在协助临床记录方面的潜力。建议在未来的工作中扩大研究范围，采用更多的语言模型和混合方法，并进行比较分析，以衡量文档编制负担和人为因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls.

Objective: We present a proof-of-concept digital scribe system as an Emergency Department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation, and report its performance.

Materials and methods: We use four pre-trained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.

Results: The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1=0.49, F1ROUGE-2=0.23, F1ROUGE-L=0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1=0.28, F1ROUGE-2=0.11, F1ROUGE-L=0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.

Discussion: The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories.

Conclusion: The study provides evidence towards the potential of AI-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches, and comparative analysis to measure documentation burden and human factors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Clinical Informatics MEDICAL INFORMATICS-

CiteScore

4.60

自引率

24.10%

发文量

132

期刊介绍： ACI is the third Schattauer journal dealing with biomedical and health informatics. It perfectly complements our other journals Öffnet internen Link im aktuellen FensterMethods of Information in Medicine and the Öffnet internen Link im aktuellen FensterYearbook of Medical Informatics. The Yearbook of Medical Informatics being the “Milestone” or state-of-the-art journal and Methods of Information in Medicine being the “Science and Research” journal of IMIA, ACI intends to be the “Practical” journal of IMIA.