A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.

Griffin Adams, Jason Zucker, Noémie Elhadad
{"title":"A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.","authors":"Griffin Adams, Jason Zucker, Noémie Elhadad","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The factual consistency of summaries-their faithfulness-is critical to their safe usage in clinical settings. To better understand the limitations of state-of-the-art natural language processing (NLP) systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like \"following up\") into one of three categories: \"Incorrect,\" \"Missing,\" and \"Not in Notes.\" We meta-evaluate a broad set of faithfulness metrics-proposed for the general NLP domain-by measuring the correlation of metric scores to clinician ratings. Across metrics, we explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble. We find that off-the-shelf metrics with no exposure to clinical text correlate well to clinician ratings yet overly rely on copy-and-pasted text. As a practical guide, we observe that most metrics correlate best to clinicians when provided with one summary sentence at a time and a minimal set of supporting sentences from the notes before discharge.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"219 ","pages":"2-30"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11441639/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The factual consistency of summaries-their faithfulness-is critical to their safe usage in clinical settings. To better understand the limitations of state-of-the-art natural language processing (NLP) systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of faithfulness metrics-proposed for the general NLP domain-by measuring the correlation of metric scores to clinician ratings. Across metrics, we explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble. We find that off-the-shelf metrics with no exposure to clinical text correlate well to clinician ratings yet overly rely on copy-and-pasted text. As a practical guide, we observe that most metrics correlate best to clinicians when provided with one summary sentence at a time and a minimal set of supporting sentences from the notes before discharge.

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
长篇医院病历摘要忠实度指标的元评价。
对住院病例进行长篇临床总结具有现实意义,因为它可以帮助临床医生和患者。摘要的事实一致性--即其忠实性--对其在临床环境中的安全使用至关重要。为了更好地了解最先进的自然语言处理(NLP)系统的局限性以及现有评估指标的适用性,我们针对模型生成的患者住院病程摘要的细粒度人工注释,制定了忠实度指标基准。我们创建了一个患者入院病历和摘要语料库,其中包含了一批艾滋病患者,每个人都有复杂的病史。我们向注释者展示了摘要和源注释,并要求他们将手动突出显示的摘要元素(如病情和药物等临床实体以及 "随访 "等操作)归入三个类别之一:"不正确"、"缺失 "和 "不在注释中"。通过衡量指标得分与临床医生评分的相关性,我们对为一般 NLP 领域提出的一系列广泛的忠实度指标进行了元评估。在各种度量标准中,我们探讨了领域适应性的重要性(例如,领域内预训练和度量标准微调的影响)、源摘要排列的使用以及从组合中提炼单一度量标准的效果。我们发现,没有接触过临床文本的现成度量标准与临床医生的评分有很好的相关性,但却过度依赖复制粘贴的文本。作为实用指南,我们观察到,如果每次只向临床医生提供一个摘要句子和出院前病历中最基本的辅助句子集,大多数指标与临床医生的相关性最佳。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Contrastive Learning for Clinical Outcome Prediction with Partial Data Sources. Multi-Source Conformal Inference Under Distribution Shift. DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation. Kernel Debiased Plug-in Estimation: Simultaneous, Automated Debiasing without Influence Functions for Many Target Parameters. Adapt and Diffuse: Sample-Adaptive Reconstruction Via Latent Diffusion Models.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1