医学文本的分段简化

Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li
{"title":"医学文本的分段简化","authors":"Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li","doi":"10.18653/V1/2021.NAACL-MAIN.395","DOIUrl":null,"url":null,"abstract":"We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"7 1","pages":"4972-4984"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"Paragraph-level Simplification of Medical Texts\",\"authors\":\"Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li\",\"doi\":\"10.18653/V1/2021.NAACL-MAIN.395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.\",\"PeriodicalId\":74542,\"journal\":{\"name\":\"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting\",\"volume\":\"7 1\",\"pages\":\"4972-4984\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/V1/2021.NAACL-MAIN.395\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/V1/2021.NAACL-MAIN.395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 47

摘要

我们考虑学习简化医学文本的问题。这一点很重要,因为大多数可靠的、最新的生物医学信息都充斥着行话,因此外行读者实际上无法理解。此外,人工简化并不适用于快速增长的生物医学文献,这促使人们需要自动化方法。不幸的是,没有大规模的资源可用于此任务。在这项工作中,我们介绍了一个新的语料库平行文本的英语,包括技术和lay总结所有已发表的证据有关不同的临床主题。然后,我们提出了一个基于基于科学文本预训练的屏蔽语言模型的似然分数的新度量。我们表明,这种自动度量比现有的启发式更好地区分了技术摘要和外行摘要。我们引入并评估了基线编码器-解码器转换器模型以简化,并提出了一种新的增强方法,其中我们明确地惩罚解码器产生“术语”术语;我们发现,这在可读性方面比基线有所提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Paragraph-level Simplification of Medical Texts
We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ODD: A Benchmark Dataset for the Natural Language Processing Based Opioid Related Aberrant Behavior Detection. Towards Reducing Diagnostic Errors with Interpretable Risk Prediction. ScAN: Suicide Attempt and Ideation Events Dataset. ScAN: Suicide Attempt and Ideation Events Dataset Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1