Paragraph-level Simplification of Medical Texts

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting Pub Date : 2021-04-12 DOI:10.18653/V1/2021.NAACL-MAIN.395

Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li

{"title":"Paragraph-level Simplification of Medical Texts","authors":"Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li","doi":"10.18653/V1/2021.NAACL-MAIN.395","DOIUrl":null,"url":null,"abstract":"We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"7 1","pages":"4972-4984"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/V1/2021.NAACL-MAIN.395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

Abstract

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

医学文本的分段简化

我们考虑学习简化医学文本的问题。这一点很重要，因为大多数可靠的、最新的生物医学信息都充斥着行话，因此外行读者实际上无法理解。此外，人工简化并不适用于快速增长的生物医学文献，这促使人们需要自动化方法。不幸的是，没有大规模的资源可用于此任务。在这项工作中，我们介绍了一个新的语料库平行文本的英语，包括技术和lay总结所有已发表的证据有关不同的临床主题。然后，我们提出了一个基于基于科学文本预训练的屏蔽语言模型的似然分数的新度量。我们表明，这种自动度量比现有的启发式更好地区分了技术摘要和外行摘要。我们引入并评估了基线编码器-解码器转换器模型以简化，并提出了一种新的增强方法，其中我们明确地惩罚解码器产生“术语”术语;我们发现，这在可读性方面比基线有所提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting

自引率

0.00%

发文量