首页 > 最新文献

Proceedings of the conference. Association for Computational Linguistics. Meeting最新文献

英文 中文
Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients’ Active Diagnoses and Problems from Electronic Health Record Progress Notes 问题列表总结概述(ProbSum) 2023关于总结患者主动诊断和电子健康记录进度记录问题的共享任务
Pub Date : 2023-06-08 DOI: 10.48550/arXiv.2306.05270
Yanjun Gao, Dmitriy Dligach, Timothy Miller, M. Churpek, M. Afshar
The BioNLP Workshop 2023 initiated the launch of a shared task on Problem List Summarization (ProbSum) in January 2023. The aim of this shared task is to attract future research efforts in building NLP models for real-world diagnostic decision support applications, where a system generating relevant and accurate diagnoses will augment the healthcare providers’ decision-making process and improve the quality of care for patients. The goal for participants is to develop models that generated a list of diagnoses and problems using input from the daily care notes collected from the hospitalization of critically ill patients. Eight teams submitted their final systems to the shared task leaderboard. In this paper, we describe the tasks, datasets, evaluation metrics, and baseline systems. Additionally, the techniques and results of the evaluation of the different approaches tried by the participating teams are summarized.
BioNLP Workshop 2023于2023年1月启动了一个关于问题列表总结(ProbSum)的共享任务。这项共同任务的目的是吸引未来的研究努力,为现实世界的诊断决策支持应用建立NLP模型,其中系统生成相关和准确的诊断将增加医疗保健提供者的决策过程,提高对患者的护理质量。参与者的目标是开发模型,利用从危重病人住院时收集的日常护理笔记的输入,生成诊断和问题列表。8个团队向共享任务排行榜提交了他们的最终系统。在本文中,我们描述了任务、数据集、评估指标和基线系统。此外,还总结了各参赛队尝试的不同方法的技术和评估结果。
{"title":"Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients’ Active Diagnoses and Problems from Electronic Health Record Progress Notes","authors":"Yanjun Gao, Dmitriy Dligach, Timothy Miller, M. Churpek, M. Afshar","doi":"10.48550/arXiv.2306.05270","DOIUrl":"https://doi.org/10.48550/arXiv.2306.05270","url":null,"abstract":"The BioNLP Workshop 2023 initiated the launch of a shared task on Problem List Summarization (ProbSum) in January 2023. The aim of this shared task is to attract future research efforts in building NLP models for real-world diagnostic decision support applications, where a system generating relevant and accurate diagnoses will augment the healthcare providers’ decision-making process and improve the quality of care for patients. The goal for participants is to develop models that generated a list of diagnoses and problems using input from the daily care notes collected from the hospitalization of critically ill patients. Eight teams submitted their final systems to the shared task leaderboard. In this paper, we describe the tasks, datasets, evaluation metrics, and baseline systems. Additionally, the techniques and results of the evaluation of the different approaches tried by the participating teams are summarized.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"1 1","pages":"461-467"},"PeriodicalIF":0.0,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90235599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning 基于领域内语言模型的诊断推理多任务训练
Pub Date : 2023-06-07 DOI: 10.48550/arXiv.2306.04551
B. Sharma, Yanjun Gao, Timothy Miller, M. Churpek, M. Afshar, Dmitriy Dligach
Generative artificial intelligence (AI) is a promising direction for augmenting clinical diagnostic decision support and reducing diagnostic errors, a leading contributor to medical errors. To further the development of clinical AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a comprehensive generative AI framework, comprised of six tasks representing key components in clinical reasoning. We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training with a focus on the problem summarization task in DR.BENCH. We demonstrate that a multi-task, clinically-trained language model outperforms its general domain counterpart by a large margin, establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55. This research underscores the value of domain-specific training for optimizing clinical diagnostic reasoning tasks.
生成式人工智能(AI)是增强临床诊断决策支持和减少诊断错误的一个有前途的方向,诊断错误是导致医疗错误的主要原因。为了进一步发展临床人工智能系统,引入了诊断推理基准(DR.BENCH)作为一个全面的生成式人工智能框架,由代表临床推理关键组件的六个任务组成。我们对领域内和领域外的语言模型以及多任务和单任务训练进行了比较分析,重点关注了DR.BENCH中的问题总结任务。我们证明了一个多任务、临床训练的语言模型在很大程度上优于其一般领域的对应模型,建立了一个新的最先进的性能,ROUGE-L得分为28.55。这项研究强调了领域特定训练对优化临床诊断推理任务的价值。
{"title":"Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning","authors":"B. Sharma, Yanjun Gao, Timothy Miller, M. Churpek, M. Afshar, Dmitriy Dligach","doi":"10.48550/arXiv.2306.04551","DOIUrl":"https://doi.org/10.48550/arXiv.2306.04551","url":null,"abstract":"Generative artificial intelligence (AI) is a promising direction for augmenting clinical diagnostic decision support and reducing diagnostic errors, a leading contributor to medical errors. To further the development of clinical AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a comprehensive generative AI framework, comprised of six tasks representing key components in clinical reasoning. We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training with a focus on the problem summarization task in DR.BENCH. We demonstrate that a multi-task, clinically-trained language model outperforms its general domain counterpart by a large margin, establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55. This research underscores the value of domain-specific training for optimizing clinical diagnostic reasoning tasks.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2 1","pages":"78-85"},"PeriodicalIF":0.0,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78774047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Less Likely Brainstorming: Using Language Models to Generate Alternative Hypotheses 不太可能的头脑风暴:使用语言模型生成替代假设
Pub Date : 2023-05-30 DOI: 10.48550/arXiv.2305.19339
Liyan Tang, Yifan Peng, Yanshan Wang, Ying Ding, Greg Durrett, Justin F. Rousseau
A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it is worth considering a broad differential diagnosis, going beyond the most likely options. We introduce a new task, "less likely brainstorming," that asks a model to generate outputs that humans think are relevant but less likely to happen. We explore the task in two settings: a brain MRI interpretation generation setting and an everyday commonsense reasoning setting. We found that a baseline approach of training with less likely hypotheses as targets generates outputs that humans evaluate as either likely or irrelevant nearly half of the time; standard MLE training is not effective. To tackle this problem, we propose a controlled text generation method that uses a novel contrastive learning strategy to encourage models to differentiate between generating likely and less likely outputs according to humans. We compare our method with several state-of-the-art controlled text generation models via automatic and human evaluations and show that our models' capability of generating less likely outputs is improved.
人类决策者从纠正他们偏见的人工智能助手中获益最多。对于某些问题,例如根据发现生成放射学报告的解释,如果这些结果对用户来说已经很明显,那么仅预测高度可能的结果的系统可能用处不大。为了减轻人类决策中的偏见,值得考虑广泛的鉴别诊断,而不是最可能的选择。我们引入了一个新任务,“不太可能的头脑风暴”,它要求一个模型生成人类认为相关但不太可能发生的输出。我们在两种设置中探索任务:脑MRI解释生成设置和日常常识推理设置。我们发现,以不太可能的假设作为目标的基线训练方法产生的输出,人类在近一半的时间内评估为可能或不相关;标准的MLE培训效果不佳。为了解决这个问题,我们提出了一种受控文本生成方法,该方法使用一种新的对比学习策略来鼓励模型区分根据人类生成的可能输出和不太可能输出。我们通过自动和人工评估将我们的方法与几种最先进的受控文本生成模型进行了比较,并表明我们的模型生成不太可能输出的能力得到了提高。
{"title":"Less Likely Brainstorming: Using Language Models to Generate Alternative Hypotheses","authors":"Liyan Tang, Yifan Peng, Yanshan Wang, Ying Ding, Greg Durrett, Justin F. Rousseau","doi":"10.48550/arXiv.2305.19339","DOIUrl":"https://doi.org/10.48550/arXiv.2305.19339","url":null,"abstract":"A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it is worth considering a broad differential diagnosis, going beyond the most likely options. We introduce a new task, \"less likely brainstorming,\" that asks a model to generate outputs that humans think are relevant but less likely to happen. We explore the task in two settings: a brain MRI interpretation generation setting and an everyday commonsense reasoning setting. We found that a baseline approach of training with less likely hypotheses as targets generates outputs that humans evaluate as either likely or irrelevant nearly half of the time; standard MLE training is not effective. To tackle this problem, we propose a controlled text generation method that uses a novel contrastive learning strategy to encourage models to differentiate between generating likely and less likely outputs according to humans. We compare our method with several state-of-the-art controlled text generation models via automatic and human evaluations and show that our models' capability of generating less likely outputs is improved.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"1 1","pages":"12532-12555"},"PeriodicalIF":0.0,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83485059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Revisiting Relation Extraction in the era of Large Language Models 再论大语言模型时代的关系抽取
Pub Date : 2023-05-08 DOI: 10.48550/arXiv.2305.05003
Somin Wadhwa, Silvio Amir, Byron C. Wallace
Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.
关系抽取是自然语言处理的核心任务,主要是从文本中推断实体之间的语义关系。标准的监督正则技术需要训练模块来标记包含实体跨度的令牌,然后预测它们之间的关系。最近的工作将该问题视为序列到序列的任务,将实体之间的关系线性化,作为目标字符串根据输入条件生成。在这里,我们突破了这种方法的局限性,使用比之前工作中考虑的更大的语言模型(GPT-3和Flan-T5大),并在不同级别的监督下评估它们在标准RE任务上的表现。我们通过进行人工评估来解决评估可再生能源生成方法所固有的问题,而不是依赖于精确匹配。在这种改进的评价下,我们发现:(1)GPT-3的少镜头提示达到了接近SOTA的性能,即大致相当于现有的完全监督模型;(2) Flan-T5在少数镜头设置中没有那么强的能力,但是用思维链(CoT)风格的解释(通过GPT-3生成)来监督和微调它可以产生SOTA结果。我们将此模型作为可再生能源任务的新基准发布。
{"title":"Revisiting Relation Extraction in the era of Large Language Models","authors":"Somin Wadhwa, Silvio Amir, Byron C. Wallace","doi":"10.48550/arXiv.2305.05003","DOIUrl":"https://doi.org/10.48550/arXiv.2305.05003","url":null,"abstract":"Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"256 1","pages":"15566-15589"},"PeriodicalIF":0.0,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89059389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges. 自动总结临床试验证据:一个突出当前挑战的原型。
Sanjana Ramprasad, Iain J Marshall, Denis Jered McInerney, Byron C Wallace

We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work (Marshall et al., 2020), the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART (Lewis et al., 2019), and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video is available at: https://vimeo.com/735605060 The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/.

我们提出了TrialsSummarizer,一个旨在自动总结与给定查询最相关的随机对照试验中出现的证据的系统。在先前工作的基础上(Marshall et al., 2020),系统检索与指定条件、干预措施和结果组合的查询匹配的试验出版物,并根据样本量和估计的研究质量对这些出版物进行排名。前k个这样的研究通过神经多文件摘要系统,产生这些试验的摘要。我们考虑了两种架构:基于BART的标准序列到序列模型(Lewis et al., 2019),以及旨在为最终用户提供更大透明度的多头架构。这两种模型都会生成为查询检索的证据的流畅和相关的摘要,但是它们倾向于引入不受支持的语句,这使得它们目前不适合在这个领域中使用。所建议的体系结构可以帮助用户验证输出,允许用户跟踪生成的令牌到输入。演示视频可在:https://vimeo.com/735605060。原型、源代码和模型权重可在:https://sanjanaramprasad.github.io/trials-summarizer/。
{"title":"Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges.","authors":"Sanjana Ramprasad,&nbsp;Iain J Marshall,&nbsp;Denis Jered McInerney,&nbsp;Byron C Wallace","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present <i>TrialsSummarizer</i>, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work (Marshall et al., 2020), the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-<i>k</i> such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART (Lewis et al., 2019), and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video is available at: https://vimeo.com/735605060 The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/.</p>","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2023 ","pages":"236-247"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10361334/pdf/nihms-1912129.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10240091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IRMA: the 335-million-word Italian coRpus for studying MisinformAtion. IRMA: 3.35亿字的意大利语语料库,用于研究错误信息。
Fabio Carrella, Alessandro Miani, Stephan Lewandowsky

The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often underrepresented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as 'untrustworthy' by professional factcheckers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domainspecific information such as source type (e.g., political, health, conspiracy, etc.), quality, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.

在过去的十年里,互联网上虚假信息的传播受到了相当大的关注。错误信息的传播速度往往快于主流新闻,从而使人工事实核查效率低下,或者充其量是劳动密集型的。因此,越来越需要开发自动检测错误信息的方法。尽管创建这类方法的资源有英文版本,但其他语言在这方面的表现往往不足。有了这一贡献,我们提出了IRMA,这是一个包含超过60万篇意大利新闻文章(3.35亿代币)的语料库,这些文章收集自56个被专业事实检查员分类为“不可信”的网站。该语料库是免费提供的,包括一组丰富的文本和网站级数据,代表了测试假设和开发自动检测算法的交钥匙资源。它包含文本、标题和日期(从2004年到2022年),以及三种类型的语义度量(即关键字、三种不同分辨率的主题和LIWC词汇特征)。IRMA还包括特定领域的信息,如来源类型(例如,政治、健康、阴谋等)、质量和更高层次的元数据,包括允许调查用户在线行为的网站传入流量的几个指标。IRMA构成了目前意大利最大的错误信息语料库,使其成为推进不可信新闻检测定量研究的有效工具,并最终有助于限制错误信息的传播。
{"title":"IRMA: the 335-million-word Italian coRpus for studying MisinformAtion.","authors":"Fabio Carrella, Alessandro Miani, Stephan Lewandowsky","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often underrepresented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as 'untrustworthy' by professional factcheckers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domainspecific information such as source type (e.g., political, health, conspiracy, etc.), quality, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.</p>","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2023 ","pages":"2339-2349"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7615326/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138300729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges 自动总结临床试验证据:一个突出当前挑战的原型
Pub Date : 2023-03-07 DOI: 10.48550/arXiv.2303.05392
S. Ramprasad, Denis Jered McInerney, Iain J. Marshal, Byron Wallace
In this work we present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work, the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality.The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART, and a multi-headed architecture intended to provide greater transparency and controllability to end-users.Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present.The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video can be found at https://vimeo.com/735605060The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/
在这项工作中,我们提出了TrialsSummarizer,这是一个旨在自动总结与给定查询最相关的随机对照试验中提供的证据的系统。在先前工作的基础上,系统检索与指定条件、干预措施和结果组合的查询相匹配的试验出版物,并根据样本量和估计的研究质量对这些出版物进行排名。前k个这样的研究通过神经多文件摘要系统,产生这些试验的摘要。我们考虑两种体系结构:基于BART的标准序列到序列模型,以及旨在为最终用户提供更大透明度和可控制性的多头体系结构。这两种模型都会生成为查询检索的证据的流畅和相关的摘要,但是它们倾向于引入不受支持的语句,这使得它们目前不适合在这个领域中使用。所建议的体系结构可以帮助用户验证输出,允许用户跟踪生成的令牌到输入。演示视频可在https://vimeo.com/735605060The上找到,原型、源代码和模型权重可在https://sanjanaramprasad.github.io/trials-summarizer/上找到
{"title":"Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges","authors":"S. Ramprasad, Denis Jered McInerney, Iain J. Marshal, Byron Wallace","doi":"10.48550/arXiv.2303.05392","DOIUrl":"https://doi.org/10.48550/arXiv.2303.05392","url":null,"abstract":"In this work we present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work, the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality.The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART, and a multi-headed architecture intended to provide greater transparency and controllability to end-users.Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present.The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video can be found at https://vimeo.com/735605060The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"71 1","pages":"236-247"},"PeriodicalIF":0.0,"publicationDate":"2023-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85881756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Self-Repetition in Abstractive Neural Summarizers. 抽象神经总结器中的自我重复。
Nikita Salkar, Thomas Trikalinos, Byron C Wallace, Ani Nenkova

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

我们对神经总结器输出中的自我重复进行了定量和定性分析。我们衡量自我重复的方法是在同一系统的多个输出中出现长度为4或更长的n-grams的数量。我们分析了三种流行架构(BART, T5和Pegasus)的行为,并在五个数据集上进行了微调。在回归分析中,我们发现这三种架构在输入的输出摘要中重复内容的倾向不同,BART特别倾向于自我重复。对更抽象的数据和以公式化语言为特征的数据进行微调,与更高的自我重复率相关。在定性分析中,我们发现系统产生了诸如广告和免责声明等与被总结的内容无关的工件,以及微调领域中常见的公式化短语。我们对自我重复的语料库级分析方法可以帮助从业者为总结者清理训练数据,并最终支持最小化自我重复量的方法。
{"title":"Self-Repetition in Abstractive Neural Summarizers.","authors":"Nikita Salkar,&nbsp;Thomas Trikalinos,&nbsp;Byron C Wallace,&nbsp;Ani Nenkova","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of <i>n</i>-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.</p>","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2022 ","pages":"341-350"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10361333/pdf/nihms-1912154.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10240591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Repetition in Abstractive Neural Summarizers 抽象神经总结器中的自我重复
Pub Date : 2022-10-14 DOI: 10.48550/arXiv.2210.08145
Nikita Salkar, T. Trikalinos, Byron C. Wallace, A. Nenkova
We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.
我们对神经总结器输出中的自我重复进行了定量和定性分析。我们衡量自我重复的方法是在同一系统的多个输出中出现长度为4或更长的n-grams的数量。我们分析了三种流行架构(BART、T5和Pegasus)的行为,并对五个数据集进行了微调。在回归分析中,我们发现这三种架构在输入的输出摘要中重复内容的倾向不同,BART特别倾向于自我重复。对更抽象的数据和以公式化语言为特征的数据进行微调与更高的自我重复率相关。在定性分析中,我们发现系统产生诸如广告和免责声明等与被总结的内容无关的工件,以及微调领域中常见的公式化短语。我们对自我重复的语料库级分析方法可以帮助从业者为总结者清理训练数据,并最终支持最小化自我重复量的方法。
{"title":"Self-Repetition in Abstractive Neural Summarizers","authors":"Nikita Salkar, T. Trikalinos, Byron C. Wallace, A. Nenkova","doi":"10.48550/arXiv.2210.08145","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08145","url":null,"abstract":"We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"74 5 1","pages":"341-350"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83847902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Evaluating Factuality in Text Simplification. 文本简化中的真实性评价。
Pub Date : 2022-05-01 DOI: 10.18653/v1/2022.acl-long.506
Ashwin Devaraj, William Sheffield, Byron C Wallace, Junyi Jessy Li

Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

自动简化模型旨在使输入文本更具可读性。这种方法有可能使更广泛的受众能够获得复杂的信息,例如,提供获取最近医学文献的途径,否则这些文献对于外行读者来说可能难以理解。然而,这样的模型可能会在自动简化的文本中引入错误,例如插入相应的原始文本不支持的语句,或者省略关键信息。在许多情况下,提供可读性更强但不准确的文本版本可能比完全不提供这种访问更糟糕。摘要模型中事实准确性(及其缺乏)的问题已受到高度关注,但自动简化文本的事实性尚未得到调查。我们引入了一个错误分类,我们使用它来分析从标准简化数据集和最先进的模型输出中提取的参考。我们发现,错误经常出现在现有的评估指标中,这促使人们需要研究确保自动化简化模型的事实准确性。
{"title":"Evaluating Factuality in Text Simplification.","authors":"Ashwin Devaraj,&nbsp;William Sheffield,&nbsp;Byron C Wallace,&nbsp;Junyi Jessy Li","doi":"10.18653/v1/2022.acl-long.506","DOIUrl":"https://doi.org/10.18653/v1/2022.acl-long.506","url":null,"abstract":"<p><p>Automated <i>simplification</i> models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.</p>","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2022 ","pages":"7331-7345"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9671157/pdf/nihms-1847771.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10641375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the conference. Association for Computational Linguistics. Meeting
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1