首页 > 最新文献

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting最新文献

英文 中文
ScAN: Suicide Attempt and Ideation Events Dataset 扫描:自杀企图和构思事件数据集
Bhanu Pratap Singh Rawat, Samuel Kovaly, W. Pigeon, Hong-ye Yu
Suicide is an important public health concern and one of the leading causes of death worldwide. Suicidal behaviors, including suicide attempts (SA) and suicide ideations (SI), are leading risk factors for death by suicide. Information related to patients’ previous and current SA and SI are frequently documented in the electronic health record (EHR) notes. Accurate detection of such documentation may help improve surveillance and predictions of patients’ suicidal behaviors and alert medical professionals for suicide prevention efforts. In this study, we first built Suicide Attempt and Ideation Events (ScAN) dataset, a subset of the publicly available MIMIC III dataset spanning over 12k+ EHR notes with 19k+ annotated SA and SI events information. The annotations also contain attributes such as method of suicide attempt. We also provide a strong baseline model ScANER (Suicide Attempt and Ideation Events Retriever), a multi-task RoBERTa-based model with a retrieval module to extract all the relevant suicidal behavioral evidences from EHR notes of an hospital-stay and, and a prediction module to identify the type of suicidal behavior (SA and SI) concluded during the patient’s stay at the hospital. ScANER achieved a macro-weighted F1-score of 0.83 for identifying suicidal behavioral evidences and a macro F1-score of 0.78 and 0.60 for classification of SA and SI for the patient’s hospital-stay, respectively. ScAN and ScANER are publicly available.
自杀是一个重要的公共卫生问题,也是全世界死亡的主要原因之一。自杀行为,包括自杀企图(SA)和自杀意念(SI),是自杀死亡的主要危险因素。与患者以前和目前的SA和SI相关的信息经常记录在电子健康记录(EHR)笔记中。准确地发现这些记录可能有助于改善对患者自杀行为的监测和预测,并提醒医疗专业人员采取预防自杀的措施。在这项研究中,我们首先建立了自杀企图和构思事件(ScAN)数据集,这是公开可用的MIMIC III数据集的一个子集,涵盖了超过12k+ EHR笔记,其中包含19k+注释的SA和SI事件信息。注释还包含自杀企图方法等属性。我们还提供了一个强大的基线模型ScANER(自杀企图和意念事件检索器),一个基于roberta的多任务模型,该模型具有检索模块,用于从住院患者的电子病历记录中提取所有相关的自杀行为证据,以及一个预测模块,用于识别患者住院期间发生的自杀行为类型(SA和SI)。ScANER在识别自杀行为证据方面的宏观加权f1得分为0.83,在患者住院期间的SA和SI分类方面的宏观加权f1得分分别为0.78和0.60。ScAN和ScANER是公开可用的。
{"title":"ScAN: Suicide Attempt and Ideation Events Dataset","authors":"Bhanu Pratap Singh Rawat, Samuel Kovaly, W. Pigeon, Hong-ye Yu","doi":"10.48550/arXiv.2205.07872","DOIUrl":"https://doi.org/10.48550/arXiv.2205.07872","url":null,"abstract":"Suicide is an important public health concern and one of the leading causes of death worldwide. Suicidal behaviors, including suicide attempts (SA) and suicide ideations (SI), are leading risk factors for death by suicide. Information related to patients’ previous and current SA and SI are frequently documented in the electronic health record (EHR) notes. Accurate detection of such documentation may help improve surveillance and predictions of patients’ suicidal behaviors and alert medical professionals for suicide prevention efforts. In this study, we first built Suicide Attempt and Ideation Events (ScAN) dataset, a subset of the publicly available MIMIC III dataset spanning over 12k+ EHR notes with 19k+ annotated SA and SI events information. The annotations also contain attributes such as method of suicide attempt. We also provide a strong baseline model ScANER (Suicide Attempt and Ideation Events Retriever), a multi-task RoBERTa-based model with a retrieval module to extract all the relevant suicidal behavioral evidences from EHR notes of an hospital-stay and, and a prediction module to identify the type of suicidal behavior (SA and SI) concluded during the patient’s stay at the hospital. ScANER achieved a macro-weighted F1-score of 0.83 for identifying suicidal behavioral evidences and a macro F1-score of 0.78 and 0.60 for classification of SA and SI for the patient’s hospital-stay, respectively. ScAN and ScANER are publicly available.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"17 1","pages":"1029-1040"},"PeriodicalIF":0.0,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78256254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Analysis of Behavior Classification in Motivational Interviewing. 动机访谈中的行为分类分析。
Leili Tavabi, Trang Tran, Kalin Stefanov, Brian Borsari, Joshua D Woolley, Stefan Scherer, Mohammad Soleymani

Analysis of client and therapist behavior in counseling sessions can provide helpful insights for assessing the quality of the session and consequently, the client's behavioral outcome. In this paper, we study the automatic classification of standardized behavior codes (i.e. annotations) used for assessment of psychotherapy sessions in Motivational Interviewing (MI). We develop models and examine the classification of client behaviors throughout MI sessions, comparing the performance by models trained on large pretrained embeddings (RoBERTa) versus interpretable and expert-selected features (LIWC). Our best performing model using the pretrained RoBERTa embeddings beats the baseline model, achieving an F1 score of 0.66 in the subject-independent 3-class classification. Through statistical analysis on the classification results, we identify prominent LIWC features that may not have been captured by the model using pretrained embeddings. Although classification using LIWC features underperforms RoBERTa, our findings motivate the future direction of incorporating auxiliary tasks in the classification of MI codes.

对客户和治疗师在心理咨询过程中的行为进行分析,有助于评估咨询过程的质量,进而评估客户的行为结果。在本文中,我们研究了标准化行为代码(即注释)的自动分类,用于评估动机访谈法(MI)中的心理治疗过程。我们开发了模型并研究了客户在整个动机访谈过程中的行为分类,比较了在大型预训练嵌入(RoBERTa)和可解释及专家选择特征(LIWC)上训练的模型的性能。我们使用预训练的 RoBERTa 嵌入的最佳表现模型击败了基线模型,在与主体无关的三类分类中取得了 0.66 的 F1 分数。通过对分类结果的统计分析,我们发现了使用预训练嵌入的模型可能没有捕捉到的突出的 LIWC 特征。虽然使用 LIWC 特征进行分类的结果不如 RoBERTa,但我们的研究结果为在 MI 代码分类中加入辅助任务提供了新的方向。
{"title":"Analysis of Behavior Classification in Motivational Interviewing.","authors":"Leili Tavabi, Trang Tran, Kalin Stefanov, Brian Borsari, Joshua D Woolley, Stefan Scherer, Mohammad Soleymani","doi":"10.18653/v1/2021.clpsych-1.13","DOIUrl":"10.18653/v1/2021.clpsych-1.13","url":null,"abstract":"<p><p>Analysis of client and therapist behavior in counseling sessions can provide helpful insights for assessing the quality of the session and consequently, the client's behavioral outcome. In this paper, we study the automatic classification of standardized behavior codes (i.e. annotations) used for assessment of psychotherapy sessions in Motivational Interviewing (MI). We develop models and examine the classification of client behaviors throughout MI sessions, comparing the performance by models trained on large pretrained embeddings (RoBERTa) versus interpretable and expert-selected features (LIWC). Our best performing model using the pretrained RoBERTa embeddings beats the baseline model, achieving an F1 score of 0.66 in the subject-independent 3-class classification. Through statistical analysis on the classification results, we identify prominent LIWC features that may not have been captured by the model using pretrained embeddings. Although classification using LIWC features underperforms RoBERTa, our findings motivate the future direction of incorporating auxiliary tasks in the classification of MI codes.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2021 ","pages":"110-115"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321779/pdf/nihms-1727153.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39266882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora. TextEssence:语料库间语义转换交互分析工具。
Denis Newman-Griffis, Venkatesh Sivaraman, Adam Perer, Eric Fosler-Lussier, Harry Hochheiser

Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.

词和概念的嵌入捕捉语言的句法和语义规律;然而,他们认为,作为研究不同语料库的特征以及它们之间如何相互关联的工具,它们的作用有限。我们介绍TextEssence,这是一个交互式系统,旨在使用嵌入对语料库进行比较分析。TextEssence包括可视化的、基于邻居的和基于相似度的嵌入分析模式,在一个轻量级的、基于web的界面中。我们进一步提出了一种新的基于最近邻重叠的嵌入置信度度量,以帮助识别用于语料分析的高质量嵌入。一项关于COVID-19科学文献的案例研究说明了该系统的实用性。TextEssence可以在https://textessence.github.io上找到。
{"title":"TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora.","authors":"Denis Newman-Griffis,&nbsp;Venkatesh Sivaraman,&nbsp;Adam Perer,&nbsp;Eric Fosler-Lussier,&nbsp;Harry Hochheiser","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2021 ","pages":"106-115"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212692/pdf/nihms-1710045.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39251210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research. 翻译NLP:自然语言处理研究的新范式和一般原则。
Denis Newman-Griffis, Jill Fain Lehman, Carolyn Rosé, Harry Hochheiser

Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of Translational NLP, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.

自然语言处理(NLP)研究通过基础科学将对普遍原理的研究与针对特定用例和设置的应用科学相结合。然而,基础NLP和应用之间的交流过程通常被认为是自然出现的,导致许多创新没有得到应用,许多重要问题没有得到研究。我们描述了一个翻译型NLP的新范式,其目的是构建和促进基础和应用NLP研究相互告知的过程。因此,翻译NLP提出了第三种研究范式,重点是理解应用需求带来的挑战,以及这些挑战如何推动基础科学和技术设计的创新。我们表明,NLP研究的许多重大进展都是从基本原则与应用需求的交叉中出现的,并提出了一个概念框架,概述了转化研究中的利益相关者和关键问题。我们的框架为将翻译型自然语言处理发展为一个专门的研究领域提供了路线图,并确定了一般的翻译原则,以促进基础研究和应用研究之间的交流。
{"title":"Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research.","authors":"Denis Newman-Griffis,&nbsp;Jill Fain Lehman,&nbsp;Carolyn Rosé,&nbsp;Harry Hochheiser","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of <i>Translational NLP</i>, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2021 ","pages":"4125-4138"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8223521/pdf/nihms-1710048.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39115253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality. 对人类水平NLP的预训练变压器的经验评价:样本大小和维度的作用。
Adithya V Ganesan, Matthew Matero, Aravind Reddy Ravula, Huy Vu, H Andrew Schwartz

In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just 1 12 of the embedding dimensions.

在人类级别的NLP任务中,例如预测心理健康、个性或人口统计,观察的数量通常小于现代基于转换器的语言模型中每层的标准768+隐藏状态大小,限制了有效利用转换器的能力。在这里,我们对降维方法(主成分分析、因子分解技术或多层自编码器)以及嵌入向量的维数和样本量作为预测性能的函数的作用进行了系统的研究。我们首先发现,对数据量有限的大型模型进行微调会带来很大的困难,这可以通过预训练的降维机制来克服。RoBERTa在人类级别的任务中始终实现最佳性能,PCA在更好地处理编写较长文本的用户方面比其他简化方法更有优势。最后,我们观察到大多数任务获得的结果与仅使用12个嵌入维度的最佳性能相当。
{"title":"Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality.","authors":"Adithya V Ganesan,&nbsp;Matthew Matero,&nbsp;Aravind Reddy Ravula,&nbsp;Huy Vu,&nbsp;H Andrew Schwartz","doi":"10.18653/v1/2021.naacl-main.357","DOIUrl":"https://doi.org/10.18653/v1/2021.naacl-main.357","url":null,"abstract":"<p><p>In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just <math> <mrow><mfrac><mn>1</mn> <mrow><mn>12</mn></mrow> </mfrac> </mrow> </math> of the embedding dimensions.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2021 ","pages":"4515-4532"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8294338/pdf/nihms-1716243.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39215546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
What's in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. 摘要有哪些内容?为医院课程总结的进步奠定基础。
Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, Noémie Elhadad

Summarization of clinical narratives is a long-standing research problem. Here, we introduce the task of hospital-course summarization. Given the documentation authored throughout a patient's hospitalization, generate a paragraph that tells the story of the patient admission. We construct an English, text-to-text dataset of 109,000 hospitalizations (2M source notes) and their corresponding summary proxy: the clinician-authored "Brief Hospital Course" paragraph written as part of a discharge note. Exploratory analyses reveal that the BHC paragraphs are highly abstractive with some long extracted fragments; are concise yet comprehensive; differ in style and content organization from the source notes; exhibit minimal lexical cohesion; and represent silver-standard references. Our analysis identifies multiple implications for modeling this complex, multi-document summarization task.

临床叙述的总结是一个长期存在的研究问题。在此,我们介绍医院病程总结任务。给定病人住院期间撰写的文件,生成一个段落,讲述病人入院的故事。我们构建了一个英文文本到文本数据集,其中包含 109,000 个住院病例(200 万份原始病历)及其相应的摘要代理:临床医生撰写的 "简要住院病程 "段落,作为出院病历的一部分。探索性分析表明,"简要住院过程 "段落具有高度的抽象性,其中包含一些较长的提取片段;简洁而全面;在风格和内容组织方面与原始病历不同;表现出最低限度的词汇连贯性;并且代表了银标准参考文献。我们的分析为这一复杂的多文档摘要任务建模提供了多方面的启示。
{"title":"What's in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization.","authors":"Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, Noémie Elhadad","doi":"10.18653/v1/2021.naacl-main.382","DOIUrl":"10.18653/v1/2021.naacl-main.382","url":null,"abstract":"<p><p>Summarization of clinical narratives is a long-standing research problem. Here, we introduce the task of hospital-course summarization. Given the documentation authored throughout a patient's hospitalization, generate a paragraph that tells the story of the patient admission. We construct an English, text-to-text dataset of 109,000 hospitalizations (2M source notes) and their corresponding summary proxy: the clinician-authored \"Brief Hospital Course\" paragraph written as part of a discharge note. Exploratory analyses reveal that the BHC paragraphs are highly abstractive with some long extracted fragments; are concise yet comprehensive; differ in style and content organization from the source notes; exhibit minimal lexical cohesion; and represent silver-standard references. Our analysis identifies multiple implications for modeling this complex, multi-document summarization task.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2021 ","pages":"4794-4811"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8225248/pdf/nihms-1705151.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39115254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Word centrality constrained representation for keyphrase extraction. 关键词提取的词中心性约束表示。
Zelalem Gero, Joyce C Ho

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on two publicly available datasets demonstrate our model outperforms existing state-of-the art approaches. Our model is publicly available at https://github.com/ZHgero/keyphrases_centrality.git.

为了跟上日益增长的文档生成和数字化的步伐,能够改进大量文献的搜索、发现和挖掘的自动化方法是必不可少的。关键字通过识别文档中的重要概念提供了简洁的表示。各种监督方法使用本地上下文对关键短语提取进行建模,以预测每个令牌的标签,并且比无监督的对应方法执行得更好。不幸的是,对于上下文不明确的简短文档,此方法不适用。此外,关键短语(通常是文档的要点)需要成为中心主题。我们提出了一种新的提取模型,该模型引入了中心性约束,以丰富双向长短期记忆的单词表示。对两个公开可用数据集的性能评估表明,我们的模型优于现有的最先进的方法。我们的模型可以在https://github.com/ZHgero/keyphrases_centrality.git上公开获取。
{"title":"Word centrality constrained representation for keyphrase extraction.","authors":"Zelalem Gero,&nbsp;Joyce C Ho","doi":"10.18653/v1/2021.bionlp-1.17","DOIUrl":"https://doi.org/10.18653/v1/2021.bionlp-1.17","url":null,"abstract":"<p><p>To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on two publicly available datasets demonstrate our model outperforms existing state-of-the art approaches. Our model is publicly available at https://github.com/ZHgero/keyphrases_centrality.git.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":" ","pages":"155-161"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9208728/pdf/nihms-1815573.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40396966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research 翻译NLP:自然语言处理研究的新范式和一般原则
Denis Newman-Griffis, J. Lehman, C. Ros'e, H. Hochheiser
Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of Translational NLP, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.
自然语言处理(NLP)研究通过基础科学将对普遍原理的研究与针对特定用例和设置的应用科学相结合。然而,基础NLP和应用之间的交流过程通常被认为是自然出现的,导致许多创新没有得到应用,许多重要问题没有得到研究。我们描述了一个翻译型NLP的新范式,其目的是构建和促进基础和应用NLP研究相互告知的过程。因此,翻译NLP提出了第三种研究范式,重点是理解应用需求带来的挑战,以及这些挑战如何推动基础科学和技术设计的创新。我们表明,NLP研究的许多重大进展都是从基本原则与应用需求的交叉中出现的,并提出了一个概念框架,概述了转化研究中的利益相关者和关键问题。我们的框架为将翻译型自然语言处理发展为一个专门的研究领域提供了路线图,并确定了一般的翻译原则,以促进基础研究和应用研究之间的交流。
{"title":"Translational NLP: A New Paradigm and General Principles for Natural Language Processing Research","authors":"Denis Newman-Griffis, J. Lehman, C. Ros'e, H. Hochheiser","doi":"10.18653/V1/2021.NAACL-MAIN.325","DOIUrl":"https://doi.org/10.18653/V1/2021.NAACL-MAIN.325","url":null,"abstract":"Natural language processing (NLP) research combines the study of universal principles, through basic science, with applied science targeting specific use cases and settings. However, the process of exchange between basic NLP and applications is often assumed to emerge naturally, resulting in many innovations going unapplied and many important questions left unstudied. We describe a new paradigm of Translational NLP, which aims to structure and facilitate the processes by which basic and applied NLP research inform one another. Translational NLP thus presents a third research paradigm, focused on understanding the challenges posed by application needs and how these challenges can drive innovation in basic science and technology design. We show that many significant advances in NLP research have emerged from the intersection of basic principles with application needs, and present a conceptual framework outlining the stakeholders and key questions in translational research. Our framework provides a roadmap for developing Translational NLP as a dedicated research area, and identifies general translational principles to facilitate exchange between basic and applied research.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"21 1","pages":"4125-4138"},"PeriodicalIF":0.0,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86528175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Paragraph-level Simplification of Medical Texts 医学文本的分段简化
Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li
We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.
我们考虑学习简化医学文本的问题。这一点很重要,因为大多数可靠的、最新的生物医学信息都充斥着行话,因此外行读者实际上无法理解。此外,人工简化并不适用于快速增长的生物医学文献,这促使人们需要自动化方法。不幸的是,没有大规模的资源可用于此任务。在这项工作中,我们介绍了一个新的语料库平行文本的英语,包括技术和lay总结所有已发表的证据有关不同的临床主题。然后,我们提出了一个基于基于科学文本预训练的屏蔽语言模型的似然分数的新度量。我们表明,这种自动度量比现有的启发式更好地区分了技术摘要和外行摘要。我们引入并评估了基线编码器-解码器转换器模型以简化,并提出了一种新的增强方法,其中我们明确地惩罚解码器产生“术语”术语;我们发现,这在可读性方面比基线有所提高。
{"title":"Paragraph-level Simplification of Medical Texts","authors":"Ashwin Devaraj, I. Marshall, Byron C. Wallace, J. Li","doi":"10.18653/V1/2021.NAACL-MAIN.395","DOIUrl":"https://doi.org/10.18653/V1/2021.NAACL-MAIN.395","url":null,"abstract":"We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"7 1","pages":"4972-4984"},"PeriodicalIF":0.0,"publicationDate":"2021-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78695199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora TextEssence:语料库间语义转换交互分析工具
Denis Newman-Griffis, Venkatesh Sivaraman, Adam Perer, E. Fosler-Lussier, H. Hochheiser
Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.
词和概念的嵌入捕捉语言的句法和语义规律;然而,他们认为,作为研究不同语料库的特征以及它们之间如何相互关联的工具,它们的作用有限。我们介绍TextEssence,这是一个交互式系统,旨在使用嵌入对语料库进行比较分析。TextEssence包括可视化的、基于邻居的和基于相似度的嵌入分析模式,在一个轻量级的、基于web的界面中。我们进一步提出了一种新的基于最近邻重叠的嵌入置信度度量,以帮助识别用于语料分析的高质量嵌入。一项关于COVID-19科学文献的案例研究说明了该系统的实用性。TextEssence可以在https://textessence.github.io上找到。
{"title":"TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora","authors":"Denis Newman-Griffis, Venkatesh Sivaraman, Adam Perer, E. Fosler-Lussier, H. Hochheiser","doi":"10.18653/v1/2021.naacl-demos.13","DOIUrl":"https://doi.org/10.18653/v1/2021.naacl-demos.13","url":null,"abstract":"Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"48 1","pages":"106-115"},"PeriodicalIF":0.0,"publicationDate":"2021-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87979469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1