Computational Linguistics最新文献_第5页

Boring Problems Are Sometimes the Most Interesting 无聊的问题有时是最有趣的

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00439

Abstract In a recent position paper, Turing Award Winners Yoshua Bengio, Geoffrey Hinton, and Yann LeCun make the case that symbolic methods are not needed in AI and that, while there are still many issues to be resolved, AI will be solved using purely neural methods. In this piece I issue a challenge: Demonstrate that a purely neural approach to the problem of text normalization is possible. Various groups have tried, but so far nobody has eliminated the problem of unrecoverable errors, errors where, due to insufficient training data or faulty generalization, the system substitutes some other reading for the correct one. Solutions have been proposed that involve a marriage of traditional finite-state methods with neural models, but thus far nobody has shown that the problem can be solved using neural methods alone. Though text normalization is hardly an “exciting” problem, I argue that until one can solve “boring” problems like that using purely AI methods, one cannot claim that AI is a success.

在最近的一份立场文件中，图灵奖获得者Yoshua Bengio、Geoffrey Hinton和Yann LeCun提出，人工智能不需要符号方法，尽管仍有许多问题有待解决，但人工智能将使用纯神经方法来解决。在这篇文章中，我提出了一个挑战:证明纯神经方法解决文本规范化问题是可能的。各种小组都做过尝试，但到目前为止，还没有人能够消除不可恢复错误的问题，即由于训练数据不足或泛化错误，系统会用其他读数代替正确的读数。已经有人提出了将传统有限状态方法与神经模型结合起来的解决方案，但到目前为止，还没有人证明这个问题可以单独使用神经方法来解决。尽管文本规范化很难说是一个“令人兴奋”的问题，但我认为，除非人们能够使用纯粹的AI方法解决这类“无聊”的问题，否则就不能说AI是成功的。

引用次数: 5

Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity 评估语料库证据对非投射性的形式和心理语言学约束

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00437

Himanshu Yadav Samar Husain Richard Futrell

Abstract Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.

摘要交叉依赖的形式约束在自然语言语法和解析的形式复杂性研究中发挥了重要作用。在这里，我们询问树库中交叉依赖性约束的明显证据是否可能是因为树上的独立约束而出现的，例如低arity和依赖长度最小化。我们用两组实验来解决这个问题。在实验1中，我们比较了真实树和基线树之间交叉依赖关系的形式性质的分布，如间隙度，这些树与交叉依赖关系率和其他各种性质相匹配。在实验2中，我们对两个依赖关系是否交叉进行建模，给定依赖关系的某些心理语言学特性。我们发现，来自温和上下文敏感性文献的约束（间隙度和良好嵌套性）的证据出奇地弱，超出了交叉依赖率、树的拓扑性质和依赖长度的约束所能解释的范围。然而，解析文献中出现的度量（例如，边缘度、端点交叉和头部深度差）在真实树和随机树之间存在很大差异。建模结果表明，与信息局部性和工作记忆限制相关的认知指标会影响两种依赖关系是否交叉，但它们并不能完全解释交叉依赖关系在自然语言中的分布。总之，这些结果表明，处理压力比轻度上下文敏感的约束更能表征交叉约束。

{"title":"Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity","authors":"Himanshu Yadav, Samar Husain, Richard Futrell","doi":"10.1162/coli_a_00437","DOIUrl":"https://doi.org/10.1162/coli_a_00437","url":null,"abstract":"Abstract Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"375-401"},"PeriodicalIF":9.3,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45451373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Challenges of Neural Machine Translation for Short Texts 短文本神经机器翻译的挑战

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00435

Derek F. Wong Lidia S. Chao Baosong Yang Yu Wan Liang Yao Haibo Zhang Boxing Chen

Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.

摘要短文本(STs)存在于各种场景中，包括查询、对话框和实体名称。神经机器翻译(NMT)的研究大多集中在解决长句而非短句的开放性问题上。这背后的直觉是，就人类的学习和处理而言，短序列通常被视为简单的例子。在本文中，我们首先通过进行初步实验来消除这一猜测，结果表明，传统的最先进的NMT方法，即Transformer (Vaswani et al. 2017)，在STs上仍然存在过度翻译和误译错误。在实证研究了这背后的原理之后，我们总结了与上述翻译错误类型相关的两大挑战，分别是:(1)训练集长度分布的不平衡加剧了对STs的模型推理校准，导致STs上出现更多的过度翻译情况;(2)上下文信息的缺乏使得NMT在短句上具有较高的数据不确定性，因此NMT模型存在相当大的误译错误。一些现有的方法，如平衡训练数据分布(例如，数据上采样)和补充上下文信息(例如，引入翻译记忆库)可以缓解面向STs的NMT中的翻译问题。我们鼓励研究人员研究翻译过程中的其他挑战，从而减少翻译错误，提高翻译质量。

{"title":"Challenges of Neural Machine Translation for Short Texts","authors":"Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Liang Yao, Haibo Zhang, Boxing Chen","doi":"10.1162/coli_a_00435","DOIUrl":"https://doi.org/10.1162/coli_a_00435","url":null,"abstract":"Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"321-342"},"PeriodicalIF":9.3,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48624460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Hierarchical Interpretation of Neural Text Classification 神经文本分类的层次解释

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-02-20 DOI: 10.1162/coli_a_00459

Lin Gui Yulan He Hanqi Yan

Abstract Recent years have witnessed increasing interest in developing interpretable models in Natural Language Processing (NLP). Most existing models aim at identifying input features such as words or phrases important for model predictions. Neural models developed in NLP, however, often compose word semantics in a hierarchical manner. As such, interpretation by words or phrases only cannot faithfully explain model decisions in text classification. This article proposes a novel Hierarchical Interpretable Neural Text classifier, called HINT, which can automatically generate explanations of model predictions in the form of label-associated topics in a hierarchical manner. Model interpretation is no longer at the word level, but built on topics as the basic semantic unit. Experimental results on both review datasets and news datasets show that our proposed approach achieves text classification results on par with existing state-of-the-art text classifiers, and generates interpretations more faithful to model predictions and better understood by humans than other interpretable neural text classifiers.1

近年来，人们对开发自然语言处理(NLP)中的可解释模型越来越感兴趣。大多数现有的模型旨在识别输入特征，如对模型预测很重要的单词或短语。然而，在NLP中开发的神经模型通常以分层方式组成词语义。因此，仅通过单词或短语的解释不能忠实地解释文本分类中的模型决策。本文提出了一种新的分层可解释神经文本分类器HINT，它可以分层地以标签相关主题的形式自动生成模型预测的解释。模型解释不再是在词的层次上，而是建立在主题作为基本语义单位的基础上。在评论数据集和新闻数据集上的实验结果表明，我们提出的方法达到了与现有最先进的文本分类器相当的文本分类结果，并且产生的解释比其他可解释的神经文本分类器更忠实于模型预测，更容易被人类理解

引用次数: 9

Transformers and the Representation of Biomedical Background Knowledge 变形金刚与生物医学背景知识的表示

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-02-04 DOI: 10.1162/coli_a_00462

Zili Zhou M. Wysocka D. Ferreira Oskar Wysocki U. Manchester The University of Manchester Paul O'Regan Dónal Landers Cruk Manchester Institute digital Experimental Cancer Medicine Team Andr'e Freitas Department of Computer Science Idiap Research Institute Cancer Biomarker Centre

Specialized transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine—namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs, and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyze how the models behave with regard to biases and imbalances in the dataset.

专门的基于变压器的模型(如BioBERT和BioMegatron)适用于基于公开可用的生物医学数据库的生物医学领域。因此，它们具有编码大规模生物知识的潜力。我们研究了这些模型中生物学知识的编码和表示，以及它在支持癌症精准医学推理方面的潜在效用，即基因组改变的临床意义的解释。我们比较了不同变压器基线的性能;我们使用探测来确定不同实体编码的一致性;我们使用聚类方法来比较和对比基因、变异、药物和疾病的嵌入的内部特性。我们表明，这些模型确实编码了生物学知识，尽管其中一些在特定任务的微调中丢失了。最后，我们分析了模型在数据集中的偏差和不平衡方面的表现。

{"title":"Transformers and the Representation of Biomedical Background Knowledge","authors":"Oskar Wysocki, Zili Zhou, Paul O'Regan, D. Ferreira, M. Wysocka, Dónal Landers, Andr'e Freitas Department of Computer Science, The University of Manchester, digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, Cruk Manchester Institute, U. Manchester, Idiap Research Institute","doi":"10.1162/coli_a_00462","DOIUrl":"https://doi.org/10.1162/coli_a_00462","url":null,"abstract":"Specialized transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine—namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs, and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyze how the models behave with regard to biases and imbalances in the dataset.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"73-115"},"PeriodicalIF":9.3,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48709787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization 文本匿名基准(TAB):文本匿名化的专用语料库和评估框架

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2022-01-25 DOI: 10.1162/coli_a_00458

Ildik'o Pil'an Pierre Lison Anthia Papadopoulou David Sánchez Montserrat Batet Lilja Ovrelid

Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.

摘要我们提出了一种新的基准和相关的评估指标来评估文本匿名化方法的性能。文本匿名化被定义为编辑文本文档以防止个人信息泄露的任务，目前缺乏面向隐私的注释文本资源，这使得很难正确评估各种匿名化方法提供的隐私保护水平。本文介绍了TAB（文本匿名基准），这是一个新的开源注释语料库，旨在解决这一不足。该语料库包括来自欧洲人权法院（ECHR）的1268起英语法庭案件，其中对每份文件中出现的个人信息进行了全面的注释，包括其语义类别、标识符类型、机密属性和共同参考关系。与之前的工作相比，TAB语料库的设计超越了传统的去识别（仅限于检测预定义的语义类别），并明确标记了哪些文本跨度应该被屏蔽，以隐藏要保护的人的身份。除了展示语料库及其注释层外，我们还提出了一组评估指标，专门用于衡量文本匿名化在隐私保护和效用保护方面的性能。我们通过评估几个基线文本匿名化模型的经验性能来说明基准和所提出的指标的使用。完整的语料库及其面向隐私的注释指南、评估脚本和基线模型可在以下网站上获得：https://github.com/NorskRegnesentral/text-anonymization-benchmark.

{"title":"The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization","authors":"Ildik'o Pil'an, Pierre Lison, Lilja Ovrelid, Anthia Papadopoulou, David Sánchez, Montserrat Batet","doi":"10.1162/coli_a_00458","DOIUrl":"https://doi.org/10.1162/coli_a_00458","url":null,"abstract":"Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"1053-1101"},"PeriodicalIF":9.3,"publicationDate":"2022-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43427803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization 面向查询的抽象文本摘要的预训练变形域自适应

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-12-22 DOI: 10.1162/coli_a_00434

J. Huang Enamul Hoque Md Tahmid Rahman Laskar

Abstract The Query-Focused Text Summarization (QFTS) task aims at building systems that generate the summary of the text document(s) based on the given query. A key challenge in addressing this task is the lack of large labeled data for training the summarization model. In this article, we address this challenge by exploring a series of domain adaptation techniques. Given the recent success of pre-trained transformer models in a wide range of natural language processing tasks, we utilize such models to generate abstractive summaries for the QFTS task for both single-document and multi-document scenarios. For domain adaptation, we apply a variety of techniques using pre-trained transformer-based summarization models including transfer learning, weakly supervised learning, and distant supervision. Extensive experiments on six datasets show that our proposed approach is very effective in generating abstractive summaries for the QFTS task while setting a new state-of-the-art result in several datasets across a set of automatic and human evaluation metrics.

摘要以查询为中心的文本摘要（QFTS）任务旨在构建基于给定查询生成文本文档摘要的系统。解决这一任务的一个关键挑战是缺乏用于训练摘要模型的大标记数据。在本文中，我们通过探索一系列领域自适应技术来应对这一挑战。鉴于预训练的转换器模型最近在广泛的自然语言处理任务中取得了成功，我们利用这些模型为单文档和多文档场景的QFTS任务生成抽象摘要。对于领域自适应，我们使用预先训练的基于变换器的摘要模型应用了各种技术，包括迁移学习、弱监督学习和远程监督。在六个数据集上的大量实验表明，我们提出的方法在为QFTS任务生成抽象摘要方面非常有效，同时在一组自动和人工评估指标的几个数据集中设置了最先进的新结果。

引用次数: 24

Novelty Detection: A Perspective from Natural Language Processing 从自然语言处理的角度看新颖性检测

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-12-20 DOI: 10.1162/coli_a_00429

Asif Ekbal Tanik Saikh Tirthankar Ghosal P. Bhattacharyya Tameesh Biswas

The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.

对新信息的追求是人类与生俱来的特征，一直是人类生存和进步的精髓。新奇激发好奇心，好奇心反过来又推动创新。在自然语言处理(NLP)中，新颖性检测(Novelty Detection)指的是找到相对于之前所见或已知的内容具有一些新信息的文本。随着整个Web上信息的指数级增长，冗余的威胁也随之而来。相当一部分Web内容是重复的，我们需要有效的机制来保留新信息并过滤掉冗余信息。然而，在语义层面检测冗余和识别新文本并不简单，因为文本可能具有较少的词汇重叠，但传达相同的信息。最重要的是，文档中的非新颖/冗余信息可能来自多个源文档，而不仅仅是一个源文档。当话语的主题是文件时，这个问题就会超越，并且需要处理许多先前的文件以确定当前所关注的文件的新颖性/非新颖性。在这项工作中，我们建立在我们早期对文档级新颖性检测的调查基础上，并对我们针对该问题所做的努力进行了全面的描述。我们探索了预先训练的文本蕴涵(TE)模型在处理多源上下文中的作用，并展示了我们当前研究的结果。我们认为多前提蕴涵任务是识别语义级非新颖性的一种近似方法。我们最近的方法在几个数据集和几个相关任务(释义、抄袭、重写)上，与最新报告的结果相比，要么表现相当，要么取得了显著的改进。我们批判性地分析了我们对现有技术的表现，并展示了我们的方法在未来调查中的优越性和前景。我们还向社区展示了我们的增强数据集tap - dnd 2.0和几个基线，以进一步研究文档级新颖性检测。

{"title":"Novelty Detection: A Perspective from Natural Language Processing","authors":"Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya","doi":"10.1162/coli_a_00429","DOIUrl":"https://doi.org/10.1162/coli_a_00429","url":null,"abstract":"The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"77-117"},"PeriodicalIF":9.3,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41934834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease 识别轻度认知障碍和阿尔茨海默病的自发言语语言参数

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-12-20 DOI: 10.1162/coli_a_00428

V. Vincze M. Pákáski I. Hoffmann G. Gosztolya L. Tóth Martina Katalin Szabó J. Kálmán

In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.

在这篇文章中，我们试图根据匈牙利轻度认知障碍（MCI）或轻度阿尔茨海默病（mAD）患者的语音记录自动识别他们，只关注语言特征。除了我们早期研究中检查的特征外，我们还介绍了自发语音的句法、语义和语用特征，这些特征可能会影响痴呆症的检测。为了确定区分健康对照组、MCI患者和mAD患者的最有用特征，我们对数据进行了统计分析，并研究了在各种说话者组配对和各种说话任务中提取的特征的显著性水平。在文章的第二部分，我们使用这一丰富的特征集作为三个说话者群体之间有效歧视的基础。在我们的机器学习实验中，我们分别分析了每个特征组的功效。我们的模型使用了所有特征，无论有没有人口统计信息，都能获得有竞争力的分数（三类准确率值：68%-70%，二类准确率：77.3%-80%）。我们还分析了不同的数据记录场景如何影响语言特征，以及在区分MCI患者和健康对照时如何有效地使用这些特征。

{"title":"Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease","authors":"V. Vincze, Martina Katalin Szabó, I. Hoffmann, L. Tóth, M. Pákáski, J. Kálmán, G. Gosztolya","doi":"10.1162/coli_a_00428","DOIUrl":"https://doi.org/10.1162/coli_a_00428","url":null,"abstract":"In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":"119-153"},"PeriodicalIF":9.3,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44385743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Obituary: Martin Kay 讣告:马丁·凯

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-12-16 DOI: 10.1162/coli_a_00424

H. Uszkoreit R. Kaplan

It is with great sadness that we report the passing of Martin Kay in August 2021. Martin was a pioneer and intellectual trailblazer in computational linguistics. He was also a close friend and colleague of many years. Martin was a polyglot undergraduate student of modern and medieval languages at Cambridge University, with a particular interest in translation. He was not (yet) a mathematician or engineer, but idle speculation in 1958 about the possibilities of automating the translation process led him to Margaret Masterman at the Cambridge Language Research Unit, and a shift to a long and productive career. In 1960 he was offered an internship with Dave Hays and the Linguistics Project at The RAND Corporation in California, another early center of research in our emerging discipline. He stayed at RAND for more than a decade, working on basic technologies that are needed for machine processing of natural language. Among his contributions during that period was the development of the first so-called chart parser (Kay 1967), a computationally effective mechanism for dealing systematically with linguistic dependencies that cannot be expressed in context-free grammars. The chart architecture could be deployed for language generation as well as parsing, an important property for Martin’s continuing interest in translation. It was during the years at RAND that Martin found his second calling, as a teacher of computational linguistics, initially at UCLA and then in many other settings. He was a gifted and entertaining speaker and lecturer, able to present complex material with clarity and precision. He took great pleasure in the interactions with his students and the role that he played in helping to advance their careers. He left RAND in 1972 to become a full-time professor and chair of the Computer Science Department at the University of California at Irvine. His time at Irvine was short-lived, as he was attracted back to an open-ended research environment. In 1974 he joined with Danny Bobrow, Ron Kaplan, and Terry Winograd to form the Language Understander project at the recently created Palo Alto Research Center (PARC) of the Xerox Corporation. The group took as a first goal the construction of a mixed-initiative dialog system using state-of-the-art components for knowledge representation and reasoning, language understanding, language production, and dialog management (Bobrow et al. 1977). Martin took responsibility for

我们怀着极大的悲痛报告马丁·凯于2021年8月去世。马丁是计算语言学的先驱和开拓者。他也是一位多年的亲密朋友和同事。马丁是剑桥大学现代和中世纪语言专业的一名精通多种语言的本科生，对翻译特别感兴趣。他（还）不是数学家或工程师，但1958年，他对翻译过程自动化的可能性进行了无聊的猜测，这让他找到了剑桥语言研究所的玛格丽特·马斯特曼，并转向了漫长而富有成效的职业生涯。1960年，他获得了Dave Hays和加州兰德公司语言学项目的实习机会，兰德公司是我们新兴学科的另一个早期研究中心。他在兰德公司工作了十多年，研究自然语言机器处理所需的基本技术。他在这一时期的贡献之一是开发了第一个所谓的图表解析器（Kay 1967），这是一种计算有效的机制，用于系统地处理无法在上下文无关语法中表达的语言依赖关系。图表体系结构可以用于语言生成和解析，这是Martin对翻译持续感兴趣的一个重要特性。在兰德公司的几年里，马丁找到了他的第二份职业，最初在加州大学洛杉矶分校，后来在许多其他地方担任计算语言学教师。他是一位天赋异禀、富有娱乐性的演讲者和讲师，能够清晰准确地呈现复杂的材料。他对与学生的互动以及他在帮助他们发展职业生涯中所扮演的角色感到非常高兴。1972年，他离开兰德公司，成为加州大学欧文分校计算机科学系的全职教授和主任。他在欧文的时间很短，因为他被吸引回到了一个开放式的研究环境中。1974年，他与Danny Bobrow、Ron Kaplan和Terry Winograd一起在施乐公司最近成立的帕洛阿尔托研究中心（PARC）成立了语言理解者项目。该小组将构建一个混合主动对话系统作为首要目标，该系统使用最先进的组件进行知识表示和推理、语言理解、语言生成和对话管理（Bobrow等人，1977）。Martin负责

{"title":"Obituary: Martin Kay","authors":"R. Kaplan, H. Uszkoreit","doi":"10.1162/coli_a_00424","DOIUrl":"https://doi.org/10.1162/coli_a_00424","url":null,"abstract":"It is with great sadness that we report the passing of Martin Kay in August 2021. Martin was a pioneer and intellectual trailblazer in computational linguistics. He was also a close friend and colleague of many years. Martin was a polyglot undergraduate student of modern and medieval languages at Cambridge University, with a particular interest in translation. He was not (yet) a mathematician or engineer, but idle speculation in 1958 about the possibilities of automating the translation process led him to Margaret Masterman at the Cambridge Language Research Unit, and a shift to a long and productive career. In 1960 he was offered an internship with Dave Hays and the Linguistics Project at The RAND Corporation in California, another early center of research in our emerging discipline. He stayed at RAND for more than a decade, working on basic technologies that are needed for machine processing of natural language. Among his contributions during that period was the development of the first so-called chart parser (Kay 1967), a computationally effective mechanism for dealing systematically with linguistic dependencies that cannot be expressed in context-free grammars. The chart architecture could be deployed for language generation as well as parsing, an important property for Martin’s continuing interest in translation. It was during the years at RAND that Martin found his second calling, as a teacher of computational linguistics, initially at UCLA and then in many other settings. He was a gifted and entertaining speaker and lecturer, able to present complex material with clarity and precision. He took great pleasure in the interactions with his students and the role that he played in helping to advance their careers. He left RAND in 1972 to become a full-time professor and chair of the Computer Science Department at the University of California at Irvine. His time at Irvine was short-lived, as he was attracted back to an open-ended research environment. In 1974 he joined with Danny Bobrow, Ron Kaplan, and Terry Winograd to form the Language Understander project at the recently created Palo Alto Research Center (PARC) of the Xerox Corporation. The group took as a first goal the construction of a mixed-initiative dialog system using state-of-the-art components for knowledge representation and reasoning, language understanding, language production, and dialog management (Bobrow et al. 1977). Martin took responsibility for","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"1-3"},"PeriodicalIF":9.3,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45299319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104