Computational Linguistics最新文献

英文中文

Probing Classifiers: Promises, Shortcomings, and Advances 探查分类器:承诺、缺点和进步

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-02-24 DOI: 10.1162/coli_a_00422

Yonatan Belinkov

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This squib critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.

探测分类器已成为解释和分析自然语言处理的深度神经网络模型的重要方法之一。基本思想很简单——分类器被训练来从模型的表示中预测一些语言属性——并且已经被用于检查各种各样的模型和属性。然而，最近的研究表明，这种方法在方法上存在各种局限性。这篇文章批判性地回顾了探索分类器框架，强调了它们的承诺、缺点和进步。

引用次数: 160

Position Information in Transformers: An Overview 变压器中的位置信息：综述

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-02-22 DOI: 10.1162/coli_a_00445

Philipp Dufter, Martin Schmitt, Hinrich Schütze

Abstract Transformers are arguably the main workhorse in recent natural language processing research. By definition, a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase that position information in Transformer is a vibrant and extensive research area; (2) enable the reader to compare existing methods by providing a unified notation and systematization of different approaches along important model dimensions; (3) indicate what characteristics of an application should be taken into account when selecting a position encoding; and (4) provide stimuli for future research.

摘要在最近的自然语言处理研究中，变形器可以说是主要的主力。根据定义，Transformer对于输入的重新排序是不变的。然而，语言本质上是顺序的，语序对话语的语义和句法至关重要。在本文中，我们提供了一个概述和理论比较现有的方法，以纳入位置信息到Transformer模型。本次调查的目的是:(1)展示Transformer中的位置信息是一个充满活力和广泛的研究领域;(2)根据重要的模型维度为不同的方法提供统一的符号和系统化，使读者能够比较现有的方法;(3)表明在选择位置编码时应考虑应用程序的哪些特征;(4)为今后的研究提供激励。

引用次数: 69

Sparse Transcription 稀疏的转录

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-02-01 DOI: 10.1162/coli_a_00387

Steven Bird

Abstract The transcription bottleneck is often cited as a major obstacle for efforts to document the world’s endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.

转录瓶颈通常被认为是记录世界濒危语言和为其提供语言技术的主要障碍。一种解决方案是扩展自动语音识别和机器翻译的方法，并聘请语言学家提供窄音标和句子对齐的翻译。然而，我认为这些方法并不适合现有的数据和技能，也不适合长期以来以文字为基础的实践。在寻求更有效的方法时，我考虑了一个世纪的转录实践和广泛的计算方法，然后提出了一个基于口语术语检测的计算模型，我称之为“稀疏转录”。这代表了当前假设的转变，即我们转录电话，完全转录，首先转录。相反，稀疏转录将旧的单词级转录实践与解释，迭代和互动过程相结合，这些过程适合于更广泛的参与，并为处理口头语言的新方法开辟了道路。

引用次数: 29

Efficient Outside Computation 高效的外部计算

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2021-02-01 DOI: 10.1162/coli_a_00386

D. Gildea

Abstract Weighted deduction systems provide a framework for describing parsing algorithms that can be used with a variety of operations for combining the values of partial derivations. For some operations, inside values can be computed efficiently, but outside values cannot. We view out-side values as functions from inside values to the total value of all derivations, and we analyze outside computation in terms of function composition. This viewpoint helps explain why efficient outside computation is possible in many settings, despite the lack of a general outside algorithm for semiring operations.

摘要加权演绎系统为描述解析算法提供了一个框架，该框架可用于组合偏导值的各种操作。对于某些操作，内部值可以有效地计算，但外部值不能。我们将外部值视为从内部值到所有派生的总价值的函数，并根据函数组成来分析外部计算。这种观点有助于解释为什么尽管缺乏用于半循环操作的通用外部算法，但在许多情况下仍然可以实现高效的外部计算。

引用次数: 1

Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora 跨多个公司的跨文档事件引用解析的通用化

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-11-24 DOI: 10.1162/coli_a_00407

M. Bugert, Nils Reimers, Iryna Gurevych

Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multidocument applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability—a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus, and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Although being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-or-miss. Via model introspection, we find that the importance of event actions, event time, and so forth, for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future—the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.1

跨文档事件共引用解析（CDCR）是一项NLP任务，其中需要在整个文档集合中识别和聚集事件的提及。CDCR旨在使下游多文档应用程序受益，但尽管最近在语料库和系统开发方面取得了进展，但应用CDCR的下游改进尚未显示出来。我们观察到，迄今为止，每个CDCR系统都是仅在单个相应的语料库上开发、训练和测试的。这引发了人们对其可推广性的强烈担忧——这是下游应用程序的必备条件，在这些应用程序中，领域或事件提及的数量可能超过策划语料库中的数量。为了研究这一假设，我们定义了一个统一的评估设置，涉及三个CDCR语料库：ECB+、枪支暴力语料库和足球参考语料库（我们在令牌级别重新标记，使我们的分析成为可能）。我们将一个独立于语料库、基于特征的系统与最近为ECB+开发的神经系统进行了比较。尽管在绝对数量上较差，但基于特征的系统在所有语料库中表现出更一致的性能，而神经系统则是命中或未命中的。通过模型内省，我们发现，在实践中，事件动作、事件时间等对解决共指的重要性在语料库之间存在很大差异。额外的分析表明，几个系统对ECB+语料库的结构进行了过度拟合。最后，我们就如何在未来实现普遍适用的CDCR系统提出了建议——最重要的是，对多个CDCR语料库的评估是非常必要的。为了促进未来的研究，我们向公众发布了我们的数据集、注释指南和系统实现。1

{"title":"Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora","authors":"M. Bugert, Nils Reimers, Iryna Gurevych","doi":"10.1162/coli_a_00407","DOIUrl":"https://doi.org/10.1162/coli_a_00407","url":null,"abstract":"Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multidocument applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability—a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus, and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Although being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-or-miss. Via model introspection, we find that the importance of event actions, event time, and so forth, for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future—the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.1","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"47 1","pages":"1-40"},"PeriodicalIF":9.3,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43656801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Deep Learning for Text Style Transfer: A Survey 文本风格迁移的深度学习研究综述

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-11-01 DOI: 10.1162/coli_a_00426

Di Jin, Zhijing Jin, Zhiting Hu, Olga Vechtomova, Rada Mihalcea

Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this article, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task.1

文本风格转换是自然语言生成中的一项重要任务，其目的是控制生成的文本中的某些属性，如礼貌、情感、幽默等。它在自然语言处理领域有着悠久的历史，最近由于深度神经模型带来的良好性能而再次受到关注。在这篇文章中，我们对神经文本风格转移的研究进行了系统的调查，涵盖了自2017年第一次神经文本风格迁移工作以来的100多篇代表性文章。我们讨论了任务公式、现有数据集和子任务、评估，以及在存在并行和非并行数据的情况下的丰富方法。我们还就有关这项任务未来发展的各种重要议题进行了讨论。1

引用次数: 136

Sentence Meaning Representations Across Languages: What Can We Learn from Existing Frameworks? 跨语言的句子意义表示:我们能从现有框架中学到什么?

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-11-01 DOI: 10.1162/coli_a_00385

Z. Žabokrtský, Daniel Zeman, M. Sevcíková

This article gives an overview of how sentence meaning is represented in eleven deep-syntactic frameworks, ranging from those based on linguistic theories elaborated for decades to rather lightweight NLP-motivated approaches. We outline the most important characteristics of each framework and then discuss how particular language phenomena are treated across those frameworks, while trying to shed light on commonalities as well as differences.

本文概述了如何在11个深度句法框架中表示句子的意义，这些框架从几十年来基于语言学理论的框架到相当轻量级的nlp驱动的方法。我们概述了每个框架的最重要特征，然后讨论如何在这些框架中处理特定的语言现象，同时试图揭示共性和差异。

引用次数: 17

Tractable Lexical-Functional Grammar 可牵引词汇功能语法

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-11-01 DOI: 10.1162/coli_a_00384

Jürgen Wedekind, R. Kaplan

The formalism for Lexical-Functional Grammar (LFG) was introduced in the 1980s as one of the first constraint-based grammatical formalisms for natural language. It has led to substantial contributions to the linguistic literature and to the construction of large-scale descriptions of particular languages. Investigations of its mathematical properties have shown that, without further restrictions, the recognition, emptiness, and generation problems are undecidable, and that they are intractable in the worst case even with commonly applied restrictions. However, grammars of real languages appear not to invoke the full expressive power of the formalism, as indicated by the fact that algorithms and implementations for recognition and generation have been developed that run—even for broad-coverage grammars—in typically polynomial time. This article formalizes some restrictions on the notation and its interpretation that are compatible with conventions and principles that have been implicit or informally stated in linguistic theory. We show that LFG grammars that respect these restrictions, while still suitable for the description of natural languages, are equivalent to linear context-free rewriting systems and allow for tractable computation.

词汇功能语法形式主义(LFG)是20世纪80年代提出的第一个基于约束的自然语言语法形式主义。它对语言学文献和对特定语言的大规模描述的构建做出了重大贡献。对其数学性质的研究表明，在没有进一步限制的情况下，识别问题、空性问题和生成问题是不可确定的，即使在通常应用的限制下，它们在最坏的情况下也是难以解决的。然而，实际语言的语法似乎并没有调用形式主义的全部表达能力，这一点可以从以下事实中看出:用于识别和生成的算法和实现已经被开发出来，这些算法和实现通常在多项式时间内运行——甚至对于广泛覆盖的语法也是如此。本文形式化了对符号及其解释的一些限制，这些限制与语言理论中隐含或非正式陈述的惯例和原则相兼容。我们证明了尊重这些限制的LFG语法，虽然仍然适用于自然语言的描述，但等同于线性上下文无关的重写系统，并允许可处理的计算。

{"title":"Tractable Lexical-Functional Grammar","authors":"Jürgen Wedekind, R. Kaplan","doi":"10.1162/coli_a_00384","DOIUrl":"https://doi.org/10.1162/coli_a_00384","url":null,"abstract":"The formalism for Lexical-Functional Grammar (LFG) was introduced in the 1980s as one of the first constraint-based grammatical formalisms for natural language. It has led to substantial contributions to the linguistic literature and to the construction of large-scale descriptions of particular languages. Investigations of its mathematical properties have shown that, without further restrictions, the recognition, emptiness, and generation problems are undecidable, and that they are intractable in the worst case even with commonly applied restrictions. However, grammars of real languages appear not to invoke the full expressive power of the formalism, as indicated by the fact that algorithms and implementations for recognition and generation have been developed that run—even for broad-coverage grammars—in typically polynomial time. This article formalizes some restrictions on the notation and its interpretation that are compatible with conventions and principles that have been implicit or informally stated in linguistic theory. We show that LFG grammars that respect these restrictions, while still suitable for the description of natural languages, are equivalent to linear context-free rewriting systems and allow for tractable computation.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"46 1","pages":"515-569"},"PeriodicalIF":9.3,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47112928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Graph-Based Framework for Structured Prediction Tasks in Sanskrit 基于图的结构化预测任务框架

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-10-22 DOI: 10.1162/coli_a_00390

A. Krishna, Ashim Gupta, Pawan Goyal, Bishal Santra, Pavankumar Satuluri

Abstract We propose a framework using energy-based models for multiple structured prediction tasks in Sanskrit. Ours is an arc-factored model, similar to the graph-based parsing approaches, and we consider the tasks of word segmentation, morphological parsing, dependency parsing, syntactic linearization, and prosodification, a “prosody-level” task we introduce in this work. Ours is a search-based structured prediction framework, which expects a graph as input, where relevant linguistic information is encoded in the nodes, and the edges are then used to indicate the association between these nodes. Typically, the state-of-the-art models for morphosyntactic tasks in morphologically rich languages still rely on hand-crafted features for their performance. But here, we automate the learning of the feature function. The feature function so learned, along with the search space we construct, encode relevant linguistic information for the tasks we consider. This enables us to substantially reduce the training data requirements to as low as 10%, as compared to the data requirements for the neural state-of-the-art models. Our experiments in Czech and Sanskrit show the language-agnostic nature of the framework, where we train highly competitive models for both the languages. Moreover, our framework enables us to incorporate language-specific constraints to prune the search space and to filter the candidates during inference. We obtain significant improvements in morphosyntactic tasks for Sanskrit by incorporating language-specific constraints into the model. In all the tasks we discuss for Sanskrit, we either achieve state-of-the-art results or ours is the only data-driven solution for those tasks.

摘要我们提出了一个框架，使用基于能量的模型在梵语中执行多个结构化预测任务。我们的模型是一个弧因子模型，类似于基于图的解析方法，我们将分词、形态解析、依赖解析、句法线性化和韵律化任务视为我们在本工作中引入的“韵律级”任务。我们的是一个基于搜索的结构化预测框架，它期望一个图作为输入，其中相关的语言信息被编码在节点中，然后边缘被用来指示这些节点之间的关联。通常，在形态丰富的语言中，最先进的形态句法任务模型仍然依赖于手工制作的功能来实现其性能。但在这里，我们自动学习特征函数。如此学习的特征函数，以及我们构建的搜索空间，为我们考虑的任务编码相关的语言信息。与最先进的神经模型的数据要求相比，这使我们能够将训练数据要求大幅降低至低至10%。我们在捷克语和梵语中的实验表明了该框架的语言不可知性，我们为这两种语言训练了极具竞争力的模型。此外，我们的框架使我们能够结合特定于语言的约束来修剪搜索空间，并在推理过程中过滤候选者。通过在模型中加入特定语言的约束，我们在梵语的形态句法任务中获得了显著的改进。在我们为梵语讨论的所有任务中，我们要么取得了最先进的结果，要么我们的解决方案是这些任务的唯一数据驱动解决方案。

{"title":"A Graph-Based Framework for Structured Prediction Tasks in Sanskrit","authors":"A. Krishna, Ashim Gupta, Pawan Goyal, Bishal Santra, Pavankumar Satuluri","doi":"10.1162/coli_a_00390","DOIUrl":"https://doi.org/10.1162/coli_a_00390","url":null,"abstract":"Abstract We propose a framework using energy-based models for multiple structured prediction tasks in Sanskrit. Ours is an arc-factored model, similar to the graph-based parsing approaches, and we consider the tasks of word segmentation, morphological parsing, dependency parsing, syntactic linearization, and prosodification, a “prosody-level” task we introduce in this work. Ours is a search-based structured prediction framework, which expects a graph as input, where relevant linguistic information is encoded in the nodes, and the edges are then used to indicate the association between these nodes. Typically, the state-of-the-art models for morphosyntactic tasks in morphologically rich languages still rely on hand-crafted features for their performance. But here, we automate the learning of the feature function. The feature function so learned, along with the search space we construct, encode relevant linguistic information for the tasks we consider. This enables us to substantially reduce the training data requirements to as low as 10%, as compared to the data requirements for the neural state-of-the-art models. Our experiments in Czech and Sanskrit show the language-agnostic nature of the framework, where we train highly competitive models for both the languages. Moreover, our framework enables us to incorporate language-specific constraints to prune the search space and to filter the candidates during inference. We obtain significant improvements in morphosyntactic tasks for Sanskrit by incorporating language-specific constraints into the model. In all the tasks we discuss for Sanskrit, we either achieve state-of-the-art results or ours is the only data-driven solution for those tasks.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"46 1","pages":"785-845"},"PeriodicalIF":9.3,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49092390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Statistical Significance Testing for Natural Language Processing 自然语言处理的统计显著性检验

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2020-10-20 DOI: 10.1162/coli_r_00388

Edwin Simpson

Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions. This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP. The book focuses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.

像任何其他科学一样，自然语言处理(NLP)的研究依赖于从实验中得出正确结论的能力。其中一个关键工具是统计显著性检验:我们用它来判断一个结果是否提供了有意义的、可推广的发现，还是应该持保留态度。当将新方法与其他方法进行比较时，性能指标通常只有很小的差异，因此研究人员求助于显著性检验来证明改进后的模型确实更好。不幸的是，这种推理常常失败，因为我们选择了不恰当的显著性检验，或者执行得不正确，使结果毫无意义。或者，当一个更合适的测试可以找到一个重要的结果时，我们使用的测试可能无法指示一个重要的结果。NLP研究人员必须避免这些陷阱，以确保他们的评估是合理的，并最终避免浪费时间和金钱通过错误的结论。这本书通过显著性测试的整个过程指导NLP研究人员，使其易于通过匹配规范的NLP任务到特定的显著性测试程序来选择正确的测试类型。作为研究人员的手册，本书提供了显著性测试的理论背景，包括解决深度学习和多数据集基准世界中显著性测试问题的新方法，并描述了NLP显著性测试的开放研究问题。这本书着重于比较一种算法与另一种算法的任务。其核心是p值，即至少与我们观察到的差异一样极端的差异偶然发生的概率。如果p值低于预定的阈值，则声明结果显著。撇开将结果的有效性转化为具有任意阈值的二元问题的基本限制不谈，要成为有效的统计显著性检验，p值必须以正确的方式计算。这本书描述了一个适当的显著性测试的两个关键属性:测试必须是有效的和强大的。有效性指的是避免类型1错误，即错误地声明结果显著。导致类型1错误的常见错误包括部署做出不正确假设的测试，例如数据点之间的独立性。测试的能力是指它能够检测到重要的结果，从而避免第2类错误。在这里，必须使用数据和实验的知识来选择做出正确假设的测试。有效性和能力之间存在权衡，但对于最常见的NLP任务(语言建模、序列标记、翻译等)，有明确的测试选择，可以提供良好的平衡。

{"title":"Statistical Significance Testing for Natural Language Processing","authors":"Edwin Simpson","doi":"10.1162/coli_r_00388","DOIUrl":"https://doi.org/10.1162/coli_r_00388","url":null,"abstract":"Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions. This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP. The book focuses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"46 1","pages":"905-908"},"PeriodicalIF":9.3,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1162/coli_r_00388","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43183053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational Linguistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀