Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Linguistics Pub Date : 2022-11-07 DOI:10.1162/coli_r_00467

Richard Futrell

{"title":"Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science","authors":"Richard Futrell","doi":"10.1162/coli_r_00467","DOIUrl":null,"url":null,"abstract":"When we come up with a new model in NLP and machine learning more generally, we usually look at some performance metric (one number), compare it against the same performance metric for a strong baseline model (one number), and if the new model gets a better number, we mark it in bold and declare it the winner. For anyone with a background in statistics or a field where conclusions must be drawn on the basis of noisy data, this procedure is frankly shocking. Suppose model A gets a BLEU score one point higher than model B: Is that difference reliable? If you used a slightly different dataset for training and evaluation, would that one point difference still hold? Would the difference even survive running the same models on the same datasets but with different random seeds? In fields such as psychology and biology, it is standard to answer such questions using standardized statistical procedures to make sure that differences of interest are larger than some quantification of measurement noise. Making a claim based on a bare difference of two numbers is unthinkable. Yet statistical procedures remain rare in the evaluation of NLP models, whose performance metrics are arguably just as noisy. To these objections, NLP practitioners can respond that they have faithfully followed the hallowed train-(dev-)test split paradigm. As long as proper test set discipline has been followed, the theory goes, the evaluation is secure: By testing on held-out data, we can be sure that our models are performing well in a way that is independent of random accidents of the training data, and by testing on that data only once, we guard against making claims based on differences that would not replicate if we ran the models again. But does the train-test split paradigm really guard against all problems of validity and reliability? Into this situation comes the book under review, Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science, by Stefan Riezler and Michael Hagmann. The authors argue that the train-test split paradigm does not in fact insulate NLP from problems relating to the validity and reliability of its models, their features, and their performance metrics. They present numerous case studies to prove their point, and advocate and teach standard statistical methods as the solution, with rich examples","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"249-251"},"PeriodicalIF":3.7000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_r_00467","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

When we come up with a new model in NLP and machine learning more generally, we usually look at some performance metric (one number), compare it against the same performance metric for a strong baseline model (one number), and if the new model gets a better number, we mark it in bold and declare it the winner. For anyone with a background in statistics or a field where conclusions must be drawn on the basis of noisy data, this procedure is frankly shocking. Suppose model A gets a BLEU score one point higher than model B: Is that difference reliable? If you used a slightly different dataset for training and evaluation, would that one point difference still hold? Would the difference even survive running the same models on the same datasets but with different random seeds? In fields such as psychology and biology, it is standard to answer such questions using standardized statistical procedures to make sure that differences of interest are larger than some quantification of measurement noise. Making a claim based on a bare difference of two numbers is unthinkable. Yet statistical procedures remain rare in the evaluation of NLP models, whose performance metrics are arguably just as noisy. To these objections, NLP practitioners can respond that they have faithfully followed the hallowed train-(dev-)test split paradigm. As long as proper test set discipline has been followed, the theory goes, the evaluation is secure: By testing on held-out data, we can be sure that our models are performing well in a way that is independent of random accidents of the training data, and by testing on that data only once, we guard against making claims based on differences that would not replicate if we ran the models again. But does the train-test split paradigm really guard against all problems of validity and reliability? Into this situation comes the book under review, Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science, by Stefan Riezler and Michael Hagmann. The authors argue that the train-test split paradigm does not in fact insulate NLP from problems relating to the validity and reliability of its models, their features, and their performance metrics. They present numerous case studies to prove their point, and advocate and teach standard statistical methods as the solution, with rich examples

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

效度、信度与显著性:NLP与数据科学的实证方法

当我们在NLP和机器学习中提出一个新模型时，我们通常会查看一些性能指标（一个数字），将其与强基线模型的相同性能指标进行比较（一个数据），如果新模型得到了更好的数字，我们会用粗体标记它，并宣布它是赢家。对于任何有统计学背景或必须根据嘈杂数据得出结论的人来说，这个过程坦率地说是令人震惊的。假设模型A的BLEU得分比模型B高一分：这种差异可靠吗？如果你使用一个略有不同的数据集进行训练和评估，那么这一点的差异还会成立吗？在相同的数据集上运行相同的模型，但使用不同的随机种子，这种差异还会存在吗？在心理学和生物学等领域，标准的做法是使用标准化的统计程序来回答这些问题，以确保感兴趣的差异大于测量噪声的某些量化。基于两个数字的明显差异提出索赔是不可想象的。然而，统计程序在NLP模型的评估中仍然很少见，其性能指标可以说也同样嘈杂。对于这些反对意见，NLP从业者可以回应说，他们忠实地遵循了神圣的训练（开发）测试分裂范式。理论认为，只要遵循了适当的测试集规则，评估就是安全的：通过对保留的数据进行测试，我们可以确保我们的模型以一种独立于训练数据的随机事故的方式表现良好，并且通过只对该数据进行一次测试，我们防止基于差异提出索赔，如果我们再次运行模型，这些差异将不会复制。但是，列车测试分流范式真的能防止所有的有效性和可靠性问题吗？Stefan Riezler和Michael Hagmann的《有效性、可靠性和意义：NLP和数据科学的经验方法》一书正是在这种情况下出版的。作者认为，训练-测试分离范式实际上并没有将NLP与与其模型的有效性和可靠性、其特征和性能指标相关的问题隔离开来。他们提出了大量的案例研究来证明他们的观点，并提倡和教授标准的统计方法作为解决方案，并提供了丰富的例子

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.