Principal components analysis in stylometry

IF 1.1 3区文学 0 HUMANITIES, MULTIDISCIPLINARY Digital Scholarship in the Humanities Pub Date : 2023-11-29 DOI:10.1093/llc/fqad083

Hugh Craig

{"title":"Principal components analysis in stylometry","authors":"Hugh Craig","doi":"10.1093/llc/fqad083","DOIUrl":null,"url":null,"abstract":"Principal components analysis (PCA) has been one of the staple methods used in stylometry. In a 2021 article, Pervez Rizvi casts doubt on this method and argues that some widely cited results based on it should be set aside. In the current article, I show that none of Rizvi’s theoretical claims or experimental results stand up to examination. Rizvi argues that discarding the principal components beyond the first two makes the method unreliable, but permutation testing of PCAs shows that the top components in these trials are significant and robust, and the results across many experiments show the combination of the first and second component to be effective in classification. Rizvi argues that PCA components must be treated separately, and much of his critique of the PCA method is based on this standpoint, but this is not the practice in the work presented in the publications he cites or in the wider literature. Rizvi is unable to replicate a chart in an article by Craig, but his replication, unlike the original, does not account for the widely varying sizes of samples in his data. The current article shows that Rizvi’s claims are misguided and that using PCA in the Burrows tradition to find and formalize authorial discriminations in text samples from plays of the Shakespearean era is efficacious and robust.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"37 9","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad083","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Principal components analysis (PCA) has been one of the staple methods used in stylometry. In a 2021 article, Pervez Rizvi casts doubt on this method and argues that some widely cited results based on it should be set aside. In the current article, I show that none of Rizvi’s theoretical claims or experimental results stand up to examination. Rizvi argues that discarding the principal components beyond the first two makes the method unreliable, but permutation testing of PCAs shows that the top components in these trials are significant and robust, and the results across many experiments show the combination of the first and second component to be effective in classification. Rizvi argues that PCA components must be treated separately, and much of his critique of the PCA method is based on this standpoint, but this is not the practice in the work presented in the publications he cites or in the wider literature. Rizvi is unable to replicate a chart in an article by Craig, but his replication, unlike the original, does not account for the widely varying sizes of samples in his data. The current article shows that Rizvi’s claims are misguided and that using PCA in the Burrows tradition to find and formalize authorial discriminations in text samples from plays of the Shakespearean era is efficacious and robust.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文体学中的主成分分析

主成分分析(PCA)一直是文体学中常用的主要方法之一。在2021年的一篇文章中，Pervez Rizvi对这种方法表示怀疑，并认为一些被广泛引用的基于这种方法的结果应该被搁置一边。在这篇文章中，我指出里兹维的理论主张和实验结果都经不起检验。Rizvi认为，在前两个成分之外丢弃主成分会使该方法不可靠，但pca的排列测试表明，在这些试验中，最重要的成分是显著的和稳健的，许多实验的结果表明，第一个和第二个成分的组合在分类中是有效的。Rizvi认为PCA成分必须单独处理，他对PCA方法的许多批评都是基于这一立场，但这并不是他引用的出版物或更广泛的文献中提出的工作实践。里兹维无法复制克雷格文章中的图表，但他的复制与原版不同，没有考虑到他的数据中样本大小的广泛差异。当前的文章表明，Rizvi的主张是错误的，使用Burrows传统的PCA来发现和形式化莎士比亚时代戏剧文本样本中的作者歧视是有效和稳健的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.