{"title":"Principal components analysis in stylometry","authors":"Hugh Craig","doi":"10.1093/llc/fqad083","DOIUrl":null,"url":null,"abstract":"Principal components analysis (PCA) has been one of the staple methods used in stylometry. In a 2021 article, Pervez Rizvi casts doubt on this method and argues that some widely cited results based on it should be set aside. In the current article, I show that none of Rizvi’s theoretical claims or experimental results stand up to examination. Rizvi argues that discarding the principal components beyond the first two makes the method unreliable, but permutation testing of PCAs shows that the top components in these trials are significant and robust, and the results across many experiments show the combination of the first and second component to be effective in classification. Rizvi argues that PCA components must be treated separately, and much of his critique of the PCA method is based on this standpoint, but this is not the practice in the work presented in the publications he cites or in the wider literature. Rizvi is unable to replicate a chart in an article by Craig, but his replication, unlike the original, does not account for the widely varying sizes of samples in his data. The current article shows that Rizvi’s claims are misguided and that using PCA in the Burrows tradition to find and formalize authorial discriminations in text samples from plays of the Shakespearean era is efficacious and robust.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"37 9","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad083","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Principal components analysis (PCA) has been one of the staple methods used in stylometry. In a 2021 article, Pervez Rizvi casts doubt on this method and argues that some widely cited results based on it should be set aside. In the current article, I show that none of Rizvi’s theoretical claims or experimental results stand up to examination. Rizvi argues that discarding the principal components beyond the first two makes the method unreliable, but permutation testing of PCAs shows that the top components in these trials are significant and robust, and the results across many experiments show the combination of the first and second component to be effective in classification. Rizvi argues that PCA components must be treated separately, and much of his critique of the PCA method is based on this standpoint, but this is not the practice in the work presented in the publications he cites or in the wider literature. Rizvi is unable to replicate a chart in an article by Craig, but his replication, unlike the original, does not account for the widely varying sizes of samples in his data. The current article shows that Rizvi’s claims are misguided and that using PCA in the Burrows tradition to find and formalize authorial discriminations in text samples from plays of the Shakespearean era is efficacious and robust.
期刊介绍:
DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.