Machine learning-based authorship attribution using token n-grams and other time tested features

International journal of hybrid intelligent systems Pub Date : 2022-04-21 DOI:10.3233/his-220005

S. Gupta, Swarupa Das, Jyotish Ranjan Mallik

{"title":"Machine learning-based authorship attribution using token n-grams and other time tested features","authors":"S. Gupta, Swarupa Das, Jyotish Ranjan Mallik","doi":"10.3233/his-220005","DOIUrl":null,"url":null,"abstract":"Authorship Attribution is a process to determine and/or identify the author of a given text document. The relevance of this research area comes to the fore when two or more writers claim to be the prospective authors of an unidentified or anonymous text document or are unwilling to accept any authorship. This research work aims to utilize various Machine Learning techniques in order to solve the problem of author identification. In the proposed approach, a number of textual features such as Token n-grams, Stylometric features, bag-of-words and TF-IDF have been extracted. Experimentation has been performed on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33. Models have been built and tested with supervised learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The proposed system yields promising results. For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. For the Manual dataset, the best score of 96.67% is obtained using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation when both syntactic features and 600 most frequent unigrams are used in combination.","PeriodicalId":88526,"journal":{"name":"International journal of hybrid intelligent systems","volume":"53 1","pages":"37-51"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of hybrid intelligent systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/his-220005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Authorship Attribution is a process to determine and/or identify the author of a given text document. The relevance of this research area comes to the fore when two or more writers claim to be the prospective authors of an unidentified or anonymous text document or are unwilling to accept any authorship. This research work aims to utilize various Machine Learning techniques in order to solve the problem of author identification. In the proposed approach, a number of textual features such as Token n-grams, Stylometric features, bag-of-words and TF-IDF have been extracted. Experimentation has been performed on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33. Models have been built and tested with supervised learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The proposed system yields promising results. For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. For the Manual dataset, the best score of 96.67% is obtained using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation when both syntactic features and 600 most frequent unigrams are used in combination.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于机器学习的作者归属，使用令牌n-gram和其他经过时间考验的特征

作者归属是确定和/或识别给定文本文档的作者的过程。当两个或两个以上的作者声称是未知或匿名文本文件的潜在作者或不愿意接受任何作者身份时，这个研究领域的相关性就会出现。本研究工作旨在利用各种机器学习技术来解决作者识别问题。在该方法中，提取了Token n-gram、文体特征、词袋特征和TF-IDF等文本特征。在Spooky Author Identification数据集、Reuter_50_50数据集和Manual数据集上进行了3个不同的训练-测试分割比(80-20、70-30和66.67-33.33)的实验。模型已经建立并测试了监督学习算法，如朴素贝叶斯，支持向量机，k近邻，决策树和随机森林。所提出的系统产生了令人满意的结果。对于Spooky数据集，使用Naïve贝叶斯分类器进行词袋分类，获得的准确率最高为84.14%。当分类器使用支持向量机时，Reuter_50_50数据集的2100个最频繁的单词计算出了86.2%的最佳准确率。对于Manual数据集，当同时使用语法特征和600个最频繁的单图时，使用Naïve贝叶斯分类模型进行5次和10次交叉验证，获得96.67%的最佳分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International journal of hybrid intelligent systems

CiteScore

3.30

自引率

0.00%

发文量