Machine learning-based authorship attribution using token n-grams and other time tested features

S. Gupta, Swarupa Das, Jyotish Ranjan Mallik
{"title":"Machine learning-based authorship attribution using token n-grams and other time tested features","authors":"S. Gupta, Swarupa Das, Jyotish Ranjan Mallik","doi":"10.3233/his-220005","DOIUrl":null,"url":null,"abstract":"Authorship Attribution is a process to determine and/or identify the author of a given text document. The relevance of this research area comes to the fore when two or more writers claim to be the prospective authors of an unidentified or anonymous text document or are unwilling to accept any authorship. This research work aims to utilize various Machine Learning techniques in order to solve the problem of author identification. In the proposed approach, a number of textual features such as Token n-grams, Stylometric features, bag-of-words and TF-IDF have been extracted. Experimentation has been performed on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33. Models have been built and tested with supervised learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The proposed system yields promising results. For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. For the Manual dataset, the best score of 96.67% is obtained using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation when both syntactic features and 600 most frequent unigrams are used in combination.","PeriodicalId":88526,"journal":{"name":"International journal of hybrid intelligent systems","volume":"53 1","pages":"37-51"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of hybrid intelligent systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/his-220005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Authorship Attribution is a process to determine and/or identify the author of a given text document. The relevance of this research area comes to the fore when two or more writers claim to be the prospective authors of an unidentified or anonymous text document or are unwilling to accept any authorship. This research work aims to utilize various Machine Learning techniques in order to solve the problem of author identification. In the proposed approach, a number of textual features such as Token n-grams, Stylometric features, bag-of-words and TF-IDF have been extracted. Experimentation has been performed on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33. Models have been built and tested with supervised learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The proposed system yields promising results. For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. For the Manual dataset, the best score of 96.67% is obtained using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation when both syntactic features and 600 most frequent unigrams are used in combination.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于机器学习的作者归属,使用令牌n-gram和其他经过时间考验的特征
作者归属是确定和/或识别给定文本文档的作者的过程。当两个或两个以上的作者声称是未知或匿名文本文件的潜在作者或不愿意接受任何作者身份时,这个研究领域的相关性就会出现。本研究工作旨在利用各种机器学习技术来解决作者识别问题。在该方法中,提取了Token n-gram、文体特征、词袋特征和TF-IDF等文本特征。在Spooky Author Identification数据集、Reuter_50_50数据集和Manual数据集上进行了3个不同的训练-测试分割比(80-20、70-30和66.67-33.33)的实验。模型已经建立并测试了监督学习算法,如朴素贝叶斯,支持向量机,k近邻,决策树和随机森林。所提出的系统产生了令人满意的结果。对于Spooky数据集,使用Naïve贝叶斯分类器进行词袋分类,获得的准确率最高为84.14%。当分类器使用支持向量机时,Reuter_50_50数据集的2100个最频繁的单词计算出了86.2%的最佳准确率。对于Manual数据集,当同时使用语法特征和600个最频繁的单图时,使用Naïve贝叶斯分类模型进行5次和10次交叉验证,获得96.67%的最佳分数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.30
自引率
0.00%
发文量
0
期刊最新文献
Vision transformer-convolution for breast cancer classification using mammography images: A comparative study Comparative temporal dynamics of individuation and perceptual averaging using a biological neural network model Metaheuristic optimized electrocardiography time-series anomaly classification with recurrent and long-short term neural networks Classifications, evaluation metrics, datasets, and domains in recommendation services: A survey A hybrid approach of machine learning algorithms for improving accuracy of social media crisis detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1