Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction

IF 0.8 Q4 LINGUISTICS Glottometrics Pub Date : 2021-05-01 DOI:10.53482/2021_50_389

Alexandr Osochkin, X. Piotrowska, Vladimir Fomin

{"title":"Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction","authors":"Alexandr Osochkin, X. Piotrowska, Vladimir Fomin","doi":"10.53482/2021_50_389","DOIUrl":null,"url":null,"abstract":"We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"1 1","pages":"76-89"},"PeriodicalIF":0.8000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Glottometrics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53482/2021_50_389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"LINGUISTICS","Score":null,"Total":0}

引用次数: 1

Abstract

We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于搭配抽取的扩展集理论模型的俄语小说语料库作者文体与性别自动识别

本文提出了一种基于词语搭配提取的作者文体和性别差异定量分类方法。提出的算法减弱了先前描述的使用向量模型的文本处理问题。我们通过分析一个俄罗斯散文语料库来证明这种方法。我们讨论了目前可用的软件解决方案和形态学分析、参数化方法、文本索引、人工智能算法和知识提取库实现的作者风格分类和识别的不同方法。我们的结果证明了回归决策树方法在识别信息频率指标方面的效率和相对优势，这种方法有助于其逻辑解释。我们开发了一个工具包，用于进行比较实验，以评估自然语言文本数据分类的有效性，使用向量，集合论和作者的集合论与文本表示的搭配提取模型。通过比较不同方法识别小说作者风格和性别差异的能力，我们发现，结合搭配信息的方法缓解了之前发现的一些不足，并在分类准确性方面取得了总体上的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Glottometrics LINGUISTICS-

CiteScore

0.50

自引率

0.00%

发文量

期刊介绍： The aim of Glottometrics is quantification, measurement and mathematical modeling of any kind of language phenomena. We invite contributions on probabilistic or other mathematical models (e.g. graph theoretic or optimization approaches) which enable to establish language laws that can be validated by testing statistical hypotheses.