Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

IF 0.6 Q4 AUTOMATION & CONTROL SYSTEMS AUTOMATIC CONTROL AND COMPUTER SCIENCES Pub Date : 2024-02-27 DOI:10.3103/S0146411623070076
K. V. Lagutina
{"title":"Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm","authors":"K. V. Lagutina","doi":"10.3103/S0146411623070076","DOIUrl":null,"url":null,"abstract":"<p>This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"57 7","pages":"817 - 827"},"PeriodicalIF":0.6000,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411623070076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于现代嵌入和节奏的俄语文本体裁分类
摘要 本文研究了用于解决俄语文本体裁分类问题的现代矢量文本模型。这些模型包括 ELMo 嵌入、预训练 BERT 语言模型和一套基于词典-语法工具的数字节奏特征。实验是在一个包含 10 000 篇文本的语料库中进行的,这些文本包括五种体裁:小说、科学文章、评论、来自 VKontakte 社交网络的帖子和来自 OpenCorpora 的新闻。通过对节奏特征的可视化统计和分析,可以区分出在节奏方面最多样化的体裁(小说和评论)和最不多样化的体裁(科技文章)。正是这些体裁随后通过节奏和 LSTM 神经网络分类器得到了最佳分类。使用 ELMo 和 BERT 嵌入对文本进行流派聚类和分类,可以在误差较小的情况下将一种流派与另一种流派区分开来。多重分类的 F-measure 达到 99%。这项研究证实了现代嵌入式在计算语言学任务中的有效性,并强调了韵律特征集在体裁分类材料中的优势和局限性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
AUTOMATIC CONTROL AND COMPUTER SCIENCES
AUTOMATIC CONTROL AND COMPUTER SCIENCES AUTOMATION & CONTROL SYSTEMS-
CiteScore
1.70
自引率
22.20%
发文量
47
期刊介绍: Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision
期刊最新文献
Advancing Driver Behavior Recognition: An Intelligent Approach Utilizing ResNet Airborne Chemical Detection Using IoT and Machine Learning in the Agricultural Area Chinese License Plate Recognition Based on OpenCV and LPCR Net Research on Groundwater Level Prediction Method in Karst Areas Based on Improved Attention Mechanism Fusion Time Convolutional Network Road Traffic Classification from Nighttime Videos Using the Multihead Self-Attention Vision Transformer Model and the SVM
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1