A Comprehensive Study of Learning Approaches for Author Gender Identification

IF 2 4区 计算机科学 Q3 AUTOMATION & CONTROL SYSTEMS Information Technology and Control Pub Date : 2022-09-23 DOI:10.5755/j01.itc.51.3.29907
Tuǧba Dalyan, H. Ayral, Özgür Özdemir
{"title":"A Comprehensive Study of Learning Approaches for Author Gender Identification","authors":"Tuǧba Dalyan, H. Ayral, Özgür Özdemir","doi":"10.5755/j01.itc.51.3.29907","DOIUrl":null,"url":null,"abstract":"In recent years, author gender identification is an important yet challenging task in the fields of information retrieval and computational linguistics. In this paper, different learning approaches are presented to address the problem of author gender identification for Turkish articles. First, several classification algorithms are applied to the list of representations based on different paradigms: fixed-length vector representations such as Stylometric Features (SF), Bag-of-Words (BoW) and distributed word/document embeddings such as Word2vec, fastText and Doc2vec. Secondly, deep learning architectures, Convolution Neural Network (CNN), Recurrent Neural Network (RNN), special kinds of RNN such as Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), C-RNN, Bidirectional LSTM (bi-LSTM), Bidirectional GRU (bi-GRU), Hierarchical Attention Networks and Multi-head Attention (MHA) are designated and their comparable performances are evaluated. We conducted a variety of experiments and achieved outstanding empirical results. To conclude, ML algorithms with BoW have promising results. fast-Text is also probably suitable between embedding models. This comprehensive study contributes to literature utilizing different learning approaches based on several ways of representations. It is also first important attempt to identify author gender applying SF, fastText and DNN architectures to the Turkish language.","PeriodicalId":54982,"journal":{"name":"Information Technology and Control","volume":"28 1","pages":"429-445"},"PeriodicalIF":2.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Technology and Control","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.5755/j01.itc.51.3.29907","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 2

Abstract

In recent years, author gender identification is an important yet challenging task in the fields of information retrieval and computational linguistics. In this paper, different learning approaches are presented to address the problem of author gender identification for Turkish articles. First, several classification algorithms are applied to the list of representations based on different paradigms: fixed-length vector representations such as Stylometric Features (SF), Bag-of-Words (BoW) and distributed word/document embeddings such as Word2vec, fastText and Doc2vec. Secondly, deep learning architectures, Convolution Neural Network (CNN), Recurrent Neural Network (RNN), special kinds of RNN such as Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), C-RNN, Bidirectional LSTM (bi-LSTM), Bidirectional GRU (bi-GRU), Hierarchical Attention Networks and Multi-head Attention (MHA) are designated and their comparable performances are evaluated. We conducted a variety of experiments and achieved outstanding empirical results. To conclude, ML algorithms with BoW have promising results. fast-Text is also probably suitable between embedding models. This comprehensive study contributes to literature utilizing different learning approaches based on several ways of representations. It is also first important attempt to identify author gender applying SF, fastText and DNN architectures to the Turkish language.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
作者性别认同学习方法的综合研究
作者性别识别是近年来信息检索和计算语言学领域的一个重要而又具有挑战性的课题。在本文中,提出了不同的学习方法来解决作者性别认同的问题土耳其文章。首先,将几种分类算法应用于基于不同范式的表示列表:固定长度向量表示,如文体特征(SF)、词袋(BoW)和分布式词/文档嵌入,如Word2vec、fastText和Doc2vec。其次,指定了深度学习架构、卷积神经网络(CNN)、循环神经网络(RNN)、长短期记忆(LSTM)和门控循环单元(GRU)、C-RNN、双向LSTM (bi-LSTM)、双向GRU (bi-GRU)、分层注意网络和多级注意(MHA)等特殊类型的RNN,并对它们的性能进行了比较评价。我们进行了各种各样的实验,并取得了出色的实证结果。综上所述,带有BoW的ML算法有很好的效果。fast-Text也可能适用于嵌入模型之间。这项综合研究有助于文献利用基于几种表征方式的不同学习方法。这也是将SF、fastText和DNN架构应用于土耳其语来识别作者性别的第一次重要尝试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Technology and Control
Information Technology and Control 工程技术-计算机:人工智能
CiteScore
2.70
自引率
9.10%
发文量
36
审稿时长
12 months
期刊介绍: Periodical journal covers a wide field of computer science and control systems related problems including: -Software and hardware engineering; -Management systems engineering; -Information systems and databases; -Embedded systems; -Physical systems modelling and application; -Computer networks and cloud computing; -Data visualization; -Human-computer interface; -Computer graphics, visual analytics, and multimedia systems.
期刊最新文献
Model construction of big data asset management system for digital power grid regulation Melanoma Diagnosis Using Enhanced Faster Region Convolutional Neural Networks Optimized by Artificial Gorilla Troops Algorithm A Scalable and Stacked Ensemble Approach to Improve Intrusion Detection in Clouds Traffic Sign Detection Algorithm Based on Improved Yolox Apply Physical System Model and Computer Algorithm to Identify Osmanthus Fragrans Seed Vigor Based on Hyperspectral Imaging and Convolutional Neural Network
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1