Word embedding and cognitive linguistic models in text classification tasks

A. Surkova, S. Skorynin, Igor Chernobaev
{"title":"Word embedding and cognitive linguistic models in text classification tasks","authors":"A. Surkova, S. Skorynin, Igor Chernobaev","doi":"10.1145/3373722.3373778","DOIUrl":null,"url":null,"abstract":"The paper considers two linguistic models, analyzed the possibility of their use for the text data classification as well as their associations in the integrated texts presentation. A cognitive approach for the text classification issues is presented. An algorithm to identify the words basic level using WordNet is considered. A model for text classification based on the pre-trained word embeddings is presented. The model consists of three layers: embedding layer Long-Short Term Memory (LSTM) layer, and softmax layer. The model was trained and evaluated on the 20 Newsgroups dataset. The classification quality was assessed by F- measure, precision and recall. The obtained results analysis is carried out. Both described models show good results, low scores for some texts are explained. The advantages and limitations of the linguistic models are shown. In future works the authors are going to combine proposed models and modify them. Thus, for model based on word embedding there are pretty vast opportunities for extension: from experimenting with different word embeddings and various distance metrics to more complicated architecture of layers and even promising state of the art artificial neural network models, activation functions and their modifications. In addition, there is research area of proper ensemble strategy selection.","PeriodicalId":243162,"journal":{"name":"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3373722.3373778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The paper considers two linguistic models, analyzed the possibility of their use for the text data classification as well as their associations in the integrated texts presentation. A cognitive approach for the text classification issues is presented. An algorithm to identify the words basic level using WordNet is considered. A model for text classification based on the pre-trained word embeddings is presented. The model consists of three layers: embedding layer Long-Short Term Memory (LSTM) layer, and softmax layer. The model was trained and evaluated on the 20 Newsgroups dataset. The classification quality was assessed by F- measure, precision and recall. The obtained results analysis is carried out. Both described models show good results, low scores for some texts are explained. The advantages and limitations of the linguistic models are shown. In future works the authors are going to combine proposed models and modify them. Thus, for model based on word embedding there are pretty vast opportunities for extension: from experimenting with different word embeddings and various distance metrics to more complicated architecture of layers and even promising state of the art artificial neural network models, activation functions and their modifications. In addition, there is research area of proper ensemble strategy selection.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文本分类任务中的词嵌入和认知语言模型
本文考虑了两种语言模型,分析了它们用于文本数据分类的可能性,以及它们在综合文本呈现中的关联。提出了一种基于认知的文本分类方法。提出了一种基于WordNet的词的基本层次识别算法。提出了一种基于预训练词嵌入的文本分类模型。该模型由三层组成:嵌入层长短期记忆(LSTM)层和softmax层。该模型在20个新闻组数据集上进行了训练和评估。通过F-测度、准确率和召回率评价分类质量。对所得结果进行了分析。两种描述的模型都显示出良好的结果,解释了一些文本的低分数。指出了语言模型的优点和局限性。在未来的工作中,作者将结合提出的模型并对其进行修改。因此,对于基于词嵌入的模型来说,有相当大的扩展机会:从实验不同的词嵌入和各种距离度量到更复杂的层架构,甚至是有前途的最先进的人工神经网络模型、激活函数及其修改。此外,还存在适当集成策略选择的研究领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Dynamic of hashtag functions development in new media: Hashtag as an identificational mark of digital communication in social networks Digital trance: neo-shamanism in the Russian Internet Dynamics of the student youth's value paradigm changes in the information society Information extraction tasks in public administration domain: ISIDA-T natural language processing system Social aspects of human-computer interactions in the media: tendencies and threats
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1