Hierarchical Three-module Method of Text Classification in Web Big Data

Zahra Rezaei, B. Eslami, M. Amini, Mohammad Eslami
{"title":"Hierarchical Three-module Method of Text Classification in Web Big Data","authors":"Zahra Rezaei, B. Eslami, M. Amini, Mohammad Eslami","doi":"10.1109/ICWR49608.2020.9122326","DOIUrl":null,"url":null,"abstract":"Text analysis is a method for extracting knowledge from text. Memory and time limitations in processing big data is crucial due to data sources distributed in web, search engines and socials network sites. In addition, due to automatizing search process, summarizing and finding the interests of users, immediate classification of various texts in a streaming manner has gained attention in industrial and scientific fields. Hierarchical classification of text is among common issues which is simply possible in traditional methods using bag of words; however, while talking about big data and when there are a lot of labels of classes, employing traditional methods will not meet the needs of societies. With the improvement of data in internet and social networks, more powerful methods are needed which can classify the data closely and immediately. Through abstraction in textual data, deep learning can deal with these challenges. In this paper a deep learning method will be introduced which is based on hierarchical classification (HAN) named HAN-MODI and which can classify texts from social networks and web sites with an accuracy of 98.81% at the real time bilingually in English and Farsi. This paper also shows that this complex network with three modules word, sentence and document can work better at word level and there is no need to know syntactic or semantics structure of language. The novelty of the proposed method is adding a third level to the hierarchical structure for general detection and for more exact detection of the class. In addition, classification using this method will be multi-level classification and finally with a change in HAN, this method can be used with Farsi texts. Model improvement is done by adding a new layer above the architecture HAN. We called it as segmentation of sentences into expressions Bag of Sentences and added a dynamicity window in any stage that applied attention mechanism simultaneously.","PeriodicalId":231982,"journal":{"name":"2020 6th International Conference on Web Research (ICWR)","volume":"61 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 6th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR49608.2020.9122326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Text analysis is a method for extracting knowledge from text. Memory and time limitations in processing big data is crucial due to data sources distributed in web, search engines and socials network sites. In addition, due to automatizing search process, summarizing and finding the interests of users, immediate classification of various texts in a streaming manner has gained attention in industrial and scientific fields. Hierarchical classification of text is among common issues which is simply possible in traditional methods using bag of words; however, while talking about big data and when there are a lot of labels of classes, employing traditional methods will not meet the needs of societies. With the improvement of data in internet and social networks, more powerful methods are needed which can classify the data closely and immediately. Through abstraction in textual data, deep learning can deal with these challenges. In this paper a deep learning method will be introduced which is based on hierarchical classification (HAN) named HAN-MODI and which can classify texts from social networks and web sites with an accuracy of 98.81% at the real time bilingually in English and Farsi. This paper also shows that this complex network with three modules word, sentence and document can work better at word level and there is no need to know syntactic or semantics structure of language. The novelty of the proposed method is adding a third level to the hierarchical structure for general detection and for more exact detection of the class. In addition, classification using this method will be multi-level classification and finally with a change in HAN, this method can be used with Farsi texts. Model improvement is done by adding a new layer above the architecture HAN. We called it as segmentation of sentences into expressions Bag of Sentences and added a dynamicity window in any stage that applied attention mechanism simultaneously.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Web大数据文本分类的分层三模块方法
文本分析是一种从文本中提取知识的方法。由于数据源分布在web、搜索引擎和社交网站中,因此处理大数据的内存和时间限制至关重要。此外,由于搜索过程的自动化,总结和发现用户的兴趣,以流媒体的方式对各种文本进行即时分类已经在工业和科学领域受到关注。文本的层次分类是使用词包的传统方法难以实现的常见问题之一;然而,在谈论大数据的同时,在有很多阶级标签的情况下,采用传统的方法将无法满足社会的需求。随着互联网和社交网络数据量的不断增加,需要更强大的方法来对数据进行紧密、即时的分类。通过对文本数据的抽象,深度学习可以应对这些挑战。本文将介绍一种基于层次分类(HAN)的深度学习方法,称为HAN- modi,该方法可以实时对英语和波斯语双语的社交网络和网站中的文本进行分类,准确率达到98.81%。本文还表明,这种由词、句、文档三个模块组成的复杂网络可以在词层面上更好地工作,并且不需要了解语言的句法和语义结构。该方法的新颖之处在于在层次结构中增加了第三层,用于一般检测和更精确的类检测。此外,使用该方法进行分类将是多层次的分类,最后随着汉文的变化,该方法可以用于波斯语文本。模型改进是通过在体系结构HAN之上添加一个新层来完成的。我们将其称为句子分割为句子袋,并在任何阶段增加动态窗口,同时应用注意机制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hierarchical Three-module Method of Text Classification in Web Big Data RePersian:An Efficient Open Information Extraction Tool in Persian Personalization of E-Learning Environment Using the Kolb's Learning Style Model A Multiagent Approach To Web Service Composition Based On TROPOS Methodology Analyzing the Robustness of Web Service Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1