非标准词作为文本分类的特征

Slobodan Beliga, Sanda Martinčić-Ipšić
{"title":"非标准词作为文本分类的特征","authors":"Slobodan Beliga, Sanda Martinčić-Ipšić","doi":"10.1109/MIPRO.2014.6859744","DOIUrl":null,"url":null,"abstract":"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.","PeriodicalId":299409,"journal":{"name":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","volume":"28 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Non-standard words as features for text categorization\",\"authors\":\"Slobodan Beliga, Sanda Martinčić-Ipšić\",\"doi\":\"10.1109/MIPRO.2014.6859744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.\",\"PeriodicalId\":299409,\"journal\":{\"name\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"volume\":\"28 5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MIPRO.2014.6859744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPRO.2014.6859744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

本文提出了以非标准词(NSW)为特征的克罗地亚语文本分类。非标准词汇包括:数字、日期、首字母缩写、缩写、货币等。克罗地亚语中的新南威尔士州是根据克罗地亚的新南威尔士州分类法确定的。本研究收集了390份文本文件,形成了SKIPEZ系列,分为官方、文学、信息、大众、教育和科学6个类别。对SKIPEZ集合的三种不同表示进行了文本分类实验:在第一种表示中,使用NSWs的频率作为特征;在第二种表示中,使用NSWs的统计度量(方差、变异系数、标准差等)作为特征;而第三种表示结合了前两个特性集。文本分类实验采用朴素贝叶斯、CN2、C4.5、kNN、分类树和随机森林算法。使用第一个特征集(NSW频率)获得最佳分类结果,分类准确率为87%。这表明nsw应该被认为是高度屈折的语言的特征,比如克罗地亚语。基于NSW的特征在没有标准词形化过程的情况下降低了特征空间的维数,因此应该考虑将袋子-NSW用于进一步的克罗地亚文本分类实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Non-standard words as features for text categorization
This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Non-standard words as features for text categorization Evaluation study and results of intelligent pedagogical agent-led learning scenarios in a virtual world Internet of things cloud mediator platform Use of iPads in foreign language classes Cloud based laboratory for distance education
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1