Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools

Nicolas Turenne, Bokai Xu, Xinyue Li, Xindi Xu, Hongyu Liu, Xiaolin Zhu
{"title":"Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools","authors":"Nicolas Turenne, Bokai Xu, Xinyue Li, Xindi Xu, Hongyu Liu, Xiaolin Zhu","doi":"10.1145/3446132.3446192","DOIUrl":null,"url":null,"abstract":"To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用多种文本挖掘工具探索平衡参考语料库
为了比较各种技术,通常使用相同的平台,用户将导入文本数据集。另一种方法使用基于特定任务的黄金标准的评估,但不经常使用平衡的公共语言语料库。我们选择当代美国英语语料库(COCA)作为平衡的参考语料库,并将该语料库划分为主题和类型等类别,以应用特征提取和机器学习算法家族。我们发现Stanford CoreNLP方法比NLTK方法更快、更准确,并且更可靠、更容易理解。聚类结果表明,较高的模块化影响解释。对于体裁和主题分类,所有技术都获得了相对较高的分数,尽管这些分数低于挑战文本数据集的最先进分数。Naïve贝叶斯优于其他选择。我们希望来自各种不同方言(或低资源)语言的平衡语料库可以作为参考,以确定各种最先进的文本挖掘工具的效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Lane Detection Combining Details and Integrity: an Advanced Method for Lane Detection The Cat's Eye Effect Target Recognition Method Based on deep convolutional neural network Leveraging Different Context for Response Generation through Topic-guided Multi-head Attention Siamese Multiplicative LSTM for Semantic Text Similarity Multi-constrained Vehicle Routing Problem Solution based on Adaptive Genetic Algorithm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1