Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2020-12-24 DOI:10.1145/3446132.3446192

Nicolas Turenne, Bokai Xu, Xinyue Li, Xindi Xu, Hongyu Liu, Xiaolin Zhu

{"title":"Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools","authors":"Nicolas Turenne, Bokai Xu, Xinyue Li, Xindi Xu, Hongyu Liu, Xiaolin Zhu","doi":"10.1145/3446132.3446192","DOIUrl":null,"url":null,"abstract":"To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用多种文本挖掘工具探索平衡参考语料库

为了比较各种技术，通常使用相同的平台，用户将导入文本数据集。另一种方法使用基于特定任务的黄金标准的评估，但不经常使用平衡的公共语言语料库。我们选择当代美国英语语料库(COCA)作为平衡的参考语料库，并将该语料库划分为主题和类型等类别，以应用特征提取和机器学习算法家族。我们发现Stanford CoreNLP方法比NLTK方法更快、更准确，并且更可靠、更容易理解。聚类结果表明，较高的模块化影响解释。对于体裁和主题分类，所有技术都获得了相对较高的分数，尽管这些分数低于挑战文本数据集的最先进分数。Naïve贝叶斯优于其他选择。我们希望来自各种不同方言(或低资源)语言的平衡语料库可以作为参考，以确定各种最先进的文本挖掘工具的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量