基于类比比例的阿拉伯语文本分类

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems Pub Date : 2024-06-17 DOI:10.1111/exsy.13609

Myriam Bounhas, Bilel Elayeb, Amina Chouigui, Amir Hussain, Erik Cambria

{"title":"基于类比比例的阿拉伯语文本分类","authors":"Myriam Bounhas, Bilel Elayeb, Amina Chouigui, Amir Hussain, Erik Cambria","doi":"10.1111/exsy.13609","DOIUrl":null,"url":null,"abstract":"Text classification is the process of labelling a given set of text documents with predefined classes or categories. Existing Arabic text classifiers are either applying classic Machine Learning algorithms such as k-NN and SVM or using modern deep learning techniques. The former are assessed using small text collections and their accuracy is still subject to improvement while the latter are efficient in classifying big data collections and show limited effectiveness in classifying small corpora with a large number of categories. This paper proposes a new approach to Arabic text classification to treat small and large data collections while improving the classification rates of existing classifiers. We first demonstrate the ability of analogical proportions (AP) (statements of the form ‘x is to <math>\n <mrow>\n <mi>y</mi>\n </mrow></math> as <math>\n <mrow>\n <mi>z</mi>\n </mrow></math> is to <math>\n <mrow>\n <mi>t</mi>\n </mrow></math>’), which have recently been shown to be effective in classifying ‘structured’ data, to classify ‘unstructured’ text documents requiring preprocessing. We design an analogical model to express the relationship between text documents and their real categories. Next, based on this principle, we develop two new analogical Arabic text classifiers. These rely on the idea that the category of a new document can be predicted from the categories of three others, in the training set, in case the four documents build together a ‘valid’ analogical proportion on all or on a large number of components extracted from each of them. The two proposed classifiers (denoted AATC1 and AATC2) differ mainly in terms of the keywords extracted for classification. To evaluate the proposed classifiers, we perform an extensive experimental study using five benchmark Arabic text collections with small or large sizes, namely ANT (Arabic News Texts) v2.1 and v1.1, BBC-Arabic, CNN-Arabic and AlKhaleej-2004. We also compare analogical classifiers with both classical ML-based and Deep Learning-based classifiers. Results show that AATC2 has the best average accuracy (78.78%) over all other classifiers and the best average precision (0.77) ranked first followed by AATC1 (0.73), NB (0.73) and SVM (0.72) for the ANT corpus v2.1. Besides, AATC1 shows the best average precisions (0.88) and (0.92), respectively for the BBC-Arabic corpus and AlKhaleej-2004, and the best average accuracy (85.64%) for CNN-Arabic over all other classifiers. Results demonstrate the utility of analogical proportions for text classification. In particular, the proposed analogical classifiers are shown to significantly outperform a number of existing Arabic classifiers, and in many cases, compare favourably to the robust SVM classifier.","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Arabic text classification based on analogical proportions\",\"authors\":\"Myriam Bounhas, Bilel Elayeb, Amina Chouigui, Amir Hussain, Erik Cambria\",\"doi\":\"10.1111/exsy.13609\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text classification is the process of labelling a given set of text documents with predefined classes or categories. Existing Arabic text classifiers are either applying classic Machine Learning algorithms such as k-NN and SVM or using modern deep learning techniques. The former are assessed using small text collections and their accuracy is still subject to improvement while the latter are efficient in classifying big data collections and show limited effectiveness in classifying small corpora with a large number of categories. This paper proposes a new approach to Arabic text classification to treat small and large data collections while improving the classification rates of existing classifiers. We first demonstrate the ability of analogical proportions (AP) (statements of the form ‘x is to <math>\\n <mrow>\\n <mi>y</mi>\\n </mrow></math> as <math>\\n <mrow>\\n <mi>z</mi>\\n </mrow></math> is to <math>\\n <mrow>\\n <mi>t</mi>\\n </mrow></math>’), which have recently been shown to be effective in classifying ‘structured’ data, to classify ‘unstructured’ text documents requiring preprocessing. We design an analogical model to express the relationship between text documents and their real categories. Next, based on this principle, we develop two new analogical Arabic text classifiers. These rely on the idea that the category of a new document can be predicted from the categories of three others, in the training set, in case the four documents build together a ‘valid’ analogical proportion on all or on a large number of components extracted from each of them. The two proposed classifiers (denoted AATC1 and AATC2) differ mainly in terms of the keywords extracted for classification. To evaluate the proposed classifiers, we perform an extensive experimental study using five benchmark Arabic text collections with small or large sizes, namely ANT (Arabic News Texts) v2.1 and v1.1, BBC-Arabic, CNN-Arabic and AlKhaleej-2004. We also compare analogical classifiers with both classical ML-based and Deep Learning-based classifiers. Results show that AATC2 has the best average accuracy (78.78%) over all other classifiers and the best average precision (0.77) ranked first followed by AATC1 (0.73), NB (0.73) and SVM (0.72) for the ANT corpus v2.1. Besides, AATC1 shows the best average precisions (0.88) and (0.92), respectively for the BBC-Arabic corpus and AlKhaleej-2004, and the best average accuracy (85.64%) for CNN-Arabic over all other classifiers. Results demonstrate the utility of analogical proportions for text classification. In particular, the proposed analogical classifiers are shown to significantly outperform a number of existing Arabic classifiers, and in many cases, compare favourably to the robust SVM classifier.\",\"PeriodicalId\":51053,\"journal\":{\"name\":\"Expert Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13609\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13609","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

文本分类是将一组给定的文本文档标记为预定义的类别或类别的过程。现有的阿拉伯语文本分类器要么采用 k-NN 和 SVM 等经典机器学习算法，要么采用现代深度学习技术。前者使用小型文本集合进行评估，其准确性仍有待提高，而后者在对大型数据集合进行分类时效率较高，但在对具有大量类别的小型语料库进行分类时效果有限。本文提出了一种新的阿拉伯语文本分类方法，用于处理小型和大型数据集，同时提高现有分类器的分类率。我们首先展示了类比比例（AP）（"x 与 y 的关系就像 z 与 t 的关系"）在 "非结构化 "文本文档分类中的能力，这种方法最近已被证明在 "结构化 "数据分类中非常有效，而 "非结构化 "文本文档则需要进行预处理。我们设计了一个类比模型来表达文本文档与其实际类别之间的关系。接下来，基于这一原理，我们开发了两个新的类比阿拉伯语文本分类器。这两个分类器所依赖的理念是，如果四篇文档共同建立了一个 "有效 "的类比比例，那么新文档的类别就可以通过训练集中其他三篇文档的类别来预测，或者通过从每篇文档中提取的大量成分来预测。所提出的两个分类器（分别称为 AATC1 和 AATC2）主要在分类关键词的提取上有所不同。为了评估所提出的分类器，我们使用五个或大或小的基准阿拉伯语文本集（即 ANT（阿拉伯语新闻文本）v2.1 和 v1.1、BBC-Arabic、CNN-Arabic 和 AlKhaleej-2004）进行了广泛的实验研究。我们还将类比分类器与经典的基于 ML 和基于深度学习的分类器进行了比较。结果表明，在 ANT 语料库 v2.1 中，AATC2 的平均准确率（78.78%）比所有其他分类器都高，平均精度（0.77）排名第一，其次是 AATC1（0.73）、NB（0.73）和 SVM（0.72）。此外，在 BBC-Arabic 语料库和 AlKhaleej-2004 中，AATC1 的平均精确度（0.88）和（0.92）都是最好的，而在 CNN-Arabic 中，AATC1 的平均精确度（85.64%）是所有其他分类器中最好的。结果证明了类比比例在文本分类中的实用性。特别是，所提出的类比分类器的性能明显优于许多现有的阿拉伯语分类器，而且在许多情况下，与稳健的 SVM 分类器相比也毫不逊色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Arabic text classification based on analogical proportions

Text classification is the process of labelling a given set of text documents with predefined classes or categories. Existing Arabic text classifiers are either applying classic Machine Learning algorithms such as k-NN and SVM or using modern deep learning techniques. The former are assessed using small text collections and their accuracy is still subject to improvement while the latter are efficient in classifying big data collections and show limited effectiveness in classifying small corpora with a large number of categories. This paper proposes a new approach to Arabic text classification to treat small and large data collections while improving the classification rates of existing classifiers. We first demonstrate the ability of analogical proportions (AP) (statements of the form ‘x is to $y$ as $z$ is to $t$ ’), which have recently been shown to be effective in classifying ‘structured’ data, to classify ‘unstructured’ text documents requiring preprocessing. We design an analogical model to express the relationship between text documents and their real categories. Next, based on this principle, we develop two new analogical Arabic text classifiers. These rely on the idea that the category of a new document can be predicted from the categories of three others, in the training set, in case the four documents build together a ‘valid’ analogical proportion on all or on a large number of components extracted from each of them. The two proposed classifiers (denoted AATC1 and AATC2) differ mainly in terms of the keywords extracted for classification. To evaluate the proposed classifiers, we perform an extensive experimental study using five benchmark Arabic text collections with small or large sizes, namely ANT (Arabic News Texts) v2.1 and v1.1, BBC-Arabic, CNN-Arabic and AlKhaleej-2004. We also compare analogical classifiers with both classical ML-based and Deep Learning-based classifiers. Results show that AATC2 has the best average accuracy (78.78%) over all other classifiers and the best average precision (0.77) ranked first followed by AATC1 (0.73), NB (0.73) and SVM (0.72) for the ANT corpus v2.1. Besides, AATC1 shows the best average precisions (0.88) and (0.92), respectively for the BBC-Arabic corpus and AlKhaleej-2004, and the best average accuracy (85.64%) for CNN-Arabic over all other classifiers. Results demonstrate the utility of analogical proportions for text classification. In particular, the proposed analogical classifiers are shown to significantly outperform a number of existing Arabic classifiers, and in many cases, compare favourably to the robust SVM classifier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.