阿拉伯文政府公文的文本文件分类应用

IF 1.2 4区综合性期刊 Q3 MULTIDISCIPLINARY SCIENCES Kuwait Journal of Science Pub Date : 2024-07-22 DOI:10.1016/j.kjs.2024.100299

Khaled Alzamel, Manayer Alajmi

{"title":"阿拉伯文政府公文的文本文件分类应用","authors":"Khaled Alzamel, Manayer Alajmi","doi":"10.1016/j.kjs.2024.100299","DOIUrl":null,"url":null,"abstract":"<div><p>The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.</p></div>","PeriodicalId":17848,"journal":{"name":"Kuwait Journal of Science","volume":"52 1","pages":"Article 100299"},"PeriodicalIF":1.2000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S230741082400124X/pdfft?md5=7773964c72c4b5c247bc08c319486326&pid=1-s2.0-S230741082400124X-main.pdf","citationCount":"0","resultStr":"{\"title\":\"An application of textual document classification for Arabic governmental correspondence\",\"authors\":\"Khaled Alzamel, Manayer Alajmi\",\"doi\":\"10.1016/j.kjs.2024.100299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.</p></div>\",\"PeriodicalId\":17848,\"journal\":{\"name\":\"Kuwait Journal of Science\",\"volume\":\"52 1\",\"pages\":\"Article 100299\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S230741082400124X/pdfft?md5=7773964c72c4b5c247bc08c319486326&pid=1-s2.0-S230741082400124X-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Kuwait Journal of Science\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S230741082400124X\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kuwait Journal of Science","FirstCategoryId":"103","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S230741082400124X","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

阿拉伯语文档的自动化分类需求日益增长，尤其是在处理日益增长的语言数据时。由于最近在引入基于转换器的模型方面取得了进展，自然语言处理（NLP）最近已成为人工智能（AI）中最重要的领域之一。转换器通过使用预训练模型（PTM）促进了可重用模型的使用。本研究旨在微调单语（AraBERT (Antoun et al., 2020)）、双语（GigaBERT (Lan et al., 2020)）和多语种（XLM-RoBERTa (Conneau et al., 2020)）基于变换器的编码器模型，使用新的平衡数据集将阿拉伯语官方信函划分为预定义的类别，并比较它们在准确性方面的预测性能。新的平衡数据集包含 22,741 个阿拉伯语文本，并以最常见的部委名称将其分为六类。研究结果表明，GigaBERT 的准确率最高，达到 98%。所实施的模型可为信息系统（IS）领域做出贡献，从而在无需人工干预的情况下促进部委分类过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An application of textual document classification for Arabic governmental correspondence

The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Kuwait Journal of Science MULTIDISCIPLINARY SCIENCES-

CiteScore

1.60

自引率

28.60%

发文量

132

期刊介绍： Kuwait Journal of Science (KJS) is indexed and abstracted by major publishing houses such as Chemical Abstract, Science Citation Index, Current contents, Mathematics Abstract, Micribiological Abstracts etc. KJS publishes peer-review articles in various fields of Science including Mathematics, Computer Science, Physics, Statistics, Biology, Chemistry and Earth & Environmental Sciences. In addition, it also aims to bring the results of scientific research carried out under a variety of intellectual traditions and organizations to the attention of specialized scholarly readership. As such, the publisher expects the submission of original manuscripts which contain analysis and solutions about important theoretical, empirical and normative issues.