阿拉伯文政府公文的文本文件分类应用

IF 1.2 4区 综合性期刊 Q3 MULTIDISCIPLINARY SCIENCES Kuwait Journal of Science Pub Date : 2024-07-22 DOI:10.1016/j.kjs.2024.100299
{"title":"阿拉伯文政府公文的文本文件分类应用","authors":"","doi":"10.1016/j.kjs.2024.100299","DOIUrl":null,"url":null,"abstract":"<div><p>The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.</p></div>","PeriodicalId":17848,"journal":{"name":"Kuwait Journal of Science","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S230741082400124X/pdfft?md5=7773964c72c4b5c247bc08c319486326&pid=1-s2.0-S230741082400124X-main.pdf","citationCount":"0","resultStr":"{\"title\":\"An application of textual document classification for Arabic governmental correspondence\",\"authors\":\"\",\"doi\":\"10.1016/j.kjs.2024.100299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.</p></div>\",\"PeriodicalId\":17848,\"journal\":{\"name\":\"Kuwait Journal of Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S230741082400124X/pdfft?md5=7773964c72c4b5c247bc08c319486326&pid=1-s2.0-S230741082400124X-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Kuwait Journal of Science\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S230741082400124X\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kuwait Journal of Science","FirstCategoryId":"103","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S230741082400124X","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

阿拉伯语文档的自动化分类需求日益增长,尤其是在处理日益增长的语言数据时。由于最近在引入基于转换器的模型方面取得了进展,自然语言处理(NLP)最近已成为人工智能(AI)中最重要的领域之一。转换器通过使用预训练模型(PTM)促进了可重用模型的使用。本研究旨在微调单语(AraBERT (Antoun et al., 2020))、双语(GigaBERT (Lan et al., 2020))和多语种(XLM-RoBERTa (Conneau et al., 2020))基于变换器的编码器模型,使用新的平衡数据集将阿拉伯语官方信函划分为预定义的类别,并比较它们在准确性方面的预测性能。新的平衡数据集包含 22,741 个阿拉伯语文本,并以最常见的部委名称将其分为六类。研究结果表明,GigaBERT 的准确率最高,达到 98%。所实施的模型可为信息系统(IS)领域做出贡献,从而在无需人工干预的情况下促进部委分类过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An application of textual document classification for Arabic governmental correspondence

The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Kuwait Journal of Science
Kuwait Journal of Science MULTIDISCIPLINARY SCIENCES-
CiteScore
1.60
自引率
28.60%
发文量
132
期刊介绍: Kuwait Journal of Science (KJS) is indexed and abstracted by major publishing houses such as Chemical Abstract, Science Citation Index, Current contents, Mathematics Abstract, Micribiological Abstracts etc. KJS publishes peer-review articles in various fields of Science including Mathematics, Computer Science, Physics, Statistics, Biology, Chemistry and Earth & Environmental Sciences. In addition, it also aims to bring the results of scientific research carried out under a variety of intellectual traditions and organizations to the attention of specialized scholarly readership. As such, the publisher expects the submission of original manuscripts which contain analysis and solutions about important theoretical, empirical and normative issues.
期刊最新文献
Editorial Board Design, synthesis, and structural investigations of novel (S)-amide derivatives as promising ACE inhibitors Mixture design as an innovative tool to optimize the antioxidant and antibacterial activity of moroccan essential oils: Clinopodium nepeta, Ruta montana, and Dittrichia viscosa Optimization of microwave-assisted extraction to obtain a polyphenol-rich crude extract from duku (Lansium domesticum Corr.) leaf and the correlation with antioxidant and cytotoxic activities The use of mint and thyme extracts as eco–friendly natural dyes and the antimicrobial properties of dyed products
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1