阿拉伯文文本分类方法:主要研究的系统文献综述

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) Pub Date : 2016-10-01 DOI:10.1109/CIST.2016.7805072

Waleed Alabbas, Haider M. Al-Khateeb, Ali Mansour

{"title":"阿拉伯文文本分类方法:主要研究的系统文献综述","authors":"Waleed Alabbas, Haider M. Al-Khateeb, Ali Mansour","doi":"10.1109/CIST.2016.7805072","DOIUrl":null,"url":null,"abstract":"Recent research on Big Data proposed and evaluated a number of advanced techniques to gain meaningful information from the complex and large volume of data available on the World Wide Web. To achieve accurate text analysis, a process is usually initiated with a Text Classification (TC) method. Reviewing the very recent literature in this area shows that most studies are focused on English (and other scripts) while attempts on classifying Arabic texts remain relatively very limited. Hence, we intend to contribute the first Systematic Literature Review (SLR) utilizing a search protocol strictly to summarize key characteristics of the different TC techniques and methods used to classify Arabic text, this work also aims to identify and share a scientific evidence of the gap in current literature to help suggesting areas for further research. Our SLR explicitly investigates empirical evidence as a decision factor to include studies, then conclude which classifier produced more accurate results. Further, our findings identify the lack of standardized corpuses for Arabic text; authors compile their own, and most of the work is focused on Modern Arabic with very little done on Colloquial Arabic despite its wide use in Social Media Networks such as Twitter. In total, 1464 papers were surveyed from which 48 primary studies were included and analyzed.","PeriodicalId":196827,"journal":{"name":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Arabic text classification methods: Systematic literature review of primary studies\",\"authors\":\"Waleed Alabbas, Haider M. Al-Khateeb, Ali Mansour\",\"doi\":\"10.1109/CIST.2016.7805072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent research on Big Data proposed and evaluated a number of advanced techniques to gain meaningful information from the complex and large volume of data available on the World Wide Web. To achieve accurate text analysis, a process is usually initiated with a Text Classification (TC) method. Reviewing the very recent literature in this area shows that most studies are focused on English (and other scripts) while attempts on classifying Arabic texts remain relatively very limited. Hence, we intend to contribute the first Systematic Literature Review (SLR) utilizing a search protocol strictly to summarize key characteristics of the different TC techniques and methods used to classify Arabic text, this work also aims to identify and share a scientific evidence of the gap in current literature to help suggesting areas for further research. Our SLR explicitly investigates empirical evidence as a decision factor to include studies, then conclude which classifier produced more accurate results. Further, our findings identify the lack of standardized corpuses for Arabic text; authors compile their own, and most of the work is focused on Modern Arabic with very little done on Colloquial Arabic despite its wide use in Social Media Networks such as Twitter. In total, 1464 papers were surveyed from which 48 primary studies were included and analyzed.\",\"PeriodicalId\":196827,\"journal\":{\"name\":\"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIST.2016.7805072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIST.2016.7805072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

最近关于大数据的研究提出并评估了一些先进的技术，以从万维网上复杂而大量的数据中获得有意义的信息。为了实现准确的文本分析，通常使用文本分类(TC)方法启动一个过程。回顾这一领域的最新文献表明，大多数研究都集中在英语(和其他文字)上，而对阿拉伯文本进行分类的尝试仍然相对非常有限。因此，我们打算撰写第一篇系统文献综述(SLR)，严格使用搜索协议来总结用于分类阿拉伯文本的不同TC技术和方法的关键特征，这项工作还旨在识别和分享当前文献差距的科学证据，以帮助提出进一步研究的领域。我们的SLR明确调查经验证据作为包括研究的决定因素，然后得出哪个分类器产生更准确的结果。此外，我们的研究结果表明，阿拉伯语文本缺乏标准化的语料库;作者们自己编撰，大部分工作都集中在现代阿拉伯语上，很少涉及口语阿拉伯语，尽管它在社交媒体网络(如Twitter)中广泛使用。总共调查了1464篇论文，其中包括48项主要研究并进行了分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Arabic text classification methods: Systematic literature review of primary studies

Recent research on Big Data proposed and evaluated a number of advanced techniques to gain meaningful information from the complex and large volume of data available on the World Wide Web. To achieve accurate text analysis, a process is usually initiated with a Text Classification (TC) method. Reviewing the very recent literature in this area shows that most studies are focused on English (and other scripts) while attempts on classifying Arabic texts remain relatively very limited. Hence, we intend to contribute the first Systematic Literature Review (SLR) utilizing a search protocol strictly to summarize key characteristics of the different TC techniques and methods used to classify Arabic text, this work also aims to identify and share a scientific evidence of the gap in current literature to help suggesting areas for further research. Our SLR explicitly investigates empirical evidence as a decision factor to include studies, then conclude which classifier produced more accurate results. Further, our findings identify the lack of standardized corpuses for Arabic text; authors compile their own, and most of the work is focused on Modern Arabic with very little done on Colloquial Arabic despite its wide use in Social Media Networks such as Twitter. In total, 1464 papers were surveyed from which 48 primary studies were included and analyzed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)

自引率

0.00%

发文量