Arfan Ahmed , Nashva Ali , Mahmood Alzubaidi , Wajdi Zaghouani , Alaa A Abd-alrazaq , Mowafa Househ
{"title":"Freely Available Arabic Corpora: A Scoping Review","authors":"Arfan Ahmed , Nashva Ali , Mahmood Alzubaidi , Wajdi Zaghouani , Alaa A Abd-alrazaq , Mowafa Househ","doi":"10.1016/j.cmpbup.2022.100049","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Corpora play a vital role when training machine learning (ML) models and building systems that use natural language processing (NLP). It can be challenging for researchers to access corpora in a language other than English, and even more so if the corpora are not available for free of cost. The Arabic language is used by more than 1.5 billion Muslims and is the native language of over 250 million people as the Quran, the core text of Islam, is written in Arabic.</p></div><div><h3>Objective</h3><p>To highlight peer-reviewed literature reporting free and accessible Arabic corpora. We aimed to benefit researchers by providing insights into freely available Arabic and accessible corpora, allowing them to achieve their research goals with ease.</p></div><div><h3>Methods</h3><p>By conducting a scoping review using PRISMA guidelines, we searched the most common information technology (IT) databases and identified free of cost and accessible Arabic corpora.</p></div><div><h3>Results</h3><p>We identified a total of 48 accessible corpora sources available free of cost in the Arabic language, we present our findings according to categories to further help readers understand the corpora with direct links where available. The results were classified by corpora type into five categories based on their primary purpose.</p></div><div><h3>Conclusion</h3><p>Arabic is underrepresented considering freely available corpora as most such corpora are available in English. Although previous studies have performed searches for corpora, ours is the first of its kind as it follows the PRISMA guidelines and includes peer-reviewed articles in the literature, obtained by searching the most common IT databases and source recommendations from language experts.</p></div>","PeriodicalId":72670,"journal":{"name":"Computer methods and programs in biomedicine update","volume":"2 ","pages":"Article 100049"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666990022000015/pdfft?md5=831b01a961c72be6134c80b48ab89f71&pid=1-s2.0-S2666990022000015-main.pdf","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine update","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666990022000015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Background
Corpora play a vital role when training machine learning (ML) models and building systems that use natural language processing (NLP). It can be challenging for researchers to access corpora in a language other than English, and even more so if the corpora are not available for free of cost. The Arabic language is used by more than 1.5 billion Muslims and is the native language of over 250 million people as the Quran, the core text of Islam, is written in Arabic.
Objective
To highlight peer-reviewed literature reporting free and accessible Arabic corpora. We aimed to benefit researchers by providing insights into freely available Arabic and accessible corpora, allowing them to achieve their research goals with ease.
Methods
By conducting a scoping review using PRISMA guidelines, we searched the most common information technology (IT) databases and identified free of cost and accessible Arabic corpora.
Results
We identified a total of 48 accessible corpora sources available free of cost in the Arabic language, we present our findings according to categories to further help readers understand the corpora with direct links where available. The results were classified by corpora type into five categories based on their primary purpose.
Conclusion
Arabic is underrepresented considering freely available corpora as most such corpora are available in English. Although previous studies have performed searches for corpora, ours is the first of its kind as it follows the PRISMA guidelines and includes peer-reviewed articles in the literature, obtained by searching the most common IT databases and source recommendations from language experts.