Dotless Arabic text for Natural Language Processing

IF 5.3 2区计算机科学 Computational Linguistics Pub Date : 2024-09-12 DOI:10.1162/coli_a_00535

Maged S. Al-Shaibani, Irfan Ahmad

{"title":"Dotless Arabic text for Natural Language Processing","authors":"Maged S. Al-Shaibani, Irfan Ahmad","doi":"10.1162/coli_a_00535","DOIUrl":null,"url":null,"abstract":"This paper introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. The performances using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"311 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00535","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. The performances using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于自然语言处理的无点阿拉伯语文本

本文介绍了一种新颖的阿拉伯语文本表示法，作为阿拉伯语 NLP 的另一种方法，其灵感来自古代阿拉伯语的无点文字。我们通过对不同大小和领域的文本语料库进行广泛分析，并使用多种标记化技术对其进行标记，从而探索了这种表示法。此外，我们还研究了这种表示法的信息密度，并使用文本熵分析法将其与标准的阿拉伯文无点文本进行了比较。利用平行语料库，我们还对阿拉伯语和英语文本分析进行了比较，以获得更多见解。我们的研究扩展到各种上游和下游 NLP 任务，包括语言建模、文本分类、序列标注和机器翻译，以检查这两种表征的影响。具体来说，我们使用各种标记化方案执行了七项不同的下游任务，比较了标准带点文本和无点阿拉伯语文本表示法。在不同的标记化方案中，两种表示法的性能相当。然而，无点表示法在实现这些结果的同时，还显著减少了词汇量，在某些情况下，词汇量减少高达 50%。此外，我们还介绍了一种可在无点阿拉伯语文本中恢复点的系统。该系统适用于需要阿拉伯语文本作为输出的任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.