{"title":"Correction of whitespace and word segmentation in noisy Pashto text using CRF","authors":"Ijazul Haq, Weidong Qiu, Jie Guo, Peng Tang","doi":"10.1016/j.specom.2023.102970","DOIUrl":null,"url":null,"abstract":"<div><p>Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word segmenter. This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models' training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102970"},"PeriodicalIF":2.4000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001048","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 1
Abstract
Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word segmenter. This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models' training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.