David Joshen Bastien, Vijay Prakash Chumroo, Johan Patrice Bastien
{"title":"Case Study on Data Collection of Kreol Morisien, a Low-Resourced Creole Language","authors":"David Joshen Bastien, Vijay Prakash Chumroo, Johan Patrice Bastien","doi":"10.23919/IST-Africa56635.2022.9845658","DOIUrl":null,"url":null,"abstract":"This case study focuses on laying down the foundations for the development of Kreol Morisien NLP (KreMoN) which is a series of Natural Language Processing tools to be used to process Mauritian Creole. While most of the works done so far focuses on detailing the Machine Learning algorithms, this work focuses on the first steps needed for any low resourced language which is the collection of data. We present a process currently being used to collect audio and textual data for a low resourced language like Mauritian Creole. This data will be used to develop a speech-to-text system as well as an Information Extractor for Mauritian Creole. As part of the case study, we detail some of the works made using existing textual data in Non standardized Mauritian Creole where an NLP pre-processing pipeline adapted for low resourced languages have been developed.","PeriodicalId":142887,"journal":{"name":"2022 IST-Africa Conference (IST-Africa)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IST-Africa Conference (IST-Africa)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IST-Africa56635.2022.9845658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This case study focuses on laying down the foundations for the development of Kreol Morisien NLP (KreMoN) which is a series of Natural Language Processing tools to be used to process Mauritian Creole. While most of the works done so far focuses on detailing the Machine Learning algorithms, this work focuses on the first steps needed for any low resourced language which is the collection of data. We present a process currently being used to collect audio and textual data for a low resourced language like Mauritian Creole. This data will be used to develop a speech-to-text system as well as an Information Extractor for Mauritian Creole. As part of the case study, we detail some of the works made using existing textual data in Non standardized Mauritian Creole where an NLP pre-processing pipeline adapted for low resourced languages have been developed.