Mary Joy P. Canon, Christian Y. Sy, T. Palaoag, R. Roxas, Lany L. Maceda
{"title":"Language Resource Construction of Multi-Domain Philippine English Text for Pre-training Objective","authors":"Mary Joy P. Canon, Christian Y. Sy, T. Palaoag, R. Roxas, Lany L. Maceda","doi":"10.1109/ICACSIS56558.2022.9923429","DOIUrl":null,"url":null,"abstract":"Pre-trained language models (PLMs) have gained significant attention in NLP because of its effectiveness in improving the performance of several downstream tasks. Pre-training these PLMs requires benchmark datasets to create universal language representation and to generate robust models. This paper established the first linguistic resource for Philippine English language to help future researchers in language modeling and other NLP tasks. We used NLP approach to prepare and build our data and transformers paradigm to generate small PLMs. The PHEnText corpus is composed of multi-domain Philippine English text data in formal language scraped from different sources. Tokenization process was performed using BPE and WordPiece tokenizer algorithms. Using a subset of the PHEnText, we generated four small versions of transformer-based language models. Cross-validation during the pre-training reported that a RoBERTa-base model outperformed all other variants in terms of training loss, evaluation loss and accuracy. This work introduced the PHEnText benchmark corpus composed of 2.6B tokens primarily intended for pre-training objective. The corpus provides starting point and opportunities for current and future NLP researches and once trained, can be used more efficiently via fine-tuning. Additionally, the dataset was prepared to be pre-training compatible with different transformer models. Furthermore, the generated PLMs using a subset of PHEnText rendered notable results in terms of minimal loss and nearly acceptable accuracy. Next step for this undertaking is to train PLMs using the entire PHEnText dataset and to test the models' effectiveness by fine-tuning them to NLP downstream tasks.","PeriodicalId":165728,"journal":{"name":"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS56558.2022.9923429","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Pre-trained language models (PLMs) have gained significant attention in NLP because of its effectiveness in improving the performance of several downstream tasks. Pre-training these PLMs requires benchmark datasets to create universal language representation and to generate robust models. This paper established the first linguistic resource for Philippine English language to help future researchers in language modeling and other NLP tasks. We used NLP approach to prepare and build our data and transformers paradigm to generate small PLMs. The PHEnText corpus is composed of multi-domain Philippine English text data in formal language scraped from different sources. Tokenization process was performed using BPE and WordPiece tokenizer algorithms. Using a subset of the PHEnText, we generated four small versions of transformer-based language models. Cross-validation during the pre-training reported that a RoBERTa-base model outperformed all other variants in terms of training loss, evaluation loss and accuracy. This work introduced the PHEnText benchmark corpus composed of 2.6B tokens primarily intended for pre-training objective. The corpus provides starting point and opportunities for current and future NLP researches and once trained, can be used more efficiently via fine-tuning. Additionally, the dataset was prepared to be pre-training compatible with different transformer models. Furthermore, the generated PLMs using a subset of PHEnText rendered notable results in terms of minimal loss and nearly acceptable accuracy. Next step for this undertaking is to train PLMs using the entire PHEnText dataset and to test the models' effectiveness by fine-tuning them to NLP downstream tasks.