Xiaoming Lu, Wenjian Liu, Shengyi Jiang, Changqing Liu
{"title":"Multilingual BERT Cross-Lingual Transferability with Pre-trained Representations on Tangut: A Survey","authors":"Xiaoming Lu, Wenjian Liu, Shengyi Jiang, Changqing Liu","doi":"10.1109/ICNLP58431.2023.00048","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) systems have three main components including tokenization, embedding, and model architectures (top deep learning models such as BERT, GPT-2, or GPT-3). In this paper, the authors attempt to explore and sum up possible ways of fine-tuning the Multilingual BERT (mBERT) model and feeding it with effective encodings of Tangut characters. Tangut is an extinct low-resource language. We expect to introduce a tailored embedding layer into Tangut as part of the fine-tuning procedure without altering mBERT internal structure. The initial work is listed on. By reviewing existing State of the Art (SOTA) approaches, we hope to further analyze the performance boost of mBERT when applied to low-resource languages.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"14 1","pages":"229-234"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Icon","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNLP58431.2023.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0
Abstract
Natural Language Processing (NLP) systems have three main components including tokenization, embedding, and model architectures (top deep learning models such as BERT, GPT-2, or GPT-3). In this paper, the authors attempt to explore and sum up possible ways of fine-tuning the Multilingual BERT (mBERT) model and feeding it with effective encodings of Tangut characters. Tangut is an extinct low-resource language. We expect to introduce a tailored embedding layer into Tangut as part of the fine-tuning procedure without altering mBERT internal structure. The initial work is listed on. By reviewing existing State of the Art (SOTA) approaches, we hope to further analyze the performance boost of mBERT when applied to low-resource languages.