Young-Seob Jeong, Eunjin Kim, JunHa Hwang, M. E. Mswahili, Youngjin Kim
{"title":"Language Model for Statistics Domain","authors":"Young-Seob Jeong, Eunjin Kim, JunHa Hwang, M. E. Mswahili, Youngjin Kim","doi":"10.1109/AI4I54798.2022.00020","DOIUrl":null,"url":null,"abstract":"Since transformer has appeared, there were many studies that proposed variants of some representative language models (e.g., Bidirectional Encoder Representations from Transformers (BERT) [1] and Generative Pre-Training (GPT) series [2]). Huge language models are appearing recently (e.g., Chinchilla [3], Megatron LM), whereas there are studies of domain-specific (or language-specific) language models. For example, BioBERT for bio-informatics [4], SwahBERT for Swahili language [5], and FinBERT for financial domain [6]. Without doubt, statistics must be one of the domains with many collected data (e.g., reports of statistics). Pre-trained language model for the statistic domain will probably deliver much performance improvement in down-stream tasks such as industry code classification and job code classification, and more accurate system for the code classification tasks will contribute to better national statistics and taxation. Indeed, many countries are trying to develop such system, and this paper summarizes some relevant findings and provides suggestions to develop language models for statistics domain.","PeriodicalId":345427,"journal":{"name":"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AI4I54798.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Since transformer has appeared, there were many studies that proposed variants of some representative language models (e.g., Bidirectional Encoder Representations from Transformers (BERT) [1] and Generative Pre-Training (GPT) series [2]). Huge language models are appearing recently (e.g., Chinchilla [3], Megatron LM), whereas there are studies of domain-specific (or language-specific) language models. For example, BioBERT for bio-informatics [4], SwahBERT for Swahili language [5], and FinBERT for financial domain [6]. Without doubt, statistics must be one of the domains with many collected data (e.g., reports of statistics). Pre-trained language model for the statistic domain will probably deliver much performance improvement in down-stream tasks such as industry code classification and job code classification, and more accurate system for the code classification tasks will contribute to better national statistics and taxation. Indeed, many countries are trying to develop such system, and this paper summarizes some relevant findings and provides suggestions to develop language models for statistics domain.