Tae-Yeon Kim, Seong-Uk Baek, Myeong-Hun Lim, Byungyoon Yun, Domyung Paek, Kyung Ehi Zoh, Kanwoo Youn, Yun Keun Lee, Yangho Kim, Jungwon Kim, Eunsuk Choi, Mo-Yeol Kang, YoonHo Cho, Kyung-Eun Lee, Juho Sim, Juyeon Oh, Heejoo Park, Jian Lee, Jong-Uk Won, Yu-Min Lee, Jin-Ha Yoon
{"title":"Occupation classification model based on DistilKoBERT: using the 5th and 6th Korean Working Condition Surveys.","authors":"Tae-Yeon Kim, Seong-Uk Baek, Myeong-Hun Lim, Byungyoon Yun, Domyung Paek, Kyung Ehi Zoh, Kanwoo Youn, Yun Keun Lee, Yangho Kim, Jungwon Kim, Eunsuk Choi, Mo-Yeol Kang, YoonHo Cho, Kyung-Eun Lee, Juho Sim, Juyeon Oh, Heejoo Park, Jian Lee, Jong-Uk Won, Yu-Min Lee, Jin-Ha Yoon","doi":"10.35371/aoem.2024.36.e19","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.</p><p><strong>Methods: </strong>This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.</p><p><strong>Results: </strong>The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.</p><p><strong>Conclusions: </strong>This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.</p>","PeriodicalId":46631,"journal":{"name":"Annals of Occupational and Environmental Medicine","volume":"36 ","pages":"e19"},"PeriodicalIF":1.2000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11345209/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Occupational and Environmental Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35371/aoem.2024.36.e19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.
Methods: This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.
Results: The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.
Conclusions: This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.
期刊介绍:
Annals of Occupational and Environmental Medicine (AOEM) is an open access journal that considers original contributions relevant to occupational and environmental medicine and related fields, in the form of original articles, review articles, short letters and case reports. AOEM is aimed at clinicians and researchers working in the wide-ranging discipline of occupational and environmental medicine. Topic areas focus on, but are not limited to, interactions between work and health, covering occupational and environmental epidemiology, toxicology, hygiene, diagnosis and treatment of diseases, management, organization and policy. As the official journal of the Korean Society of Occupational and Environmental Medicine (KSOEM), members and authors based in the Republic of Korea are entitled to a discounted article-processing charge when they publish in AOEM.