Jiyong An, Jiyun Kim, Leonard Sunwoo, Hyunyoung Baek, Sooyoung Yoo, Seunggeun Lee
{"title":"De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.","authors":"Jiyong An, Jiyun Kim, Leonard Sunwoo, Hyunyoung Baek, Sooyoung Yoo, Seunggeun Lee","doi":"10.1186/s12911-025-02913-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea.</p><p><strong>Methods: </strong>Our study utilized a comprehensive dataset stored in the Note table of the OMOP Common Data Model at Seoul National University Bundang Hospital. This dataset includes 11,181,617 radiology and 9,282,477 notes from various other departments (non-radiology reports). From this, 0.1% of the reports (11,182) were randomly selected for training and validation purposes. We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes.</p><p><strong>Results: </strong>Validation was conducted using 342 radiology and 12 non-radiology notes labeled at the token level. Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach, KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score.</p><p><strong>Conclusion: </strong>By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"82"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-02913-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea.
Methods: Our study utilized a comprehensive dataset stored in the Note table of the OMOP Common Data Model at Seoul National University Bundang Hospital. This dataset includes 11,181,617 radiology and 9,282,477 notes from various other departments (non-radiology reports). From this, 0.1% of the reports (11,182) were randomly selected for training and validation purposes. We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes.
Results: Validation was conducted using 342 radiology and 12 non-radiology notes labeled at the token level. Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach, KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score.
Conclusion: By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.