Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES JMIR Formative Research Pub Date : 2025-03-19 DOI:10.2196/54803
Jiajia Li, Zikai Wang, Longxuan Yu, Hui Liu, Haitao Song
{"title":"Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.","authors":"Jiajia Li, Zikai Wang, Longxuan Yu, Hui Liu, Haitao Song","doi":"10.2196/54803","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Medical abstract sentence classification is crucial for enhancing medical database searches, literature reviews, and generating new abstracts. However, Chinese medical abstract classification research is hindered by a lack of suitable datasets. Given the vastness of Chinese medical literature and the unique value of traditional Chinese medicine, precise classification of these abstracts is vital for advancing global medical research.</p><p><strong>Objective: </strong>This study aims to address the data scarcity issue by generating a large volume of labeled Chinese abstract sentences without manual annotation, thereby creating new training datasets. Additionally, we seek to develop more accurate text classification algorithms to improve the precision of Chinese medical abstract classification.</p><p><strong>Methods: </strong>We developed 3 training datasets (dataset #1, dataset #2, and dataset #3) and a test dataset to evaluate our model. Dataset #1 contains 15,000 abstract sentences translated from the PubMed dataset into Chinese. Datasets #2 and #3, each with 15,000 sentences, were generated using GPT-3.5 from 40,000 Chinese medical abstracts in the CSL database. Dataset #2 used titles and keywords for pseudolabeling, while dataset #3 aligned abstracts with category labels. The test dataset includes 87,000 sentences from 20,000 abstracts. We used SBERT embeddings for deeper semantic analysis and evaluated our model using clustering (SBERT-DocSCAN) and supervised methods (SBERT-MEC). Extensive ablation studies and feature analyses were conducted to validate the model's effectiveness and robustness.</p><p><strong>Results: </strong>Our experiments involved training both clustering and supervised models on the 3 datasets, followed by comprehensive evaluation using the test dataset. The outcomes demonstrated that our models outperformed the baseline metrics. Specifically, when trained on dataset #1, the SBERT-DocSCAN model registered an impressive accuracy and F1-score of 89.85% on the test dataset. Concurrently, the SBERT-MEC algorithm exhibited comparable performance with an accuracy of 89.38% and an identical F1-score. Training on dataset #2 yielded similarly positive results for the SBERT-DocSCAN model, achieving an accuracy and F1-score of 89.83%, while the SBERT-MEC algorithm recorded an accuracy of 86.73% and an F1-score of 86.51%. Notably, training with dataset #3 allowed the SBERT-DocSCAN model to attain the best with an accuracy and F1-score of 91.30%, whereas the SBERT-MEC algorithm also showed robust performance, obtaining an accuracy of 90.39% and an F1-score of 90.35%. Ablation analysis highlighted the critical role of integrated features and methodologies in improving classification efficiency.</p><p><strong>Conclusions: </strong>Our approach addresses the challenge of limited datasets for Chinese medical abstract classification by generating novel datasets. The deployment of SBERT-DocSCAN and SBERT-MEC models significantly enhances the precision of classifying Chinese medical abstracts, even when using synthetic datasets with pseudolabels.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e54803"},"PeriodicalIF":2.0000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11939029/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/54803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Medical abstract sentence classification is crucial for enhancing medical database searches, literature reviews, and generating new abstracts. However, Chinese medical abstract classification research is hindered by a lack of suitable datasets. Given the vastness of Chinese medical literature and the unique value of traditional Chinese medicine, precise classification of these abstracts is vital for advancing global medical research.

Objective: This study aims to address the data scarcity issue by generating a large volume of labeled Chinese abstract sentences without manual annotation, thereby creating new training datasets. Additionally, we seek to develop more accurate text classification algorithms to improve the precision of Chinese medical abstract classification.

Methods: We developed 3 training datasets (dataset #1, dataset #2, and dataset #3) and a test dataset to evaluate our model. Dataset #1 contains 15,000 abstract sentences translated from the PubMed dataset into Chinese. Datasets #2 and #3, each with 15,000 sentences, were generated using GPT-3.5 from 40,000 Chinese medical abstracts in the CSL database. Dataset #2 used titles and keywords for pseudolabeling, while dataset #3 aligned abstracts with category labels. The test dataset includes 87,000 sentences from 20,000 abstracts. We used SBERT embeddings for deeper semantic analysis and evaluated our model using clustering (SBERT-DocSCAN) and supervised methods (SBERT-MEC). Extensive ablation studies and feature analyses were conducted to validate the model's effectiveness and robustness.

Results: Our experiments involved training both clustering and supervised models on the 3 datasets, followed by comprehensive evaluation using the test dataset. The outcomes demonstrated that our models outperformed the baseline metrics. Specifically, when trained on dataset #1, the SBERT-DocSCAN model registered an impressive accuracy and F1-score of 89.85% on the test dataset. Concurrently, the SBERT-MEC algorithm exhibited comparable performance with an accuracy of 89.38% and an identical F1-score. Training on dataset #2 yielded similarly positive results for the SBERT-DocSCAN model, achieving an accuracy and F1-score of 89.83%, while the SBERT-MEC algorithm recorded an accuracy of 86.73% and an F1-score of 86.51%. Notably, training with dataset #3 allowed the SBERT-DocSCAN model to attain the best with an accuracy and F1-score of 91.30%, whereas the SBERT-MEC algorithm also showed robust performance, obtaining an accuracy of 90.39% and an F1-score of 90.35%. Ablation analysis highlighted the critical role of integrated features and methodologies in improving classification efficiency.

Conclusions: Our approach addresses the challenge of limited datasets for Chinese medical abstract classification by generating novel datasets. The deployment of SBERT-DocSCAN and SBERT-MEC models significantly enhances the precision of classifying Chinese medical abstracts, even when using synthetic datasets with pseudolabels.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
中医摘要句子分类的综合数据驱动方法:计算研究。
背景:医学摘要句子分类对于加强医学数据库检索、文献综述和生成新的摘要至关重要。然而,由于缺乏合适的数据集,阻碍了中医摘要分类研究。鉴于中国医学文献的广泛性和传统中医的独特价值,对这些摘要进行精确分类对于推进全球医学研究至关重要。目的:本研究旨在通过生成大量不需要人工标注的标注中文抽象句来解决数据稀缺问题,从而创建新的训练数据集。此外,我们寻求开发更准确的文本分类算法,以提高中医摘要分类的精度。方法:我们开发了3个训练数据集(数据集#1、数据集#2和数据集#3)和一个测试数据集来评估我们的模型。数据集#1包含从PubMed数据集翻译成中文的15,000个摘要句子。数据集#2和#3各有15,000个句子,使用GPT-3.5从CSL数据库中的40,000篇中国医学摘要中生成。数据集#2使用标题和关键词进行伪标记,而数据集#3将摘要与类别标签对齐。测试数据集包括来自20,000篇摘要的87,000个句子。我们使用SBERT嵌入进行更深层次的语义分析,并使用聚类(SBERT- docscan)和监督方法(SBERT- mec)评估我们的模型。进行了广泛的消融研究和特征分析,以验证模型的有效性和稳健性。结果:我们的实验包括在3个数据集上训练聚类和监督模型,然后使用测试数据集进行综合评估。结果表明,我们的模型优于基准度量。具体来说,当在数据集#1上进行训练时,SBERT-DocSCAN模型在测试数据集上取得了令人印象深刻的准确性和f1分数89.85%。同时,SBERT-MEC算法表现出相当的性能,准确率为89.38%,f1得分相同。在数据集#2上的训练对SBERT-DocSCAN模型产生了类似的积极结果,达到了89.83%的准确率和f1分数,而SBERT-MEC算法的准确率为86.73%,f1分数为86.51%。值得注意的是,使用数据集#3进行训练使SBERT-DocSCAN模型达到了最好的精度和f1分数为91.30%,而SBERT-MEC算法也表现出了良好的性能,获得了90.39%的精度和90.35%的f1分数。消融分析强调了综合特征和方法对提高分类效率的关键作用。结论:我们的方法通过生成新的数据集解决了中国医学摘要分类数据集有限的挑战。部署SBERT-DocSCAN和SBERT-MEC模型显著提高了中国医学摘要分类的精度,即使使用带有假标签的合成数据集也是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
期刊最新文献
Recruiting a National Sample of First Response Agencies to Participate in an Overdose Prevention Research Project: Randomized Controlled Trial and Feasibility Study. Social Media Promotion of Raw Date Palm Sap and Emerging Nipah Virus Transmission Risk in Bangladesh: Descriptive Analysis of Multisource Data. Emotional and Psychosocial Correlates of Problematic Social Media Use Among Adults: Cross-Sectional Study. Implementation, Acceptability, and Actions After Using an AI Workplace Health Kiosk in a Low-Resource Public School Workplace Setting: Cross-Sectional Pilot Study. A Conversational Platform (Okaya) for Multimodal Digital Biomarkers of Fatigue, Cognition, and Mental Health: Feasibility Observational Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1