Robust privacy amidst innovation with large language models through a critical assessment of the risks.

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of the American Medical Informatics Association Pub Date : 2025-05-01 DOI:10.1093/jamia/ocaf037

Yao-Shun Chuang, Atiquer Rahman Sarkar, Yu-Chun Hsu, Noman Mohammed, Xiaoqian Jiang

{"title":"Robust privacy amidst innovation with large language models through a critical assessment of the risks.","authors":"Yao-Shun Chuang, Atiquer Rahman Sarkar, Yu-Chun Hsu, Noman Mohammed, Xiaoqian Jiang","doi":"10.1093/jamia/ocaf037","DOIUrl":null,"url":null,"abstract":"Objective: This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.Materials and methods: The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.Results: The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.Discussion: Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.Conclusion: This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"885-892"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012348/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf037","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.

Materials and methods: The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.

Results: The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.

Discussion: Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.

Conclusion: This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过对风险的严格评估，在大型语言模型的创新中实现健壮的隐私。

目的：本研究评估了电子健康记录（EHRs）和自然语言处理（NLP）与大型语言模型（llm）的集成，以增强医疗数据管理和患者护理，重点是使用先进的语言模型为全球生物医学研究创建安全的，符合健康保险可移植性和责任法案的合成患者笔记。材料和方法：该研究使用GPT-3.5、GPT-4和Mistral 7B的MIMIC III数据集的去识别和重新识别版本来生成合成临床记录。文本生成使用模板和关键字提取上下文相关的笔记，与一次性生成比较。通过分析受保护的健康信息（PHI）的发生和共发生来评估隐私，而通过使用合成笔记培训ICD-9编码器来评估效用。使用ROUGE（面向回忆的替代评价）和余弦相似度度量来比较合成音符与源音符的语义相似度。结果：通过ICD-9编码任务对PHI发生情况和文本效用进行分析，结果表明基于关键词的方法风险低，性能好。单次代表现出最高的PHI暴露和PHI共现，特别是在地理位置和日期类别中。归一化单次方法的分类准确率最高。重新识别的数据始终优于去识别的数据。讨论：隐私分析揭示了数据效用和隐私保护之间的关键平衡，影响了未来的数据使用和共享。结论：本研究表明，基于关键字的方法可以创建合成的临床笔记，在保留数据可用性的同时保护隐私，有可能改善临床数据共享。使用虚拟身份信息来对抗隐私攻击可能比传统的去识别提供更好的效用和隐私。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.