From general to specific: Tailoring large language models for real-world medical communications

IF 6.8 1区医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL Clinical and Translational Medicine Pub Date : 2024-12-31 DOI:10.1002/ctm2.70157

Xinti Sun, Wenjun Tang, Zigeng Huang, Erping Long, Peixing Wan

{"title":"From general to specific: Tailoring large language models for real-world medical communications","authors":"Xinti Sun, Wenjun Tang, Zigeng Huang, Erping Long, Peixing Wan","doi":"10.1002/ctm2.70157","DOIUrl":null,"url":null,"abstract":"The development of generative artificial intelligence (AI), such as large language models (LLMs), has garnered significant attention due to their proficiency in interpreting instructions and generating human-resembled responses. Autoregressive LLMs, like GPT, are pre-trained on large-scale natural language corpora.1 Subsequently, they are fine-tuned using human-provided instructions, endowing them with generalized application capabilities across various tasks.However, these foundation models lack training in specialized medical corpora, which limits their capability to accurately interpret the critical aspects of patient-provider communications, leading to misinterpretation or inaccuracies in real-world applications.2 Over the past 2 years, several medical LLMs have been developed or fine-tuned using medical corpora and professional knowledge. For example, among Chinese LLMs, Sunsimiao-7B was fine-tuned from Qwen2-7B with medical corpora, achieving state-of-the-art performance in the Chinese Medical Benchmark test.3 Similarly, HuatuoGPT-II, adopting a one-stage adaptive training approach showed outstanding performance in the 2023 Chinese National Pharmacist Examination.4 Despite possessing extensive medical knowledge and excelling in medical examinations, these models have yet to be implemented in real-world healthcare settings. A primary barrier is the lack of “site-specific” knowledge, which refers to the unique protocols, workflows, and contextual information specific to each reception desk within a hospital. The indispensable site-specific knowledge underscores the need for further refinement of foundation models. Secondly, LLMs may generate hallucinations or fabricated facts, leading to misinformation. This not only undermines trust but also raises significant concerns about patient safety, posing a major barrier to their practical implementation in healthcare settings.5 Third, randomized clinical trials (RCTs) on medical LLMs remained limited, necessitating rigorous validation to assess their practicality. As pointed out by David Ouyang6, “We need more RCTs or AI”.To address these needs, we developed a site-specific prompt engineering chatbot (SSPEC) within the “Panoramic Data Collection-Knowledge Refinement-Algorithm Enhancement” framework7 (Figure 1). This process began with panoramic data collection across each reception site. Next, we incorporated site-specific knowledge and fine-tuned the foundation model, GPT-3.5 Turbo, using a prompt template with three components: Role of SSPEC, Patient Query, and Site-Specific Knowledge. This approach enriches the foundation model with site-specific knowledge and enhances its logical reasoning, effectively adapting it to the heterogeneity of medical settings (Figure 2A). Furthermore, the model undergoes iterative refinement through training and clinical trials, enabling SSPEC to achieve superior adaptability and performance compared to the fundamental model in RCT.To further enhance SSPEC's safety, we developed a knowledge-aligned alert system to mitigate hallucinations. This system integrates three modules: a key phrases matching module, an independent LLM evaluation module, and a Retrieval-Augmented Generation (RAG)-based automatic evaluation module. Key phrases represent hallucination-related terms (uncertain responses) detected in our real-world corpora. Semantic similarity between SSPEC responses and key phrases is determined by cosine similarity. The independent LLM evaluation module involves the evaluation of the potential risks in SSPEC responses using an independent model, GPT-4.0. The RAG-based automatic evaluation module leverages site-specific knowledge to construct a knowledge base. It assesses hallucination by measuring the recall rate of responses derived from this knowledge base. If a response fails these evaluations, the system issues an alert, prompting a nurse to review the response, with feedback provided to improve SSPEC's performance (Figure 2B). This nurse-SSPEC collaboration model effectively mitigated the potential harm of uncertain responses in RCT, with a specificity of 99.8% and a sensitivity of 85.0%.In populous developing countries with limited healthcare resources, such as China, healthcare providers often experience intense workloads and struggle to maintain high-quality communication, resulting in inefficiencies and heightened patient-provider conflicts. A 2016–2019 survey of nearly 40,000 Chinese healthcare providers found that 62% of doctors and 43.8% of nurses were consistently overburdened, a situation further exacerbated by the coronavirus disease 2019 pandemic.8 Another survey from 2017 to 2019 across 136 tertiary hospitals showed that the proportion of patient-nurse conflicts increased from 20.47% to 28.61%.9 Despite numerous efforts from the Chinese government, such as the “Healthy China 2030” initiative,10 a shortage of qualified healthcare professionals has slowed the progress of these policies. By leveraging generative AI technology to optimize resource allocation and enhance communication, SSPEC has significantly improved healthcare worker efficiency and reduced workload. It has also fostered greater empathy towards patients, effectively lowering patient-provider conflicts and offering a promising solution to these challenges.Despite SSPEC's potential, several challenges remain. First, AI must not become a cold barrier between healthcare providers and patients, and over-reliance by nurses could weaken the essential human connection in patient-provider interactions. Second, AI acceptance across age groups should be carefully considered to avoid forcing its use, which could overlook the concerns of certain individuals and vulnerable groups. Third, fairness is key: this model should aim to reduce, not exacerbate, healthcare resource disparities between developed and underdeveloped regions.","PeriodicalId":10189,"journal":{"name":"Clinical and Translational Medicine","volume":"15 1","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11686336/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Translational Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ctm2.70157","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

The development of generative artificial intelligence (AI), such as large language models (LLMs), has garnered significant attention due to their proficiency in interpreting instructions and generating human-resembled responses. Autoregressive LLMs, like GPT, are pre-trained on large-scale natural language corpora.¹ Subsequently, they are fine-tuned using human-provided instructions, endowing them with generalized application capabilities across various tasks.

However, these foundation models lack training in specialized medical corpora, which limits their capability to accurately interpret the critical aspects of patient-provider communications, leading to misinterpretation or inaccuracies in real-world applications.² Over the past 2 years, several medical LLMs have been developed or fine-tuned using medical corpora and professional knowledge. For example, among Chinese LLMs, Sunsimiao-7B was fine-tuned from Qwen2-7B with medical corpora, achieving state-of-the-art performance in the Chinese Medical Benchmark test.³ Similarly, HuatuoGPT-II, adopting a one-stage adaptive training approach showed outstanding performance in the 2023 Chinese National Pharmacist Examination.⁴ Despite possessing extensive medical knowledge and excelling in medical examinations, these models have yet to be implemented in real-world healthcare settings. A primary barrier is the lack of “site-specific” knowledge, which refers to the unique protocols, workflows, and contextual information specific to each reception desk within a hospital. The indispensable site-specific knowledge underscores the need for further refinement of foundation models. Secondly, LLMs may generate hallucinations or fabricated facts, leading to misinformation. This not only undermines trust but also raises significant concerns about patient safety, posing a major barrier to their practical implementation in healthcare settings.⁵ Third, randomized clinical trials (RCTs) on medical LLMs remained limited, necessitating rigorous validation to assess their practicality. As pointed out by David Ouyang⁶, “We need more RCTs or AI”.

To address these needs, we developed a site-specific prompt engineering chatbot (SSPEC) within the “Panoramic Data Collection-Knowledge Refinement-Algorithm Enhancement” framework⁷ (Figure 1). This process began with panoramic data collection across each reception site. Next, we incorporated site-specific knowledge and fine-tuned the foundation model, GPT-3.5 Turbo, using a prompt template with three components: Role of SSPEC, Patient Query, and Site-Specific Knowledge. This approach enriches the foundation model with site-specific knowledge and enhances its logical reasoning, effectively adapting it to the heterogeneity of medical settings (Figure 2A). Furthermore, the model undergoes iterative refinement through training and clinical trials, enabling SSPEC to achieve superior adaptability and performance compared to the fundamental model in RCT.

To further enhance SSPEC's safety, we developed a knowledge-aligned alert system to mitigate hallucinations. This system integrates three modules: a key phrases matching module, an independent LLM evaluation module, and a Retrieval-Augmented Generation (RAG)-based automatic evaluation module. Key phrases represent hallucination-related terms (uncertain responses) detected in our real-world corpora. Semantic similarity between SSPEC responses and key phrases is determined by cosine similarity. The independent LLM evaluation module involves the evaluation of the potential risks in SSPEC responses using an independent model, GPT-4.0. The RAG-based automatic evaluation module leverages site-specific knowledge to construct a knowledge base. It assesses hallucination by measuring the recall rate of responses derived from this knowledge base. If a response fails these evaluations, the system issues an alert, prompting a nurse to review the response, with feedback provided to improve SSPEC's performance (Figure 2B). This nurse-SSPEC collaboration model effectively mitigated the potential harm of uncertain responses in RCT, with a specificity of 99.8% and a sensitivity of 85.0%.

In populous developing countries with limited healthcare resources, such as China, healthcare providers often experience intense workloads and struggle to maintain high-quality communication, resulting in inefficiencies and heightened patient-provider conflicts. A 2016–2019 survey of nearly 40,000 Chinese healthcare providers found that 62% of doctors and 43.8% of nurses were consistently overburdened, a situation further exacerbated by the coronavirus disease 2019 pandemic.⁸ Another survey from 2017 to 2019 across 136 tertiary hospitals showed that the proportion of patient-nurse conflicts increased from 20.47% to 28.61%.⁹ Despite numerous efforts from the Chinese government, such as the “Healthy China 2030” initiative,¹⁰ a shortage of qualified healthcare professionals has slowed the progress of these policies. By leveraging generative AI technology to optimize resource allocation and enhance communication, SSPEC has significantly improved healthcare worker efficiency and reduced workload. It has also fostered greater empathy towards patients, effectively lowering patient-provider conflicts and offering a promising solution to these challenges.

Despite SSPEC's potential, several challenges remain. First, AI must not become a cold barrier between healthcare providers and patients, and over-reliance by nurses could weaken the essential human connection in patient-provider interactions. Second, AI acceptance across age groups should be carefully considered to avoid forcing its use, which could overlook the concerns of certain individuals and vulnerable groups. Third, fairness is key: this model should aim to reduce, not exacerbate, healthcare resource disparities between developed and underdeveloped regions.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从一般到特定：为现实世界的医疗通信定制大型语言模型。

生成式人工智能（AI）的发展，如大型语言模型（llm），由于它们在解释指令和产生类似人类的反应方面的熟练程度，已经引起了极大的关注。自回归法学硕士，如GPT，是在大规模自然语言语料库上进行预训练的随后，使用人工提供的指令对它们进行微调，赋予它们跨各种任务的通用应用程序功能。然而，这些基础模型缺乏专业医学语料库的培训，这限制了它们准确解释患者-提供者沟通的关键方面的能力，导致在实际应用中的误解或不准确在过去的两年中，一些医学法学硕士已经开发或微调使用医学语料库和专业知识。例如，在中国LLMs中，Sunsimiao-7B是用医学语料库对Qwen2-7B进行微调的，在Chinese medical Benchmark测试中取得了最先进的成绩同样，采用一阶段自适应训练方法的华图ogpt - ii在2023年中国国家药师考试中也表现出色。4尽管这些模型具有丰富的医学知识和出色的医学检查能力，但在现实医疗环境中尚未得到应用。主要障碍是缺乏“特定地点”知识，即医院内每个接待台的独特协议、工作流程和上下文信息。不可缺少的特定地点的知识强调了进一步细化基础模型的需要。其次，法学硕士可能会产生幻觉或捏造事实，导致错误信息。这不仅破坏了信任，而且引起了对患者安全的重大关切，对其在医疗保健环境中的实际实施构成了重大障碍第三，医学法学硕士的随机临床试验（rct）仍然有限，需要严格的验证来评估其实用性。正如David ouyang所指出的，“我们需要更多的随机对照试验或人工智能”。为了满足这些需求，我们在“全景数据收集-知识提炼-算法增强”框架内开发了一个特定于站点的提示工程聊天机器人（SSPEC） 7（图1）。这个过程从每个接收站点的全景数据收集开始。接下来，我们结合了特定地点的知识，并使用包含三个组件的提示模板对基础模型GPT-3.5 Turbo进行了微调：SSPEC的角色、患者查询和特定地点的知识。这种方法用特定地点的知识丰富了基础模型，增强了其逻辑推理能力，有效地适应了医疗环境的异质性（图2A）。此外，该模型通过训练和临床试验进行迭代改进，使SSPEC与RCT中的基本模型相比具有更好的适应性和性能。为了进一步提高SSPEC的安全性，我们开发了一种与知识一致的警报系统来减轻幻觉。该系统集成了三个模块：关键短语匹配模块、独立的LLM评估模块和基于检索增强生成（RAG）的自动评估模块。关键短语代表在现实世界的语料库中检测到的与幻觉相关的术语（不确定的反应）。SSPEC响应与关键短语之间的语义相似度由余弦相似度决定。独立的法学硕士评估模块涉及使用独立模型GPT-4.0对SSPEC响应中的潜在风险进行评估。基于rag的自动评估模块利用站点特定知识构建知识库。它通过测量从这个知识库中得到的反应的回忆率来评估幻觉。如果响应未通过这些评估，系统会发出警报，提示护士检查响应，并提供反馈以提高SSPEC的性能（图2B）。该模型有效减轻了RCT中不确定反应的潜在危害，特异性为99.8%，敏感性为85.0%。在人口众多、医疗资源有限的发展中国家（如中国），医疗服务提供者往往承受着巨大的工作量，难以保持高质量的沟通，从而导致效率低下和医患冲突加剧。一项2016-2019年对近4万名中国医疗服务提供者的调查发现，62%的医生和43.8%的护士一直负担过重，2019年冠状病毒病大流行进一步加剧了这一情况另一项调查显示，2017年至2019年，136家三级医院的护患冲突比例从20.47%上升到28.61%尽管中国政府做出了许多努力，如“健康中国2030”倡议，但缺乏合格的医疗保健专业人员阻碍了这些政策的实施。通过利用生成式人工智能技术优化资源分配和加强沟通，SSPEC显著提高了医护人员的工作效率并减少了工作量。它还培养了对患者更大的同理心，有效地降低了患者与提供者的冲突，并为这些挑战提供了一个有希望的解决方案。尽管SSPEC具有潜力，但仍存在一些挑战。首先，人工智能不能成为医疗服务提供者和患者之间的冷屏障，护士的过度依赖可能会削弱医患互动中必不可少的人际关系。其次，应该仔细考虑不同年龄群体对人工智能的接受程度，以避免强迫使用人工智能，这可能会忽视某些个人和弱势群体的担忧。第三，公平是关键：该模式应旨在缩小而不是加剧发达地区与欠发达地区之间的医疗资源差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Clinical and Translational Medicine Multiple-

CiteScore

15.90

自引率

1.90%

发文量

450

审稿时长

4 weeks

期刊介绍： Clinical and Translational Medicine (CTM) is an international, peer-reviewed, open-access journal dedicated to accelerating the translation of preclinical research into clinical applications and fostering communication between basic and clinical scientists. It highlights the clinical potential and application of various fields including biotechnologies, biomaterials, bioengineering, biomarkers, molecular medicine, omics science, bioinformatics, immunology, molecular imaging, drug discovery, regulation, and health policy. With a focus on the bench-to-bedside approach, CTM prioritizes studies and clinical observations that generate hypotheses relevant to patients and diseases, guiding investigations in cellular and molecular medicine. The journal encourages submissions from clinicians, researchers, policymakers, and industry professionals.