{"title":"Harnessing Large Language Models for Rheumatic Disease Diagnosis: Advancing Hybrid Care and Task Shifting","authors":"Fabian Lechner, Sebastian Kuhn, Johannes Knitza","doi":"10.1111/1756-185X.70124","DOIUrl":null,"url":null,"abstract":"<p>Rheumatology is facing an expanding care gap, as the number of newly referred patients continues to outpace the availability of rheumatologists [<span>1</span>], resulting in longer diagnostic delays—often weeks to months—that lead to irreversible damage, poorer treatment outcomes, and higher societal costs [<span>2</span>]. Patients and physicians alike struggle with fluctuating, often nonspecific symptoms (e.g., joint pain), and this challenge is compounded by limited awareness of rheumatic diseases among both the general population and general practitioners. The poor specificity of referrals and the inability of traditional triage approaches to improve the situation widen the care gap further. Although patient education is integral to rheumatology care, it remains underutilized due to inadequate reimbursement and workforce shortages, leaving many patients feeling poorly informed about their disease. Clinicians also face a significant time burden with clinical documentation [<span>3</span>], especially for newly referred patients.</p><p>In response to these multifaceted challenges, digital health technologies (DHT) have emerged as a promising cornerstone to enhance diagnosis, information provision, patient education, documentation and alleviating workforce shortages. With the rapid proliferation of smartphones and advanced DHT, traditional care delivery models should be reevaluated to leverage these innovations [<span>4</span>]. Task-shifting is increasingly being implemented to mitigate workforce shortages, wherein tasks are delegated from physicians to nurses, medical students, or other healthcare professionals. However, task-shifting remains limited in scale and cost-efficiency and DHT could significantly leverage widespread implementation [<span>5</span>].</p><p>Currently increasing numbers of rheumatic patients turn to online platforms for initial symptom assessment [<span>6</span>], and diagnostic decision support systems (DDSS), that can empower patients to receive preliminary diagnoses within minutes. Although computer-aided diagnosis for rheumatologists has existed for decades [<span>7</span>], adoption has been hindered by poor usability [<span>8</span>], including time-intensive data entry [<span>9</span>] and restricted querying options. These limitations also affect patient education, as static, often printed information leaves patients scrolling through lengthy materials rather than engaging in open-ended, personalized exploration. To bridge these limitations recently made advancements in large-language-model-technology (LLM) can be used for unprecedented scalability and multimodal data processing. Therefore DHT usability, performance, and the patient-provider relationship could be significantly improved by integrating LLM-driven decision support within a collaborative digital health triad [<span>4</span>]. By continuously processing patient- and provider-generated data, LLMs can deliver more personalized, accessible, and dynamic support to transform care delivery aiming to close the rheumatology care gap.</p><p>LLMs have demonstrated remarkable proficiency in clinical reasoning due to their ability to process large datasets across various medical fields also including rare diseases [<span>10</span>]. By passively and continuously evaluating the vast amount of available clinical data, LLMs could facilitate accelerated diagnosis and the identification of at-risk individuals, enabling a more proactive approach to care without imposing additional burdens on physicians or patients. LLM capabilities have been highlighted by out-performing human experts on standardized exams such as the United States Medical Licensing Examination (USMLE) and rheumatology exams [<span>11</span>]. Importantly, in a direct comparison study, ChatGPT's diagnostic accuracy was found to be not inferior to that of experienced rheumatologists [<span>12</span>]. Both were given the same anamnestic information from real patients presenting to a rheumatology service. Notably, the model exhibited exceptional sensitivity in identifying inflammatory rheumatic diseases (IRDs), correctly listing the accurate diagnosis among the top three options in 86% of IRD cases—surpassing the 74% success rate of rheumatologists.</p><p>Building on this, another publication by Venerito and Iannone utilized a locally fine-tuned LLM, optimized through prompt engineering, to diagnose fibromyalgia by analyzing subtle expressions of pain and emotion in patient communications [<span>13</span>]. This innovative approach achieved an accuracy of 87% and an AUROC of 0.86, underscoring the potential of LLMs to tackle diagnostic challenges associated with subjective and linguistically intricate conditions by broadening the scope of considerations and highlighting less obvious conditions. Additionally multiple studies have demonstrated that LLMs are capable of extracting diagnostic information from patient dialogues, even when the symptom descriptions are expressed in simple or colloquial language [<span>14</span>]. This linguistic adaptability allows LLMs to effectively comprehend patient narratives and identify subtle cues that might be overlooked in traditional assessments. Combined with the structured nature of multi-turn dialogues, this capability has shown significant potential for clinical applications [<span>14</span>].</p><p>One of these applications gaining more traction is the introduction of LLMs for documentation tasks such as summarizing clinical conversations, generating structured clinical notes, and extracting critical keywords. Research in this area has introduced improved note formats like K-SOAP and domain-specific datasets such as CliniKnote, which combine simulated doctor-patient dialogues with meticulously curated notes. Through advanced fine-tuning, prompting strategies, and sophisticated NLP methods, LLMs can enhance the efficiency and quality of clinical documentation, ultimately reducing clinician workload and enabling more effective patient care [<span>15</span>].</p><p>Further the potentials of LLMs can be used for educational applications, as exemplified by LLMs' ability to address patient queries with accuracy, empathy, and comprehensiveness. For instance, when ChatGPT-4 was tested with questions commonly posed by patients with systemic lupus erythematosus, its responses were not only rated more empathic but also qualitatively better than those from expert rheumatologists [<span>16</span>]. These capabilities stem from the transformer-based architectures underlying LLMs [<span>17</span>]. By integrating large, diverse knowledge sources—from clinical guidelines to authoritative research publications [<span>18</span>]—these models can maintain extensive contextual understanding and dynamically incorporate new information. As a result, LLMs hold the potential to improve diagnostic accuracy, streamline documentation, enhance patient education, and broaden the range of differential diagnoses considered. In doing so, they may help alleviate clinician workload, support more proactive and patient-centered care, and ultimately elevate the overall quality of healthcare delivery.</p><p>However, the clinical deployment of AI-driven diagnostic tools faces significant regulatory hurdles. Determining the intended purpose of these technologies is central to their classification as either medical or non-medical devices, a distinction that directly influences compliance requirements. Under the EU AI Act, general-purpose AI models such as LLMs supporting clinical decisions may face stringent obligations, especially regarding transparency, risk classification, and post-market monitoring. Simultaneously, regulatory requirements necessitate robust clinical evaluation, posing challenges in validating AI's predictive capabilities. Ensuring alignment with these frameworks is critical for advancing AI adoption while safeguarding patient safety and compliance with regulations.</p><p>While these regulatory challenges must be addressed, LLMs also pose inherent risks such as generating medical hallucinations—plausible yet incorrect or unverifiable information. This has been highlighted in the Med-HALT framework, where models such as GPT 3.5 were severely hallucinating given different more complex tasks. In a field where precision is paramount, such inaccuracies could misguide clinical decisions, jeopardizing patient safety [<span>19</span>]. Ensuring LLM transparency and explainability has become increasingly challenging, making the grounding of these models a crucial area of research. A promising grounding technique gaining significant attention is Retrieval-Augmented Generation (RAG). RAG addresses the transparency issue by first querying a database containing known information related to a user's question or input. It retrieves only semantically similar text blocks that are likely to answer the question or generate appropriate content. The model then produces an output based solely on this retrieved information, allowing it to accurately cite the source of the input. This approach enables users not only to verify the model's output against known literature but also to explore the subject further by reviewing the referenced documents, such as publications or guidelines [<span>20</span>]. As illustrated in Figure 1, RAG enhances both the accuracy and verifiability of LLM outputs by grounding responses in relevant, validated information from a knowledge base. While RAG systems have found widespread use in academic search engines, their effectiveness in medical contexts—particularly for patient education or diagnosis—remains largely unexplored. Collaborative efforts among AI developers, clinicians, and researchers are essential to optimize LLM utility while mitigating risks. Further exploration into grounding methods and developing specialized models tailored to rheumatology can enhance their effectiveness.</p><p>Integrating LLMs into the diagnosis of rheumatic diseases presents a transformative opportunity to reduce diagnostic delays, alleviate clinician workload, and enhance patient education. Despite existing challenges, the synergistic advancement of AI innovation and regulatory compliance can help bridge care gaps, improve patient outcomes, and elevate the professional experience of healthcare providers, ultimately fostering more efficient and patient-centered rheumatology care.</p><p>Fabian Lechner and Johannes Knitza drafted the manuscript. Sebastian Kuhn provided suggestions, reviewed and edited the manuscript several times.</p><p>Fabian Lechner declares honoraria from Lilly, Novo Nordisk, Siemens Healthineers, Diabetes.de, and the German Diabetes Association (DDG). Sebastian Kuhn is founder and shareholder of MED.digital GmbH. Johannes Knitza declares research support from Abbvie, GSK, Vila Health, honoraria and consulting fees from Abbvie, AstraZeneca, BMS, Boehringer Ingelheim, Chugai, GAIA, Galapagos, GSK, Janssen, Lilly, Medac, Novartis, Pfizer, Sobi, Rheumaakademie, UCB, Vila Health and Werfen.</p>","PeriodicalId":14330,"journal":{"name":"International Journal of Rheumatic Diseases","volume":"28 2","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1756-185X.70124","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Rheumatic Diseases","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1756-185X.70124","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Rheumatology is facing an expanding care gap, as the number of newly referred patients continues to outpace the availability of rheumatologists [1], resulting in longer diagnostic delays—often weeks to months—that lead to irreversible damage, poorer treatment outcomes, and higher societal costs [2]. Patients and physicians alike struggle with fluctuating, often nonspecific symptoms (e.g., joint pain), and this challenge is compounded by limited awareness of rheumatic diseases among both the general population and general practitioners. The poor specificity of referrals and the inability of traditional triage approaches to improve the situation widen the care gap further. Although patient education is integral to rheumatology care, it remains underutilized due to inadequate reimbursement and workforce shortages, leaving many patients feeling poorly informed about their disease. Clinicians also face a significant time burden with clinical documentation [3], especially for newly referred patients.
In response to these multifaceted challenges, digital health technologies (DHT) have emerged as a promising cornerstone to enhance diagnosis, information provision, patient education, documentation and alleviating workforce shortages. With the rapid proliferation of smartphones and advanced DHT, traditional care delivery models should be reevaluated to leverage these innovations [4]. Task-shifting is increasingly being implemented to mitigate workforce shortages, wherein tasks are delegated from physicians to nurses, medical students, or other healthcare professionals. However, task-shifting remains limited in scale and cost-efficiency and DHT could significantly leverage widespread implementation [5].
Currently increasing numbers of rheumatic patients turn to online platforms for initial symptom assessment [6], and diagnostic decision support systems (DDSS), that can empower patients to receive preliminary diagnoses within minutes. Although computer-aided diagnosis for rheumatologists has existed for decades [7], adoption has been hindered by poor usability [8], including time-intensive data entry [9] and restricted querying options. These limitations also affect patient education, as static, often printed information leaves patients scrolling through lengthy materials rather than engaging in open-ended, personalized exploration. To bridge these limitations recently made advancements in large-language-model-technology (LLM) can be used for unprecedented scalability and multimodal data processing. Therefore DHT usability, performance, and the patient-provider relationship could be significantly improved by integrating LLM-driven decision support within a collaborative digital health triad [4]. By continuously processing patient- and provider-generated data, LLMs can deliver more personalized, accessible, and dynamic support to transform care delivery aiming to close the rheumatology care gap.
LLMs have demonstrated remarkable proficiency in clinical reasoning due to their ability to process large datasets across various medical fields also including rare diseases [10]. By passively and continuously evaluating the vast amount of available clinical data, LLMs could facilitate accelerated diagnosis and the identification of at-risk individuals, enabling a more proactive approach to care without imposing additional burdens on physicians or patients. LLM capabilities have been highlighted by out-performing human experts on standardized exams such as the United States Medical Licensing Examination (USMLE) and rheumatology exams [11]. Importantly, in a direct comparison study, ChatGPT's diagnostic accuracy was found to be not inferior to that of experienced rheumatologists [12]. Both were given the same anamnestic information from real patients presenting to a rheumatology service. Notably, the model exhibited exceptional sensitivity in identifying inflammatory rheumatic diseases (IRDs), correctly listing the accurate diagnosis among the top three options in 86% of IRD cases—surpassing the 74% success rate of rheumatologists.
Building on this, another publication by Venerito and Iannone utilized a locally fine-tuned LLM, optimized through prompt engineering, to diagnose fibromyalgia by analyzing subtle expressions of pain and emotion in patient communications [13]. This innovative approach achieved an accuracy of 87% and an AUROC of 0.86, underscoring the potential of LLMs to tackle diagnostic challenges associated with subjective and linguistically intricate conditions by broadening the scope of considerations and highlighting less obvious conditions. Additionally multiple studies have demonstrated that LLMs are capable of extracting diagnostic information from patient dialogues, even when the symptom descriptions are expressed in simple or colloquial language [14]. This linguistic adaptability allows LLMs to effectively comprehend patient narratives and identify subtle cues that might be overlooked in traditional assessments. Combined with the structured nature of multi-turn dialogues, this capability has shown significant potential for clinical applications [14].
One of these applications gaining more traction is the introduction of LLMs for documentation tasks such as summarizing clinical conversations, generating structured clinical notes, and extracting critical keywords. Research in this area has introduced improved note formats like K-SOAP and domain-specific datasets such as CliniKnote, which combine simulated doctor-patient dialogues with meticulously curated notes. Through advanced fine-tuning, prompting strategies, and sophisticated NLP methods, LLMs can enhance the efficiency and quality of clinical documentation, ultimately reducing clinician workload and enabling more effective patient care [15].
Further the potentials of LLMs can be used for educational applications, as exemplified by LLMs' ability to address patient queries with accuracy, empathy, and comprehensiveness. For instance, when ChatGPT-4 was tested with questions commonly posed by patients with systemic lupus erythematosus, its responses were not only rated more empathic but also qualitatively better than those from expert rheumatologists [16]. These capabilities stem from the transformer-based architectures underlying LLMs [17]. By integrating large, diverse knowledge sources—from clinical guidelines to authoritative research publications [18]—these models can maintain extensive contextual understanding and dynamically incorporate new information. As a result, LLMs hold the potential to improve diagnostic accuracy, streamline documentation, enhance patient education, and broaden the range of differential diagnoses considered. In doing so, they may help alleviate clinician workload, support more proactive and patient-centered care, and ultimately elevate the overall quality of healthcare delivery.
However, the clinical deployment of AI-driven diagnostic tools faces significant regulatory hurdles. Determining the intended purpose of these technologies is central to their classification as either medical or non-medical devices, a distinction that directly influences compliance requirements. Under the EU AI Act, general-purpose AI models such as LLMs supporting clinical decisions may face stringent obligations, especially regarding transparency, risk classification, and post-market monitoring. Simultaneously, regulatory requirements necessitate robust clinical evaluation, posing challenges in validating AI's predictive capabilities. Ensuring alignment with these frameworks is critical for advancing AI adoption while safeguarding patient safety and compliance with regulations.
While these regulatory challenges must be addressed, LLMs also pose inherent risks such as generating medical hallucinations—plausible yet incorrect or unverifiable information. This has been highlighted in the Med-HALT framework, where models such as GPT 3.5 were severely hallucinating given different more complex tasks. In a field where precision is paramount, such inaccuracies could misguide clinical decisions, jeopardizing patient safety [19]. Ensuring LLM transparency and explainability has become increasingly challenging, making the grounding of these models a crucial area of research. A promising grounding technique gaining significant attention is Retrieval-Augmented Generation (RAG). RAG addresses the transparency issue by first querying a database containing known information related to a user's question or input. It retrieves only semantically similar text blocks that are likely to answer the question or generate appropriate content. The model then produces an output based solely on this retrieved information, allowing it to accurately cite the source of the input. This approach enables users not only to verify the model's output against known literature but also to explore the subject further by reviewing the referenced documents, such as publications or guidelines [20]. As illustrated in Figure 1, RAG enhances both the accuracy and verifiability of LLM outputs by grounding responses in relevant, validated information from a knowledge base. While RAG systems have found widespread use in academic search engines, their effectiveness in medical contexts—particularly for patient education or diagnosis—remains largely unexplored. Collaborative efforts among AI developers, clinicians, and researchers are essential to optimize LLM utility while mitigating risks. Further exploration into grounding methods and developing specialized models tailored to rheumatology can enhance their effectiveness.
Integrating LLMs into the diagnosis of rheumatic diseases presents a transformative opportunity to reduce diagnostic delays, alleviate clinician workload, and enhance patient education. Despite existing challenges, the synergistic advancement of AI innovation and regulatory compliance can help bridge care gaps, improve patient outcomes, and elevate the professional experience of healthcare providers, ultimately fostering more efficient and patient-centered rheumatology care.
Fabian Lechner and Johannes Knitza drafted the manuscript. Sebastian Kuhn provided suggestions, reviewed and edited the manuscript several times.
Fabian Lechner declares honoraria from Lilly, Novo Nordisk, Siemens Healthineers, Diabetes.de, and the German Diabetes Association (DDG). Sebastian Kuhn is founder and shareholder of MED.digital GmbH. Johannes Knitza declares research support from Abbvie, GSK, Vila Health, honoraria and consulting fees from Abbvie, AstraZeneca, BMS, Boehringer Ingelheim, Chugai, GAIA, Galapagos, GSK, Janssen, Lilly, Medac, Novartis, Pfizer, Sobi, Rheumaakademie, UCB, Vila Health and Werfen.
期刊介绍:
The International Journal of Rheumatic Diseases (formerly APLAR Journal of Rheumatology) is the official journal of the Asia Pacific League of Associations for Rheumatology. The Journal accepts original articles on clinical or experimental research pertinent to the rheumatic diseases, work on connective tissue diseases and other immune and allergic disorders. The acceptance criteria for all papers are the quality and originality of the research and its significance to our readership. Except where otherwise stated, manuscripts are peer reviewed by two anonymous reviewers and the Editor.