Surgical planning can be highly complicated and personalized, where a surgeon needs to balance multiple decisional dimensions including surgical effectiveness, risk, cost, and patient’s conditions and preferences. Turning to artificial intelligence is a great appeal. This study filled in this gap with Multi-Dimensional Recommendation (MUDI), an interpretable data-driven intelligent system that supported personalized surgical recommendations on both the patient’s and the surgeon’s side with joint consideration of multiple decisional dimensions. Applied to Pelvic Organ Prolapse, a common female disease with significant impacts on life quality, MUDI stood out from a crowd of competing methods and achieved excellent performance that was comparable to top urogynecologists, with a transparent process that made communications between surgeons and patients easier. Users showed a willingness to accept the recommendations and achieved higher accuracy with the aid of MUDI. Such a success indicated that MUDI had the potential to solve similar challenges in other situations.
Demonstrating cardiovascular (CV) benefits with lipid-lowering therapy (LLT) requires long-term randomized clinical trials (RCTs) with thousands of patients. Innovative approaches such as in silico trials applying a disease computational model to virtual patients receiving multiple treatments offer a complementary approach to rapidly generate comparative effectiveness data. A mechanistic computational model of atherosclerotic cardiovascular disease (ASCVD) was built from knowledge, describing lipoprotein homeostasis, LLT effects, and the progression of atherosclerotic plaques leading to myocardial infarction, ischemic stroke, major acute limb event and CV death. The ASCVD model was successfully calibrated and validated, and reproduced LLT effects observed in selected RCTs (ORION-10 and FOURIER for calibration; ORION-11, ODYSSEY-OUTCOMES and FOURIER-OLE for validation) on lipoproteins and ASCVD event incidence at both population and subgroup levels. This enables the future use of the model to conduct the SIRIUS programme, which intends to predict CV event reduction with inclisiran, an siRNA targeting hepatic PCSK9 mRNA.
As the volume of medical literature accelerates, necessitating efficient tools to synthesize evidence for clinical practice and research, the interest in leveraging large language models (LLMs) for generating clinical reviews has surged. However, there are significant concerns regarding the reliability associated with integrating LLMs into the clinical review process. This study presents a systematic comparison between LLM-generated and human-authored clinical reviews, revealing that while AI can quickly produce reviews, it often has fewer references, less comprehensive insights, and lower logical consistency while exhibiting lower authenticity and accuracy in their citations. Additionally, a higher proportion of its references are from lower-tier journals. Moreover, the study uncovers a concerning inefficiency in current detection systems for identifying AI-generated content, suggesting a need for more advanced checking systems and a stronger ethical framework to ensure academic transparency. Addressing these challenges is vital for the responsible integration of LLMs into clinical research.
Understanding the factors associated with persistent symptoms after SARS-CoV-2 infection is critical to improving long-term health outcomes. Using a wearable-derived behavioral and physiological dataset (n = 20,815), we identified individuals characterized by self-reported persistent fatigue and shortness of breath after SARS-CoV-2 infection. Compared with symptom-free COVID-19 positive (n = 150) and negative controls (n = 150), these individuals (n = 50) had higher resting heart rates (mean difference 2.37/1.49 bpm) and lower daily step counts (mean 3030/2909 steps fewer), even at least three weeks prior to SARS-CoV-2 infection. In addition, persistent fatigue and shortness of breath were associated with a significant reduction in mean quality of life (WHO-5, EQ-5D), even before infection. Here we show that persistent symptoms after SARS-CoV-2 infection may be associated with pre-existing lower fitness levels or health conditions. These findings additionally highlight the potential of wearable devices to track health dynamics and provide valuable insights into long-term outcomes of infectious diseases.
Artificial Intelligence (AI) is revolutionizing healthcare, but its true impact depends on seamless human interaction. While most research focuses on technical metrics, we lack frameworks to measure the compatibility or synergy of real-world human-AI interactions in healthcare settings. We propose a multimodal toolkit combining ecological momentary assessment, quantitative observations, and baseline measurements to optimize AI implementation.
Differential diagnosis (DDx) is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study evaluates the influence of lab test results on DDx accuracy generated by large language models (LLMs). Clinical vignettes from 50 randomly selected case reports from PMC-Patients were created, incorporating demographics, symptoms, and lab data. Five LLMs—GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B—were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. Results show that incorporating lab data enhances accuracy by up to 30% across models. GPT-4 achieved the highest performance, with Top 1 accuracy of 55% (0.41–0.69) and lenient accuracy reaching 79% (0.68–0.90). Statistically significant improvements (Holm-adjusted p values < 0.05) were observed, with GPT-4 and Mixtral excelling. Lab tests, including liver function, metabolic/toxicology panels, and serology, were generally interpreted correctly by LLMs for DDx.
Large language models (LLMs) can answer expert-level questions in medicine but are prone to hallucinations and arithmetic errors. Early evidence suggests LLMs cannot reliably perform clinical calculations, limiting their potential integration into clinical workflows. We evaluated ChatGPT’s performance across 48 medical calculation tasks, finding incorrect responses in one-third of trials (n = 212). We then assessed three forms of agentic augmentation: retrieval-augmented generation, a code interpreter tool, and a set of task-specific calculation tools (OpenMedCalc) across 10,000 trials. Models with access to task-specific tools showed the greatest improvement, with LLaMa and GPT-based models demonstrating a 5.5-fold (88% vs 16%) and 13-fold (64% vs 4.8%) reduction in incorrect responses, respectively, compared to the unimproved models. Our findings suggest that integration of machine-readable, task-specific tools may help overcome LLMs’ limitations in medical calculations.
Attention-deficit/hyperactivity disorder (ADHD), characterized by diagnostic complexity and symptom heterogeneity, is a prevalent neurodevelopmental disorder. Here, we explored the machine learning (ML) analysis of retinal fundus photographs as a noninvasive biomarker for ADHD screening and stratification of executive function (EF) deficits. From April to October 2022, 323 children and adolescents with ADHD were recruited from two tertiary South Korean hospitals, and the age- and sex-matched individuals with typical development were retrospectively collected. We used the AutoMorph pipeline to extract retinal features and used four types of ML models for ADHD screening and EF subdomain prediction, and we adopted the Shapely additive explanation method. ADHD screening models achieved 95.5%-96.9% AUROC. For EF function stratification, the visual and auditory subdomains showed strong (AUROC > 85%) and poor performances, respectively. Our analysis of retinal fundus photographs demonstrated potential as a noninvasive biomarker for ADHD screening and EF deficit stratification in the visual attention domain.