Background: ChatGPT-4o, Google Gemini, and Microsoft Copilot have shown potential in generating health care-related information. However, their accuracy, completeness, and safety for providing drug-related information in Thai contexts remain underexplored.
Objective: This study aims to evaluate the performance of artificial intelligence (AI) systems in responding to drug-related questions in Thai.
Methods: An analytical cross-sectional study was conducted using 76 public drug-related questions compiled from medical databases and social media between November 1, 2019, and December 31, 2024. All questions were categorized into 19 distinct categories, each comprising 4 questions. ChatGPT-4o, Google Gemini, and Microsoft Copilot were queried in a single session on March 1, 2025, by using input in Thai. All responses were evaluated for correctness, completeness, risk, and reproducibility independently by clinical pharmacists using standardized evaluation criteria.
Results: All 3 AI models provided generally complete responses (P=.08). ChatGPT-4o yielded the highest proportion of fully correct responses (P=.08). The overall risk levels of high-risk answers were not significantly different (P=.12). Response correctness was influenced by the category of the drug-related questions (P=.002) but not completeness (P=.23). The correctness of Google Gemini and Microsoft Copilot was higher than that of ChatGPT for pharmacology queries. The type of questions also statistically significantly affected the risk level of the answers (P=.04). In particular, the pregnancy and lactation category had the highest high-risk response rate (1/76, 1% per system). All 3 AI models demonstrated consistent response patterns when the same questions were re-queried after 1, 7, and 14 days.
Conclusions: The evaluated AI chatbots were able to answer the queries with generally complete content; however, we found limited accuracy and occasional high-risk errors in responding to drug-related questions in Thai. All models exhibited good reproducibility.
Background: Mental disorders are frequently evaluated using questionnaires, which have been developed over the past decades for the assessment of different conditions. Despite the rigorous validation of these tools, high levels of content divergence have been reported for questionnaires measuring the same construct of psychopathology. Previous studies that examined the content overlap required manual symptom labeling, which is observer-dependent and time-consuming.
Objective: In this study, we used large language models (LLMs) to analyze content overlap of mental health questionnaires in an observer-independent way and compare our results with clinical expertise.
Methods: We analyzed questionnaires from a range of mental health conditions, including adult depression (n=7), childhood depression (n=15), clinical high risk for psychosis (CHR-P; n=11), mania (n=7), obsessive-compulsive disorder (n=7), and sleep disorder (n=12). Two different LLM-based approaches were tested. First, we used sentence Bidirectional Encoder Representations from Transformers (sBERT) to derive numerical representations (embeddings) for each questionnaire item, which were then clustered using k-means to group semantically similar symptoms. Second, questionnaire items were prompted to a Generative Pretrained Transformer to identify underlying symptom clusters. Clustering results were compared to a manual categorization by experts using the adjusted rand index. Further, we assessed the content overlap within each diagnostic domain based on LLM-derived clusters.
Results: We observed varying degrees of similarity between expert-based and LLM-based clustering across diagnostic domains. Overall, agreement between experts was higher than between experts and LLMs. Among the 2 LLM approaches, GPT showed greater alignment with expert ratings than sBERT, ranging from weak to strong similarity depending on the diagnostic domain. Using GPT-based clustering of questionnaire items to assess the content overlap within each diagnostic domain revealed a weak (CHR-P: 0.344) to moderate (adult depression: 0.574; childhood depression: 0.433; mania: 0.419; obsessive-compulsive disorder [OCD]: 0.450; sleep disorder: 0.445) content overlap of questionnaires. Compared to the studies that manually investigated content overlap among these scales, the results of this study exhibited variations, though these were not substantial.
Conclusions: These findings demonstrate the feasibility of using LLMs to objectively assess content overlap in diagnostic questionnaires. Notably, the GPT-based approach showed particular promise in aligning with expert-derived symptom structures.
Background: Systematic literature reviews (SLRs) build the foundation for evidence synthesis, but they are exceptionally demanding in terms of time and resources. While recent advances in artificial intelligence (AI), particularly large language models, offer the potential to accelerate this process, their use introduces challenges to transparency and reproducibility. Reporting guidelines such as the PRISMA-AI (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Artificial Intelligence Extension) primarily focus on AI as a subject of research, not as a tool in the review process itself.
Objective: To address the gap in reporting standards, this study aimed to develop and propose a discipline-agnostic checklist extension to the PRISMA 2020 statement. The goal was to ensure transparent reporting when AI is used as a methodological tool in evidence synthesis, fostering trust in the next generation of SLRs.
Methods: The proposed checklist, named PRISMA-trAIce (PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis), was developed through a systematic process. We conducted a literature search to identify established, consensus-based AI reporting guidelines (eg, CONSORT-AI [Consolidated Standards of Reporting Trials-Artificial Intelligence] and TRIPOD-AI [Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis-Artificial Intelligence]). Relevant items from these frameworks were extracted, analyzed, and thematically synthesized to form a modular checklist that integrated with the PRISMA 2020 structure.
Results: The primary result of this work is the PRISMA-trAIce checklist, a comprehensive set of reporting items designed to document the use of AI in SLRs. The checklist covers the entire structure of an SLR, from title and abstract to methods and discussion, and includes specific items for identifying AI tools, describing human-AI interaction, reporting performance evaluation, and discussing limitations.
Conclusions: PRISMA-trAIce establishes an important framework to improve the transparency and methodological integrity of AI-assisted systematic reviews, enhancing the trust required for the responsible application of AI-assisted systematic reviews in evidence synthesis. We present this work as a foundational proposal, explicitly inviting the scientific community to join an open science process of consensus building. Through this collaborative refinement, we aim to evolve PRISMA-trAIce into a formally endorsed guideline, thereby ensuring the collective validation and scientific rigor of future AI-driven research.
Background: Spinal cord injury (SCI) is complicated and varied conditions that receive a lot of attention. However, the prognosis of patients with SCI is increasingly being predicted using machine learning (ML) techniques.
Objective: This study aims to evaluate the efficacy and caliber of ML models in forecasting the consequences of SCI.
Methods: Literature searches were conducted in PubMed, Web of Science, Embase, PROSPERO, Scopus, Cochrane Library, China National Knowledge Infrastructure, China Biomedical Literature Service System, and Wanfang databases. Meta-analysis of the area under the receiver operating characteristic curve of ML models was performed to comprehensively evaluate their performance.
Results: A total of 1254 articles were retrieved, and 13 eligible studies were included. Predictive outcomes included spinal cord function prognosis, postoperative complications, independent living ability, and walking ability. For spinal cord function prognosis, the area under the curve (AUC) of the random forest algorithm was 0.832, the AUC of the logistic regression algorithm was 0.813 (95% CI 0.805-0.883), the AUC of the decision tree algorithm was 0.747 (95% CI 0.677-0.802), and the AUC of the XGBoost (extreme gradient boosting) algorithm was 0.867. For postoperative complications, the AUC of the random forest algorithm was 0.627 (95% CI 0.441-0.812), the AUC of the logistic regression algorithm was 0.747 (95% CI 0.597-0.896), and the AUC of the decision tree algorithm was 0.688. For independent living ability, the AUC of the classification and regression tree model was 0.813. For walking ability, the model based on the vector machine algorithm was the most effective, with an AUC of 0.780.
Conclusions: The ML models predict SCI outcomes with relative accuracy, particularly in spinal cord function prognosis. They are expected to become important tools for clinicians in assessing the prognosis of patients with SCI, with the XGBoost algorithm showing the best performance. Prediction models should continue to advance as large data are used and ML algorithms develop.
Background: Neglected tropical diseases (NTDs) are the most prevalent diseases and comprise 21 different conditions. One-half of these conditions have skin manifestations, known as skin NTDs. The diagnosis of skin NTDs incorporates visual examination of patients, and deep learning (DL)-based diagnostic tools can be used to assist the diagnostic procedures. The use of advanced DL-based methods, including multimodal data fusion (MMDF) functionality, could be a potential approach to enhance the diagnostic procedures of these diseases. However, little has been done toward the application of such tools, as confirmed by the very few studies currently available that implemented MMDF for skin NTDs.
Objective: This article presents a systematic review regarding the use of DL-based MMDF methods for the diagnosis of skin NTDs and related diseases (non-NTD skin diseases), including the ethical risks and potential risk of bias.
Methods: The review was conducted based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method using 6 parameters (research approach followed, disease[s] diagnosed, dataset[s] used, algorithm[s] applied, performance achieved, and future direction[s]).
Results: Initially, 437 articles were collected from 5 major groups of identified sources; 14 articles were selected for the final analysis. Results revealed that, compared with traditional methods, the MMDF methods improved model performances for the diagnoses of skin NTDs and non-NTD skin diseases. Algorithmically, convolutional neural network (CNN)-based models were the predominantly used DL architectures (9/14 studies, 64% ), providing feature extraction, feature fusion, and disease classification, which were also conducted with transformer-based methods (1/14, 7%). Furthermore, recurrent neural networks were used in combination with CNN-based feature extractors to fuse multimodal data (1/14, 7%) and with generative models (1/14, 7%). The remaining studies used study-specific algorithms using transformers (1/14, 7%) and generative models (1/14, 7%).
Conclusions: Finally, this article suggests that further studies should be conducted about using DL-based MMDF methods for skin NTDs, considering model efficiency, data scarcity, algorithm selection and use, fusion strategies of multiple modalities, and the possible adoption of such tools for resource-constrained areas.
Unlabelled: Artificial intelligence (AI) is revolutionizing digital health, driving innovation in care delivery and operational efficiency. Despite its potential, many AI systems fail to meet real-world expectations due to limited evaluation practices that focus narrowly on short-term metrics like efficiency and technical accuracy. Ignoring factors such as usability, trust, transparency, and adaptability hinders AI adoption, scalability, and long-term impact in health care. This paper emphasizes the importance of embedding scientific evaluation as a core operational layer throughout the AI life cycle. We outline practical guidelines for digital health companies to improve AI integration and evaluation, informed by over 35 years of experience in science, the digital health industry, and AI development. It describes a multistep approach, including stakeholder analysis, real-time monitoring, and iterative improvement, that digital health companies can adopt to ensure robust AI integration. Key recommendations include assessing stakeholder needs, designing AI systems that can check its own work, conducting testing to address usability and biases, and ensuring continuous improvement to keep systems user-centered and adaptable. By integrating these guidelines, digital health companies can improve AI reliability, scalability, and trustworthiness, driving better health care delivery and stakeholder alignment.
Background: Artificial intelligence (AI) and machine learning models are frequently developed in medical research to optimize patient care, yet they remain rarely used in clinical practice.
Objective: This study aims to understand the disconnect between model development and implementation by surveying physicians of all specialties across the United States.
Methods: The present survey was distributed to residency coordinators at Accreditation Council for Graduate Medical Education-accredited residency programs to disseminate among attending physicians and resident physicians affiliated with their academic institution. Respondents were asked to identify and quantify the extent of their training and specialization, as well as the type and location of their practice. Physicians were then asked follow-up questions regarding AI in their practice, including whether its use is permitted, whether they would use it if made available, primary reasons for using or not using AI, elements that would encourage its use, and ethical concerns.
Results: Of the 941 physicians who responded to the survey, 384 (40.8%) were attending physicians and 557 (59.2%) were resident physicians. The majority of the physicians (651/795, 81.9%) indicated that they would adopt AI in clinical practice if given the opportunity. The most cited intended uses for AI were risk stratification, image analysis or segmentation, and disease prognosis. The most common reservations were concerns about clinical errors made by AI and the potential to replicate human biases.
Conclusions: To date, this study comprises the largest and most diverse dataset of physician perspectives on AI. Our results emphasize that most academic physicians in the United States are open to adopting AI in their clinical practice. However, for AI to become clinically relevant, developers and physicians must work synergistically to design models that are accurate, accessible, and intuitive while thoroughly addressing ethical concerns associated with the implementation of AI in medicine.

