Unlabelled: Associations between eHealth literacy and mental health literacy were examined; no significant association was identified between overall eHealth and mental health literacy and only weak associations between specific skills were recorded. Results are interpreted in lieu of a difference between perceived ability and actual performance.
Unlabelled: With population aging, an increase in total life expectancy at birth (TLE) should ideally be accompanied by an equal increase in health span (HS), or by a trend in increasing HS/TLE ratio. Hong Kong has one of the longest life expectancies in the world; however, there is a trend of declining HS/TLE ratio, such that the absolute number of people with dependencies is increasing. To address this challenge, the World Health Organization proposed the model of integrated care for older people (ICOPE) that combines both health and social elements in community care, using the measurement of intrinsic capacity (IC) as a metric for monitoring the performance in different countries. The use of technology is essential in achieving a wide coverage of the population in assessing IC, followed by an individually tailored plan of action. This model can be adapted to different health and social care systems in different countries. Hong Kong has an extensive network of community centers, where the basic assessment may be based, followed by further assessments and personalized activities, and referral to medical professionals may only be needed in the presence of disease. Conversely, the medical sector may refer patients to the community for activities designed to optimize the various domains of IC. Such a model of care has the potential to address manpower shortage and mitigate inequalities in healthy aging, as well as enable the monitoring of physiological systems in community-dwelling adults using digital biomarkers as a metric of IC.
Background: Multiple-choice questions (MCQs) are essential in medical education for assessing knowledge and clinical reasoning. Traditional MCQ development involves expert reviews and revisions, which can be time-consuming and subject to bias. Large language models (LLMs) have emerged as potential tools for evaluating MCQ accuracy and efficiency. However, direct comparisons of these models in orthopedic MCQ assessments are limited.
Objective: This study compared the performance of ChatGPT and DeepSeek in terms of correctness, response time, and reliability when answering MCQs from an orthopedic examination for medical students.
Methods: This cross-sectional study included 209 orthopedic MCQs from summative assessments during the 2023-2024 academic year. ChatGPT (including the "Reason" function) and DeepSeek (including the "DeepThink" function) were used to identify the correct answers. Correctness and response times were recorded and compared using a χ2 test and Mann-Whitney U test where appropriate. The two LLMs' reliability was assessed using the Cohen κ coefficient. The MCQs incorrectly answered by both models were reviewed by orthopedic faculty to identify ambiguities or content issues.
Results: ChatGPT achieved a correctness rate of 80.38% (168/209), while DeepSeek achieved 74.2% (155/209; P=.04). ChatGPT's Reason function also outperformed DeepSeek's DeepThink function (177/209, 84.7% vs 168/209, 80.4%; P=.12). The average response time for ChatGPT was 10.40 (SD 13.29) seconds, significantly shorter than DeepSeek's 34.42 (SD 25.48) seconds (P<.001). Regarding reliability, ChatGPT demonstrated an almost perfect agreement (κ=0.81), whereas DeepSeek showed substantial agreement (κ=0.78). A completely false response was recorded in 7.7% (16/209) of responses for both models.
Conclusions: ChatGPT outperformed DeepSeek in correctness and response time, demonstrating its efficiency in evaluating orthopedic MCQs. This high reliability suggests its potential for integration into medical assessments. However, our results indicate that some MCQs will require revisions by instructors to improve their clarity. Further studies are needed to evaluate the role of artificial intelligence in other disciplines and to validate other LLMs.
Background: Despite the transformative potential of artificial intelligence (AI)-based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operations. However, as of 2025, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots.
Objective: This study aims to evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves answer accuracy on medical multiple-choice questions (MCQs) comparable to that of leading chatbots (ie, ChatGPT and Gemini) on German medical MCQs, thereby assessing its potential as a privacy-preserving alternative for clinical use.
Methods: A total of 200 interdisciplinary MCQs from the German Progress Test Medicine were administered to ChatGPT (GPT-o3-mini), DeepSeek (DeepSeek-R1), and Gemini (Gemini 2.0 Flash). Accuracy was defined as the proportion of correctly solved questions. Overall differences among the 3 models were tested with the Cochran Q test, while pairwise comparisons were conducted using the McNemar test. Subgroup analyses were performed by medical domain (Fisher exact test) and question length (Wilcoxon rank-sum test). An a priori power analysis indicated a minimum sample size of 195 questions.
Results: All 3 chatbots surpassed the conventional passing threshold of 60%, with accuracies of 96% (192/200) for DeepSeek, 94% (188/200) for Gemini, and 92.5% (185/200) for ChatGPT. The overall difference among models was not statistically significant (P=.10) nor were pairwise comparisons. However, incorrect responses were significantly associated with longer question length for DeepSeek (P=.049) and ChatGPT (P=.04) but not for Gemini. No significant differences in performance were observed across clinical versus preclinical domains or medical specialties (all P>.05).
Conclusions: Overall, DeepSeek demonstrates outstanding performance on German medical MCQs comparable to the widely used chatbots ChatGPT and Gemini. Similar to ChatGPT, DeepSeek's performance declined with increasing question length, highlighting verbosity as a persistent challenge for large language models. While DeepSeek's offline capability and lower operational costs are advantageous, its safe and reliable application in clinical contexts requires further investigation.
Background: Parenting skills programs are the primary intervention for conduct disorders in children. The Pause app enhances these programs by providing digital microinterventions that reinforce learning between sessions and after program completion. The potential of artificial intelligence (AI) in this context remains untapped. Hackathons have proven effective for health care innovation and can facilitate collaborative development in this space.
Objective: We aimed to rapidly build AI-powered features in the Pause app to enhance parenting skills.
Methods: We undertook a 1-day hackathon that included an ideation phase drawing on the Design Council's double diamond framework and a development phase using microsprints based on agile and scrum approaches. The interdisciplinary participants included medical professionals, developers, and product managers.
Results: Participants identified 3 core problems: generating age-appropriate distractions, receiving feedback on parenting efforts, and effectively using the journal function. During the solution phase, a wide range of options were explored, resulting in 3 key solutions: AI-assisted idea generation, a tool for summarizing parenting interactions, and a weekly journal roundup. During the development phase, participants completed 4 microsprints. Teams focused on 3 workstreams: building a "weekly roundup" module, creating an AI-based distraction generator, and developing a summarizer for active play sessions. These prototypes were integrated into the preproduction environment, with each workstream producing a functional component. Participant feedback (n=4) was unanimously positive, with all participants rating the event as "excellent" and highlighting the value of in-person collaboration.
Conclusions: This 1-day hackathon used the double diamond approach to develop AI-powered features for parenting programs. Three solutions were explored across workstreams, resulting in 2 fully functioning and 1 near-functioning app component. The rapid problem-solving approach mirrors other health technology hackathons and highlights the untapped potential of AI in digital parenting support, surpassing traditional e-learning or video-based methods. This work suggests broader applications of AI-driven coaching in fields like social care. Despite a small team, the hackathon was focused and productive, generating relevant solutions based on prior engagement with parents and practitioners. Future research will assess the impact of the app's AI-powered features on parenting outcomes.
Background: Assessment of medical information provided by artificial intelligence (AI) chatbots like ChatGPT and Google's Gemini and comparison with international guidelines is a burgeoning area of research. These AI models are increasingly being considered for their potential to support clinical decision-making and patient education. However, their accuracy and reliability in delivering medical information that aligns with established guidelines remain under scrutiny.
Objective: This study aims to assess the accuracy of medical information generated by ChatGPT and Gemini and its alignment with international guidelines for sepsis management.
Methods: ChatGPT and Gemini were asked 18 questions about the Surviving Sepsis Campaign guidelines, and the responses were evaluated by 7 independent intensive care physicians. The responses generated were scored as follows: 3=correct, complete, and accurate; 2=correct but incomplete or inaccurate; and 1=incorrect. This scoring system was chosen to provide a clear and straightforward assessment of the accuracy and completeness of the responses. The Fleiss κ test was used to assess the agreement between evaluators, and the Mann-Whitney U test was used to test for the significance of differences between the correct responses generated by ChatGPT and Gemini.
Results: ChatGPT provided 5 (28%) perfect responses, 12 (67%) nearly perfect responses, and 1 (5%) low-quality response, with substantial agreement among the evaluators (Fleiss κ=0.656). Gemini, on the other hand, provided 3 (17%) perfect responses, 14 (78%) nearly perfect responses, and 1 (5%) low-quality response, with moderate agreement among the evaluators (Fleiss κ=0.582). The Mann-Whitney U test revealed no statistically significant difference between the two platforms (P=.48).
Conclusions: ChatGPT and Gemini both demonstrated potential for generating medical information. Despite their current limitations, both showed promise as complementary tools in patient education and clinical decision-making. The medical information generated by ChatGPT and Gemini still needs ongoing evaluation regarding its accuracy and alignment with international guidelines in different medical domains, particularly in the sepsis field.

