Artificial intelligence in reproductive endocrinology: an in-depth longitudinal analysis of ChatGPTv4's month-by-month interpretation and adherence to clinical guidelines for diminished ovarian reserve.

IF 2.9 3区医学 Q2 Medicine Endocrine Pub Date : 2024-12-01 Epub Date: 2024-09-28 DOI:10.1007/s12020-024-04031-8

Tugba Gurbuz, Oya Gokmen, Belgin Devranoglu, Arzu Yurci, Asena Ayar Madenli

{"title":"Artificial intelligence in reproductive endocrinology: an in-depth longitudinal analysis of ChatGPTv4's month-by-month interpretation and adherence to clinical guidelines for diminished ovarian reserve.","authors":"Tugba Gurbuz, Oya Gokmen, Belgin Devranoglu, Arzu Yurci, Asena Ayar Madenli","doi":"10.1007/s12020-024-04031-8","DOIUrl":null,"url":null,"abstract":"Objective: To quantitatively assess the performance of ChatGPTv4, an Artificial Intelligence Language Model, in adhering to clinical guidelines for Diminished Ovarian Reserve (DOR) over two months, evaluating the model's consistency in providing guideline-based responses.Design: A longitudinal study design was employed to evaluate ChatGPTv4's response accuracy and completeness using a structured questionnaire at baseline and at a two-month follow-up.Setting: ChatGPTv4 was tasked with interpreting DOR questionnaires based on standardized clinical guidelines.Participants: The study did not involve human participants; the questionnaire was exclusively administered to the ChatGPT model to generate responses about DOR.Methods: A guideline-based questionnaire with 176 open-ended, 166 multiple-choice, and 153 true/false questions were deployed to rigorously assess ChatGPTv4's ability to provide accurate medical advice aligned with current DOR clinical guidelines. AI-generated responses were rated on a 6-point Likert scale for accuracy and a 3-point scale for completeness. The two-phase design assessed the stability and consistency of AI-generated answers over two months.Results: ChatGPTv4 achieved near-perfect scores across all question types, with true/false questions consistently answered with 100% accuracy. In multiple-choice queries, accuracy improved from 98.2 to 100% at the two-month follow-up. Open-ended question responses exhibited significant positive enhancements, with accuracy scores increasing from an average of 5.38 ± 0.71 to 5.74 ± 0.51 (max: 6.0) and completeness scores from 2.57 ± 0.52 to 2.85 ± 0.36 (max: 3.0). It underscored the improvements as significant (p < 0.001), with positive correlations between initial and follow-up accuracy (r = 0.597) and completeness (r = 0.381) scores.Limitations: The study was limited by the reliance on a controlled, albeit simulated, setting that may not perfectly mirror real-world clinical interactions.Conclusion: ChatGPTv4 demonstrated exceptional and improving accuracy and completeness in handling DOR-related guideline queries over the studied period. These findings highlight ChatGPTv4's potential as a reliable, adaptable AI tool in reproductive endocrinology, capable of augmenting clinical decision-making and guideline development.","PeriodicalId":11572,"journal":{"name":"Endocrine","volume":" ","pages":"1171-1177"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-024-04031-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/28 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To quantitatively assess the performance of ChatGPTv4, an Artificial Intelligence Language Model, in adhering to clinical guidelines for Diminished Ovarian Reserve (DOR) over two months, evaluating the model's consistency in providing guideline-based responses.

Design: A longitudinal study design was employed to evaluate ChatGPTv4's response accuracy and completeness using a structured questionnaire at baseline and at a two-month follow-up.

Setting: ChatGPTv4 was tasked with interpreting DOR questionnaires based on standardized clinical guidelines.

Participants: The study did not involve human participants; the questionnaire was exclusively administered to the ChatGPT model to generate responses about DOR.

Methods: A guideline-based questionnaire with 176 open-ended, 166 multiple-choice, and 153 true/false questions were deployed to rigorously assess ChatGPTv4's ability to provide accurate medical advice aligned with current DOR clinical guidelines. AI-generated responses were rated on a 6-point Likert scale for accuracy and a 3-point scale for completeness. The two-phase design assessed the stability and consistency of AI-generated answers over two months.

Results: ChatGPTv4 achieved near-perfect scores across all question types, with true/false questions consistently answered with 100% accuracy. In multiple-choice queries, accuracy improved from 98.2 to 100% at the two-month follow-up. Open-ended question responses exhibited significant positive enhancements, with accuracy scores increasing from an average of 5.38 ± 0.71 to 5.74 ± 0.51 (max: 6.0) and completeness scores from 2.57 ± 0.52 to 2.85 ± 0.36 (max: 3.0). It underscored the improvements as significant (p < 0.001), with positive correlations between initial and follow-up accuracy (r = 0.597) and completeness (r = 0.381) scores.

Limitations: The study was limited by the reliance on a controlled, albeit simulated, setting that may not perfectly mirror real-world clinical interactions.

Conclusion: ChatGPTv4 demonstrated exceptional and improving accuracy and completeness in handling DOR-related guideline queries over the studied period. These findings highlight ChatGPTv4's potential as a reliable, adaptable AI tool in reproductive endocrinology, capable of augmenting clinical decision-making and guideline development.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工智能在生殖内分泌学中的应用：对ChatGPTv4逐月解释和遵守卵巢储备功能减退临床指南的深入纵向分析。

目的定量评估人工智能语言模型 ChatGPTv4 在两个月内遵守卵巢储备功能减退（DOR）临床指南的表现，评估该模型在提供基于指南的回复方面的一致性：设计：采用纵向研究设计，在基线和两个月的随访中使用结构化问卷评估 ChatGPTv4 的回复准确性和完整性：ChatGPTv4 的任务是根据标准化临床指南解释 DOR 问卷：该研究不涉及人类参与者；调查问卷仅用于 ChatGPT 模型，以生成有关 DOR 的回复：为了严格评估 ChatGPTv4 根据当前 DOR 临床指南提供准确医疗建议的能力，我们部署了一份基于指南的调查问卷，其中包括 176 道开放式问题、166 道多项选择题和 153 道真/假问题。人工智能生成的回答以 6 分制的李克特量表来评定准确性，以 3 分制的量表来评定完整性。两阶段设计评估了人工智能生成的答案在两个月内的稳定性和一致性：ChatGPTv4 在所有问题类型中都取得了接近满分的成绩，真/假问题的回答准确率始终保持在 100%。在多选题中，准确率在两个月的跟踪调查中从 98.2% 提高到了 100%。开放式问题的回答有了明显的提高，准确率从平均 5.38 ± 0.71 提高到 5.74 ± 0.51（最高：6.0），完整性从 2.57 ± 0.52 提高到 2.85 ± 0.36（最高：3.0）。研究强调了这些显著的改善（p 局限性：这项研究的局限性在于它依赖于受控的模拟环境，而这种环境可能无法完全反映真实世界的临床互动：在研究期间，ChatGPTv4 在处理与 DOR 相关的指南查询方面表现出了卓越的准确性和完整性，而且这种准确性和完整性还在不断提高。这些发现凸显了 ChatGPTv4 作为生殖内分泌学领域可靠、适应性强的人工智能工具的潜力，它能够辅助临床决策和指南制定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Endocrine 医学-内分泌学与代谢

CiteScore

6.40

自引率

5.40%

发文量

期刊介绍： Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology. Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted. Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.