评估大型语言模型,为研究选择统计测试:试点研究

Himel Mondal, Shaikat Mondal, Prabhat Mittal
{"title":"评估大型语言模型,为研究选择统计测试:试点研究","authors":"Himel Mondal, Shaikat Mondal, Prabhat Mittal","doi":"10.4103/picr.picr_275_23","DOIUrl":null,"url":null,"abstract":"\n \n \n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\n \n \n \n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\n \n \n \n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\n \n \n \n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\n \n \n \n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\n","PeriodicalId":20015,"journal":{"name":"Perspectives in Clinical Research","volume":"83 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models for selection of statistical test for research: A pilot study\",\"authors\":\"Himel Mondal, Shaikat Mondal, Prabhat Mittal\",\"doi\":\"10.4103/picr.picr_275_23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\\n \\n \\n \\n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\\n \\n \\n \\n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\\n \\n \\n \\n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\\n \\n \\n \\n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\\n\",\"PeriodicalId\":20015,\"journal\":{\"name\":\"Perspectives in Clinical Research\",\"volume\":\"83 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Perspectives in Clinical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4103/picr.picr_275_23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Perspectives in Clinical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/picr.picr_275_23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

摘要

在当代研究中,选择适当的统计检验是一个关键且往往具有挑战性的步骤。大型语言模型(LLM)的出现为这一过程的自动化提供了一个前景广阔的途径,有可能提高统计检验选择的效率和准确性。 本研究旨在评估免费提供的 LLM(OpenAI 的 ChatGPT3.5、Google Bard、Microsoft Bing Chat 和 Perplexity)在为研究推荐合适的统计测试方面的能力,并将它们的建议与人类专家的建议进行比较。 我们为常见的研究模型准备了共 27 个案例小故事,其中有一个问题是关于合适的统计测试。这些案例是根据以前发表的文献编制的,并由一名人类专家审查其信息的准确性。法学硕士们先用案例小故事提问,然后再用转述的案例重复这一过程。对法律硕士的建议与人类专家的建议之间的一致性(如果完全符合答案要点)和可接受性(如果不完全符合答案要点,但可被视为合适)进行评估。 在 27 个案例小节中,ChatGPT3.5 建议统计测试的吻合度为 85.19%,接受度为 100%;Bard 实验的吻合度为 77.78%,接受度为 96.3%;Microsoft Bing Chat 的吻合度为 96.3%,接受度为 100%;Perplexity 的吻合度为 85.19%,接受度为 100%。本地语言学者的平均测量值的类内校正系数为 0.728(95% 置信区间 [CI]:0.51-0.86),P < 0.0001。ChatGPT 的测试-再测信度为 r = 0.71(95% CI:0.44-0.86),P < 0.0001;Bard 的测试-再测信度为 r = -0.22(95% CI:-0.56-0.18),P = 0.26;Bing 的测试-再测信度为 r = -0.06(95% CI:-0.44-0.33),P = 0.73;Perplexity 的测试-再测信度为 r = 0.52(95% CI:0.16-0.75),P = 0.0059。 在建议对研究案例小节进行统计测试时,LLMs(即 ChatGPT、Google Bard、Microsoft Bing 和 Perplexity)的一致性均大于 75%,接受度均大于 95%。LLM 之间的一致性处于中等水平。这些模型虽然不能完全取代人类的专业知识,但可以作为有效的决策支持系统,尤其是在需要快速选择测试的情况下。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating large language models for selection of statistical test for research: A pilot study
In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection. This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts. A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts. Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059. The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Perspectives in Clinical Research
Perspectives in Clinical Research Medicine-Medicine (all)
CiteScore
2.90
自引率
0.00%
发文量
41
审稿时长
36 weeks
期刊介绍: This peer review quarterly journal is positioned to build a learning clinical research community in India. This scientific journal will have a broad coverage of topics across clinical research disciplines including clinical research methodology, research ethics, clinical data management, training, data management, biostatistics, regulatory and will include original articles, reviews, news and views, perspectives, and other interesting sections. PICR will offer all clinical research stakeholders in India – academicians, ethics committees, regulators, and industry professionals -a forum for exchange of ideas, information and opinions.
期刊最新文献
Evaluation of student-led “Association for Support and Propagation of Innovation, Research, and Education” (A.S.P.I.R.E) in empowering undergraduate medical students in research: A 2-year longitudinal study Pleiotropic effect of teneligliptin versus glimepiride add-on therapy on hs-CRP and cardiorenal parameters in Indian type 2 diabetes patients: An open-labeled randomized controlled trial Efficacy and safety of quick penetrating solution heparin, quick penetrating solution diclofenac, and heparin gel in the prevention of infusion-associated superficial thrombophlebitis: A randomized controlled trial Bio-entrepreneurs’ bugbear: Regulatory rigmarole Experience of participating in community-based clinical trials from rural Maharashtra: Analysis of over 4000 participant feedback forms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1