评估大型语言模型，为研究选择统计测试：试点研究

Q2 Medicine Perspectives in Clinical Research Pub Date : 2024-04-08 DOI:10.4103/picr.picr_275_23

Himel Mondal, Shaikat Mondal, Prabhat Mittal

{"title":"评估大型语言模型，为研究选择统计测试：试点研究","authors":"Himel Mondal, Shaikat Mondal, Prabhat Mittal","doi":"10.4103/picr.picr_275_23","DOIUrl":null,"url":null,"abstract":"\n \n \n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\n \n \n \n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\n \n \n \n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\n \n \n \n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\n \n \n \n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\n","PeriodicalId":20015,"journal":{"name":"Perspectives in Clinical Research","volume":"83 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models for selection of statistical test for research: A pilot study\",\"authors\":\"Himel Mondal, Shaikat Mondal, Prabhat Mittal\",\"doi\":\"10.4103/picr.picr_275_23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\n In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.\\n \\n \\n \\n This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.\\n \\n \\n \\n A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.\\n \\n \\n \\n Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.\\n \\n \\n \\n The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.\\n\",\"PeriodicalId\":20015,\"journal\":{\"name\":\"Perspectives in Clinical Research\",\"volume\":\"83 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Perspectives in Clinical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4103/picr.picr_275_23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Perspectives in Clinical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/picr.picr_275_23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

在当代研究中，选择适当的统计检验是一个关键且往往具有挑战性的步骤。大型语言模型（LLM）的出现为这一过程的自动化提供了一个前景广阔的途径，有可能提高统计检验选择的效率和准确性。本研究旨在评估免费提供的 LLM（OpenAI 的 ChatGPT3.5、Google Bard、Microsoft Bing Chat 和 Perplexity）在为研究推荐合适的统计测试方面的能力，并将它们的建议与人类专家的建议进行比较。我们为常见的研究模型准备了共 27 个案例小故事，其中有一个问题是关于合适的统计测试。这些案例是根据以前发表的文献编制的，并由一名人类专家审查其信息的准确性。法学硕士们先用案例小故事提问，然后再用转述的案例重复这一过程。对法律硕士的建议与人类专家的建议之间的一致性（如果完全符合答案要点）和可接受性（如果不完全符合答案要点，但可被视为合适）进行评估。在 27 个案例小节中，ChatGPT3.5 建议统计测试的吻合度为 85.19%，接受度为 100%；Bard 实验的吻合度为 77.78%，接受度为 96.3%；Microsoft Bing Chat 的吻合度为 96.3%，接受度为 100%；Perplexity 的吻合度为 85.19%，接受度为 100%。本地语言学者的平均测量值的类内校正系数为 0.728（95% 置信区间 [CI]：0.51-0.86），P < 0.0001。ChatGPT 的测试-再测信度为 r = 0.71（95% CI：0.44-0.86），P < 0.0001；Bard 的测试-再测信度为 r = -0.22（95% CI：-0.56-0.18），P = 0.26；Bing 的测试-再测信度为 r = -0.06（95% CI：-0.44-0.33），P = 0.73；Perplexity 的测试-再测信度为 r = 0.52（95% CI：0.16-0.75），P = 0.0059。在建议对研究案例小节进行统计测试时，LLMs（即 ChatGPT、Google Bard、Microsoft Bing 和 Perplexity）的一致性均大于 75%，接受度均大于 95%。LLM 之间的一致性处于中等水平。这些模型虽然不能完全取代人类的专业知识，但可以作为有效的决策支持系统，尤其是在需要快速选择测试的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluating large language models for selection of statistical test for research: A pilot study

In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection. This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts. A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts. Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059. The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Perspectives in Clinical Research Medicine-Medicine (all)

CiteScore

2.90

自引率

0.00%

发文量

审稿时长

36 weeks

期刊介绍： This peer review quarterly journal is positioned to build a learning clinical research community in India. This scientific journal will have a broad coverage of topics across clinical research disciplines including clinical research methodology, research ethics, clinical data management, training, data management, biostatistics, regulatory and will include original articles, reviews, news and views, perspectives, and other interesting sections. PICR will offer all clinical research stakeholders in India – academicians, ethics committees, regulators, and industry professionals -a forum for exchange of ideas, information and opinions.