人工智能能否取代人类实验对象？大规模复制 LLM 心理实验

arXiv - ECON - General Economics Pub Date : 2024-08-29 DOI:arxiv-2409.00128

Ziyan Cui, Ning Li, Huaikang Zhou

{"title":"人工智能能否取代人类实验对象？大规模复制 LLM 心理实验","authors":"Ziyan Cui, Ning Li, Huaikang Zhou","doi":"arxiv-2409.00128","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence (AI) is increasingly being integrated into scientific\nresearch, particularly in the social sciences, where understanding human\nbehavior is critical. Large Language Models (LLMs) like GPT-4 have shown\npromise in replicating human-like responses in various psychological\nexperiments. However, the extent to which LLMs can effectively replace human\nsubjects across diverse experimental contexts remains unclear. Here, we conduct\na large-scale study replicating 154 psychological experiments from top social\nscience journals with 618 main effects and 138 interaction effects using GPT-4\nas a simulated participant. We find that GPT-4 successfully replicates 76.0\npercent of main effects and 47.0 percent of interaction effects observed in the\noriginal studies, closely mirroring human responses in both direction and\nsignificance. However, only 19.44 percent of GPT-4's replicated confidence\nintervals contain the original effect sizes, with the majority of replicated\neffect sizes exceeding the 95 percent confidence interval of the original\nstudies. Additionally, there is a 71.6 percent rate of unexpected significant\nresults where the original studies reported null findings, suggesting potential\noverestimation or false positives. Our results demonstrate the potential of\nLLMs as powerful tools in psychological research but also emphasize the need\nfor caution in interpreting AI-driven findings. While LLMs can complement human\nstudies, they cannot yet fully replace the nuanced insights provided by human\nsubjects.","PeriodicalId":501273,"journal":{"name":"arXiv - ECON - General Economics","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs\",\"authors\":\"Ziyan Cui, Ning Li, Huaikang Zhou\",\"doi\":\"arxiv-2409.00128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial Intelligence (AI) is increasingly being integrated into scientific\\nresearch, particularly in the social sciences, where understanding human\\nbehavior is critical. Large Language Models (LLMs) like GPT-4 have shown\\npromise in replicating human-like responses in various psychological\\nexperiments. However, the extent to which LLMs can effectively replace human\\nsubjects across diverse experimental contexts remains unclear. Here, we conduct\\na large-scale study replicating 154 psychological experiments from top social\\nscience journals with 618 main effects and 138 interaction effects using GPT-4\\nas a simulated participant. We find that GPT-4 successfully replicates 76.0\\npercent of main effects and 47.0 percent of interaction effects observed in the\\noriginal studies, closely mirroring human responses in both direction and\\nsignificance. However, only 19.44 percent of GPT-4's replicated confidence\\nintervals contain the original effect sizes, with the majority of replicated\\neffect sizes exceeding the 95 percent confidence interval of the original\\nstudies. Additionally, there is a 71.6 percent rate of unexpected significant\\nresults where the original studies reported null findings, suggesting potential\\noverestimation or false positives. Our results demonstrate the potential of\\nLLMs as powerful tools in psychological research but also emphasize the need\\nfor caution in interpreting AI-driven findings. While LLMs can complement human\\nstudies, they cannot yet fully replace the nuanced insights provided by human\\nsubjects.\",\"PeriodicalId\":501273,\"journal\":{\"name\":\"arXiv - ECON - General Economics\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - ECON - General Economics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - ECON - General Economics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人工智能（AI）正越来越多地融入科学研究，尤其是社会科学研究，因为在社会科学研究中，理解人类行为至关重要。像 GPT-4 这样的大型语言模型（LLMs）已经在各种心理实验中显示出复制类似人类反应的潜力。然而，在不同的实验情境中，LLMs 能在多大程度上有效取代人类受试者仍不清楚。在这里，我们使用 GPT-4 作为模拟参与者，进行了一项大规模研究，复制了来自顶级社会科学期刊的 154 个心理学实验，其中包含 618 个主效应和 138 个交互效应。我们发现，GPT-4 成功地复制了原始研究中观察到的 76.0% 的主效应和 47.0% 的交互效应，在方向和显著性上都与人类的反应非常接近。但是，GPT-4 复制的置信区间中只有 19.44% 包含原始效应大小，大部分复制的效应大小超过了原始研究 95% 的置信区间。此外，在原始研究报告为空的情况下，意外显著结果的比例为 71.6%，这表明可能存在高估或假阳性结果。我们的研究结果表明，LLMs 有潜力成为心理学研究的有力工具，但同时也强调了在解释人工智能驱动的研究结果时需要谨慎。虽然 LLM 可以补充人类研究，但还不能完全取代人类受试者提供的细致入微的洞察力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - ECON - General Economics

自引率

0.00%

发文量