大语言模型减少了在线问答平台上的公共知识共享。

IF 2.2 Q2 MULTIDISCIPLINARY SCIENCES PNAS nexus Pub Date : 2024-09-11 eCollection Date: 2024-09-01 DOI:10.1093/pnasnexus/pgae400

R Maria Del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs

{"title":"大语言模型减少了在线问答平台上的公共知识共享。","authors":"R Maria Del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs","doi":"10.1093/pnasnexus/pgae400","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) are a potential substitute for human-generated data and knowledge resources. This substitution, however, can present a significant problem for the training data needed to develop future models if it leads to a reduction of human-generated content. In this work, we document a reduction in activity on Stack Overflow coinciding with the release of ChatGPT, a popular LLM. To test whether this reduction in activity is specific to the introduction of this LLM, we use counterfactuals involving similar human-generated knowledge resources that should not be affected by the introduction of ChatGPT to such extent. Within 6 months of ChatGPT's release, activity on Stack Overflow decreased by 25% relative to its Russian and Chinese counterparts, where access to ChatGPT is limited, and to similar forums for mathematics, where ChatGPT is less capable. We interpret this estimate as a lower bound of the true impact of ChatGPT on Stack Overflow. The decline is larger for posts related to the most widely used programming languages. We find no significant change in post quality, measured by peer feedback, and observe similar decreases in content creation by more and less experienced users alike. Thus, LLMs are not only displacing duplicate, low-quality, or beginner-level content. Our findings suggest that the rapid adoption of LLMs reduces the production of public data needed to train them, with significant consequences.","PeriodicalId":74468,"journal":{"name":"PNAS nexus","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11421660/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large language models reduce public knowledge sharing on online Q&A platforms.\",\"authors\":\"R Maria Del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs\",\"doi\":\"10.1093/pnasnexus/pgae400\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) are a potential substitute for human-generated data and knowledge resources. This substitution, however, can present a significant problem for the training data needed to develop future models if it leads to a reduction of human-generated content. In this work, we document a reduction in activity on Stack Overflow coinciding with the release of ChatGPT, a popular LLM. To test whether this reduction in activity is specific to the introduction of this LLM, we use counterfactuals involving similar human-generated knowledge resources that should not be affected by the introduction of ChatGPT to such extent. Within 6 months of ChatGPT's release, activity on Stack Overflow decreased by 25% relative to its Russian and Chinese counterparts, where access to ChatGPT is limited, and to similar forums for mathematics, where ChatGPT is less capable. We interpret this estimate as a lower bound of the true impact of ChatGPT on Stack Overflow. The decline is larger for posts related to the most widely used programming languages. We find no significant change in post quality, measured by peer feedback, and observe similar decreases in content creation by more and less experienced users alike. Thus, LLMs are not only displacing duplicate, low-quality, or beginner-level content. Our findings suggest that the rapid adoption of LLMs reduces the production of public data needed to train them, with significant consequences.\",\"PeriodicalId\":74468,\"journal\":{\"name\":\"PNAS nexus\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11421660/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PNAS nexus\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/pnasnexus/pgae400\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PNAS nexus","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/pnasnexus/pgae400","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）是人类生成的数据和知识资源的潜在替代品。但是，如果这种替代导致人工生成内容的减少，那么就会给开发未来模型所需的训练数据带来很大的问题。在这项工作中，我们记录了 Stack Overflow 上活动的减少，与 ChatGPT（一种流行的 LLM）的发布不谋而合。为了检验这种活动的减少是否是该 LLM 推出所特有的，我们使用了反事实，涉及类似的人类生成的知识资源，而这些资源应该不会受到 ChatGPT 推出的如此大的影响。在 ChatGPT 发布后的 6 个月内，Stack Overflow 上的活跃度比俄罗斯和中国的同类论坛下降了 25%，而俄罗斯和中国的同类论坛对 ChatGPT 的访问是有限的。我们将这一估计值解释为 ChatGPT 对 Stack Overflow 真正影响的下限。与使用最广泛的编程语言相关的帖子质量下降幅度更大。我们发现，根据同行反馈衡量，帖子质量没有明显变化，而且观察到经验丰富和经验不足的用户在内容创建方面都出现了类似的下降。因此，LLM 取代的不仅仅是重复、低质量或初学者水平的内容。我们的研究结果表明，LLMs 的快速普及减少了培训 LLMs 所需的公共数据的生产，从而产生了重大影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Large language models reduce public knowledge sharing on online Q&A platforms.

Large language models (LLMs) are a potential substitute for human-generated data and knowledge resources. This substitution, however, can present a significant problem for the training data needed to develop future models if it leads to a reduction of human-generated content. In this work, we document a reduction in activity on Stack Overflow coinciding with the release of ChatGPT, a popular LLM. To test whether this reduction in activity is specific to the introduction of this LLM, we use counterfactuals involving similar human-generated knowledge resources that should not be affected by the introduction of ChatGPT to such extent. Within 6 months of ChatGPT's release, activity on Stack Overflow decreased by 25% relative to its Russian and Chinese counterparts, where access to ChatGPT is limited, and to similar forums for mathematics, where ChatGPT is less capable. We interpret this estimate as a lower bound of the true impact of ChatGPT on Stack Overflow. The decline is larger for posts related to the most widely used programming languages. We find no significant change in post quality, measured by peer feedback, and observe similar decreases in content creation by more and less experienced users alike. Thus, LLMs are not only displacing duplicate, low-quality, or beginner-level content. Our findings suggest that the rapid adoption of LLMs reduces the production of public data needed to train them, with significant consequences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助