{"title":"比较两个大型语言模型的创意数值评估:对验证和选择语言模型的启示","authors":"Daniel E. O’Leary","doi":"10.1109/mis.2024.3396371","DOIUrl":null,"url":null,"abstract":"This article compares numeric assessments generated by ChatGPT and Claude along four dimensions of novelty, feasibility, impact, and disruption, to study their ability to rate ideas. We find that those chatbots make numeric assessments that are consistent with the expected relationships between those dimensions, for example, novelty is negatively correlated with feasibility. We also find that the two chatbots make statistically significantly different numeric assessments of the same idea information. We suggest that this type of analysis can also be used to provide a type of validation of underlying chatbot capabilities. In addition, we suggest that, as part of their chatbot requirements analysis, enterprises use this approach to ensure that the chatbot appropriately “understands” concepts, in which they are directly interested.","PeriodicalId":13160,"journal":{"name":"IEEE Intelligent Systems","volume":"30 1","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparison of Numeric Assessments of Ideas From Two Large Language Models: With Implications for Validating and Choosing LLMs\",\"authors\":\"Daniel E. O’Leary\",\"doi\":\"10.1109/mis.2024.3396371\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article compares numeric assessments generated by ChatGPT and Claude along four dimensions of novelty, feasibility, impact, and disruption, to study their ability to rate ideas. We find that those chatbots make numeric assessments that are consistent with the expected relationships between those dimensions, for example, novelty is negatively correlated with feasibility. We also find that the two chatbots make statistically significantly different numeric assessments of the same idea information. We suggest that this type of analysis can also be used to provide a type of validation of underlying chatbot capabilities. In addition, we suggest that, as part of their chatbot requirements analysis, enterprises use this approach to ensure that the chatbot appropriately “understands” concepts, in which they are directly interested.\",\"PeriodicalId\":13160,\"journal\":{\"name\":\"IEEE Intelligent Systems\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/mis.2024.3396371\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/mis.2024.3396371","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
本文比较了 ChatGPT 和 Claude 从新颖性、可行性、影响力和破坏性四个维度生成的数字评估,以研究它们对想法进行评级的能力。我们发现,这些聊天机器人做出的数字评估符合这些维度之间的预期关系,例如,新颖性与可行性呈负相关。我们还发现,两个聊天机器人对相同创意信息的数字评估在统计学上存在显著差异。我们认为这种分析也可以用来验证聊天机器人的基本能力。此外,我们建议,作为聊天机器人需求分析的一部分,企业可以使用这种方法来确保聊天机器人能够恰当地 "理解 "他们直接感兴趣的概念。
A Comparison of Numeric Assessments of Ideas From Two Large Language Models: With Implications for Validating and Choosing LLMs
This article compares numeric assessments generated by ChatGPT and Claude along four dimensions of novelty, feasibility, impact, and disruption, to study their ability to rate ideas. We find that those chatbots make numeric assessments that are consistent with the expected relationships between those dimensions, for example, novelty is negatively correlated with feasibility. We also find that the two chatbots make statistically significantly different numeric assessments of the same idea information. We suggest that this type of analysis can also be used to provide a type of validation of underlying chatbot capabilities. In addition, we suggest that, as part of their chatbot requirements analysis, enterprises use this approach to ensure that the chatbot appropriately “understands” concepts, in which they are directly interested.
期刊介绍:
IEEE Intelligent Systems serves users, managers, developers, researchers, and purchasers who are interested in intelligent systems and artificial intelligence, with particular emphasis on applications. Typically they are degreed professionals, with backgrounds in engineering, hard science, or business. The publication emphasizes current practice and experience, together with promising new ideas that are likely to be used in the near future. Sample topic areas for feature articles include knowledge-based systems, intelligent software agents, natural-language processing, technologies for knowledge management, machine learning, data mining, adaptive and intelligent robotics, knowledge-intensive processing on the Web, and social issues relevant to intelligent systems. Also encouraged are application features, covering practice at one or more companies or laboratories; full-length product stories (which require refereeing by at least three reviewers); tutorials; surveys; and case studies. Often issues are theme-based and collect articles around a contemporary topic under the auspices of a Guest Editor working with the EIC.