Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard

IF 3.1 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE AI (Basel, Switzerland) Pub Date : 2023-10-24 DOI:10.3390/ai4040048

Vagelis Plevris, George Papazafeiropoulos, Alejandro Jiménez Rios

{"title":"Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard","authors":"Vagelis Plevris, George Papazafeiropoulos, Alejandro Jiménez Rios","doi":"10.3390/ai4040048","DOIUrl":null,"url":null,"abstract":"In an age where artificial intelligence is reshaping the landscape of education and problem solving, our study unveils the secrets behind three digital wizards, ChatGPT-3.5, ChatGPT-4, and Google Bard, as they engage in a thrilling showdown of mathematical and logical prowess. We assess the ability of the chatbots to understand the given problem, employ appropriate algorithms or methods to solve it, and generate coherent responses with correct answers. We conducted our study using a set of 30 questions. These questions were carefully crafted to be clear, unambiguous, and fully described using plain text only. Each question has a unique and well-defined correct answer. The questions were divided into two sets of 15: Set A consists of “Original” problems that cannot be found online, while Set B includes “Published” problems that are readily available online, often with their solutions. Each question was presented to each chatbot three times in May 2023. We recorded and analyzed their responses, highlighting their strengths and weaknesses. Our findings indicate that chatbots can provide accurate solutions for straightforward arithmetic, algebraic expressions, and basic logic puzzles, although they may not be consistently accurate in every attempt. However, for more complex mathematical problems or advanced logic tasks, the chatbots’ answers, although they appear convincing, may not be reliable. Furthermore, consistency is a concern as chatbots often provide conflicting answers when presented with the same question multiple times. To evaluate and compare the performance of the three chatbots, we conducted a quantitative analysis by scoring their final answers based on correctness. Our results show that ChatGPT-4 performs better than ChatGPT-3.5 in both sets of questions. Bard ranks third in the original questions of Set A, trailing behind the other two chatbots. However, Bard achieves the best performance, taking first place in the published questions of Set B. This is likely due to Bard’s direct access to the internet, unlike the ChatGPT chatbots, which, due to their designs, do not have external communication capabilities.","PeriodicalId":93633,"journal":{"name":"AI (Basel, Switzerland)","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI (Basel, Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/ai4040048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In an age where artificial intelligence is reshaping the landscape of education and problem solving, our study unveils the secrets behind three digital wizards, ChatGPT-3.5, ChatGPT-4, and Google Bard, as they engage in a thrilling showdown of mathematical and logical prowess. We assess the ability of the chatbots to understand the given problem, employ appropriate algorithms or methods to solve it, and generate coherent responses with correct answers. We conducted our study using a set of 30 questions. These questions were carefully crafted to be clear, unambiguous, and fully described using plain text only. Each question has a unique and well-defined correct answer. The questions were divided into two sets of 15: Set A consists of “Original” problems that cannot be found online, while Set B includes “Published” problems that are readily available online, often with their solutions. Each question was presented to each chatbot three times in May 2023. We recorded and analyzed their responses, highlighting their strengths and weaknesses. Our findings indicate that chatbots can provide accurate solutions for straightforward arithmetic, algebraic expressions, and basic logic puzzles, although they may not be consistently accurate in every attempt. However, for more complex mathematical problems or advanced logic tasks, the chatbots’ answers, although they appear convincing, may not be reliable. Furthermore, consistency is a concern as chatbots often provide conflicting answers when presented with the same question multiple times. To evaluate and compare the performance of the three chatbots, we conducted a quantitative analysis by scoring their final answers based on correctness. Our results show that ChatGPT-4 performs better than ChatGPT-3.5 in both sets of questions. Bard ranks third in the original questions of Set A, trailing behind the other two chatbots. However, Bard achieves the best performance, taking first place in the published questions of Set B. This is likely due to Bard’s direct access to the internet, unlike the ChatGPT chatbots, which, due to their designs, do not have external communication capabilities.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

聊天机器人在数学和逻辑问题上的测试:ChatGPT-3.5、ChatGPT-4和Google Bard的比较和评估

在人工智能正在重塑教育和解决问题的时代，我们的研究揭示了三位数字巫师——ChatGPT-3.5、ChatGPT-4和Google Bard——背后的秘密，因为他们参与了一场令人兴奋的数学和逻辑实力对决。我们评估聊天机器人理解给定问题的能力，采用适当的算法或方法来解决问题，并产生具有正确答案的连贯响应。我们通过一组30个问题进行了研究。这些问题经过精心设计，清晰、明确，并仅使用纯文本进行完整描述。每个问题都有一个明确的正确答案。这些问题被分为两组，每组15个:A组包括在网上找不到的“原创”问题，而B组包括在网上随时可以找到的“已发表”问题，通常都有答案。每个问题在2023年5月向每个聊天机器人提出三次。我们记录并分析了他们的回答，突出了他们的优点和缺点。我们的研究结果表明，聊天机器人可以为简单的算术、代数表达式和基本的逻辑谜题提供准确的解决方案，尽管它们可能不是每次尝试都始终准确。然而，对于更复杂的数学问题或高级逻辑任务，聊天机器人的答案虽然看起来令人信服，但可能并不可靠。此外，一致性是一个问题，因为聊天机器人在多次提出相同问题时经常会提供相互矛盾的答案。为了评估和比较这三个聊天机器人的表现，我们根据正确程度对它们的最终答案进行了定量分析。我们的结果表明，在这两组问题中，ChatGPT-4的表现都优于ChatGPT-3.5。巴德在A组的原始问题中排名第三，落后于另外两个聊天机器人。然而，Bard的表现最好，在b组的公开问题中排名第一。这可能是因为Bard直接接入互联网，而不像ChatGPT聊天机器人，由于其设计，没有外部通信能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊