Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences

IF 8.6 1区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH International Journal of Educational Technology in Higher Education Pub Date : 2024-09-13 DOI:10.1186/s41239-024-00485-y
Andrew Williams
{"title":"Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences","authors":"Andrew Williams","doi":"10.1186/s41239-024-00485-y","DOIUrl":null,"url":null,"abstract":"<p>The value of generative AI tools in higher education has received considerable attention. Although there are many proponents of its value as a learning tool, many are concerned with the issues regarding academic integrity and its use by students to compose written assessments. This study evaluates and compares the output of three commonly used generative AI tools, ChatGPT, Bing and Bard. Each AI tool was prompted with an essay question from undergraduate (UG) level 4 (year 1), level 5 (year 2), level 6 (year 3) and postgraduate (PG) level 7 biomedical sciences courses. Anonymised AI generated output was then evaluated by four independent markers, according to specified marking criteria and matched to the Frameworks for Higher Education Qualifications (FHEQ) of UK level descriptors. Percentage scores and ordinal grades were given for each marking criteria across AI generated papers, inter-rater reliability was calculated using Kendall’s coefficient of concordance and generative AI performance ranked. Across all UG and PG levels, ChatGPT performed better than Bing or Bard in areas of scientific accuracy, scientific detail and context. All AI tools performed consistently well at PG level compared to UG level, although only ChatGPT consistently met levels of high attainment at all UG levels. ChatGPT and Bing did not provide adequate references, while Bing falsified references. In conclusion, generative AI tools are useful for providing scientific information consistent with the academic standards required of students in written assignments. These findings have broad implications for the design, implementation and grading of written assessments in higher education.</p>","PeriodicalId":13871,"journal":{"name":"International Journal of Educational Technology in Higher Education","volume":null,"pages":null},"PeriodicalIF":8.6000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Educational Technology in Higher Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1186/s41239-024-00485-y","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

The value of generative AI tools in higher education has received considerable attention. Although there are many proponents of its value as a learning tool, many are concerned with the issues regarding academic integrity and its use by students to compose written assessments. This study evaluates and compares the output of three commonly used generative AI tools, ChatGPT, Bing and Bard. Each AI tool was prompted with an essay question from undergraduate (UG) level 4 (year 1), level 5 (year 2), level 6 (year 3) and postgraduate (PG) level 7 biomedical sciences courses. Anonymised AI generated output was then evaluated by four independent markers, according to specified marking criteria and matched to the Frameworks for Higher Education Qualifications (FHEQ) of UK level descriptors. Percentage scores and ordinal grades were given for each marking criteria across AI generated papers, inter-rater reliability was calculated using Kendall’s coefficient of concordance and generative AI performance ranked. Across all UG and PG levels, ChatGPT performed better than Bing or Bard in areas of scientific accuracy, scientific detail and context. All AI tools performed consistently well at PG level compared to UG level, although only ChatGPT consistently met levels of high attainment at all UG levels. ChatGPT and Bing did not provide adequate references, while Bing falsified references. In conclusion, generative AI tools are useful for providing scientific information consistent with the academic standards required of students in written assignments. These findings have broad implications for the design, implementation and grading of written assessments in higher education.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
比较生成式人工智能在生物医学本科生和研究生书面评估中的表现
生成式人工智能工具在高等教育中的价值受到了广泛关注。尽管有很多人支持其作为学习工具的价值,但也有很多人担心学术诚信问题以及学生使用它来撰写书面评估报告的问题。本研究对 ChatGPT、Bing 和 Bard 这三种常用的生成式人工智能工具的输出结果进行了评估和比较。每个人工智能工具都以本科(UG)4 级(1 年级)、5 级(2 年级)、6 级(3 年级)和研究生(PG)7 级生物医学课程中的作文题为提示。然后,由四名独立阅卷员根据指定的评分标准和英国高等教育资格框架(FHEQ)的等级描述进行匿名人工智能生成输出评估。对人工智能生成的论文的每个评分标准都给出了百分比分数和序数等级,使用肯德尔一致系数计算了评分者之间的可靠性,并对人工智能的生成性能进行了排名。在所有 UG 和 PG 级别中,ChatGPT 在科学准确性、科学细节和上下文方面的表现均优于 Bing 或 Bard。与 UG 水平相比,所有人工智能工具在 PG 水平上的表现都一直很好,但只有 ChatGPT 在所有 UG 水平上都一直达到高水平。ChatGPT 和 Bing 没有提供足够的参考文献,而 Bing 则伪造了参考文献。总之,生成式人工智能工具有助于提供符合学生书面作业学术标准的科学信息。这些发现对高等教育中书面评估的设计、实施和评分具有广泛的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.30
自引率
4.70%
发文量
59
审稿时长
76.7 days
期刊介绍: This journal seeks to foster the sharing of critical scholarly works and information exchange across diverse cultural perspectives in the fields of technology-enhanced and digital learning in higher education. It aims to advance scientific knowledge on the human and personal aspects of technology use in higher education, while keeping readers informed about the latest developments in applying digital technologies to learning, training, research, and management.
期刊最新文献
Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences Simple techniques to bypass GenAI text detectors: implications for inclusive education Understanding college students’ test anxiety in asynchronous online courses: the mediating role of emotional engagement Rethinking assessment strategies to improve authentic representations of learning: using blogs as a creative assessment alternative to develop professional skills Beyond content delivery: harnessing emotional intelligence for community building in fully online digital spaces
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1