Detecting Artificial Intelligence-Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES JMIR Medical Education Pub Date : 2025-03-03 DOI:10.2196/62779

Berin Doru, Christoph Maier, Johanna Sophie Busse, Thomas Lücke, Judith Schönhoff, Elena Enax-Krumova, Steffen Hessler, Maria Berger, Marianne Tokic

{"title":"Detecting Artificial Intelligence-Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study.","authors":"Berin Doru, Christoph Maier, Johanna Sophie Busse, Thomas Lücke, Judith Schönhoff, Elena Enax-Krumova, Steffen Hessler, Maria Berger, Marianne Tokic","doi":"10.2196/62779","DOIUrl":null,"url":null,"abstract":"Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)-generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount.Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity-medical professionals and humanities scholars with expertise in textual analysis-to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features.Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants' characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text's authorship.Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features-particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)-played a crucial role in participants' decisions to identify a text as AI-generated.Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts' familiarity with the text content. As the decision-making process primarily relies on linguistic attributes-such as stylistic features and text coherence-further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers' ability to distinguish between student-authored and AI-generated work.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"11 ","pages":"e62779"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11914838/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/62779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)-generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount.

Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity-medical professionals and humanities scholars with expertise in textual analysis-to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features.

Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants' characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text's authorship.

Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups (medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features-particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence (OR 6.62, 95% CI 1.25-35.2)-played a crucial role in participants' decisions to identify a text as AI-generated.

Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts' familiarity with the text content. As the decision-making process primarily relies on linguistic attributes-such as stylistic features and text coherence-further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers' ability to distinguish between student-authored and AI-generated work.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

检测人工智能生成与人类撰写的医学生论文：半随机对照研究。

背景：以ChatGPT为例的大型语言模型已经达到了一定的复杂程度，这使得区分人类和人工智能（AI）生成的文本越来越具有挑战性。这引起了学术界的关注，特别是在医学领域，书面工作的准确性和真实性至关重要。目的：这项半随机对照研究旨在检验两个具有不同内容熟悉程度的盲法专家组（医学专业人员和具有文本分析专长的人文学者）区分医学生撰写的德文较长的科学文本和ChatGPT生成的文本的能力。此外，该研究试图分析他们识别选择背后的原因，特别是内容熟悉度和语言特征的作用。方法：2023年5 - 8月，共35名专家(医学：n=22；人文学科：n=13)，每个人都有两对不同医学主题的文本。每一对都有相似的内容和结构：1个文本由医学生编写，另一个文本由ChatGPT（3.5版本，2023年3月）生成。专家们被要求识别人工智能生成的文本，并证明他们的选择是正确的。通过多阶段、跨学科的定性分析来分析这些理由，以确定相关的文本特征。在解盲之前，专家们对每篇文章的6个特征进行了评分：语言流畅性和拼写/语法准确性、科学质量、逻辑连贯性、知识局限性的表达、未来研究问题的表述和引用质量。使用单变量测试和多变量逻辑回归分析来检查参与者的特征、他们陈述的作者识别原因以及正确确定文本作者身份的可能性之间的关联。结果：总体而言，在69轮（70%）决策轮中的48轮中，参与者准确识别了人工智能生成的文本，组间差异最小(医疗：31/43,72%；人文学科：17/26,65%；优势比[OR] 1.37, 95% CI 0.5-3.9)。虽然内容错误对识别准确性影响不大，但风格特征-特别是冗余（OR 6.90, 95% CI 1.01-47.1），重复（OR 8.05, 95% CI 1.25-51.7）和线程/一致性(OR 6.62, 95% CI 1.25-35.2)-在参与者决定识别人工智能生成的文本时发挥了至关重要的作用。结论：研究结果表明，医学和人文学科专家都能够在医学背景下识别chatgpt生成的文本，他们的决定主要基于语言属性。识别的准确性似乎与专家对文本内容的熟悉程度无关。由于决策过程主要依赖于语言属性，如风格特征和文本一致性，应该使用其他学科的文本进行进一步的准实验研究，以确定基于这些特征的指导是否可以提高讲师区分学生创作和人工智能生成作品的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊