Human vs machine: identifying ChatGPT-generated abstracts in Gynecology and Urogynecology

IF 8.7 1区 医学 Q1 OBSTETRICS & GYNECOLOGY American journal of obstetrics and gynecology Pub Date : 2024-08-01 DOI:10.1016/j.ajog.2024.04.045
{"title":"Human vs machine: identifying ChatGPT-generated abstracts in Gynecology and Urogynecology","authors":"","doi":"10.1016/j.ajog.2024.04.045","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT’s writing abilities, accuracy, and implications for authorship.</p></div><div><h3>Objective</h3><p>We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text.</p></div><div><h3>Study Design</h3><p>Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly.</p></div><div><h3>Results</h3><p>A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; <em>P</em>&lt;.001; Originality, 10.9% vs 98.1%; <em>P</em>&lt;.001; Copyleaks, 18.6% vs 58.2%; <em>P</em>&lt;.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (<em>P</em>=.03), original abstracts (<em>P</em>&lt;.001), and artificial intelligence-generated abstracts (<em>P</em>&lt;.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; <em>P</em>=.006), more total writing issues (19.2 vs 12.8; <em>P</em>&lt;.001), critical issues (5.4 vs 1.3; <em>P</em>&lt;.001), confusing words (0.8 vs 0.1; <em>P</em>=.006), misspelled words (1.7 vs 0.6; <em>P</em>=.02), incorrect determiner use (1.2 vs 0.2; <em>P</em>=.002), and comma misuse (0.3 vs 0.0; <em>P</em>=.005).</p></div><div><h3>Conclusion</h3><p>Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence’s ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.</p></div>","PeriodicalId":7574,"journal":{"name":"American journal of obstetrics and gynecology","volume":null,"pages":null},"PeriodicalIF":8.7000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of obstetrics and gynecology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0002937824005714","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background

ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT’s writing abilities, accuracy, and implications for authorship.

Objective

We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text.

Study Design

Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly.

Results

A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; P<.001; Originality, 10.9% vs 98.1%; P<.001; Copyleaks, 18.6% vs 58.2%; P<.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (P=.03), original abstracts (P<.001), and artificial intelligence-generated abstracts (P<.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; P=.006), more total writing issues (19.2 vs 12.8; P<.001), critical issues (5.4 vs 1.3; P<.001), confusing words (0.8 vs 0.1; P=.006), misspelled words (1.7 vs 0.6; P=.02), incorrect determiner use (1.2 vs 0.2; P=.002), and comma misuse (0.3 vs 0.0; P=.005).

Conclusion

Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence’s ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工与机器:识别由 ChatGPT 生成的妇科和泌尿妇科摘要。
背景介绍ChatGPT 是一个公开可用的人工智能(AI)大型语言模型,可按需提供复杂的人工智能技术。事实上,ChatGPT 已经开始用于医学研究。然而,医学界还不了解人工智能在这方面的能力和伦理方面的考虑,而且 ChatGPT 的写作能力、准确性和对作者身份的影响也存在未知数:我们假设人类审稿人和人工智能检测软件在正确识别妇科和泌尿妇科领域的原创发表摘要和人工智能撰写摘要的能力上存在差异。此外,我们还怀疑原文和人工智能生成的文本在写作错误、可读性和感知写作质量方面存在具体差异:研究设计:我们选择了 25 篇发表在高影响力医学期刊和妇科和泌尿妇科期刊上的文章。ChatGPT 被提示撰写 25 篇相应的 AI 生成摘要,提供摘要标题、期刊规定的摘要要求,并选择原始结果。原始摘要和人工智能生成的摘要由妇科和泌尿妇科的盲人教员和研究员审阅,以确定是原始摘要还是人工智能生成的摘要。所有摘要均由公开的人工智能检测软件GPTZero、Originality和Copyleaks进行分析,并由人工智能写作助手Grammarly评估写作错误和质量:26 位教师和 4 位研究员对 25 篇原创摘要和 25 篇人工智能生成的摘要进行了 157 次审查。57%的原创摘要和42.3%的人工智能生成摘要被正确识别,所有摘要的平均识别率为49.7%。与 ChatGPT 生成的摘要相比,所有三种人工智能检测器都认为原始摘要不太可能是人工智能撰写的(GPTZero 5.8 vs 73.3%, p结论:由于人工智能能够生成非常逼真的文本,人类审稿人无法检测出人类科学写作与 ChatGPT 生成的科学写作之间的细微差别。人工智能检测软件提高了对人工智能生成的文章的识别能力,但仍然缺乏完全的准确性,需要对程序进行改进才能达到最佳检测效果。由于审稿人和编辑可能无法可靠地检测人工智能生成的文章,因此随着人工智能聊天机器人的使用越来越广泛,需要制定明确的指导方针来报告作者使用人工智能的情况,并在审稿过程中使用人工智能检测软件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
15.90
自引率
7.10%
发文量
2237
审稿时长
47 days
期刊介绍: The American Journal of Obstetrics and Gynecology, known as "The Gray Journal," covers the entire spectrum of Obstetrics and Gynecology. It aims to publish original research (clinical and translational), reviews, opinions, video clips, podcasts, and interviews that contribute to understanding health and disease and have the potential to impact the practice of women's healthcare. Focus Areas: Diagnosis, Treatment, Prediction, and Prevention: The journal focuses on research related to the diagnosis, treatment, prediction, and prevention of obstetrical and gynecological disorders. Biology of Reproduction: AJOG publishes work on the biology of reproduction, including studies on reproductive physiology and mechanisms of obstetrical and gynecological diseases. Content Types: Original Research: Clinical and translational research articles. Reviews: Comprehensive reviews providing insights into various aspects of obstetrics and gynecology. Opinions: Perspectives and opinions on important topics in the field. Multimedia Content: Video clips, podcasts, and interviews. Peer Review Process: All submissions undergo a rigorous peer review process to ensure quality and relevance to the field of obstetrics and gynecology.
期刊最新文献
Cerebral infarcts, edema, hypoperfusion and vasospasm in preeclampsia and eclampsia. Detection of endometrial cancer-related bleeding in virtual visits. Prevalence and risk factors for postpartum depression two months after cesarean delivery: a prospective multicenter study. Association between infertility and cervical insufficiency in nulliparous women- the contribution of fertility treatment. Delayed cord clamping in preterm twin infants: a systematic review and meta-analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1