{"title":"Human vs machine: identifying ChatGPT-generated abstracts in Gynecology and Urogynecology","authors":"","doi":"10.1016/j.ajog.2024.04.045","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT’s writing abilities, accuracy, and implications for authorship.</p></div><div><h3>Objective</h3><p>We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text.</p></div><div><h3>Study Design</h3><p>Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly.</p></div><div><h3>Results</h3><p>A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; <em>P</em><.001; Originality, 10.9% vs 98.1%; <em>P</em><.001; Copyleaks, 18.6% vs 58.2%; <em>P</em><.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (<em>P</em>=.03), original abstracts (<em>P</em><.001), and artificial intelligence-generated abstracts (<em>P</em><.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; <em>P</em>=.006), more total writing issues (19.2 vs 12.8; <em>P</em><.001), critical issues (5.4 vs 1.3; <em>P</em><.001), confusing words (0.8 vs 0.1; <em>P</em>=.006), misspelled words (1.7 vs 0.6; <em>P</em>=.02), incorrect determiner use (1.2 vs 0.2; <em>P</em>=.002), and comma misuse (0.3 vs 0.0; <em>P</em>=.005).</p></div><div><h3>Conclusion</h3><p>Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence’s ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.</p></div>","PeriodicalId":7574,"journal":{"name":"American journal of obstetrics and gynecology","volume":null,"pages":null},"PeriodicalIF":8.7000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of obstetrics and gynecology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0002937824005714","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT’s writing abilities, accuracy, and implications for authorship.
Objective
We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text.
Study Design
Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly.
Results
A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; P<.001; Originality, 10.9% vs 98.1%; P<.001; Copyleaks, 18.6% vs 58.2%; P<.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (P=.03), original abstracts (P<.001), and artificial intelligence-generated abstracts (P<.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; P=.006), more total writing issues (19.2 vs 12.8; P<.001), critical issues (5.4 vs 1.3; P<.001), confusing words (0.8 vs 0.1; P=.006), misspelled words (1.7 vs 0.6; P=.02), incorrect determiner use (1.2 vs 0.2; P=.002), and comma misuse (0.3 vs 0.0; P=.005).
Conclusion
Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence’s ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.
期刊介绍:
The American Journal of Obstetrics and Gynecology, known as "The Gray Journal," covers the entire spectrum of Obstetrics and Gynecology. It aims to publish original research (clinical and translational), reviews, opinions, video clips, podcasts, and interviews that contribute to understanding health and disease and have the potential to impact the practice of women's healthcare.
Focus Areas:
Diagnosis, Treatment, Prediction, and Prevention: The journal focuses on research related to the diagnosis, treatment, prediction, and prevention of obstetrical and gynecological disorders.
Biology of Reproduction: AJOG publishes work on the biology of reproduction, including studies on reproductive physiology and mechanisms of obstetrical and gynecological diseases.
Content Types:
Original Research: Clinical and translational research articles.
Reviews: Comprehensive reviews providing insights into various aspects of obstetrics and gynecology.
Opinions: Perspectives and opinions on important topics in the field.
Multimedia Content: Video clips, podcasts, and interviews.
Peer Review Process:
All submissions undergo a rigorous peer review process to ensure quality and relevance to the field of obstetrics and gynecology.