Reader's digest version of scientific writing: comparative evaluation of summarization capacity between large language models and medical students in analyzing scientific writing in sleep medicine.

IF 3 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Frontiers in Artificial Intelligence Pub Date : 2024-12-24 eCollection Date: 2024-01-01 DOI:10.3389/frai.2024.1477535
Jacob Matalon, August Spurzem, Sana Ahsan, Elizabeth White, Ronik Kothari, Madhu Varma
{"title":"Reader's digest version of scientific writing: comparative evaluation of summarization capacity between large language models and medical students in analyzing scientific writing in sleep medicine.","authors":"Jacob Matalon, August Spurzem, Sana Ahsan, Elizabeth White, Ronik Kothari, Madhu Varma","doi":"10.3389/frai.2024.1477535","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>As artificial intelligence systems like large language models (LLM) and natural language processing advance, the need to evaluate their utility within medicine and medical education grows. As medical research publications continue to grow exponentially, AI systems offer valuable opportunities to condense and synthesize information, especially in underrepresented areas such as Sleep Medicine. The present study aims to compare summarization capacity between LLM generated summaries of sleep medicine research article abstracts, to summaries generated by Medical Student (humans) and to evaluate if the research content, and literary readability summarized is retained comparably.</p><p><strong>Methods: </strong>A collection of three AI-generated and human-generated summaries of sleep medicine research article abstracts were shared with 19 study participants (medical students) attending a sleep medicine conference. Participants were blind as to which summary was human or LLM generated. After reading both human and AI-generated research summaries participants completed a 1-5 Likert scale survey on the readability of the extracted writings. Participants also answered article-specific multiple-choice questions evaluating their comprehension of the summaries, as a representation of the quality of content retained by the AI-generated summaries.</p><p><strong>Results: </strong>An independent sample t-test between the AI-generated and human-generated summaries comprehension by study participants revealed no significant difference between the Likert readability ratings (<i>p</i> = 0.702). A chi-squared test of proportions revealed no significant association (<i>χ</i> <sup>2</sup> = 1.485, <i>p</i> = 0.223), and a McNemar test revealed no significant association between summary type and the proportion of correct responses to the comprehension multiple choice questions (<i>p</i> = 0.289).</p><p><strong>Discussion: </strong>Some limitations in this study were a small number of participants and user bias. Participants attended at a sleep conference and study summaries were all from sleep medicine journals. Lastly the summaries did not include graphs, numbers, and pictures, and thus were limited in material extraction. While the present analysis did not demonstrate a significant difference among the readability and content quality between the AI and human-generated summaries, limitations in the present study indicate that more research is needed to objectively measure, and further define strengths and weaknesses of AI models in condensing medical literature into efficient and accurate summaries.</p>","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"7 ","pages":"1477535"},"PeriodicalIF":3.0000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704966/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2024.1477535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: As artificial intelligence systems like large language models (LLM) and natural language processing advance, the need to evaluate their utility within medicine and medical education grows. As medical research publications continue to grow exponentially, AI systems offer valuable opportunities to condense and synthesize information, especially in underrepresented areas such as Sleep Medicine. The present study aims to compare summarization capacity between LLM generated summaries of sleep medicine research article abstracts, to summaries generated by Medical Student (humans) and to evaluate if the research content, and literary readability summarized is retained comparably.

Methods: A collection of three AI-generated and human-generated summaries of sleep medicine research article abstracts were shared with 19 study participants (medical students) attending a sleep medicine conference. Participants were blind as to which summary was human or LLM generated. After reading both human and AI-generated research summaries participants completed a 1-5 Likert scale survey on the readability of the extracted writings. Participants also answered article-specific multiple-choice questions evaluating their comprehension of the summaries, as a representation of the quality of content retained by the AI-generated summaries.

Results: An independent sample t-test between the AI-generated and human-generated summaries comprehension by study participants revealed no significant difference between the Likert readability ratings (p = 0.702). A chi-squared test of proportions revealed no significant association (χ 2 = 1.485, p = 0.223), and a McNemar test revealed no significant association between summary type and the proportion of correct responses to the comprehension multiple choice questions (p = 0.289).

Discussion: Some limitations in this study were a small number of participants and user bias. Participants attended at a sleep conference and study summaries were all from sleep medicine journals. Lastly the summaries did not include graphs, numbers, and pictures, and thus were limited in material extraction. While the present analysis did not demonstrate a significant difference among the readability and content quality between the AI and human-generated summaries, limitations in the present study indicate that more research is needed to objectively measure, and further define strengths and weaknesses of AI models in condensing medical literature into efficient and accurate summaries.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
读者文摘版科学写作:大语言模型与医学生在分析睡眠医学科学写作中总结能力的比较评价
导论:随着大型语言模型(LLM)和自然语言处理等人工智能系统的发展,评估它们在医学和医学教育中的应用的需求也在增长。随着医学研究出版物呈指数级增长,人工智能系统为浓缩和综合信息提供了宝贵的机会,特别是在睡眠医学等代表性不足的领域。本研究旨在比较LLM生成的睡眠医学研究论文摘要与医学生(人类)生成的摘要的总结能力,并评估总结的研究内容和文学可读性是否具有可比性。方法:与参加睡眠医学会议的19名研究参与者(医学生)分享3篇人工智能生成和人工生成的睡眠医学研究文章摘要。参与者不知道哪个摘要是人工生成的还是LLM生成的。在阅读了人类和人工智能生成的研究摘要后,参与者完成了一项1-5李克特量表调查,以评估提取的文章的可读性。参与者还回答了特定于文章的多项选择题,以评估他们对摘要的理解,作为人工智能生成的摘要保留的内容质量的代表。结果:研究参与者对人工智能生成的摘要理解和人类生成的摘要理解之间的独立样本t检验显示,Likert可读性评级之间没有显著差异(p = 0.702)。比例的卡方检验显示无显著相关性(χ 2 = 1.485,p = 0.223),McNemar检验显示总结类型与理解选择题的正确回答比例之间无显著相关性(p = 0.289)。讨论:本研究的一些局限性是参与者数量少和用户偏见。参与者参加了一个睡眠会议,研究总结都来自睡眠医学期刊。最后,摘要没有包含图形、数字和图片,因此在材料提取上受到限制。虽然本分析并未证明人工智能和人类生成的摘要在可读性和内容质量方面存在显著差异,但本研究的局限性表明,需要更多的研究来客观衡量,并进一步界定人工智能模型在将医学文献浓缩为高效、准确的摘要方面的优势和劣势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
2.50%
发文量
272
审稿时长
13 weeks
期刊最新文献
Examining the integration of artificial intelligence in supply chain management from Industry 4.0 to 6.0: a systematic literature review. The technology acceptance model and adopter type analysis in the context of artificial intelligence. An analysis of artificial intelligence automation in digital music streaming platforms for improving consumer subscription responses: a review. Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm. SineKAN: Kolmogorov-Arnold Networks using sinusoidal activation functions.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1