人工智能与人情味：人工智能能否准确生成激光技术文献综述？

IF 2.8 2区医学 Q2 UROLOGY & NEPHROLOGY World Journal of Urology Pub Date : 2024-10-28 DOI:10.1007/s00345-024-05311-8

Frédéric Panthier, Hugh Crawford-Smith, Eduarda Alvarez, Alberto Melchionna, Daniela Velinova, Ikran Mohamed, Siobhan Price, Simon Choong, Vimoshan Arumuham, Sian Allen, Olivier Traxer, Daron Smith

{"title":"人工智能与人情味：人工智能能否准确生成激光技术文献综述？","authors":"Frédéric Panthier, Hugh Crawford-Smith, Eduarda Alvarez, Alberto Melchionna, Daniela Velinova, Ikran Mohamed, Siobhan Price, Simon Choong, Vimoshan Arumuham, Sian Allen, Olivier Traxer, Daron Smith","doi":"10.1007/s00345-024-05311-8","DOIUrl":null,"url":null,"abstract":"Purpose: To compare the accuracy of open-source Artificial Intelligence (AI) Large Language Models (LLM) against human authors to generate a systematic review (SR) on the new pulsed-Thulium:YAG (p-Tm:YAG) laser.Methods: Five manuscripts were compared. The Human-SR on p-Tm:YAG (considered to be the \"ground truth\") was written by independent certified endourologists with expertise in lasers, accepted in a peer-review pubmed-indexed journal (but not yet available online, and therefore not accessible to the LLMs). The query to the AI LLMs was: \"write a systematic review on pulsed-Thulium:YAG laser for lithotripsy\" which was submitted to four LLMs (ChatGPT3.5/Vercel/Claude/Mistral-7b). The LLM-SR were uniformed and Human-SR reformatted to fit the general output appearance, to ensure blindness. Nine participants with various levels of endourological expertise (three Clinical Nurse Specialist nurses, Urology Trainees and Consultants) objectively assessed the accuracy of the five SRs using a bespoke 10 \"checkpoint\" proforma. A subjective assessment was recorded using a composite score including quality (0-10), clarity (0-10) and overall manuscript rank (1-5).Results: The Human-SR was objectively and subjectively more accurate than LLM-SRs (96 ± 7% and 86.8 ± 8.2% respectively; p < 0.001). The LLM-SRs did not significantly differ but ChatGPT3.5 presented greater subjective and objective accuracy scores (62.4 ± 15% and 29 ± 28% respectively; p > 0.05). Quality and clarity assessments were significantly impacted by SR type but not the expertise level (p < 0.001 and > 0.05, respectively).Conclusions: LLM generated data on highly technical topics present a lower accuracy than Key Opinion Leaders. LLMs, especially ChatGPT3.5, with human supervision could improve our practice.","PeriodicalId":23954,"journal":{"name":"World Journal of Urology","volume":null,"pages":null},"PeriodicalIF":2.8000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Artificial intelligence versus human touch: can artificial intelligence accurately generate a literature review on laser technologies?\",\"authors\":\"Frédéric Panthier, Hugh Crawford-Smith, Eduarda Alvarez, Alberto Melchionna, Daniela Velinova, Ikran Mohamed, Siobhan Price, Simon Choong, Vimoshan Arumuham, Sian Allen, Olivier Traxer, Daron Smith\",\"doi\":\"10.1007/s00345-024-05311-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: To compare the accuracy of open-source Artificial Intelligence (AI) Large Language Models (LLM) against human authors to generate a systematic review (SR) on the new pulsed-Thulium:YAG (p-Tm:YAG) laser.Methods: Five manuscripts were compared. The Human-SR on p-Tm:YAG (considered to be the \\\"ground truth\\\") was written by independent certified endourologists with expertise in lasers, accepted in a peer-review pubmed-indexed journal (but not yet available online, and therefore not accessible to the LLMs). The query to the AI LLMs was: \\\"write a systematic review on pulsed-Thulium:YAG laser for lithotripsy\\\" which was submitted to four LLMs (ChatGPT3.5/Vercel/Claude/Mistral-7b). The LLM-SR were uniformed and Human-SR reformatted to fit the general output appearance, to ensure blindness. Nine participants with various levels of endourological expertise (three Clinical Nurse Specialist nurses, Urology Trainees and Consultants) objectively assessed the accuracy of the five SRs using a bespoke 10 \\\"checkpoint\\\" proforma. A subjective assessment was recorded using a composite score including quality (0-10), clarity (0-10) and overall manuscript rank (1-5).Results: The Human-SR was objectively and subjectively more accurate than LLM-SRs (96 ± 7% and 86.8 ± 8.2% respectively; p < 0.001). The LLM-SRs did not significantly differ but ChatGPT3.5 presented greater subjective and objective accuracy scores (62.4 ± 15% and 29 ± 28% respectively; p > 0.05). Quality and clarity assessments were significantly impacted by SR type but not the expertise level (p < 0.001 and > 0.05, respectively).Conclusions: LLM generated data on highly technical topics present a lower accuracy than Key Opinion Leaders. LLMs, especially ChatGPT3.5, with human supervision could improve our practice.\",\"PeriodicalId\":23954,\"journal\":{\"name\":\"World Journal of Urology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Journal of Urology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00345-024-05311-8\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Journal of Urology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00345-024-05311-8","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：比较开源人工智能（AI）大语言模型（LLM）与人类作者就新型脉冲铥：YAG（p-Tm:YAG）激光生成系统综述（SR）的准确性：比较了五篇稿件。关于p-Tm:YAG的人类系统综述（被视为 "基本事实"）由具有激光专业知识的独立认证内科医师撰写，已被同行评议的pubmed索引期刊接受（但尚未在线提供，因此LLM无法访问）。对人工智能 LLM 的要求是"撰写一篇关于脉冲铥：YAG 激光碎石的系统综述"，该综述已提交给四位 LLM（ChatGPT3.5/Vercel/Claude/Mistral-7b）。对 LLM-SR 进行了统一，并对 Human-SR 进行了重新格式化，使其符合一般输出外观，以确保盲目性。九名具有不同程度内科专业知识的参与者（三名临床专科护士、泌尿科受训人员和顾问）使用定制的 10 个 "检查点 "表格对五个 SR 的准确性进行了客观评估。主观评估采用综合评分记录，包括质量（0-10 分）、清晰度（0-10 分）和稿件总体排名（1-5 分）：结果：从客观和主观上看，Human-SR 比 LLM-SR 更准确（分别为 96 ± 7% 和 86.8 ± 8.2%；P 0.05）。质量和清晰度评估受到 SR 类型的显著影响，但不受专业知识水平的影响（分别为 p 0.05）：结论：LLM 生成的关于高技术主题的数据准确性低于关键意见领袖。LLM，尤其是 ChatGPT3.5，在人工监督下可以改进我们的实践。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Artificial intelligence versus human touch: can artificial intelligence accurately generate a literature review on laser technologies?

Purpose: To compare the accuracy of open-source Artificial Intelligence (AI) Large Language Models (LLM) against human authors to generate a systematic review (SR) on the new pulsed-Thulium:YAG (p-Tm:YAG) laser.

Methods: Five manuscripts were compared. The Human-SR on p-Tm:YAG (considered to be the "ground truth") was written by independent certified endourologists with expertise in lasers, accepted in a peer-review pubmed-indexed journal (but not yet available online, and therefore not accessible to the LLMs). The query to the AI LLMs was: "write a systematic review on pulsed-Thulium:YAG laser for lithotripsy" which was submitted to four LLMs (ChatGPT3.5/Vercel/Claude/Mistral-7b). The LLM-SR were uniformed and Human-SR reformatted to fit the general output appearance, to ensure blindness. Nine participants with various levels of endourological expertise (three Clinical Nurse Specialist nurses, Urology Trainees and Consultants) objectively assessed the accuracy of the five SRs using a bespoke 10 "checkpoint" proforma. A subjective assessment was recorded using a composite score including quality (0-10), clarity (0-10) and overall manuscript rank (1-5).

Results: The Human-SR was objectively and subjectively more accurate than LLM-SRs (96 ± 7% and 86.8 ± 8.2% respectively; p < 0.001). The LLM-SRs did not significantly differ but ChatGPT3.5 presented greater subjective and objective accuracy scores (62.4 ± 15% and 29 ± 28% respectively; p > 0.05). Quality and clarity assessments were significantly impacted by SR type but not the expertise level (p < 0.001 and > 0.05, respectively).

Conclusions: LLM generated data on highly technical topics present a lower accuracy than Key Opinion Leaders. LLMs, especially ChatGPT3.5, with human supervision could improve our practice.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

World Journal of Urology 医学-泌尿学与肾脏学

CiteScore

6.80

自引率

8.80%

发文量

317

审稿时长

4-8 weeks

期刊介绍： The WORLD JOURNAL OF UROLOGY conveys regularly the essential results of urological research and their practical and clinical relevance to a broad audience of urologists in research and clinical practice. In order to guarantee a balanced program, articles are published to reflect the developments in all fields of urology on an internationally advanced level. Each issue treats a main topic in review articles of invited international experts. Free papers are unrelated articles to the main topic.