Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study.

IF 1.9 4区 医学 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Cadernos de saude publica Pub Date : 2024-11-25 eCollection Date: 2024-01-01 DOI:10.1590/0102-311XEN028824
Adonias Caetano de Oliveira, Renato Freitas Bessa, Ariel Soares Teles
{"title":"Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study.","authors":"Adonias Caetano de Oliveira, Renato Freitas Bessa, Ariel Soares Teles","doi":"10.1590/0102-311XEN028824","DOIUrl":null,"url":null,"abstract":"<p><p>Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.</p>","PeriodicalId":9398,"journal":{"name":"Cadernos de saude publica","volume":"40 10","pages":"e00028824"},"PeriodicalIF":1.9000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cadernos de saude publica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1590/0102-311XEN028824","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于 BERT 和生成式大语言模型检测自杀意念的比较分析:性能评估研究。
人工智能可以检测文本中的自杀意念表现。研究表明,基于 BERT 的模型在文本分类问题上取得了更好的性能。大型语言模型(LLM)无需经过专门训练即可回答自由文本查询。本研究旨在比较三种不同的 BERT 模型和 LLM(Google Bard、Microsoft Bing/GPT-4 和 OpenAI ChatGPT-3.5)在从巴西葡萄牙语非临床文本中识别自杀意念方面的性能。由心理学家标注的数据集包含 2,691 个无自杀意念的句子和 1,097 个有自杀意念的句子,我们从中挑选了 100 个句子进行测试。我们采用了数据预处理技术、超参数优化和保持交叉验证来训练和测试 BERT 模型。在评估 LLM 时,我们使用了零点提示工程。根据聊天机器人的回答,对每个测试句子是否包含自杀意念进行标记。Bing/GPT-4取得了最好的成绩,在所有指标上都达到了98%。经过微调的 BERT 模型的表现优于其他 LLM:BERTimbau-Large 的准确率最高,达到 96%,其次是 BERTimbau-Base 的 94% 和 BERT-Multilingual 的 87%。Bard 的表现最差,准确率为 62%,而 ChatGPT-3.5 的准确率为 81%。这些模型的高召回能力表明,对高危患者的错误分类率很低,这对防止专业人员错过干预至关重要。不过,尽管这些模型在支持自杀意念检测方面具有潜力,但它们尚未在患者监测临床环境中得到验证。因此,在将评估过的模型用作协助医护人员检测自杀意念的工具时,建议谨慎使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Cadernos de saude publica
Cadernos de saude publica 医学-公共卫生、环境卫生与职业卫生
CiteScore
5.30
自引率
7.10%
发文量
356
审稿时长
3-6 weeks
期刊介绍: Cadernos de Saúde Pública/Reports in Public Health (CSP) is a monthly journal published by the Sergio Arouca National School of Public Health, Oswaldo Cruz Foundation (ENSP/FIOCRUZ). The journal is devoted to the publication of scientific articles focusing on the production of knowledge in Public Health. CSP also aims to foster critical reflection and debate on current themes related to public policies and factors that impact populations'' living conditions and health care. All articles submitted to CSP are judiciously evaluated by the Editorial Board, composed of the Editors-in-Chief and Associate Editors, respecting the diversity of approaches, objects, and methods of the different disciplines characterizing the field of Public Health. Originality, relevance, and methodological rigor are the principal characteristics considered in the editorial evaluation. The article evaluation system practiced by CSP consists of two stages.
期刊最新文献
Comparing diabetes prediction based on metabolic dysfunction-associated steatotic liver disease and nonalcoholic fatty liver disease: the ELSA-Brasil study. Identifying high occurrence areas of hospitalization and mortality from respiratory diseases in the Brazilian Legal Amazon: a space-time analysis. [Poorer countries have more pro-breastfeeding actions than rich countries: ecological study of 98 countries]. Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study. [Analysis of the institutional capabilities of the Guatemalan Ministry of Health: democratic constraint, defunding, reforms, and model of care].
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1