基于 BERT 和生成式大语言模型检测自杀意念的比较分析：性能评估研究。

IF 1.8 4区医学 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Cadernos de saude publica Pub Date : 2024-11-25 eCollection Date: 2024-01-01 DOI:10.1590/0102-311XEN028824

Adonias Caetano de Oliveira, Renato Freitas Bessa, Ariel Soares Teles

{"title":"基于 BERT 和生成式大语言模型检测自杀意念的比较分析：性能评估研究。","authors":"Adonias Caetano de Oliveira, Renato Freitas Bessa, Ariel Soares Teles","doi":"10.1590/0102-311XEN028824","DOIUrl":null,"url":null,"abstract":"Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.","PeriodicalId":9398,"journal":{"name":"Cadernos de saude publica","volume":"40 10","pages":"e00028824"},"PeriodicalIF":1.8000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654116/pdf/","citationCount":"0","resultStr":"{\"title\":\"Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study.\",\"authors\":\"Adonias Caetano de Oliveira, Renato Freitas Bessa, Ariel Soares Teles\",\"doi\":\"10.1590/0102-311XEN028824\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.\",\"PeriodicalId\":9398,\"journal\":{\"name\":\"Cadernos de saude publica\",\"volume\":\"40 10\",\"pages\":\"e00028824\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654116/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cadernos de saude publica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1590/0102-311XEN028824\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cadernos de saude publica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1590/0102-311XEN028824","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

摘要

人工智能可以检测文本中的自杀意念表现。研究表明，基于 BERT 的模型在文本分类问题上取得了更好的性能。大型语言模型（LLM）无需经过专门训练即可回答自由文本查询。本研究旨在比较三种不同的 BERT 模型和 LLM（Google Bard、Microsoft Bing/GPT-4 和 OpenAI ChatGPT-3.5）在从巴西葡萄牙语非临床文本中识别自杀意念方面的性能。由心理学家标注的数据集包含 2,691 个无自杀意念的句子和 1,097 个有自杀意念的句子，我们从中挑选了 100 个句子进行测试。我们采用了数据预处理技术、超参数优化和保持交叉验证来训练和测试 BERT 模型。在评估 LLM 时，我们使用了零点提示工程。根据聊天机器人的回答，对每个测试句子是否包含自杀意念进行标记。Bing/GPT-4取得了最好的成绩，在所有指标上都达到了98%。经过微调的 BERT 模型的表现优于其他 LLM：BERTimbau-Large 的准确率最高，达到 96%，其次是 BERTimbau-Base 的 94% 和 BERT-Multilingual 的 87%。Bard 的表现最差，准确率为 62%，而 ChatGPT-3.5 的准确率为 81%。这些模型的高召回能力表明，对高危患者的错误分类率很低，这对防止专业人员错过干预至关重要。不过，尽管这些模型在支持自杀意念检测方面具有潜力，但它们尚未在患者监测临床环境中得到验证。因此，在将评估过的模型用作协助医护人员检测自杀意念的工具时，建议谨慎使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study.

Artificial intelligence can detect suicidal ideation manifestations in texts. Studies demonstrate that BERT-based models achieve better performance in text classification problems. Large language models (LLMs) answer free-text queries without being specifically trained. This work aims to compare the performance of three variations of BERT models and LLMs (Google Bard, Microsoft Bing/GPT-4, and OpenAI ChatGPT-3.5) for identifying suicidal ideation from nonclinical texts written in Brazilian Portuguese. A dataset labeled by psychologists consisted of 2,691 sentences without suicidal ideation and 1,097 with suicidal ideation, of which 100 sentences were selected for testing. We applied data preprocessing techniques, hyperparameter optimization, and hold-out cross-validation for training and testing BERT models. When evaluating LLMs, we used zero-shot prompting engineering. Each test sentence was labeled if it contained suicidal ideation, according to the chatbot's response. Bing/GPT-4 achieved the best performance, with 98% across all metrics. Fine-tuned BERT models outperformed the other LLMs: BERTimbau-Large performed the best with a 96% accuracy, followed by BERTimbau-Base with 94%, and BERT-Multilingual with 87%. Bard performed the worst with 62% accuracy, whereas ChatGPT-3.5 achieved 81%. The high recall capacity of the models suggests a low misclassification rate of at-risk patients, which is crucial to prevent missed interventions by professionals. However, despite their potential in supporting suicidal ideation detection, these models have not been validated in a patient monitoring clinical setting. Therefore, caution is advised when using the evaluated models as tools to assist healthcare professionals in detecting suicidal ideation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cadernos de saude publica 医学-公共卫生、环境卫生与职业卫生

CiteScore

5.30

自引率

7.10%

发文量

356

审稿时长

3-6 weeks

期刊介绍： Cadernos de Saúde Pública/Reports in Public Health (CSP) is a monthly journal published by the Sergio Arouca National School of Public Health, Oswaldo Cruz Foundation (ENSP/FIOCRUZ). The journal is devoted to the publication of scientific articles focusing on the production of knowledge in Public Health. CSP also aims to foster critical reflection and debate on current themes related to public policies and factors that impact populations'' living conditions and health care. All articles submitted to CSP are judiciously evaluated by the Editorial Board, composed of the Editors-in-Chief and Associate Editors, respecting the diversity of approaches, objects, and methods of the different disciplines characterizing the field of Public Health. Originality, relevance, and methodological rigor are the principal characteristics considered in the editorial evaluation. The article evaluation system practiced by CSP consists of two stages.