Transformer-Based Tool for Automated Fact-Checking of Online Health Information: Development Study.

IF 3.5 Q1 HEALTH CARE SCIENCES & SERVICES JMIR infodemiology Pub Date : 2025-02-21 DOI:10.2196/56831

Azadeh Bayani, Alexandre Ayotte, Jean Noel Nikiema

{"title":"Transformer-Based Tool for Automated Fact-Checking of Online Health Information: Development Study.","authors":"Azadeh Bayani, Alexandre Ayotte, Jean Noel Nikiema","doi":"10.2196/56831","DOIUrl":null,"url":null,"abstract":"Background: Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.Objective: This study aimed to present a pilot study in which we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth using natural language processing transformer models to enhance the process.Methods: A total of 538 health-related web pages, covering 7 different disease subjects, were manually selected by Factually Health Company. The process included the following steps: (1) using transformer models of bidirectional encoder representations from transformers (BERT), BioBERT, and SciBERT, and traditional models of random forests and support vector machines, to classify the contents of web pages into 3 thematic categories (semiology, epidemiology, and management), (2) for each category in the web pages, a PubMed query was automatically produced using a combination of the \"WellcomeBertMesh\" and \"KeyBERT\" models, (3) top 20 related literatures were automatically extracted from PubMed, and finally, (4) the similarity checking techniques of cosine similarity and Jaccard distance were applied to compare the content of extracted literature and web pages.Results: The BERT model for the categorization of web page contents had good performance, with F1-scores and recall of 93% and 94% for semiology and epidemiology, respectively, and 96% for both the recall and F1-score for management. For each of the 3 categories in a web page, 1 PubMed query was generated and with each query, the 20 most related, open access articles within the category of systematic reviews and meta-analyses were extracted. Less than 10% of the extracted literature was irrelevant; those were deleted. For each web page, an average of 23% of the sentences were found to be very similar to the literature. Moreover, during the evaluation, it was found that cosine similarity outperformed the Jaccard distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared with accurate similarities, as some sentences had a similarity score exceeding 80%, but they could not be considered similar sentences.Conclusions: In this pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.","PeriodicalId":73554,"journal":{"name":"JMIR infodemiology","volume":" ","pages":"e56831"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11890130/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR infodemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/56831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.

Objective: This study aimed to present a pilot study in which we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth using natural language processing transformer models to enhance the process.

Methods: A total of 538 health-related web pages, covering 7 different disease subjects, were manually selected by Factually Health Company. The process included the following steps: (1) using transformer models of bidirectional encoder representations from transformers (BERT), BioBERT, and SciBERT, and traditional models of random forests and support vector machines, to classify the contents of web pages into 3 thematic categories (semiology, epidemiology, and management), (2) for each category in the web pages, a PubMed query was automatically produced using a combination of the "WellcomeBertMesh" and "KeyBERT" models, (3) top 20 related literatures were automatically extracted from PubMed, and finally, (4) the similarity checking techniques of cosine similarity and Jaccard distance were applied to compare the content of extracted literature and web pages.

Results: The BERT model for the categorization of web page contents had good performance, with F₁-scores and recall of 93% and 94% for semiology and epidemiology, respectively, and 96% for both the recall and F₁-score for management. For each of the 3 categories in a web page, 1 PubMed query was generated and with each query, the 20 most related, open access articles within the category of systematic reviews and meta-analyses were extracted. Less than 10% of the extracted literature was irrelevant; those were deleted. For each web page, an average of 23% of the sentences were found to be very similar to the literature. Moreover, during the evaluation, it was found that cosine similarity outperformed the Jaccard distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared with accurate similarities, as some sentences had a similarity score exceeding 80%, but they could not be considered similar sentences.

Conclusions: In this pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变压器的自动事实检查工具：在线健康信息的试点研究。

背景：许多人在网上寻找与健康相关的信息。由于错误信息的潜在危险，可靠信息的重要性变得尤为明显。因此，从虚假信息中辨别真实可靠的信息变得越来越具有挑战性。目的：在目前的试点研究中，我们引入了一种自动化事实核查过程的新方法，利用PubMed资源作为事实来源，采用自然语言处理（NLP）转换模型来增强这一过程。方法：由fact Health公司人工选取7个不同疾病主题的538个健康相关网页。该过程包括以下步骤：i)利用Transformers (BERT) BioBERT和SciBERT的双向编码器表示的transformer模型和随机森林（RF）和支持向量机（SVM）的传统模型，将网页内容分为三个主题类别：ii)结合“WellcomeBertMesh”和“KeyBERT”模型，对网页中的每个类别自动生成PubMed查询；iii)自动从PubMed中提取前20位相关文献；最后，iv)应用余弦相似度和Jaccard距离的相似度检查技术对提取的文献和网页内容进行比较。结果：应用BERT模型对网页内容进行分类，符号学分类和流行病学分类的召回率和召回率分别为93%和94%，管理分类的召回率和召回率分别为96%。对于网页中的三个类别中的每一个，生成一个PubMed查询，每个查询提取20个最相关的，开放获取的，属于系统评论和元分析的类别。不到10%的提取文献是不相关的，这些文献被删除。对于每个网页，发现平均有23%的句子与文献非常相似。此外，在评估过程中，当比较由BERT矢量化的网页句子和学术论文之间的相似度时，发现余弦相似度优于Jaccard距离度量。然而，与准确相似度相比，检索到的句子存在明显的假阳性问题，因为有些句子的相似度得分超过80%，但它们不能被认为是相似句。结论：在目前的试点研究中，我们提出了一种自动化健康相关在线信息事实核查的方法。将PubMed或其他科学文章数据库中的内容合并为可信赖的资源，可以自动发现健康领域中类似的可信信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR infodemiology

CiteScore

4.80

自引率

0.00%

发文量