CarD-T: Interpreting Carcinomic Lexicon via Transformers.

Jamey O'Neill, Gudur Ashrith Reddy, Nermeeta Dhillon, Osika Tripathi, Ludmil Alexandrov, Parag Katira
{"title":"CarD-T: Interpreting Carcinomic Lexicon via Transformers.","authors":"Jamey O'Neill, Gudur Ashrith Reddy, Nermeeta Dhillon, Osika Tripathi, Ludmil Alexandrov, Parag Katira","doi":"10.1101/2024.08.13.24311948","DOIUrl":null,"url":null,"abstract":"<p><p>The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11343268/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.13.24311948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CarD-T:通过变压器解释癌基因词典。
致癌物的鉴定和分类在癌症流行病学中至关重要,因此需要更新方法来管理不断增加的生物医学文献。目前的系统,如国际癌症研究机构(IARC)和国家毒理学计划(NTP)运行的系统,由于人工审核和新数据量导致的致癌物分类差异,面临着挑战。为了解决这些问题,我们推出了 "通过变换器检测致癌物"(CarD-T)框架,这是一种将基于变换器的机器学习与概率统计分析相结合的文本分析方法,可有效地从科学文本中提名致癌物。CarD-T 采用命名实体识别 (NER) 技术,该技术在以国际癌症研究机构(IARC)团体的已知致癌物为特征的 PubMed 摘要上进行训练,并包含一个上下文分类器,以提高准确性并管理计算需求。利用这种方法,对过去 25 年中以致癌性和致癌性医学主题词表(MeSH)为索引的期刊发表数据进行了分析,以确定潜在的致癌物。CarD-T 对 60% 的已确定致癌物(国际癌症研究机构指定的 1 类和 2A 类致癌物)进行了训练,并从分析文本中正确识别了所有剩余的 1 类和 2A 类指定致癌物。此外,CarD-T 还提名了大约 1500 个实体为潜在致癌物,这些实体至少有两篇出版物引用了致癌证据。CarD-T 与 GPT-4 模型的比较评估显示,CarD-T 的召回率(0.857 vs 0.705)和 F1 分数(0.875 vs 0.792)都很高,精确度(0.894 vs 0.903)也不相上下。此外,CarD-T 还突出显示了 554 个显示有争议致癌性证据的实体。我们使用贝叶斯时间概率致癌定义(PCarD)对这些实体进行了进一步分析,以根据不断变化的证据对其致癌状态进行概率评估。我们的研究结果表明,CarD-T 框架不仅在识别和提名大量生物医学文献中的潜在致癌物方面强大而有效,而且在消费级 GPU 上也很高效。将先进的 NLP 功能与重要的流行病学分析相结合,大大提高了公共卫生应对致癌物识别的敏捷性,从而为自动化、可扩展的毒理学调查树立了新的标杆。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Opioids Diminish the Placebo Antidepressant Response: A Post Hoc Analysis of a Randomized Controlled Ketamine Trial. Raising awareness of potential biases in medical machine learning: Experience from a Datathon. Prediction of Postoperative Delirium in Older Adults from Preoperative Cognition and Occipital Alpha Power from Resting-State Electroencephalogram. Reduced Cortical Excitability is Associated with Cognitive Symptoms in Concussed Adolescent Football Players. Basic helix-loop-helix transcription factor BHLHE22 monoallelic and biallelic variants cause a neurodevelopmental disorder with agenesis of the corpus callosum, intellectual disability, tone and movement abnormalities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1