将非常规建模框架 c-RASAR 中的化学相似性测量方法与降维技术一起应用于具有代表性的肝毒性数据集。

IF 3.8 2区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Scientific Reports Pub Date : 2024-09-06 DOI:10.1038/s41598-024-71892-4
Arkaprava Banerjee, Kunal Roy
{"title":"将非常规建模框架 c-RASAR 中的化学相似性测量方法与降维技术一起应用于具有代表性的肝毒性数据集。","authors":"Arkaprava Banerjee, Kunal Roy","doi":"10.1038/s41598-024-71892-4","DOIUrl":null,"url":null,"abstract":"<p><p>With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11379871/pdf/","citationCount":"0","resultStr":"{\"title\":\"The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset.\",\"authors\":\"Arkaprava Banerjee, Kunal Roy\",\"doi\":\"10.1038/s41598-024-71892-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.</p>\",\"PeriodicalId\":21811,\"journal\":{\"name\":\"Scientific Reports\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11379871/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Reports\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41598-024-71892-4\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-024-71892-4","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

随着化学信息学领域的飞速发展,迄今为止,传统的建模方法一直是利用代表特定化合物结构、物理化学和电子特性的标准分子描述符,采用有监督和无监督的机器学习(ML)和深度学习模型。与传统方法不同的是,在这项研究中,我们采用了分类读交叉结构-活性关系(c-RASAR),它涉及基于分类的定量结构-活性关系(QSAR)和读交叉概念的融合,将读交叉衍生的相似性和基于误差的描述符纳入统计和机器学习建模框架。根据这些 RASAR 描述符开发的 ML 模型使用了来自特定查询化合物近源邻域的基于相似性的信息。我们在选定的 QSAR 和 RASAR 描述符上采用了不同的分类建模算法,以开发预测模型,从而有效预测查询化合物的肝毒性。在大量测试集化合物上对每个模型的预测能力进行了评估。表现最好的模型还被用于筛选真正的外部数据集。可解释人工智能(XAI)的概念与 "交叉阅读"(Read-Across)相结合,用于解释最佳 c-RASAR 模型中 RASAR 描述因子的贡献,并解释数据集中的化学多样性。各种无监督降维技术(如 t-SNE 和 UMAP)和有监督 ARKA 框架的应用表明,RASAR 描述因子比所选的 QSAR 描述因子更有用,因为它们能够对相似化合物进行分组,提高数据集的可建模性,并有效识别活性悬崖。此外,通过观察构成特定查询化合物近邻的化合物的性质,还可以从 "交叉阅读 "中识别出活性悬崖。将我们的简单线性 c-RASAR 模型与之前报道的使用源自美国 FDA 橘皮书(https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm)的相同数据集开发的模型进行比较后发现,我们的模型简单、可重复、可转移且预测性高。LDA c-RASAR 模型在真实外部数据集上的表现优于之前报告的工作。因此,目前简单的 LDA c-RASAR 模型可以有效地用于预测查询化学品的肝毒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset.

With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models for efficient prediction of query compounds' hepatotoxicity. The predictivity of each of these models was evaluated on a large number of test set compounds. The best-performing model was also used to screen a true external data set. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book ( https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm ), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Scientific Reports
Scientific Reports Natural Science Disciplines-
CiteScore
7.50
自引率
4.30%
发文量
19567
审稿时长
3.9 months
期刊介绍: We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.
期刊最新文献
Floods of Egypt's Nile in the 21st century. Hypoxic postconditioning modulates neuroprotective glial reactivity in a 3D cortical ischemic-hypoxic injury model. A study on the changes in rice composition under reduced fertilization conditions using Raman spectroscopy technology. MethylCallR : a comprehensive analysis framework for Illumina Methylation Beadchip. Acute kidney injury predicts the risk of adverse cardio renal events and all cause death in southeast Asian people with type 2 diabetes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1