Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.

Felipe Santiago-Gonzalez, Jose L Martinez-Rodriguez, Carlos García-Perez, Alfredo Juárez-Saldivar, Hugo E Camacho-Cruz
{"title":"Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.","authors":"Felipe Santiago-Gonzalez, Jose L Martinez-Rodriguez, Carlos García-Perez, Alfredo Juárez-Saldivar, Hugo E Camacho-Cruz","doi":"10.2174/0115734099315538240909101737","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.</p><p><strong>Methods: </strong>The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (e.g., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.</p><p><strong>Results: </strong>We defined three testing scenarios: without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.</p><p><strong>Conclusion: </strong>Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.</p>","PeriodicalId":93961,"journal":{"name":"Current computer-aided drug design","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current computer-aided drug design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0115734099315538240909101737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.

Methods: The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (e.g., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.

Results: We defined three testing scenarios: without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.

Conclusion: Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于化学化合物毒性预测的混合类平衡方法。
引言计算方法对于高效、经济地预测药物毒性至关重要。遗憾的是,用于预测的数据往往是不平衡的,导致模型偏向于大多数类别。本文提出了一种应用混合类平衡技术的方法,并评估了其在 Tox21 数据集中用于毒性预测的计算模型的性能:方法:首先将各种生物测定数据集的化合物数据结构(SMILES 字符串)转换为可由算法处理的分子描述符。随后,在训练数据中采用两种不同的方案,即 "下采样 "和 "上采样 "技术。在第一种方案(单独)中,只使用一种平衡技术(过度取样或欠采样)。在第二种方案(混合方案)中,训练数据按照一定比例(如 90-10)进行划分,每个比例采用一种不同的平衡技术。我们在 10 个生物测定数据集上考虑了 8 种再采样技术(4 种 "过度采样 "和 4 种 "过度采样")、6 种分子描述符(基于 MACCS、ECFP 和 Mordred)和 5 种分类模型(KNN、MLP、RF、XGB 和 SVM),以确定产生最佳性能的配置:我们确定了三种测试方案:不使用平衡技术(基线)、单独和混合。我们发现,在 MACCS-MLP 组合中使用 ENN 技术后,性能提高了 10.01%。在结合使用 SMOTE(10%)和 RUS(90%)技术后,ECFP6-2048 的性能提高了 16.47%。同时,使用相同的技术组合,MORDRED-XGB 的性能提升最为显著,达到了 22.62%:结论:与最佳基准配置相比,整合任何一种类平衡方案都能使预测性能至少提高 10.01%。在这项研究中,由于样本之间存在大量重叠,因此采用下采样技术更为合适。通过从主要类别中剔除接近少数类别的特定样本,可以大大减少重叠。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Study on the Mechanism of Alpinia officinarum Hance in the Improvement of Insulin Resistance through Network Pharmacology, Molecular Docking and in vitro Experimental Verification. Synthesis, Biological Evaluation, Molecular Docking Studies and ADMET Prediction of Oxindole-Based Hybrids for the Treatment of Tuberculosis. Identifying Novel Inhibitors for Dengue NS2B-NS3 Protease by Combining Topological similarity, Molecular Dynamics, MMGBSA and SiteMap Analysis. Discovery of Two GSK3β Inhibitors from Sophora flavescens Ait. using Structure-based Virtual Screening and Bioactivity Evaluation. Berberine Ameliorates High-fat-induced Insulin Resistance in HepG2 Cells by Modulating PPARs Signaling Pathway.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1