Compact Class-conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification.

IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-08-23 DOI:10.1109/TCBB.2024.3448617
Jose A Saez, J Fernando Vera
{"title":"Compact Class-conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification.","authors":"Jose A Saez, J Fernando Vera","doi":"10.1109/TCBB.2024.3448617","DOIUrl":null,"url":null,"abstract":"<p><p>Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/TCBB.2024.3448617","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
紧凑型类条件属性类别聚类:用于增强 HIV-1 蛋白酶裂解分类的氨基酸分组。
分类属性在许多分类任务中都很常见,随着分类数量的增加,会带来一定的挑战。这种情况会影响数据处理,对模型的构建时间、复杂性以及最终的分类性能产生负面影响。为了缓解这些问题,本研究提出了一种新颖的预处理技术,用于对分类数据集中的属性类别进行分组。这种方法结合了欧几里得空间中分类值之间关联的精确表示、聚类方法和属性质量度量,根据相似属性类别对分类任务的贡献对其进行分组。为了评估其有效性,我们在 HIV-1 蛋白酶裂解位点预测的背景下对该建议进行了评估,其中每个属性代表一个氨基酸,可以有多种可能的值。在 HIV-1 真实世界数据集上获得的结果显示,每个属性的类别数量显著减少,平均减少比例为 74% 至 81%。与不进行预处理相比,这种减少导致了数据表示的简化和分类性能的提高。具体来说,不同数据集和分类算法的准确率和几何平均数分别提高了 0.07 和 0.19。此外,还在具有不同特征的合成数据集上进行了大量模拟,得出了一致可靠的结果,验证了该建议的稳健性。这些发现凸显了所开发的方法在增强裂解预测方面的能力,这可能有助于理解病毒过程和开发有针对性的治疗策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.50
自引率
6.70%
发文量
479
审稿时长
3 months
期刊介绍: IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system
期刊最新文献
iAnOxPep: a machine learning model for the identification of anti-oxidative peptides using ensemble learning. DeepLigType: Predicting Ligand Types of Protein-Ligand Binding Sites Using a Deep Learning Model. Performance Comparison between Deep Neural Network and Machine Learning based Classifiers for Huntington Disease Prediction from Human DNA Sequence. AI-based Computational Methods in Early Drug Discovery and Post Market Drug Assessment: A Survey. Enhancing Single-Cell RNA-seq Data Completeness with a Graph Learning Framework.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1