Compact Class-Conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification

IF 3.4 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-08-23 DOI:10.1109/TCBB.2024.3448617

José A. Sáez;J. Fernando Vera

{"title":"Compact Class-Conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification","authors":"José A. Sáez;J. Fernando Vera","doi":"10.1109/TCBB.2024.3448617","DOIUrl":null,"url":null,"abstract":"Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2167-2178"},"PeriodicalIF":3.4000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10645313","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10645313/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

紧凑型类条件属性类别聚类：用于增强 HIV-1 蛋白酶裂解分类的氨基酸分组。

分类属性在许多分类任务中都很常见，随着分类数量的增加，会带来一定的挑战。这种情况会影响数据处理，对模型的构建时间、复杂性以及最终的分类性能产生负面影响。为了缓解这些问题，本研究提出了一种新颖的预处理技术，用于对分类数据集中的属性类别进行分组。这种方法结合了欧几里得空间中分类值之间关联的精确表示、聚类方法和属性质量度量，根据相似属性类别对分类任务的贡献对其进行分组。为了评估其有效性，我们在 HIV-1 蛋白酶裂解位点预测的背景下对该建议进行了评估，其中每个属性代表一个氨基酸，可以有多种可能的值。在 HIV-1 真实世界数据集上获得的结果显示，每个属性的类别数量显著减少，平均减少比例为 74% 至 81%。与不进行预处理相比，这种减少导致了数据表示的简化和分类性能的提高。具体来说，不同数据集和分类算法的准确率和几何平均数分别提高了 0.07 和 0.19。此外，还在具有不同特征的合成数据集上进行了大量模拟，得出了一致可靠的结果，验证了该建议的稳健性。这些发现凸显了所开发的方法在增强裂解预测方面的能力，这可能有助于理解病毒过程和开发有针对性的治疗策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system