RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites. The data and code can be accessed at https://github.com/DoraemonXia/BLAM6A-Merge.
{"title":"BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-methyladenosine Sites.","authors":"Yunpeng Xia, Ying Zhang, Dian Liu, Yi-Heng Zhu, Zhikang Wang, Jiangning Song, Dong-Jun Yu","doi":"10.1109/TCBB.2024.3418490","DOIUrl":"10.1109/TCBB.2024.3418490","url":null,"abstract":"<p><p>RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites. The data and code can be accessed at https://github.com/DoraemonXia/BLAM6A-Merge.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141446053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high- throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%-89% dissimilar sequence pairs on average when the similarity is 0.9-0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6-2.1 times speedup compared to the original version.
{"title":"SeedHit: A GPU Friendly Pre-Align Filtering Algorithm.","authors":"Zhen Ju, Jingjing Zhang, Xuelei Li, Jintao Meng, Yanjie Wei","doi":"10.1109/TCBB.2024.3417517","DOIUrl":"10.1109/TCBB.2024.3417517","url":null,"abstract":"<p><p>The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high- throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%-89% dissimilar sequence pairs on average when the similarity is 0.9-0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6-2.1 times speedup compared to the original version.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-21DOI: 10.1109/TCBB.2024.3417715
Farhan Tanvir;Khaled Mohammed Saifuddin;Muhammad Ifte Khairul Islam;Esra Akbas
Drug-drug interaction (DDI) indicates where a particular drug's desired course of action is modified when taken with other drug (s). DDIs may hamper, enhance, or reduce the expected effect of either drug or, in the worst possible scenario, cause an adverse side effect. While it is crucial to identify drug-drug interactions, it is quite impossible to detect all possible DDIs for a new drug during the clinical trial. Therefore, many computational methods are proposed for this task. This paper presents a novel method based on a heterogeneous information network (HIN), which consists of drugs and other biomedical entities like proteins, pathways, and side effects. Afterward, we extract the rich semantic relationships among these entities using different meta-path-based topological features and facilitate DDI prediction. In addition, we present a heterogeneous graph attention network-based end-to-end model for DDI prediction in the heterogeneous graph. Experimental results show that our proposed method accurately predicts DDIs and outperforms the baselines significantly.
{"title":"DDI Prediction With Heterogeneous Information Network - Meta-Path Based Approach","authors":"Farhan Tanvir;Khaled Mohammed Saifuddin;Muhammad Ifte Khairul Islam;Esra Akbas","doi":"10.1109/TCBB.2024.3417715","DOIUrl":"10.1109/TCBB.2024.3417715","url":null,"abstract":"Drug-drug interaction (DDI) indicates where a particular drug's desired course of action is modified when taken with other drug (s). DDIs may hamper, enhance, or reduce the expected effect of either drug or, in the worst possible scenario, cause an adverse side effect. While it is crucial to identify drug-drug interactions, it is quite impossible to detect all possible DDIs for a new drug during the clinical trial. Therefore, many computational methods are proposed for this task. This paper presents a novel method based on a heterogeneous information network (HIN), which consists of drugs and other biomedical entities like proteins, pathways, and side effects. Afterward, we extract the rich semantic relationships among these entities using different meta-path-based topological features and facilitate DDI prediction. In addition, we present a heterogeneous graph attention network-based end-to-end model for DDI prediction in the heterogeneous graph. Experimental results show that our proposed method accurately predicts DDIs and outperforms the baselines significantly.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 5","pages":"1168-1179"},"PeriodicalIF":3.6,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1109/TCBB.2024.3416877
Desmond S Lun, Catherine M Grgicak
The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.
法医应用中 DNA 证据的权重通常通过计算似然比 (LR) 来评估。在标准工作流程中,DNA 从细胞集合中提取,其中混合了未知数量供体的细胞。然后对 DNA 进行基因分型,并通过成熟的方法计算 LR。最近,有人提出了一种从单细胞数据计算 LR 的方法。这种方法不是在细胞仍处于混合状态时提取 DNA,而是先分离每个细胞,然后获取单细胞数据。随后对相关法医位点进行提取和片段分析,从而对单个细胞进行基因分型。这种工作流程可大大提高证据的权重,但却无法考虑样本中可能存在的细胞外 DNA。本文介绍了一种结合单细胞和细胞外数据计算 LR 的方法。我们在实例数据上演示了计算方法,结果表明,与分别计算单细胞和细胞外 DNA 的 LR 相比,综合 LR 能得出更有力的结论。
{"title":"Calculation of the Weight of Evidence for Combined Single-Cell and Extracellular Forensic DNA.","authors":"Desmond S Lun, Catherine M Grgicak","doi":"10.1109/TCBB.2024.3416877","DOIUrl":"10.1109/TCBB.2024.3416877","url":null,"abstract":"<p><p>The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1109/TCBB.2024.3416341
Wenting Zhao, Gongping Xu, Long Wang, Zhen Cui, Tong Zhang, Jian Yang
Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.
{"title":"Intra-Inter Graph Representation Learning for Protein-Protein Binding Sites Prediction.","authors":"Wenting Zhao, Gongping Xu, Long Wang, Zhen Cui, Tong Zhang, Jian Yang","doi":"10.1109/TCBB.2024.3416341","DOIUrl":"10.1109/TCBB.2024.3416341","url":null,"abstract":"<p><p>Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first construct a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.
{"title":"SAGCN: Using graph convolutional network with subgraph-aware for circRNA-drug sensitivity identification.","authors":"Weicheng Sun, Chengjuan Ren, Jinsheng Xu, Ping Zhang","doi":"10.1109/TCBB.2024.3415058","DOIUrl":"10.1109/TCBB.2024.3415058","url":null,"abstract":"<p><p>Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first construct a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1109/TCBB.2024.3415352
Jongrae Kim, Woojeong Lee, Kwang-Hyun Cho
Boolean networks have been widely used in systems biology to study the dynamical characteristics of biological networks such as steady-states or cycles, yet there has been little attention to the dynamic properties of network structures. Here, we systematically reveal the core network structures using a recursive self-composite of the logic update rules. We find that all Boolean update rules exhibit repeated cyclic logic structures, where each converged logic leads to the same states, defined as kernel states. Consequently, the period of state cycles is upper bounded by the number of logics in the converged logic cycle. In order to uncover the underlying dynamical characteristics by exploiting the repeating structures, we propose leaping and filling algorithms. The algorithms provide a way to avoid large string explosions during the self-composition procedures. Finally, we present three examples-a simple network with a long feedback structure, a T-cell receptor network and a cancer network-to demonstrate the usefulness of the proposed algorithm.
{"title":"Recursive Self-Composite Approach Towards Structural Understanding of Boolean Networks.","authors":"Jongrae Kim, Woojeong Lee, Kwang-Hyun Cho","doi":"10.1109/TCBB.2024.3415352","DOIUrl":"10.1109/TCBB.2024.3415352","url":null,"abstract":"<p><p>Boolean networks have been widely used in systems biology to study the dynamical characteristics of biological networks such as steady-states or cycles, yet there has been little attention to the dynamic properties of network structures. Here, we systematically reveal the core network structures using a recursive self-composite of the logic update rules. We find that all Boolean update rules exhibit repeated cyclic logic structures, where each converged logic leads to the same states, defined as kernel states. Consequently, the period of state cycles is upper bounded by the number of logics in the converged logic cycle. In order to uncover the underlying dynamical characteristics by exploiting the repeating structures, we propose leaping and filling algorithms. The algorithms provide a way to avoid large string explosions during the self-composition procedures. Finally, we present three examples-a simple network with a long feedback structure, a T-cell receptor network and a cancer network-to demonstrate the usefulness of the proposed algorithm.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets. The code is available at https://github.com/ChangSIG/SIG.git.
体细胞肿瘤具有高维、稀疏和样本量小的特点,因此基于体细胞基因组数据进行癌症亚型分层是一项挑战。目前提高癌症聚类性能的方法主要集中在降维、整合多组学数据或生成真实样本等方面,但却忽略了患者-基因矩阵中突变基因之间的关联。我们将这些关联称为基因突变结构信息,其中隐含了癌症亚型信息,可以增强亚型聚类。我们引入了一种新的癌症亚型聚类方法,称为 SIG(图内结构信息)。由于癌症是由基因组合驱动的,因此我们在同一患者样本中逐一建立突变基因之间的关联,并用图来表示它们。两个突变基因之间的关联对应于图中的一条边。然后,我们合并所有突变基因之间的关联,得到一个结构信息图,从而丰富基因网络,提高其与癌症聚类的相关性。我们将体细胞肿瘤基因组与丰富的基因网络整合在一起,并将其传播到相似网络区域的突变患者群中。与 SOTA 方法相比,我们的方法实现了更优越的聚类性能,卵巢和 LUAD 数据集的聚类实验证明了这一点。代码见 https://github.com/ChangSIG/SIG.git。
{"title":"SIG: Graph-Based Cancer Subtype Stratification With Gene Mutation Structural Information.","authors":"Chengcheng Zhang, Wei Li, Ming Deng, Yizhang Jiang, Xiaohui Cui, Ping Chen","doi":"10.1109/TCBB.2024.3414498","DOIUrl":"10.1109/TCBB.2024.3414498","url":null,"abstract":"<p><p>Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets. The code is available at https://github.com/ChangSIG/SIG.git.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1109/TCBB.2024.3414497
Xinyi Qin, Lu Zhang, Min Liu, Guangzhong Liu
Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. Firstly, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.
{"title":"PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer.","authors":"Xinyi Qin, Lu Zhang, Min Liu, Guangzhong Liu","doi":"10.1109/TCBB.2024.3414497","DOIUrl":"10.1109/TCBB.2024.3414497","url":null,"abstract":"<p><p>Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. Firstly, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-12DOI: 10.1109/TCBB.2024.3413021
Ummu Gulsum Soylemez, Malik Yousef, Zulal Kesmen, Burcu Bakir-Gungor
Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe "DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.
{"title":"Novel Antimicrobial Peptide Design Using Motif Match Score Representation.","authors":"Ummu Gulsum Soylemez, Malik Yousef, Zulal Kesmen, Burcu Bakir-Gungor","doi":"10.1109/TCBB.2024.3413021","DOIUrl":"10.1109/TCBB.2024.3413021","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe \"DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences\" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141310619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}