IEEE/ACM Transactions on Computational Biology and Bioinformatics最新文献_第10页

Analyzing Large-Scale Single-Cell RNA-Seq Data Using Coreset 使用 Coreset 分析大规模单细胞 RNA-Seq 数据。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-24 DOI: 10.1109/TCBB.2024.3418078

Khalid Usman;Fangping Wan;Dan Zhao;Jian Peng;Jianyang Zeng

The recent boom in single-cell sequencing technologies provides valuable insights into the transcriptomes of individual cells. Through single-cell data analyses, a number of biological discoveries, such as novel cell types, developmental cell lineage trajectories, and gene regulatory networks, have been uncovered. However, the massive and increasingly accumulated single-cell datasets have also posed a seriously computational and analytical challenge for researchers. To address this issue, one typically applies dimensionality reduction approaches to reduce the large-scale datasets. However, these approaches are generally computationally infeasible for tall matrices. In addition, the downstream data analysis tasks such as clustering still take a large time complexity even on the dimension-reduced datasets. We present single-cell Coreset (scCoreset), a data summarization framework that extracts a small weighted subset of cells from a huge sparse single-cell RNA-seq data to facilitate the downstream data analysis tasks. Single-cell data analyses run on the extracted subset yield similar results to those derived from the original uncompressed data. Tests on various single-cell datasets show that scCoreset outperforms the existing data summarization approaches for common downstream tasks such as visualization and clustering. We believe that scCoreset can serve as a useful plug-in tool to improve the efficiency of current single-cell RNA-seq data analyses.

近年来，单细胞测序技术的蓬勃发展为人们提供了了解单个细胞转录组的宝贵信息。通过单细胞数据分析，人们发现了许多生物学新发现，如新型细胞类型、发育细胞系轨迹和基因调控网络等。然而，海量且日益积累的单细胞数据集也给研究人员带来了严重的计算和分析挑战。为了解决这个问题，人们通常采用降维方法来减少大规模数据集。然而，这些方法通常对高矩阵的计算不可行。此外，即使在降维后的数据集上，聚类等下游数据分析任务仍然需要耗费大量的时间复杂度。我们提出的单细胞核心集（scCoreset）是一个数据汇总框架，它能从庞大的稀疏单细胞 RNA-seq 数据中提取一小部分加权细胞子集，以方便下游数据分析任务。在提取的子集上运行单细胞数据分析，会得到与原始未压缩数据类似的结果。对各种单细胞数据集的测试表明，在可视化和聚类等常见下游任务方面，scCoreset 优于现有的数据汇总方法。我们相信，scCoreset 可以作为一种有用的插件工具，提高当前单细胞 RNA-seq 数据分析的效率。

{"title":"Analyzing Large-Scale Single-Cell RNA-Seq Data Using Coreset","authors":"Khalid Usman;Fangping Wan;Dan Zhao;Jian Peng;Jianyang Zeng","doi":"10.1109/TCBB.2024.3418078","DOIUrl":"10.1109/TCBB.2024.3418078","url":null,"abstract":"The recent boom in single-cell sequencing technologies provides valuable insights into the transcriptomes of individual cells. Through single-cell data analyses, a number of biological discoveries, such as novel cell types, developmental cell lineage trajectories, and gene regulatory networks, have been uncovered. However, the massive and increasingly accumulated single-cell datasets have also posed a seriously computational and analytical challenge for researchers. To address this issue, one typically applies dimensionality reduction approaches to reduce the large-scale datasets. However, these approaches are generally computationally infeasible for tall matrices. In addition, the downstream data analysis tasks such as clustering still take a large time complexity even on the dimension-reduced datasets. We present single-cell Coreset (scCoreset), a data summarization framework that extracts a small weighted subset of cells from a huge sparse single-cell RNA-seq data to facilitate the downstream data analysis tasks. Single-cell data analyses run on the extracted subset yield similar results to those derived from the original uncompressed data. Tests on various single-cell datasets show that scCoreset outperforms the existing data summarization approaches for common downstream tasks such as visualization and clustering. We believe that scCoreset can serve as a useful plug-in tool to improve the efficiency of current single-cell RNA-seq data analyses.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1784-1793"},"PeriodicalIF":3.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141446052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-Methyladenosine Sites BLAM6A-Merge：利用注意机制和特征融合策略改进 RNA N6-甲基腺苷位点的鉴定。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-24 DOI: 10.1109/TCBB.2024.3418490

Yunpeng Xia;Ying Zhang;Dian Liu;Yi-Heng Zhu;Zhikang Wang;Jiangning Song;Dong-Jun Yu

RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites.

RNA N6-甲基腺苷是一种普遍而丰富的 RNA 修饰类型，对多种生物过程具有重要影响。迄今为止，用于预测甲基化的计算方法层出不穷，但大多数方法都忽略了不同编码策略之间的相关性，也未能探索各种注意机制对甲基化鉴定的适应性。为了解决上述问题，我们提出了一种预测 RNA m6A 修饰位点的创新框架，称为 BLAM6A-Merge。具体来说，它利用多模态特征融合策略，将四个特征的分类结果与 Blastn 工具结合起来。除此之外，在筛选过程之后，还采用了不同的关注机制，以提取特定特征上的高层次特征。在 12 个基准数据集上进行的广泛实验表明，BLAM6A-Merge 取得了卓越的性能（全转录本的平均 AUC：全转录本模式为 0.849，成熟 mRNA 模式为 0.784）。值得注意的是，Blastn 工具首次被用于甲基化位点的鉴定。数据和代码可在 https://github.com/DoraemonXia/BLAM6A-Merge 上获取。

{"title":"BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-Methyladenosine Sites","authors":"Yunpeng Xia;Ying Zhang;Dian Liu;Yi-Heng Zhu;Zhikang Wang;Jiangning Song;Dong-Jun Yu","doi":"10.1109/TCBB.2024.3418490","DOIUrl":"10.1109/TCBB.2024.3418490","url":null,"abstract":"RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1803-1815"},"PeriodicalIF":3.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141446053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SeedHit: A GPU Friendly Pre-Align Filtering Algorithm SeedHit：GPU友好型预对齐过滤算法

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-21 DOI: 10.1109/TCBB.2024.3417517

Zhen Ju;Jingjing Zhang;Xuelei Li;Jintao Meng;Yanjie Wei

The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high-throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%–89% dissimilar sequence pairs on average when the similarity is 0.9–0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6–2.1 times speedup compared to the original version.

下一代测序（NGS）技术产生的基因数据量的增长速度超过了摩尔定律。这就需要开发高效的 NGS 数据处理和分析算法。在计算成本高昂的分析步骤之前进行过滤，可以大大缩短 NGS 数据分析的运行时间。由于 GPU 的性能比 CPU 高出几个数量级，本文提出了一种名为 SeedHit 的 GPU 友好型预对齐过滤算法，用于快速处理 NGS 数据。受 BLAST 的启发，SeedHit 计算两个序列之间的种子命中率，以确定它们的相似性。在 SeedHit 中，基因序列中的核酸以二进制格式呈现。通过打包数据并生成适合 L1 缓存的查找表，SeedHit 对 GPU 非常友好，而且吞吐量很高。使用来自 Greengenes 的三个 16 s rRNA 数据集作为输入，当相似度为 0.9-0.99 时，SeedHit 可以平均剔除 84%-89% 的不相似序列对。在 3080 Ti 上，SeedHit 的吞吐量达到了 1 T/s（每秒 Tera 碱基）。与其他两种基于 GPU 的过滤算法（GateKeeper 和 SneakySnake）相比，SeedHit 的剔除率和吞吐量都是最高的。将 SeedHit 纳入我们的内部聚类算法 nGIA 后，修改后的 nGIA 速度比原始版本提高了 1.6-2.1 倍。

{"title":"SeedHit: A GPU Friendly Pre-Align Filtering Algorithm","authors":"Zhen Ju;Jingjing Zhang;Xuelei Li;Jintao Meng;Yanjie Wei","doi":"10.1109/TCBB.2024.3417517","DOIUrl":"10.1109/TCBB.2024.3417517","url":null,"abstract":"The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high-throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%–89% dissimilar sequence pairs on average when the similarity is 0.9–0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6–2.1 times speedup compared to the original version.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1794-1802"},"PeriodicalIF":3.6,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DDI Prediction With Heterogeneous Information Network - Meta-Path Based Approach 利用异构信息网络进行 DDI 预测--基于元路径的方法。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-21 DOI: 10.1109/TCBB.2024.3417715

Farhan Tanvir;Khaled Mohammed Saifuddin;Muhammad Ifte Khairul Islam;Esra Akbas

Drug-drug interaction (DDI) indicates where a particular drug's desired course of action is modified when taken with other drug (s). DDIs may hamper, enhance, or reduce the expected effect of either drug or, in the worst possible scenario, cause an adverse side effect. While it is crucial to identify drug-drug interactions, it is quite impossible to detect all possible DDIs for a new drug during the clinical trial. Therefore, many computational methods are proposed for this task. This paper presents a novel method based on a heterogeneous information network (HIN), which consists of drugs and other biomedical entities like proteins, pathways, and side effects. Afterward, we extract the rich semantic relationships among these entities using different meta-path-based topological features and facilitate DDI prediction. In addition, we present a heterogeneous graph attention network-based end-to-end model for DDI prediction in the heterogeneous graph. Experimental results show that our proposed method accurately predicts DDIs and outperforms the baselines significantly.

药物相互作用（DDI）是指一种特定药物在与其他药物同时服用时，其预期的作用过程会发生改变。DDI 可能会阻碍、增强或降低其中一种药物的预期效果，或者在最坏的情况下导致不良副作用。虽然识别药物之间的相互作用至关重要，但在临床试验期间检测新药所有可能的 DDIs 是完全不可能的。因此，人们提出了许多计算方法来完成这项任务。本文提出了一种基于异构信息网络（HIN）的新方法，HIN 由药物和其他生物医学实体（如蛋白质、通路和副作用）组成。然后，我们使用不同的基于元路径的拓扑特征来提取这些实体之间丰富的语义关系，从而促进 DDI 预测。此外，我们还提出了一种基于异构图关注网络的端到端模型，用于在异构图中进行 DDI 预测。实验结果表明，我们提出的方法能准确预测 DDI，并明显优于基线方法。

引用次数: 0

Calculation of the Weight of Evidence for Combined Single-Cell and Extracellular Forensic DNA 计算单细胞和细胞外法医 DNA 的综合证据权重。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-19 DOI: 10.1109/TCBB.2024.3416877

Desmond S. Lun;Catherine M. Grgicak

The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.

法医应用中 DNA 证据的权重通常通过计算似然比 (LR) 来评估。在标准工作流程中，DNA 从细胞集合中提取，其中混合了未知数量供体的细胞。然后对 DNA 进行基因分型，并通过成熟的方法计算 LR。最近，有人提出了一种从单细胞数据计算 LR 的方法。这种方法不是在细胞仍处于混合状态时提取 DNA，而是先分离每个细胞，然后获取单细胞数据。随后对相关法医位点进行提取和片段分析，从而对单个细胞进行基因分型。这种工作流程可大大提高证据的权重，但却无法考虑样本中可能存在的细胞外 DNA。本文介绍了一种结合单细胞和细胞外数据计算 LR 的方法。我们在实例数据上演示了计算方法，结果表明，与分别计算单细胞和细胞外 DNA 的 LR 相比，综合 LR 能得出更有力的结论。

{"title":"Calculation of the Weight of Evidence for Combined Single-Cell and Extracellular Forensic DNA","authors":"Desmond S. Lun;Catherine M. Grgicak","doi":"10.1109/TCBB.2024.3416877","DOIUrl":"10.1109/TCBB.2024.3416877","url":null,"abstract":"The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2587-2591"},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intra-Inter Graph Representation Learning for Protein-Protein Binding Sites Prediction 用于蛋白质-蛋白质结合位点预测的内部图表示学习。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-19 DOI: 10.1109/TCBB.2024.3416341

Wenting Zhao;Gongping Xu;Long Wang;Zhen Cui;Tong Zhang;Jian Yang

Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.

图神经网络因其在大量不规则数据中的潜在应用而日益受到关注，并在最近取得了显著进展。将蛋白质表示为图是一种自然的方法。在这项工作中，我们主要研究配体和受体蛋白之间的蛋白结合位点预测。以往的工作只是简单地采用图卷积来学习配体和受体蛋白的残基表示，然后将它们连接起来，并将连接后的表示送入全连接层进行预测，这样会丢失很多复合物所包含的信息，也无法获得最佳预测结果。在本文中，我们提出了用于蛋白质-蛋白质结合位点预测的图内表征学习（IGRL）。具体来说，在图内学习中，我们最大化局部节点表示与全局图摘要之间的互信息，鼓励节点表示体现蛋白质图的全局信息。然后，我们探索将两个独立的配体和受体图融合为一个整体图，并学习其残基/节点之间的亲和性，从而将信息传播给对方，这可以有效捕捉蛋白质间的信息，进一步提高残基对的辨别能力。在多个基准上进行的广泛实验证明，所提出的 IIGRL 模型优于最先进的方法。

{"title":"Intra-Inter Graph Representation Learning for Protein-Protein Binding Sites Prediction","authors":"Wenting Zhao;Gongping Xu;Long Wang;Zhen Cui;Tong Zhang;Jian Yang","doi":"10.1109/TCBB.2024.3416341","DOIUrl":"10.1109/TCBB.2024.3416341","url":null,"abstract":"Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1685-1696"},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAGCN: Using Graph Convolutional Network With Subgraph-Aware for circRNA-Drug Sensitivity Identification SAGCN：使用具有子图感知功能的图卷积网络进行 circRNA 药物敏感性识别。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-17 DOI: 10.1109/TCBB.2024.3415058

Weicheng Sun;Chengjuan Ren;Jinsheng Xu;Ping Zhang

Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first constructs a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.

环状 RNA（circRNA）在癌症发展和抗药性方面发挥着重要作用。大量证据表明，circRNAs 的表达会影响细胞对药物的敏感性。鉴定 circRNAs-药物敏感性关联（CDA）有助于疾病治疗和药物发现。然而，通过传统生物学实验鉴定 CDA 既费时又费钱。因此，开发预测 CDA 的计算方法迫在眉睫。在本研究中，我们提出了一种预测 CDA 的新计算方法--子图感知图卷积网络（SAGCN）。SAGCN 首先构建一个由 circRNA 相似性网络、药物相似性网络和 circRNA-药物二元网络组成的异构网络。然后，提出一种子图提取器，利用图卷积网络学习异构网络的潜在子图结构。该提取器可以捕捉 1 跳和 2 跳信息，然后设计了一种融合关注机制来自适应地整合这些信息。同时，还提出了一种新颖的子图感知关注机制来检测内在的子图结构。最终得到的节点特征表示可用于 CDA 预测。实验结果表明，SAGCN 的平均 AUC 为 0.9120，AUPR 为 0.8693，超过了 10 倍交叉验证下最先进模型的性能。案例研究证明了 SAGCN 在识别 circRNA 与药物敏感性之间关联方面的潜力。

{"title":"SAGCN: Using Graph Convolutional Network With Subgraph-Aware for circRNA-Drug Sensitivity Identification","authors":"Weicheng Sun;Chengjuan Ren;Jinsheng Xu;Ping Zhang","doi":"10.1109/TCBB.2024.3415058","DOIUrl":"10.1109/TCBB.2024.3415058","url":null,"abstract":"Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first constructs a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1765-1774"},"PeriodicalIF":3.6,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recursive Self-Composite Approach Toward Structural Understanding of Boolean Networks 实现布尔网络结构理解的递归自复合方法

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-17 DOI: 10.1109/TCBB.2024.3415352

Jongrae Kim;Woojeong Lee;Kwang-Hyun Cho

Boolean networks have been widely used in systems biology to study the dynamical characteristics of biological networks such as steady-states or cycles, yet there has been little attention to the dynamic properties of network structures. Here, we systematically reveal the core network structures using a recursive self-composite of the logic update rules. We find that all Boolean update rules exhibit repeated cyclic logic structures, where each converged logic leads to the same states, defined as kernel states. Consequently, the period of state cycles is upper bounded by the number of logics in the converged logic cycle. In order to uncover the underlying dynamical characteristics by exploiting the repeating structures, we propose leaping and filling algorithms. The algorithms provide a way to avoid large string explosions during the self-composition procedures. Finally, we present three examples–a simple network with a long feedback structure, a T-cell receptor network and a cancer network–to demonstrate the usefulness of the proposed algorithm.

布尔网络在系统生物学中被广泛用于研究生物网络的动态特性，如稳态或循环，但人们很少关注网络结构的动态特性。在这里，我们利用逻辑更新规则的递归自复合系统地揭示了核心网络结构。我们发现，所有布尔更新规则都表现出重复循环的逻辑结构，其中每个收敛逻辑都会导致相同的状态，定义为内核状态。因此，状态循环周期的上限是收敛逻辑循环中的逻辑数。为了利用重复结构揭示潜在的动态特性，我们提出了跃迁和填充算法。这些算法提供了一种在自组合过程中避免大字符串爆炸的方法。最后，我们举了三个例子--具有长反馈结构的简单网络、T 细胞受体网络和癌症网络--来证明所提算法的实用性。

引用次数: 0

SIG: Graph-Based Cancer Subtype Stratification With Gene Mutation Structural Information SIG：利用基因突变结构信息进行基于图谱的癌症亚型分层。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-14 DOI: 10.1109/TCBB.2024.3414498

Chengcheng Zhang;Wei Li;Ming Deng;Yizhang Jiang;Xiaohui Cui;Ping Chen

Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets.

体细胞肿瘤具有高维、稀疏和样本量小的特点，因此基于体细胞基因组数据进行癌症亚型分层是一项挑战。目前提高癌症聚类性能的方法主要集中在降维、整合多组学数据或生成真实样本等方面，但却忽略了患者-基因矩阵中突变基因之间的关联。我们将这些关联称为基因突变结构信息，其中隐含了癌症亚型信息，可以增强亚型聚类。我们引入了一种新的癌症亚型聚类方法，称为 SIG（图内结构信息）。由于癌症是由基因组合驱动的，因此我们在同一患者样本中逐一建立突变基因之间的关联，并用图来表示它们。两个突变基因之间的关联对应于图中的一条边。然后，我们合并所有突变基因之间的关联，得到一个结构信息图，从而丰富基因网络，提高其与癌症聚类的相关性。我们将体细胞肿瘤基因组与丰富的基因网络整合在一起，并将其传播到相似网络区域的突变患者群中。与 SOTA 方法相比，我们的方法实现了更优越的聚类性能，卵巢和 LUAD 数据集的聚类实验证明了这一点。代码见 https://github.com/ChangSIG/SIG.git。

{"title":"SIG: Graph-Based Cancer Subtype Stratification With Gene Mutation Structural Information","authors":"Chengcheng Zhang;Wei Li;Ming Deng;Yizhang Jiang;Xiaohui Cui;Ping Chen","doi":"10.1109/TCBB.2024.3414498","DOIUrl":"10.1109/TCBB.2024.3414498","url":null,"abstract":"Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1752-1764"},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer PRFold-TNN：使用基于变换器的 PageRank 算法的集合特征选择方法识别蛋白质折叠。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-06-14 DOI: 10.1109/TCBB.2024.3414497

Xinyi Qin;Lu Zhang;Min Liu;Guangzhong Liu

Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. First, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.

了解蛋白质的三级结构对人类生活中许多方面的功能都大有裨益。蛋白质折叠识别是了解蛋白质结构的重要手段。迄今为止，研究人员已经相继提出了多种实现蛋白质折叠识别的方法，但随着蛋白质结构数据库的不断更新，仍然需要新颖有效的计算方法来处理这一问题。在本研究中，我们建立了一个名为 AT 的新蛋白质结构数据集，并提出了用于蛋白质折叠识别的 PRFold-TNN 模型。首先，我们选择了不同类型的特征提取方法，包括 AAC、HMM、HMM-Bigram 和 ACC，以提取蛋白质序列的相应特征。然后，使用基于 PageRank 算法的集合特征选择方法来筛选融合特征。最终，基于 Transformer 模型的分类器实现了最终预测。实验结果表明，该模型在 AT 数据集上的预测准确率为 86.27%，在独立测试集上的预测准确率为 88.91%，表明该模型在蛋白质折叠识别问题上表现出了卓越的性能和泛化能力。此外，我们还对 DD、EDD 和 TG 基准数据集进行了研究，使它们的预测准确率分别达到 88.41%、97.91% 和 95.16%，比最先进方法的预测准确率至少高出 3.0%、0.8% 和 2.5%。由此可见，PRFold-TNN 模型的优势更为突出。

{"title":"PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer","authors":"Xinyi Qin;Lu Zhang;Min Liu;Guangzhong Liu","doi":"10.1109/TCBB.2024.3414497","DOIUrl":"10.1109/TCBB.2024.3414497","url":null,"abstract":"Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. First, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1740-1751"},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0