Pub Date : 2024-06-19DOI: 10.1109/TCBB.2024.3416877
Desmond S Lun, Catherine M Grgicak
The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.
法医应用中 DNA 证据的权重通常通过计算似然比 (LR) 来评估。在标准工作流程中,DNA 从细胞集合中提取,其中混合了未知数量供体的细胞。然后对 DNA 进行基因分型,并通过成熟的方法计算 LR。最近,有人提出了一种从单细胞数据计算 LR 的方法。这种方法不是在细胞仍处于混合状态时提取 DNA,而是先分离每个细胞,然后获取单细胞数据。随后对相关法医位点进行提取和片段分析,从而对单个细胞进行基因分型。这种工作流程可大大提高证据的权重,但却无法考虑样本中可能存在的细胞外 DNA。本文介绍了一种结合单细胞和细胞外数据计算 LR 的方法。我们在实例数据上演示了计算方法,结果表明,与分别计算单细胞和细胞外 DNA 的 LR 相比,综合 LR 能得出更有力的结论。
{"title":"Calculation of the Weight of Evidence for Combined Single-Cell and Extracellular Forensic DNA.","authors":"Desmond S Lun, Catherine M Grgicak","doi":"10.1109/TCBB.2024.3416877","DOIUrl":"10.1109/TCBB.2024.3416877","url":null,"abstract":"<p><p>The weight of DNA evidence for forensic applications is typically assessed through the calculation of the likelihood ratio (LR). In the standard workflow, DNA is extracted from a collection of cells where the cells of an unknown number of donors are mixed. The DNA is then genotyped, and the LR is calculated through well-established methods. Recently, a method for calculating the LR from single-cell data has been presented. Rather than extracting the DNA while the cells are still mixed, single-cell data is procured by first isolating each cell. Extraction and fragment analysis of relevant forensic loci follows such that individual cells are genotyped. This workflow leads to significantly stronger weights of evidence, but it does not account for extracellular DNA that could also be present in the sample. In this paper, we present a method for calculation of an LR that combines single-cell and extracellular data. We demonstrate the calculation on example data and show that the combined LR can lead to stronger conclusions than would be obtained from calculating LRs on the single-cell and extracellular DNA separately.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1109/TCBB.2024.3416341
Wenting Zhao, Gongping Xu, Long Wang, Zhen Cui, Tong Zhang, Jian Yang
Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.
{"title":"Intra-Inter Graph Representation Learning for Protein-Protein Binding Sites Prediction.","authors":"Wenting Zhao, Gongping Xu, Long Wang, Zhen Cui, Tong Zhang, Jian Yang","doi":"10.1109/TCBB.2024.3416341","DOIUrl":"10.1109/TCBB.2024.3416341","url":null,"abstract":"<p><p>Graph neural networks have drawn increasing attention and achieved remarkable progress recently due to their potential applications for a large amount of irregular data. It is a natural way to represent protein as a graph. In this work, we focus on protein-protein binding sites prediction between the ligand and receptor proteins. Previous work just simply adopts graph convolution to learn residue representations of ligand and receptor proteins, then concatenates them and feeds the concatenated representation into a fully connected layer to make predictions, losing much of the information contained in complexes and failing to obtain an optimal prediction. In this paper, we present Intra-Inter Graph Representation Learning for protein-protein binding sites prediction (IIGRL). Specifically, for intra-graph learning, we maximize the mutual information between local node representation and global graph summary to encourage node representation to embody the global information of protein graph. Then we explore fusing two separate ligand and receptor graphs as a whole graph and learning affinities between their residues/nodes to propagate information to each other, which could effectively capture inter-protein information and further enhance the discrimination of residue pairs. Extensive experiments on multiple benchmarks demonstrate that the proposed IIGRL model outperforms state-of-the-art methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141426799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first construct a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.
{"title":"SAGCN: Using graph convolutional network with subgraph-aware for circRNA-drug sensitivity identification.","authors":"Weicheng Sun, Chengjuan Ren, Jinsheng Xu, Ping Zhang","doi":"10.1109/TCBB.2024.3415058","DOIUrl":"10.1109/TCBB.2024.3415058","url":null,"abstract":"<p><p>Circular RNAs (circRNAs) play a significant role in cancer development and therapy resistance. There is substantial evidence indicating that the expression of circRNAs affects the sensitivity of cells to drugs. Identifying circRNAs-drug sensitivity association (CDA) is helpful for disease treatment and drug discovery. However, the identification of CDA through conventional biological experiments is both time-consuming and costly. Therefore, it is urgent to develop computational methods to predict CDA. In this study, we propose a new computational method, the subgraph-aware graph convolutional network (SAGCN), for predicting CDA. SAGCN first construct a heterogeneous network composed of circRNA similarity network, drug similarity network, and circRNA-drug bipartite network. Then, a subgraph extractor is proposed to learn the latent subgraph structure of the heterogeneous network using a graph convolutional network. The extractor can capture 1-hop and 2-hop information and then a fusing attention mechanism is designed to integrate them adaptively. Simultaneously, a novel subgraph-aware attention mechanism is proposed to detect intrinsic subgraph structure. The final node feature representation is obtained to make the CDA prediction. Experimental results demonstrate that SAGCN obtained an average AUC of 0.9120 and AUPR of 0.8693, exceeding the performance of the most advanced models under 10-fold cross-validation. Case studies have demonstrated the potential of SAGCN in identifying associations between circRNA and drug sensitivity.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1109/TCBB.2024.3415352
Jongrae Kim, Woojeong Lee, Kwang-Hyun Cho
Boolean networks have been widely used in systems biology to study the dynamical characteristics of biological networks such as steady-states or cycles, yet there has been little attention to the dynamic properties of network structures. Here, we systematically reveal the core network structures using a recursive self-composite of the logic update rules. We find that all Boolean update rules exhibit repeated cyclic logic structures, where each converged logic leads to the same states, defined as kernel states. Consequently, the period of state cycles is upper bounded by the number of logics in the converged logic cycle. In order to uncover the underlying dynamical characteristics by exploiting the repeating structures, we propose leaping and filling algorithms. The algorithms provide a way to avoid large string explosions during the self-composition procedures. Finally, we present three examples-a simple network with a long feedback structure, a T-cell receptor network and a cancer network-to demonstrate the usefulness of the proposed algorithm.
{"title":"Recursive Self-Composite Approach Towards Structural Understanding of Boolean Networks.","authors":"Jongrae Kim, Woojeong Lee, Kwang-Hyun Cho","doi":"10.1109/TCBB.2024.3415352","DOIUrl":"10.1109/TCBB.2024.3415352","url":null,"abstract":"<p><p>Boolean networks have been widely used in systems biology to study the dynamical characteristics of biological networks such as steady-states or cycles, yet there has been little attention to the dynamic properties of network structures. Here, we systematically reveal the core network structures using a recursive self-composite of the logic update rules. We find that all Boolean update rules exhibit repeated cyclic logic structures, where each converged logic leads to the same states, defined as kernel states. Consequently, the period of state cycles is upper bounded by the number of logics in the converged logic cycle. In order to uncover the underlying dynamical characteristics by exploiting the repeating structures, we propose leaping and filling algorithms. The algorithms provide a way to avoid large string explosions during the self-composition procedures. Finally, we present three examples-a simple network with a long feedback structure, a T-cell receptor network and a cancer network-to demonstrate the usefulness of the proposed algorithm.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets. The code is available at https://github.com/ChangSIG/SIG.git.
体细胞肿瘤具有高维、稀疏和样本量小的特点,因此基于体细胞基因组数据进行癌症亚型分层是一项挑战。目前提高癌症聚类性能的方法主要集中在降维、整合多组学数据或生成真实样本等方面,但却忽略了患者-基因矩阵中突变基因之间的关联。我们将这些关联称为基因突变结构信息,其中隐含了癌症亚型信息,可以增强亚型聚类。我们引入了一种新的癌症亚型聚类方法,称为 SIG(图内结构信息)。由于癌症是由基因组合驱动的,因此我们在同一患者样本中逐一建立突变基因之间的关联,并用图来表示它们。两个突变基因之间的关联对应于图中的一条边。然后,我们合并所有突变基因之间的关联,得到一个结构信息图,从而丰富基因网络,提高其与癌症聚类的相关性。我们将体细胞肿瘤基因组与丰富的基因网络整合在一起,并将其传播到相似网络区域的突变患者群中。与 SOTA 方法相比,我们的方法实现了更优越的聚类性能,卵巢和 LUAD 数据集的聚类实验证明了这一点。代码见 https://github.com/ChangSIG/SIG.git。
{"title":"SIG: Graph-Based Cancer Subtype Stratification With Gene Mutation Structural Information.","authors":"Chengcheng Zhang, Wei Li, Ming Deng, Yizhang Jiang, Xiaohui Cui, Ping Chen","doi":"10.1109/TCBB.2024.3414498","DOIUrl":"10.1109/TCBB.2024.3414498","url":null,"abstract":"<p><p>Somatic tumors have a high-dimensional, sparse, and small sample size nature, making cancer subtype stratification based on somatic genomic data a challenge. Current methods for improving cancer clustering performance focus on dimension reduction, integrating multi-omics data, or generating realistic samples, yet ignore the associations between mutated genes within the patient-gene matrix. We refer to these associations as gene mutation structural information, which implicitly includes cancer subtype information and can enhance subtype clustering. We introduce a novel method for cancer subtype clustering called SIG(Structural Information within Graph). As cancer is driven by a combination of genes, we establish associations between mutated genes within the same patient sample, pair by pair, and use a graph to represent them. An association between two mutated genes corresponds to an edge in the graph. We then merge these associations among all mutated genes to obtain a structural information graph, which enriches the gene network and improves its relevance to cancer clustering. We integrate the somatic tumor genome with the enriched gene network and propagate it to cluster patients with mutations in similar network regions. Our method achieves superior clustering performance compared to SOTA methods, as demonstrated by clustering experiments on ovarian and LUAD datasets. The code is available at https://github.com/ChangSIG/SIG.git.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1109/TCBB.2024.3414497
Xinyi Qin, Lu Zhang, Min Liu, Guangzhong Liu
Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. Firstly, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.
{"title":"PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer.","authors":"Xinyi Qin, Lu Zhang, Min Liu, Guangzhong Liu","doi":"10.1109/TCBB.2024.3414497","DOIUrl":"10.1109/TCBB.2024.3414497","url":null,"abstract":"<p><p>Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. Firstly, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141320780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-12DOI: 10.1109/TCBB.2024.3413021
Ummu Gulsum Soylemez, Malik Yousef, Zulal Kesmen, Burcu Bakir-Gungor
Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe "DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.
{"title":"Novel Antimicrobial Peptide Design Using Motif Match Score Representation.","authors":"Ummu Gulsum Soylemez, Malik Yousef, Zulal Kesmen, Burcu Bakir-Gungor","doi":"10.1109/TCBB.2024.3413021","DOIUrl":"10.1109/TCBB.2024.3413021","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe \"DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences\" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141310619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-11DOI: 10.1109/TCBB.2024.3412174
Yuanfei Dai, Bin Zhang, Shiping Wang
Biomedical relation extraction aims to identify underlying relationships among entities, such as gene associations and drug interactions, within biomedical texts. Despite advancements in relation extraction in general knowledge domains, the scarcity of labeled training data remains a significant challenge in the biomedical field. This paper provides a novel approach for biomedical relation extraction that leverages a noisy student self-training strategy combined with negative learning. This method addresses the challenge of data insufficiency by utilizing distantly supervised data to generate high-quality labeled samples. Negative learning, as opposed to traditional positive learning, offers a more robust mechanism to discern and relabel noisy samples, preventing model overfitting. The integration of these techniques ensures enhanced noise reduction and relabeling capabilities, leading to improved performance even with noisy datasets. Experimental results demonstrate the effectiveness of the proposed framework in mitigating the impact of noisy data and outperforming existing benchmarks.
{"title":"Distantly Supervised Biomedical Relation Extraction Via Negative Learning and Noisy Student Self-Training.","authors":"Yuanfei Dai, Bin Zhang, Shiping Wang","doi":"10.1109/TCBB.2024.3412174","DOIUrl":"10.1109/TCBB.2024.3412174","url":null,"abstract":"<p><p>Biomedical relation extraction aims to identify underlying relationships among entities, such as gene associations and drug interactions, within biomedical texts. Despite advancements in relation extraction in general knowledge domains, the scarcity of labeled training data remains a significant challenge in the biomedical field. This paper provides a novel approach for biomedical relation extraction that leverages a noisy student self-training strategy combined with negative learning. This method addresses the challenge of data insufficiency by utilizing distantly supervised data to generate high-quality labeled samples. Negative learning, as opposed to traditional positive learning, offers a more robust mechanism to discern and relabel noisy samples, preventing model overfitting. The integration of these techniques ensures enhanced noise reduction and relabeling capabilities, leading to improved performance even with noisy datasets. Experimental results demonstrate the effectiveness of the proposed framework in mitigating the impact of noisy data and outperforming existing benchmarks.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141305853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
AlphaFold2 has achieved a major breakthrough in end-to-end prediction for static protein structures. However, protein conformational change is considered to be a key factor in protein biological function. Inter-residue multiple distances prediction is of great significance for research on protein multiple conformations exploration. In this study, we proposed an inter-residue multiple distances prediction method, DeepMDisPre, based on an improved network which integrates triangle update, axial attention and ResNet to predict multiple distances of residue pairs. We built a dataset which contains proteins with a single structure and proteins with multiple conformations to train the network. We tested DeepMDisPre on 114 proteins with multiple conformations. The results show that the inter-residue distance distribution predicted by DeepMDisPre tends to have multiple peaks for flexible residue pairs than for rigid residue pairs. On two cases of proteins with multiple conformations, we modeled the multiple conformations relatively accurately by using the predicted inter-residue multiple distances. In addition, we also tested the performance of DeepMDisPre on 279 proteins with a single structure. Experimental results demonstrate that the average contact accuracy of DeepMDisPre is higher than that of the comparative method. In terms of static protein modeling, the average TM-score of the 3D models built by DeepMDisPre is also improved compared with the comparative method. The executable program is freely available at https://github.com/iobio-zjut/DeepMDisPre.
{"title":"Prediction of Inter-residue Multiple Distances and Exploration of Protein Multiple Conformations by Deep Learning.","authors":"Fujin Zhang, Zhangwei Li, Kailong Zhao, Pengxin Zhao, Guijun Zhang","doi":"10.1109/TCBB.2024.3411825","DOIUrl":"10.1109/TCBB.2024.3411825","url":null,"abstract":"<p><p>AlphaFold2 has achieved a major breakthrough in end-to-end prediction for static protein structures. However, protein conformational change is considered to be a key factor in protein biological function. Inter-residue multiple distances prediction is of great significance for research on protein multiple conformations exploration. In this study, we proposed an inter-residue multiple distances prediction method, DeepMDisPre, based on an improved network which integrates triangle update, axial attention and ResNet to predict multiple distances of residue pairs. We built a dataset which contains proteins with a single structure and proteins with multiple conformations to train the network. We tested DeepMDisPre on 114 proteins with multiple conformations. The results show that the inter-residue distance distribution predicted by DeepMDisPre tends to have multiple peaks for flexible residue pairs than for rigid residue pairs. On two cases of proteins with multiple conformations, we modeled the multiple conformations relatively accurately by using the predicted inter-residue multiple distances. In addition, we also tested the performance of DeepMDisPre on 279 proteins with a single structure. Experimental results demonstrate that the average contact accuracy of DeepMDisPre is higher than that of the comparative method. In terms of static protein modeling, the average TM-score of the 3D models built by DeepMDisPre is also improved compared with the comparative method. The executable program is freely available at https://github.com/iobio-zjut/DeepMDisPre.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141300549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-07DOI: 10.1109/TCBB.2024.3411024
Mohamed Divan Masood, Manjula, Vijayan Sugumaran
Controlling the gene expression is the most important development in a living organism, which makes it easier to find different kinds of diseases and their causes. It's very difficult to know what factors control the gene expression. Transcription Factor (TF) is a protein that plays an important role in gene expression. Discovering the transcription factor has immense biological significance, however, it is challenging to develop novel techniques and evaluation for regulatory developments in biological structures. In this research, we mainly focus on 'sequence specificities' that can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for predicting transcription factor binding. Specifically, Multiple Expression motifs for Motif Elicitation (MEME) technique with Convolution Neural Network (CNN) named as CnNet, has been used for discovering the 'sequence specificities' of DNA gene sequences dataset. This process involves two steps: a) discovering the motifs that are capable of identifying useful TF binding site by using MEME technique, and b) computing a score indicating the likelihood of a given sequence being a useful binding site by using CNN technique. The proposed CnNet approach predicts the TF binding score with much better accuracy compared to existing approaches. The source code and datasets used in this work are available at https://github.com/masoodbai/CnNet-Approach-for-TFBS.git.
{"title":"Transcription Factor Binding Site Prediction Using CnNet Approach.","authors":"Mohamed Divan Masood, Manjula, Vijayan Sugumaran","doi":"10.1109/TCBB.2024.3411024","DOIUrl":"10.1109/TCBB.2024.3411024","url":null,"abstract":"<p><p>Controlling the gene expression is the most important development in a living organism, which makes it easier to find different kinds of diseases and their causes. It's very difficult to know what factors control the gene expression. Transcription Factor (TF) is a protein that plays an important role in gene expression. Discovering the transcription factor has immense biological significance, however, it is challenging to develop novel techniques and evaluation for regulatory developments in biological structures. In this research, we mainly focus on 'sequence specificities' that can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for predicting transcription factor binding. Specifically, Multiple Expression motifs for Motif Elicitation (MEME) technique with Convolution Neural Network (CNN) named as CnNet, has been used for discovering the 'sequence specificities' of DNA gene sequences dataset. This process involves two steps: a) discovering the motifs that are capable of identifying useful TF binding site by using MEME technique, and b) computing a score indicating the likelihood of a given sequence being a useful binding site by using CNN technique. The proposed CnNet approach predicts the TF binding score with much better accuracy compared to existing approaches. The source code and datasets used in this work are available at https://github.com/masoodbai/CnNet-Approach-for-TFBS.git.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141288055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}