Pub Date : 2024-09-04DOI: 10.1186/s12859-024-05912-5
Siddhartha G Jena, Archit Verma, Barbara E Engelhardt
Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shapes, relative locations, movement, and interactions of cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present a framework to respond to this new avenue of research: four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each open question that may yield specific insights, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological understanding.
{"title":"Answering open questions in biology using spatial genomics and structured methods.","authors":"Siddhartha G Jena, Archit Verma, Barbara E Engelhardt","doi":"10.1186/s12859-024-05912-5","DOIUrl":"10.1186/s12859-024-05912-5","url":null,"abstract":"<p><p>Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shapes, relative locations, movement, and interactions of cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present a framework to respond to this new avenue of research: four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each open question that may yield specific insights, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological understanding.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11375982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1186/s12859-024-05860-0
Abdullah Asım Emül, Mehmet Arif Ergün, Rumeysa Aslıhan Ertürk, Ömer Çinal, Mehmet Baysan
Background: Advancements over the past decade in DNA sequencing technology and computing power have created the potential to revolutionize medicine. There has been a marked increase in genetic data available, allowing for the advancement of areas such as personalized medicine. A crucial type of data in this context is genetic variant data which is stored in variant call format (VCF) files. However, the rapid growth in genomics has presented challenges in analyzing and comparing VCF files.
Results: In response to the limitations of existing tools, this paper introduces a novel web application that provides a user-friendly solution for VCF file analyses and comparisons. The software tool enables researchers and clinicians to perform high-level analysis with ease and enhances productivity. The application's interface allows users to conveniently upload, analyze, and visualize their VCF files using simple drag-and-drop and point-and-click operations. Essential visualizations such as Venn diagrams, clustergrams, and precision-recall plots are provided to users. A key feature of the application is its support for metadata-based file grouping, accomplished through flexible data matrix uploads, streamlining organization and analysis of user-defined categories. Additionally, the application facilitates standardized benchmarking of VCF files by integrating user-provided ground truth regions and variant lists.
Conclusions: By providing a user-friendly interface and supporting essential visualizations, this software enhances the accessibility of VCF file analysis and assists researchers and clinicians in their scientific inquiries.
{"title":"VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison.","authors":"Abdullah Asım Emül, Mehmet Arif Ergün, Rumeysa Aslıhan Ertürk, Ömer Çinal, Mehmet Baysan","doi":"10.1186/s12859-024-05860-0","DOIUrl":"10.1186/s12859-024-05860-0","url":null,"abstract":"<p><strong>Background: </strong>Advancements over the past decade in DNA sequencing technology and computing power have created the potential to revolutionize medicine. There has been a marked increase in genetic data available, allowing for the advancement of areas such as personalized medicine. A crucial type of data in this context is genetic variant data which is stored in variant call format (VCF) files. However, the rapid growth in genomics has presented challenges in analyzing and comparing VCF files.</p><p><strong>Results: </strong>In response to the limitations of existing tools, this paper introduces a novel web application that provides a user-friendly solution for VCF file analyses and comparisons. The software tool enables researchers and clinicians to perform high-level analysis with ease and enhances productivity. The application's interface allows users to conveniently upload, analyze, and visualize their VCF files using simple drag-and-drop and point-and-click operations. Essential visualizations such as Venn diagrams, clustergrams, and precision-recall plots are provided to users. A key feature of the application is its support for metadata-based file grouping, accomplished through flexible data matrix uploads, streamlining organization and analysis of user-defined categories. Additionally, the application facilitates standardized benchmarking of VCF files by integrating user-provided ground truth regions and variant lists.</p><p><strong>Conclusions: </strong>By providing a user-friendly interface and supporting essential visualizations, this software enhances the accessibility of VCF file analysis and assists researchers and clinicians in their scientific inquiries.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1186/s12859-024-05875-7
Jennifer Li, Andy Yang, Benedito A Carneiro, Ece D Gamsiz Uzun, Lauren Massingham, Alper Uzun
Background: The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration.
Results: To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data.
Conclusions: In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .
{"title":"Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants.","authors":"Jennifer Li, Andy Yang, Benedito A Carneiro, Ece D Gamsiz Uzun, Lauren Massingham, Alper Uzun","doi":"10.1186/s12859-024-05875-7","DOIUrl":"10.1186/s12859-024-05875-7","url":null,"abstract":"<p><strong>Background: </strong>The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration.</p><p><strong>Results: </strong>To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data.</p><p><strong>Conclusions: </strong>In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370019/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1186/s12859-024-05910-7
Sergey Dolgov, Dmitry Savostyanov
We consider a problem of inferring contact network from nodal states observed during an epidemiological process. In a black-box Bayesian optimisation framework this problem reduces to a discrete likelihood optimisation over the set of possible networks. The cardinality of this set grows combinatorially with the number of network nodes, which makes this optimisation computationally challenging. For each network, its likelihood is the probability for the observed data to appear during the evolution of the epidemiological process on this network. This probability can be very small, particularly if the network is significantly different from the ground truth network, from which the observed data actually appear. A commonly used stochastic simulation algorithm struggles to recover rare events and hence to estimate small probabilities and likelihoods. In this paper we replace the stochastic simulation with solving the chemical master equation for the probabilities of all network states. Since this equation also suffers from the curse of dimensionality, we apply tensor train approximations to overcome it and enable fast and accurate computations. Numerical simulations demonstrate efficient black-box Bayesian inference of the network.
我们考虑了从流行病学过程中观察到的节点状态推断接触网络的问题。在黑箱贝叶斯优化框架中,这一问题简化为对可能网络集合的离散似然优化。这个集合的可计算性随着网络节点数量的增加而增加,这使得优化工作在计算上极具挑战性。对于每个网络来说,其可能性是在该网络的流行病学过程演变过程中观察到的数据出现的概率。这个概率可能非常小,尤其是当网络与观察到的数据实际出现的基本真实网络有很大差异时。常用的随机模拟算法难以恢复罕见事件,因此也难以估计小概率和小可能性。在本文中,我们用求解所有网络状态概率的化学主方程来取代随机模拟。由于该方程也存在 "维度诅咒"(curse of dimensionality),我们采用张量列车近似来克服这一问题,从而实现快速、准确的计算。数值模拟证明了网络的高效黑箱贝叶斯推断。
{"title":"Tensor product algorithms for inference of contact network from epidemiological data.","authors":"Sergey Dolgov, Dmitry Savostyanov","doi":"10.1186/s12859-024-05910-7","DOIUrl":"10.1186/s12859-024-05910-7","url":null,"abstract":"<p><p>We consider a problem of inferring contact network from nodal states observed during an epidemiological process. In a black-box Bayesian optimisation framework this problem reduces to a discrete likelihood optimisation over the set of possible networks. The cardinality of this set grows combinatorially with the number of network nodes, which makes this optimisation computationally challenging. For each network, its likelihood is the probability for the observed data to appear during the evolution of the epidemiological process on this network. This probability can be very small, particularly if the network is significantly different from the ground truth network, from which the observed data actually appear. A commonly used stochastic simulation algorithm struggles to recover rare events and hence to estimate small probabilities and likelihoods. In this paper we replace the stochastic simulation with solving the chemical master equation for the probabilities of all network states. Since this equation also suffers from the curse of dimensionality, we apply tensor train approximations to overcome it and enable fast and accurate computations. Numerical simulations demonstrate efficient black-box Bayesian inference of the network.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370089/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1186/s12859-024-05914-3
J Ouyang, Y Gao, Y Yang
Background: Recently, the process of evolution information and the deep learning network has promoted the improvement of protein contact prediction methods. Nevertheless, still remain some bottleneck: (1) One of the bottlenecks is the prediction of orphans and other fewer evolution information proteins. (2) The other bottleneck is the method of predicting single-sequence-based proteins mainly focuses on selecting protein sequence features and tuning the neural network architecture, However, while the deeper neural networks improve prediction accuracy, there is still the problem of increasing the computational burden. Compared with other neural networks in the field of protein prediction, the graph neural network has the following advantages: due to the advantage of revealing the topology structure via graph neural network and being able to take advantage of the hierarchical structure and local connectivity of graph neural networks has certain advantages in capturing the features of different levels of abstraction in protein molecules. When using protein sequence and structure information for joint training, the dependencies between the two kinds of information can be better captured. And it can process protein molecular structures of different lengths and shapes, while traditional neural networks need to convert proteins into fixed-size vectors or matrices for processing.
Results: Here, we propose a single-sequence-based protein contact map predictor PCP-GC-LM, with dual-level graph neural networks and convolution networks. Our method performs better with other single-sequence-based predictors in different independent tests. In addition, to verify the validity of our method against complex protein structures, we will also compare it with other methods in two homodimers protein test sets (DeepHomo test dataset and CASP-CAPRI target dataset). Furthermore, we also perform ablation experiments to demonstrate the necessity of a dual graph network. In all, our framework presents new modules to accurately predict inter-chain contact maps in protein and it's also useful to analyze interactions in other types of protein complexes.
{"title":"PCP-GC-LM: single-sequence-based protein contact prediction using dual graph convolutional neural network and convolutional neural network.","authors":"J Ouyang, Y Gao, Y Yang","doi":"10.1186/s12859-024-05914-3","DOIUrl":"10.1186/s12859-024-05914-3","url":null,"abstract":"<p><strong>Background: </strong>Recently, the process of evolution information and the deep learning network has promoted the improvement of protein contact prediction methods. Nevertheless, still remain some bottleneck: (1) One of the bottlenecks is the prediction of orphans and other fewer evolution information proteins. (2) The other bottleneck is the method of predicting single-sequence-based proteins mainly focuses on selecting protein sequence features and tuning the neural network architecture, However, while the deeper neural networks improve prediction accuracy, there is still the problem of increasing the computational burden. Compared with other neural networks in the field of protein prediction, the graph neural network has the following advantages: due to the advantage of revealing the topology structure via graph neural network and being able to take advantage of the hierarchical structure and local connectivity of graph neural networks has certain advantages in capturing the features of different levels of abstraction in protein molecules. When using protein sequence and structure information for joint training, the dependencies between the two kinds of information can be better captured. And it can process protein molecular structures of different lengths and shapes, while traditional neural networks need to convert proteins into fixed-size vectors or matrices for processing.</p><p><strong>Results: </strong>Here, we propose a single-sequence-based protein contact map predictor PCP-GC-LM, with dual-level graph neural networks and convolution networks. Our method performs better with other single-sequence-based predictors in different independent tests. In addition, to verify the validity of our method against complex protein structures, we will also compare it with other methods in two homodimers protein test sets (DeepHomo test dataset and CASP-CAPRI target dataset). Furthermore, we also perform ablation experiments to demonstrate the necessity of a dual graph network. In all, our framework presents new modules to accurately predict inter-chain contact maps in protein and it's also useful to analyze interactions in other types of protein complexes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370006/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1186/s12859-024-05909-0
Giovanni Marturano, Diego Carli, Claudio Cucini, Antonio Carapelli, Federico Plazzi, Francesco Frati, Marco Passamonti, Francesco Nardi
Background: SmithRNAs (Small MITochondrial Highly-transcribed RNAs) are a novel class of small RNA molecules that are encoded in the mitochondrial genome and regulate the expression of nuclear transcripts. Initial evidence for their existence came from the Manila clam Ruditapes philippinarum, where they have been described and whose activity has been biologically validated through RNA injection experiments. Current evidence on the existence of these RNAs in other species is based only on small RNA sequencing. As a preliminary step to characterize smithRNAs across different metazoan lineages, a dedicated, unified, analytical workflow is needed.
Results: We propose a novel workflow specifically designed for smithRNAs. Sequence data (from small RNA sequencing) uniquely mapping to the mitochondrial genome are clustered into putative smithRNAs and prefiltered based on their abundance, presence in replicate libraries and 5' and 3' transcription boundary conservation. The surviving sequences are subsequently compared to the untranslated regions of nuclear transcripts based on seed pairing, overall match and thermodynamic stability to identify possible targets. Ample collateral information and graphics are produced to help characterize these molecules in the species of choice and guide the operator through the analysis. The workflow was tested on the original Manila clam data. Under basic settings, the results of the original study are largely replicated. The effect of additional parameter customization (clustering threshold, stringency, minimum number of replicates, seed matching) was further evaluated.
Conclusions: The study of smithRNAs is still in its infancy and no dedicated analytical workflow is currently available. At its core, the SmithHunter workflow builds over the bioinformatic procedure originally applied to identify candidate smithRNAs in the Manila clam. In fact, this is currently the only evidence for smithRNAs that has been biologically validated and, therefore, the elective starting point for characterizing smithRNAs in other species. The original analysis was readapted using current software implementations and some minor issues were solved. Moreover, the workflow was improved by allowing the customization of different analytical parameters, mostly focusing on stringency and the possibility of accounting for a minimal level of genetic differentiation among samples.
{"title":"SmithHunter: a workflow for the identification of candidate smithRNAs and their targets.","authors":"Giovanni Marturano, Diego Carli, Claudio Cucini, Antonio Carapelli, Federico Plazzi, Francesco Frati, Marco Passamonti, Francesco Nardi","doi":"10.1186/s12859-024-05909-0","DOIUrl":"10.1186/s12859-024-05909-0","url":null,"abstract":"<p><strong>Background: </strong>SmithRNAs (Small MITochondrial Highly-transcribed RNAs) are a novel class of small RNA molecules that are encoded in the mitochondrial genome and regulate the expression of nuclear transcripts. Initial evidence for their existence came from the Manila clam Ruditapes philippinarum, where they have been described and whose activity has been biologically validated through RNA injection experiments. Current evidence on the existence of these RNAs in other species is based only on small RNA sequencing. As a preliminary step to characterize smithRNAs across different metazoan lineages, a dedicated, unified, analytical workflow is needed.</p><p><strong>Results: </strong>We propose a novel workflow specifically designed for smithRNAs. Sequence data (from small RNA sequencing) uniquely mapping to the mitochondrial genome are clustered into putative smithRNAs and prefiltered based on their abundance, presence in replicate libraries and 5' and 3' transcription boundary conservation. The surviving sequences are subsequently compared to the untranslated regions of nuclear transcripts based on seed pairing, overall match and thermodynamic stability to identify possible targets. Ample collateral information and graphics are produced to help characterize these molecules in the species of choice and guide the operator through the analysis. The workflow was tested on the original Manila clam data. Under basic settings, the results of the original study are largely replicated. The effect of additional parameter customization (clustering threshold, stringency, minimum number of replicates, seed matching) was further evaluated.</p><p><strong>Conclusions: </strong>The study of smithRNAs is still in its infancy and no dedicated analytical workflow is currently available. At its core, the SmithHunter workflow builds over the bioinformatic procedure originally applied to identify candidate smithRNAs in the Manila clam. In fact, this is currently the only evidence for smithRNAs that has been biologically validated and, therefore, the elective starting point for characterizing smithRNAs in other species. The original analysis was readapted using current software implementations and some minor issues were solved. Moreover, the workflow was improved by allowing the customization of different analytical parameters, mostly focusing on stringency and the possibility of accounting for a minimal level of genetic differentiation among samples.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370224/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142118920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1186/s12859-024-05917-0
Salman Khan, Salman A AlQahtani, Sumaiya Noor, Nijad Ahmad
Post-translational modifications (PTMs) are fundamental to essential biological processes, exerting significant influence over gene expression, protein localization, stability, and genome replication. Sumoylation, a PTM involving the covalent addition of a chemical group to a specific protein sequence, profoundly impacts the functional diversity of proteins. Notably, identifying sumoylation sites has garnered significant attention due to their crucial roles in proteomic functions and their implications in various diseases, including Parkinson's and Alzheimer's. Despite the proposal of several computational models for identifying sumoylation sites, their effectiveness could be improved by the limitations associated with conventional learning methodologies. In this study, we introduce pseudo-position-specific scoring matrix (PsePSSM), a robust computational model designed for accurately predicting sumoylation sites using an optimized deep learning algorithm and efficient feature extraction techniques. Moreover, to streamline computational processes and eliminate irrelevant and noisy features, sequential forward selection using a support vector machine (SFS-SVM) is implemented to identify optimal features. The multi-layer Deep Neural Network (DNN) is a robust classifier, facilitating precise sumoylation site prediction. We meticulously assess the performance of PSSM-Sumo through a tenfold cross-validation approach, employing various statistical metrics such as the Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, and the Area under the ROC Curve (AUC). Comparative analyses reveal that PSSM-Sumo achieves an exceptional average prediction accuracy of 98.71%, surpassing existing models. The robustness and accuracy of the proposed model position it as a promising tool for advancing drug discovery and the diagnosis of diverse diseases linked to sumoylation sites.
{"title":"PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features.","authors":"Salman Khan, Salman A AlQahtani, Sumaiya Noor, Nijad Ahmad","doi":"10.1186/s12859-024-05917-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05917-0","url":null,"abstract":"<p><p>Post-translational modifications (PTMs) are fundamental to essential biological processes, exerting significant influence over gene expression, protein localization, stability, and genome replication. Sumoylation, a PTM involving the covalent addition of a chemical group to a specific protein sequence, profoundly impacts the functional diversity of proteins. Notably, identifying sumoylation sites has garnered significant attention due to their crucial roles in proteomic functions and their implications in various diseases, including Parkinson's and Alzheimer's. Despite the proposal of several computational models for identifying sumoylation sites, their effectiveness could be improved by the limitations associated with conventional learning methodologies. In this study, we introduce pseudo-position-specific scoring matrix (PsePSSM), a robust computational model designed for accurately predicting sumoylation sites using an optimized deep learning algorithm and efficient feature extraction techniques. Moreover, to streamline computational processes and eliminate irrelevant and noisy features, sequential forward selection using a support vector machine (SFS-SVM) is implemented to identify optimal features. The multi-layer Deep Neural Network (DNN) is a robust classifier, facilitating precise sumoylation site prediction. We meticulously assess the performance of PSSM-Sumo through a tenfold cross-validation approach, employing various statistical metrics such as the Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, and the Area under the ROC Curve (AUC). Comparative analyses reveal that PSSM-Sumo achieves an exceptional average prediction accuracy of 98.71%, surpassing existing models. The robustness and accuracy of the proposed model position it as a promising tool for advancing drug discovery and the diagnosis of diverse diseases linked to sumoylation sites.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363370/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142103931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1186/s12859-024-05874-8
Zahra Rahaie, Hamid R Rabiee, Hamid Alinejad-Rokny
Background: Copy number variants (CNVs) have become increasingly instrumental in understanding the etiology of all diseases and phenotypes, including Neurocognitive Disorders (NDs). Among the well-established regions associated with ND are small parts of chromosome 16 deletions (16p11.2) and chromosome 15 duplications (15q3). Various methods have been developed to identify associations between CNVs and diseases of interest. The majority of methods are based on statistical inference techniques. However, due to the multi-dimensional nature of the features of the CNVs, these methods are still immature. The other aspect is that regions discovered by different methods are large, while the causative regions may be much smaller.
Results: In this study, we propose a regularized deep learning model to select causal regions for the target disease. With the help of the proximal [20] gradient descent algorithm, the model utilizes the group LASSO concept and embraces a deep learning model in a sparsity framework. We perform the CNV analysis for 74,811 individuals with three types of brain disorders, autism spectrum disorder (ASD), schizophrenia (SCZ), and developmental delay (DD), and also perform cumulative analysis to discover the regions that are common among the NDs. The brain expression of genes associated with diseases has increased by an average of 20 percent, and genes with homologs in mice that cause nervous system phenotypes have increased by 18 percent (on average). The DECIPHER data source also seeks other phenotypes connected to the detected regions alongside gene ontology analysis. The target diseases are correlated with some unexplored regions, such as deletions on 1q21.1 and 1q21.2 (for ASD), deletions on 20q12 (for SCZ), and duplications on 8p23.3 (for DD). Furthermore, our method is compared with other machine learning algorithms.
Conclusions: Our model effectively identifies regions associated with phenotypic traits using regularized deep learning. Rather than attempting to analyze the whole genome, CNVDeep allows us to focus only on the causative regions of disease.
{"title":"CNVDeep: deep association of copy number variants with neurocognitive disorders.","authors":"Zahra Rahaie, Hamid R Rabiee, Hamid Alinejad-Rokny","doi":"10.1186/s12859-024-05874-8","DOIUrl":"https://doi.org/10.1186/s12859-024-05874-8","url":null,"abstract":"<p><strong>Background: </strong>Copy number variants (CNVs) have become increasingly instrumental in understanding the etiology of all diseases and phenotypes, including Neurocognitive Disorders (NDs). Among the well-established regions associated with ND are small parts of chromosome 16 deletions (16p11.2) and chromosome 15 duplications (15q3). Various methods have been developed to identify associations between CNVs and diseases of interest. The majority of methods are based on statistical inference techniques. However, due to the multi-dimensional nature of the features of the CNVs, these methods are still immature. The other aspect is that regions discovered by different methods are large, while the causative regions may be much smaller.</p><p><strong>Results: </strong>In this study, we propose a regularized deep learning model to select causal regions for the target disease. With the help of the proximal [20] gradient descent algorithm, the model utilizes the group LASSO concept and embraces a deep learning model in a sparsity framework. We perform the CNV analysis for 74,811 individuals with three types of brain disorders, autism spectrum disorder (ASD), schizophrenia (SCZ), and developmental delay (DD), and also perform cumulative analysis to discover the regions that are common among the NDs. The brain expression of genes associated with diseases has increased by an average of 20 percent, and genes with homologs in mice that cause nervous system phenotypes have increased by 18 percent (on average). The DECIPHER data source also seeks other phenotypes connected to the detected regions alongside gene ontology analysis. The target diseases are correlated with some unexplored regions, such as deletions on 1q21.1 and 1q21.2 (for ASD), deletions on 20q12 (for SCZ), and duplications on 8p23.3 (for DD). Furthermore, our method is compared with other machine learning algorithms.</p><p><strong>Conclusions: </strong>Our model effectively identifies regions associated with phenotypic traits using regularized deep learning. Rather than attempting to analyze the whole genome, CNVDeep allows us to focus only on the causative regions of disease.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11360772/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142103930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1186/s12859-024-05876-6
Shan Shan Li, Zhao Ming Liu, Jiao Li, Yi Bo Ma, Ze Yuan Dong, Jun Wei Hou, Fu Jie Shen, Wei Bu Wang, Qi Ming Li, Ji Guo Su
Background: Thermostability is a fundamental property of proteins to maintain their biological functions. Predicting protein stability changes upon mutation is important for our understanding protein structure-function relationship, and is also of great interest in protein engineering and pharmaceutical design.
Results: Here we present mutDDG-SSM, a deep learning-based framework that uses the geometric representations encoded in protein structure to predict the mutation-induced protein stability changes. mutDDG-SSM consists of two parts: a graph attention network-based protein structural feature extractor that is trained with a self-supervised learning scheme using large-scale high-resolution protein structures, and an eXtreme Gradient Boosting model-based stability change predictor with an advantage of alleviating overfitting problem. The performance of mutDDG-SSM was tested on several widely-used independent datasets. Then, myoglobin and p53 were used as case studies to illustrate the effectiveness of the model in predicting protein stability changes upon mutations. Our results show that mutDDG-SSM achieved high performance in estimating the effects of mutations on protein stability. In addition, mutDDG-SSM exhibited good unbiasedness, where the prediction accuracy on the inverse mutations is as well as that on the direct mutations.
Conclusion: Meaningful features can be extracted from our pre-trained model to build downstream tasks and our model may serve as a valuable tool for protein engineering and drug design.
{"title":"Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method.","authors":"Shan Shan Li, Zhao Ming Liu, Jiao Li, Yi Bo Ma, Ze Yuan Dong, Jun Wei Hou, Fu Jie Shen, Wei Bu Wang, Qi Ming Li, Ji Guo Su","doi":"10.1186/s12859-024-05876-6","DOIUrl":"10.1186/s12859-024-05876-6","url":null,"abstract":"<p><strong>Background: </strong>Thermostability is a fundamental property of proteins to maintain their biological functions. Predicting protein stability changes upon mutation is important for our understanding protein structure-function relationship, and is also of great interest in protein engineering and pharmaceutical design.</p><p><strong>Results: </strong>Here we present mutDDG-SSM, a deep learning-based framework that uses the geometric representations encoded in protein structure to predict the mutation-induced protein stability changes. mutDDG-SSM consists of two parts: a graph attention network-based protein structural feature extractor that is trained with a self-supervised learning scheme using large-scale high-resolution protein structures, and an eXtreme Gradient Boosting model-based stability change predictor with an advantage of alleviating overfitting problem. The performance of mutDDG-SSM was tested on several widely-used independent datasets. Then, myoglobin and p53 were used as case studies to illustrate the effectiveness of the model in predicting protein stability changes upon mutations. Our results show that mutDDG-SSM achieved high performance in estimating the effects of mutations on protein stability. In addition, mutDDG-SSM exhibited good unbiasedness, where the prediction accuracy on the inverse mutations is as well as that on the direct mutations.</p><p><strong>Conclusion: </strong>Meaningful features can be extracted from our pre-trained model to build downstream tasks and our model may serve as a valuable tool for protein engineering and drug design.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11360314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142092141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}