Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae653
Tasbiraha Athaya, Xiaoman Li, Haiyan Hu
Motivation: Extracellular miRNAs (exmiRs) and intracellular mRNAs both can serve as promising biomarkers and therapeutic targets for various diseases. However, exmiR expression data is often noisy, and obtaining intracellular mRNA expression data usually involves intrusive procedures. To gain valuable insights into disease mechanisms, it is thus essential to improve the quality of exmiR expression data and develop noninvasive methods for assessing intracellular mRNA expression.
Results: We developed CrossPred, a deep-learning multi-encoder model for the cross-prediction of exmiRs and mRNAs. Utilizing contrastive learning, we created a shared embedding space to integrate exmiRs and mRNAs. This shared embedding was then used to predict intracellular mRNA expression from noisy exmiR data and to predict exmiR expression from intracellular mRNA data. We evaluated CrossPred on three types of cancers and assessed its effectiveness in predicting the expression levels of exmiRs and mRNAs. CrossPred outperformed the baseline encoder-decoder model, exmiR or mRNA-based models, and variational autoencoder models. Moreover, the integration of exmiR and mRNA data uncovered important exmiRs and mRNAs associated with cancer. Our study offers new insights into the bidirectional relationship between mRNAs and exmiRs.
Availability and implementation: The datasets and tool are available at https://doi.org/10.5281/zenodo.13891508.
{"title":"A deep learning method to integrate extracelluar miRNA with mRNA for cancer studies.","authors":"Tasbiraha Athaya, Xiaoman Li, Haiyan Hu","doi":"10.1093/bioinformatics/btae653","DOIUrl":"10.1093/bioinformatics/btae653","url":null,"abstract":"<p><strong>Motivation: </strong>Extracellular miRNAs (exmiRs) and intracellular mRNAs both can serve as promising biomarkers and therapeutic targets for various diseases. However, exmiR expression data is often noisy, and obtaining intracellular mRNA expression data usually involves intrusive procedures. To gain valuable insights into disease mechanisms, it is thus essential to improve the quality of exmiR expression data and develop noninvasive methods for assessing intracellular mRNA expression.</p><p><strong>Results: </strong>We developed CrossPred, a deep-learning multi-encoder model for the cross-prediction of exmiRs and mRNAs. Utilizing contrastive learning, we created a shared embedding space to integrate exmiRs and mRNAs. This shared embedding was then used to predict intracellular mRNA expression from noisy exmiR data and to predict exmiR expression from intracellular mRNA data. We evaluated CrossPred on three types of cancers and assessed its effectiveness in predicting the expression levels of exmiRs and mRNAs. CrossPred outperformed the baseline encoder-decoder model, exmiR or mRNA-based models, and variational autoencoder models. Moreover, the integration of exmiR and mRNA data uncovered important exmiRs and mRNAs associated with cancer. Our study offers new insights into the bidirectional relationship between mRNAs and exmiRs.</p><p><strong>Availability and implementation: </strong>The datasets and tool are available at https://doi.org/10.5281/zenodo.13891508.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11565234/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae660
Alex C H Liu, Steven M Chan
Summary: We present ADTGP, an R package that uses Gaussian process regression to correct droplet-specific technical noise in single-cell protein sequencing data. ADTGP improves the interpretability of the data by modeling the distribution of protein expression, conditioned on equal isotype control counts across cells. ADTGP is written in R and needs only the protein raw counts, isotype control raw counts, and a design matrix to run.
Availability and implementation: ADTGP can be installed from https://github.com/northNomad/ADTGP. It depends on Stan and the R package 'cmdstanr'.
摘要:我们介绍的 ADTGP 是一个 R 软件包,它使用高斯过程回归校正单细胞蛋白质测序数据中的液滴特异性技术噪声。ADTGP 通过对蛋白质表达的分布进行建模,并以各细胞的同种型对照计数相等为条件,提高了数据的可解释性。ADTGP 用 R 语言编写,运行时只需要蛋白质原始计数、同种型对照原始计数和设计矩阵:ADTGP 可从 https://github.com/northNomad/ADTGP 安装。它依赖于 Stan 和 R 软件包 "cmdstanr"。
{"title":"ADTGP: correcting single-cell antibody sequencing data using Gaussian process regression.","authors":"Alex C H Liu, Steven M Chan","doi":"10.1093/bioinformatics/btae660","DOIUrl":"10.1093/bioinformatics/btae660","url":null,"abstract":"<p><strong>Summary: </strong>We present ADTGP, an R package that uses Gaussian process regression to correct droplet-specific technical noise in single-cell protein sequencing data. ADTGP improves the interpretability of the data by modeling the distribution of protein expression, conditioned on equal isotype control counts across cells. ADTGP is written in R and needs only the protein raw counts, isotype control raw counts, and a design matrix to run.</p><p><strong>Availability and implementation: </strong>ADTGP can be installed from https://github.com/northNomad/ADTGP. It depends on Stan and the R package 'cmdstanr'.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142592385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae618
Tobias H Olsen, Iain H Moal, Charlotte M Deane
Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.
Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.
Availability and implementation: AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.
{"title":"Addressing the antibody germline bias and its effect on language models for improved antibody design.","authors":"Tobias H Olsen, Iain H Moal, Charlotte M Deane","doi":"10.1093/bioinformatics/btae618","DOIUrl":"10.1093/bioinformatics/btae618","url":null,"abstract":"<p><strong>Motivation: </strong>The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.</p><p><strong>Results: </strong>In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.</p><p><strong>Availability and implementation: </strong>AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11543624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: DNA methylation patterns provide precise and accurate estimates of biological age due to their robustness and predictable changes associated with aging processes. Although several methylation aging clocks have been developed in recent years, they are primarily designed for DNA methylation array data, which has limited CpG coverage and detection sensitivity compared to bisulfite sequencing data.
Results: Here, we present BS-clock, a novel DNA methylation clock for human aging based on bisulfite sequencing data. Using BS-seq data from 529 samples retrieved from four tissues, our BS-clock achieves higher correlations with chronological age in multiple tissue types compared to existing array-based clocks. Our study revealed age-dependent aging rates across different age stages and disease conditions, and overall low cross-tissue prediction capability by applying the model trained on one tissue type to others. In summary, BS-clock overcomes limitations of array-based techniques, offering genome-wide CpG site coverage and more robust and accurate aging quantification. This research paves the way for advanced epigenetic studies of aging and holds promise for developing targeted interventions to promote healthy aging.
Availability and implementation: All analysis codes for reproducing the results of the study are publicly available at https://github.com/hucongcong97/BS-clock.
{"title":"BS-clock, advancing epigenetic age prediction with high-resolution DNA methylation bisulfite sequencing data.","authors":"Congcong Hu, Yunxiao Li, Longhui Li, Naiqian Zhang, Xiaoqi Zheng","doi":"10.1093/bioinformatics/btae656","DOIUrl":"10.1093/bioinformatics/btae656","url":null,"abstract":"<p><strong>Motivation: </strong>DNA methylation patterns provide precise and accurate estimates of biological age due to their robustness and predictable changes associated with aging processes. Although several methylation aging clocks have been developed in recent years, they are primarily designed for DNA methylation array data, which has limited CpG coverage and detection sensitivity compared to bisulfite sequencing data.</p><p><strong>Results: </strong>Here, we present BS-clock, a novel DNA methylation clock for human aging based on bisulfite sequencing data. Using BS-seq data from 529 samples retrieved from four tissues, our BS-clock achieves higher correlations with chronological age in multiple tissue types compared to existing array-based clocks. Our study revealed age-dependent aging rates across different age stages and disease conditions, and overall low cross-tissue prediction capability by applying the model trained on one tissue type to others. In summary, BS-clock overcomes limitations of array-based techniques, offering genome-wide CpG site coverage and more robust and accurate aging quantification. This research paves the way for advanced epigenetic studies of aging and holds promise for developing targeted interventions to promote healthy aging.</p><p><strong>Availability and implementation: </strong>All analysis codes for reproducing the results of the study are publicly available at https://github.com/hucongcong97/BS-clock.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11572488/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae644
Ifigenia Tsitsa, Izabella Krystkowiak, Norman E Davey
Motivation: Short linear motifs (SLiMs) are compact functional modules that mediate low-affinity protein-protein interactions. SLiMs direct the function of many dynamic signalling and regulatory complexes playing a central role in most biological processes of the cell. Motif-binding determinants describe the contribution of each residue in a motif-containing peptide to the affinity and specificity of binding to the motif-binding partner. Motif-binding determinants are generally defined as a motif consensus pattern or a position-specific scoring matrix (PSSM) encoding quantitative preferences. Motif-binding determinant comparison is an important motif analysis task and can be applied to motif annotation, classification, clustering, discovery and benchmarking. Currently, binding determinant comparison is generally performed by analysing consensus similarity; however, this ignores important quantitative information in both the consensus and non-consensus positions.
Results: We have created a new tool, CompariPSSM, that quantifies the similarity between motif-binding determinants using sliding window PSSM-PSSM comparison and scores PSSM similarity using a randomisation-based probabilistic framework. The tool has been benchmarked on curated data from the eukaryotic linear motif database and experimental data from proteomic peptidephage display. CompariPSSM can be used for peptide classification to validate motif classes, peptide clustering to group functionally related conserved disordered regions, and benchmarking experimental motif discovery methods.
Availability and implementation: CompariPSSM is available at https://slim.icr.ac.uk/projects/comparipssm.
{"title":"CompariPSSM: a PSSM-PSSM comparison tool for motif-binding determinant analysis.","authors":"Ifigenia Tsitsa, Izabella Krystkowiak, Norman E Davey","doi":"10.1093/bioinformatics/btae644","DOIUrl":"10.1093/bioinformatics/btae644","url":null,"abstract":"<p><strong>Motivation: </strong>Short linear motifs (SLiMs) are compact functional modules that mediate low-affinity protein-protein interactions. SLiMs direct the function of many dynamic signalling and regulatory complexes playing a central role in most biological processes of the cell. Motif-binding determinants describe the contribution of each residue in a motif-containing peptide to the affinity and specificity of binding to the motif-binding partner. Motif-binding determinants are generally defined as a motif consensus pattern or a position-specific scoring matrix (PSSM) encoding quantitative preferences. Motif-binding determinant comparison is an important motif analysis task and can be applied to motif annotation, classification, clustering, discovery and benchmarking. Currently, binding determinant comparison is generally performed by analysing consensus similarity; however, this ignores important quantitative information in both the consensus and non-consensus positions.</p><p><strong>Results: </strong>We have created a new tool, CompariPSSM, that quantifies the similarity between motif-binding determinants using sliding window PSSM-PSSM comparison and scores PSSM similarity using a randomisation-based probabilistic framework. The tool has been benchmarked on curated data from the eukaryotic linear motif database and experimental data from proteomic peptidephage display. CompariPSSM can be used for peptide classification to validate motif classes, peptide clustering to group functionally related conserved disordered regions, and benchmarking experimental motif discovery methods.</p><p><strong>Availability and implementation: </strong>CompariPSSM is available at https://slim.icr.ac.uk/projects/comparipssm.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae624
Sai Srikanth Lakkimsetty, Andreas Weber, Kylie A Bemis, Verena Stehl, Peter Bronsert, Melanie C Föll, Olga Vitek
Summary: Joint analysis of mass spectrometry images (MS images) and microscopy images of hematoxylin and eosin (H&E) stained tissues assists pathologists in characterizing the morphological structure of the tissues, and in performing diagnosis. Unfortunately, the analysis is undermined by substantial differences between these modalities in terms of aspect ratios, spatial resolution, number of channels in each image, as well as by large global or small local elastic spatial deformations of one image with respect to the other. Therefore, accurate coregistration of the images is a critical pre-requisite for their joint interpretation. We introduce MSIreg, an open-source R package for coregistration of MSI and H&E images. MSIreg is designed for high-dimensional MSI experiments where each spatial location is represented by thousands of mass features. Unlike most existing coregistration methods, MSIreg implements a landmark free workflow, and quantitative metrics for performance evaluation. We evaluate the performance of MSIreg on six case studies, including coregistration of contiguous tissues with large deformations, as well as simultaneous coregistration of 29 tissue microarray cores.
Availability and implementation: The R package, installation instructions, and fully reproducible vignettes describing methods and Case Studies are available open-source under the GPL-3.0 license at https://github.com/sslakkimsetty/msireg/.
{"title":"MSIreg: an R package for unsupervised coregistration of mass spectrometry and H&E images.","authors":"Sai Srikanth Lakkimsetty, Andreas Weber, Kylie A Bemis, Verena Stehl, Peter Bronsert, Melanie C Föll, Olga Vitek","doi":"10.1093/bioinformatics/btae624","DOIUrl":"10.1093/bioinformatics/btae624","url":null,"abstract":"<p><strong>Summary: </strong>Joint analysis of mass spectrometry images (MS images) and microscopy images of hematoxylin and eosin (H&E) stained tissues assists pathologists in characterizing the morphological structure of the tissues, and in performing diagnosis. Unfortunately, the analysis is undermined by substantial differences between these modalities in terms of aspect ratios, spatial resolution, number of channels in each image, as well as by large global or small local elastic spatial deformations of one image with respect to the other. Therefore, accurate coregistration of the images is a critical pre-requisite for their joint interpretation. We introduce MSIreg, an open-source R package for coregistration of MSI and H&E images. MSIreg is designed for high-dimensional MSI experiments where each spatial location is represented by thousands of mass features. Unlike most existing coregistration methods, MSIreg implements a landmark free workflow, and quantitative metrics for performance evaluation. We evaluate the performance of MSIreg on six case studies, including coregistration of contiguous tissues with large deformations, as well as simultaneous coregistration of 29 tissue microarray cores.</p><p><strong>Availability and implementation: </strong>The R package, installation instructions, and fully reproducible vignettes describing methods and Case Studies are available open-source under the GPL-3.0 license at https://github.com/sslakkimsetty/msireg/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11530229/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae639
Sumyyah Toonsi, Iris Ivy Gauran, Hernando Ombao, Paul N Schofield, Robert Hoehndorf
Motivation: Identifying causal relations between diseases allows for the study of shared pathways, biological mechanisms, and inter-disease risks. Such causal relations can facilitate the identification of potential disease precursors and candidates for drug re-purposing. However, computational methods often lack access to these causal relations. Few approaches have been developed to automatically extract causal relationships between diseases from unstructured text, but they are often only focused on a small number of diseases, lack validation of the extracted causal relations, or do not make their data available.
Results: We automatically mined statements asserting a causal relation between diseases from the scientific literature by leveraging lexical patterns. Following automated mining of causal relations, we mapped the diseases to the International Classification of Diseases (ICD) identifiers to allow the direct application to clinical data. We provide quantitative and qualitative measures to evaluate the mined causal relations and compare to UK Biobank diagnosis data as a completely independent data source. The validated causal associations were used to create a directed acyclic graph that can be used by causal inference frameworks. We demonstrate the utility of our causal network by performing causal inference using the do-calculus, using relations within the graph to construct and improve polygenic risk scores, and disentangle the pleiotropic effects of variants.
Availability and implementation: The data are available through https://github.com/bio-ontology-research-group/causal-relations-between-diseases.
{"title":"Causal relationships between diseases mined from the literature improve the use of polygenic risk scores.","authors":"Sumyyah Toonsi, Iris Ivy Gauran, Hernando Ombao, Paul N Schofield, Robert Hoehndorf","doi":"10.1093/bioinformatics/btae639","DOIUrl":"10.1093/bioinformatics/btae639","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying causal relations between diseases allows for the study of shared pathways, biological mechanisms, and inter-disease risks. Such causal relations can facilitate the identification of potential disease precursors and candidates for drug re-purposing. However, computational methods often lack access to these causal relations. Few approaches have been developed to automatically extract causal relationships between diseases from unstructured text, but they are often only focused on a small number of diseases, lack validation of the extracted causal relations, or do not make their data available.</p><p><strong>Results: </strong>We automatically mined statements asserting a causal relation between diseases from the scientific literature by leveraging lexical patterns. Following automated mining of causal relations, we mapped the diseases to the International Classification of Diseases (ICD) identifiers to allow the direct application to clinical data. We provide quantitative and qualitative measures to evaluate the mined causal relations and compare to UK Biobank diagnosis data as a completely independent data source. The validated causal associations were used to create a directed acyclic graph that can be used by causal inference frameworks. We demonstrate the utility of our causal network by performing causal inference using the do-calculus, using relations within the graph to construct and improve polygenic risk scores, and disentangle the pleiotropic effects of variants.</p><p><strong>Availability and implementation: </strong>The data are available through https://github.com/bio-ontology-research-group/causal-relations-between-diseases.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae645
Yuan Gao, Rob Patro, Peng Jiang
Motivation: A crucial component of intuitive data visualization is presenting a hierarchical tree structure with interactive functions. For example, single-cell transcriptomics studies may generate gene expression values with developmental trajectories or cell lineage structures. Two common visualization methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), require two separate figures to depict the distribution of cell types and gene expression data, with low-dimension projections that may not capture the hierarchical structures among cells.
Results: Here, we present a JavaScript framework and an interactive web app named Collapsible Tree, which presents values jointly with interactive, expandable, and collapsible lineage structures. For example, the Collapsible Tree presents cellular states and gene expression from single-cell transcriptomics within a single hierarchical plot, enabling comparisons of gene expression across lineages and subtle patterns between sub-lineages. Our framework can facilitate the exploration of complicated value patterns that are not evident in UMAP or t-SNE plots.
Availability and implementation: The Collapsible Tree web interface is available at https://collapsibletree.data2in.net. The JavaScript library source code is available at https://github.com/data2intelligence/collapsible_tree.
{"title":"Collapsible tree: interactive web app to present collapsible hierarchies.","authors":"Yuan Gao, Rob Patro, Peng Jiang","doi":"10.1093/bioinformatics/btae645","DOIUrl":"10.1093/bioinformatics/btae645","url":null,"abstract":"<p><strong>Motivation: </strong>A crucial component of intuitive data visualization is presenting a hierarchical tree structure with interactive functions. For example, single-cell transcriptomics studies may generate gene expression values with developmental trajectories or cell lineage structures. Two common visualization methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), require two separate figures to depict the distribution of cell types and gene expression data, with low-dimension projections that may not capture the hierarchical structures among cells.</p><p><strong>Results: </strong>Here, we present a JavaScript framework and an interactive web app named Collapsible Tree, which presents values jointly with interactive, expandable, and collapsible lineage structures. For example, the Collapsible Tree presents cellular states and gene expression from single-cell transcriptomics within a single hierarchical plot, enabling comparisons of gene expression across lineages and subtle patterns between sub-lineages. Our framework can facilitate the exploration of complicated value patterns that are not evident in UMAP or t-SNE plots.</p><p><strong>Availability and implementation: </strong>The Collapsible Tree web interface is available at https://collapsibletree.data2in.net. The JavaScript library source code is available at https://github.com/data2intelligence/collapsible_tree.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11543613/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae655
Quancheng Liu, Chengxin Zhang, Lydia Freddolino
Motivation: Accurate protein function prediction is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of protein sequences far outpaces the experimental characterization of their functions, necessitating the development of automated computational methods.
Results: We present InterLabelGO+, a hybrid approach that integrates a deep learning-based method with an alignment-based method for improved protein function prediction. InterLabelGO+ incorporates a novel loss function that addresses label dependency and imbalance and further enhances performance through dynamic weighting of the alignment-based component. A preliminary version of InterLabelGO+ achieved a strong performance in the CAFA5 challenge, ranking sixth out of 1625 participating teams. Comprehensive evaluations on large-scale protein function prediction tasks demonstrate InterLabelGO+'s ability to accurately predict Gene Ontology terms across various functional categories and evaluation metrics.
Availability and implementation: The source code and datasets for InterLabelGO+ are freely available on GitHub at https://github.com/QuanEvans/InterLabelGO. A web-server is available at https://seq2fun.dcmb.med.umich.edu/InterLabelGO/. The software is implemented in Python and PyTorch, and is supported on Linux and macOS.
{"title":"InterLabelGO+: unraveling label correlations in protein function prediction.","authors":"Quancheng Liu, Chengxin Zhang, Lydia Freddolino","doi":"10.1093/bioinformatics/btae655","DOIUrl":"10.1093/bioinformatics/btae655","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate protein function prediction is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of protein sequences far outpaces the experimental characterization of their functions, necessitating the development of automated computational methods.</p><p><strong>Results: </strong>We present InterLabelGO+, a hybrid approach that integrates a deep learning-based method with an alignment-based method for improved protein function prediction. InterLabelGO+ incorporates a novel loss function that addresses label dependency and imbalance and further enhances performance through dynamic weighting of the alignment-based component. A preliminary version of InterLabelGO+ achieved a strong performance in the CAFA5 challenge, ranking sixth out of 1625 participating teams. Comprehensive evaluations on large-scale protein function prediction tasks demonstrate InterLabelGO+'s ability to accurately predict Gene Ontology terms across various functional categories and evaluation metrics.</p><p><strong>Availability and implementation: </strong>The source code and datasets for InterLabelGO+ are freely available on GitHub at https://github.com/QuanEvans/InterLabelGO. A web-server is available at https://seq2fun.dcmb.med.umich.edu/InterLabelGO/. The software is implemented in Python and PyTorch, and is supported on Linux and macOS.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568131/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae622
Michael K B Ford, Ananth Hari, Qinghui Zhou, Ibrahim Numanagić, S Cenk Sahinalp
Summary: Natural killer (NK) cells are essential components of the innate immune system, with their activity significantly regulated by Killer cell Immunoglobulin-like Receptors (KIRs). The diversity and structural complexity of KIR genes present significant challenges for accurate genotyping, essential for understanding NK cell functions and their implications in health and disease. Traditional genotyping methods struggle with the variable nature of KIR genes, leading to inaccuracies that can impede immunogenetic research. These challenges extend to high-quality phased assemblies, which have been recently popularized by the Human Pangenome Consortium. This article introduces BAKIR (Biologically informed Annotator for KIR locus), a tailored computational tool designed to overcome the challenges of KIR genotyping and annotation on high-quality, phased genome assemblies. BAKIR aims to enhance the accuracy of KIR gene annotations by structuring its annotation pipeline around identifying key functional mutations, thereby improving the identification and subsequent relevance of gene and allele calls. It uses a multi-stage mapping, alignment, and variant calling process to ensure high-precision gene and allele identification, while also maintaining high recall for sequences that are significantly mutated or truncated relative to the known allele database. BAKIR has been evaluated on a subset of the HPRC assemblies, where BAKIR was able to improve many of the associated annotations and call novel variants. BAKIR is freely available on GitHub, offering ease of access and use through multiple installation methods, including pip, conda, and singularity container, and is equipped with a user-friendly command-line interface, thereby promoting its adoption in the scientific community.
Availability and implementation: BAKIR is available at github.com/algo-cancer/bakir.
摘要:自然杀伤(NK)细胞是先天性免疫系统的重要组成部分,其活性受杀伤细胞免疫球蛋白样受体(KIR)的重要调节。KIR 基因的多样性和结构复杂性给准确的基因分型带来了巨大挑战,而准确的基因分型对于了解 NK 细胞的功能及其对健康和疾病的影响至关重要。传统的基因分型方法难以应对 KIR 基因的多变性,从而导致不准确性,阻碍了免疫遗传学的研究。这些挑战延伸到了高质量的分阶段组装,最近人类泛基因组联盟(Human Pangenome Consortium)推广了这种组装方法。本文介绍了 BAKIR(Biologically-informed Annotator for KIR locus),这是一种量身定制的计算工具,旨在克服在高质量分阶段基因组组装上进行 KIR 基因分型和注释所面临的挑战。BAKIR 的目标是通过围绕识别关键功能突变来构建其注释管道,从而提高 KIR 基因注释的准确性,从而改善基因和等位基因调用的识别和后续相关性。它采用多阶段映射、比对和变异调用过程,确保高精度的基因和等位基因鉴定,同时还能对相对于已知等位基因数据库有明显突变或截断的序列保持较高的召回率。BAKIR 已在 HPRC 集合的一个子集上进行了评估,BAKIR 能够改进许多相关注释并调用新的变异。BAKIR 可在 GitHub 上免费获取,通过多种安装方法(包括 pip、conda 和 singularity container)轻松访问和使用,并配备了用户友好的命令行界面,从而促进了其在科学界的应用:BAKIR 可在 github.com/algo-cancer/bakir 上获取:补充数据可在 Bioinformatics online 上获取。
{"title":"Biologically-informed killer cell immunoglobulin-like receptor gene annotation tool.","authors":"Michael K B Ford, Ananth Hari, Qinghui Zhou, Ibrahim Numanagić, S Cenk Sahinalp","doi":"10.1093/bioinformatics/btae622","DOIUrl":"10.1093/bioinformatics/btae622","url":null,"abstract":"<p><strong>Summary: </strong>Natural killer (NK) cells are essential components of the innate immune system, with their activity significantly regulated by Killer cell Immunoglobulin-like Receptors (KIRs). The diversity and structural complexity of KIR genes present significant challenges for accurate genotyping, essential for understanding NK cell functions and their implications in health and disease. Traditional genotyping methods struggle with the variable nature of KIR genes, leading to inaccuracies that can impede immunogenetic research. These challenges extend to high-quality phased assemblies, which have been recently popularized by the Human Pangenome Consortium. This article introduces BAKIR (Biologically informed Annotator for KIR locus), a tailored computational tool designed to overcome the challenges of KIR genotyping and annotation on high-quality, phased genome assemblies. BAKIR aims to enhance the accuracy of KIR gene annotations by structuring its annotation pipeline around identifying key functional mutations, thereby improving the identification and subsequent relevance of gene and allele calls. It uses a multi-stage mapping, alignment, and variant calling process to ensure high-precision gene and allele identification, while also maintaining high recall for sequences that are significantly mutated or truncated relative to the known allele database. BAKIR has been evaluated on a subset of the HPRC assemblies, where BAKIR was able to improve many of the associated annotations and call novel variants. BAKIR is freely available on GitHub, offering ease of access and use through multiple installation methods, including pip, conda, and singularity container, and is equipped with a user-friendly command-line interface, thereby promoting its adoption in the scientific community.</p><p><strong>Availability and implementation: </strong>BAKIR is available at github.com/algo-cancer/bakir.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549020/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}