Pub Date : 2023-05-15eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1120290
Xinling Li, Greg Gibson, Peng Qiu
One important characteristic of single-cell RNA sequencing (scRNA-seq) data is its high sparsity, where the gene-cell count data matrix contains high proportion of zeros. The sparsity has motivated widespread discussions on dropouts and missing data, as well as imputation algorithms of scRNA-seq analysis. Here, we aim to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share. From public data sources, we gathered paired bulk RNA-seq and scRNA-seq data from 53 human samples, which were generated in diverse biological contexts. We derived pseudo-bulk gene expression by averaging the scRNA-seq data across cells. Comparisons of the paired bulk and pseudo-bulk gene expression profiles revealed that there indeed exists a collection of genes that are frequently under-detected in scRNA-seq compared to bulk RNA-seq. This result was robust to randomization when unpaired bulk and pseudo-bulk gene expression profiles were compared. We performed motif search to the last 350 bp of the identified genes, and observed an enrichment of poly(T) motif. The poly(T) motif toward the tails of those genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during scRNA-seq library preparation, which is a mechanistic conjecture of why certain genes may be more prone to be under-detected in scRNA-seq.
{"title":"Gene representation in scRNA-seq is correlated with common motifs at the 3' end of transcripts.","authors":"Xinling Li, Greg Gibson, Peng Qiu","doi":"10.3389/fbinf.2023.1120290","DOIUrl":"10.3389/fbinf.2023.1120290","url":null,"abstract":"<p><p>One important characteristic of single-cell RNA sequencing (scRNA-seq) data is its high sparsity, where the gene-cell count data matrix contains high proportion of zeros. The sparsity has motivated widespread discussions on dropouts and missing data, as well as imputation algorithms of scRNA-seq analysis. Here, we aim to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share. From public data sources, we gathered paired bulk RNA-seq and scRNA-seq data from 53 human samples, which were generated in diverse biological contexts. We derived pseudo-bulk gene expression by averaging the scRNA-seq data across cells. Comparisons of the paired bulk and pseudo-bulk gene expression profiles revealed that there indeed exists a collection of genes that are frequently under-detected in scRNA-seq compared to bulk RNA-seq. This result was robust to randomization when unpaired bulk and pseudo-bulk gene expression profiles were compared. We performed motif search to the last 350 bp of the identified genes, and observed an enrichment of poly(T) motif. The poly(T) motif toward the tails of those genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during scRNA-seq library preparation, which is a mechanistic conjecture of why certain genes may be more prone to be under-detected in scRNA-seq.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1120290"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10226423/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9924316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-06eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1122559
Chang Li, Ian Hou, Mingjia Ma, Grace Wang, Yongsheng Bai, Xiaoming Liu
Introduction: Alzheimer's disease (AD) is one of the most prominent medical conditions in the world. Understanding the genetic component of the disease can greatly advance our knowledge regarding its progression, treatment and prognosis. Single amino-acid variants (SAVs) in the APOE gene have been widely investigated as a risk factor for AD Studies, including genome-wide association studies, meta-analysis based studies, and in-vivo animal studies, were carried out to investigate the functional importance and pathogenesis potential of APOE SAVs. However, given the high cost of such large-scale or experimental studies, there are only a handful of variants being reported that have definite explanations. The recent development of in-silico analytical approaches, especially large-scale deep learning models, has opened new opportunities for us to probe the structural and functional importance of APOE variants extensively. Method: In this study, we are taking an ensemble approach that simultaneously uses large-scale protein sequence-based models, including Evolutionary Scale Model and AlphaFold, together with a few in-silico functional prediction web services to investigate the known and possibly disease-causing SAVs in APOE and evaluate their likelihood of being functional and structurally disruptive. Results: As a result, using an ensemble approach with little to no prior field-specific knowledge, we reported 5 SAVs in APOE gene to be potentially disruptive, one of which (C112R) was classificed by previous studies as a key risk factor for AD. Discussion: Our study provided a novel framework to analyze and prioritize the functional and structural importance of SAVs for future experimental and functional validation.
简介阿尔茨海默病(AD)是世界上最常见的疾病之一。了解阿尔茨海默病的遗传因素可极大地促进我们对该病的进展、治疗和预后的了解。APOE 基因中的单氨基酸变异体(SAVs)作为 AD 的风险因素受到了广泛的研究。然而,由于此类大规模研究或实验研究的成本较高,目前报道的变异中只有少数几个能给出明确的解释。近来,体内分析方法的发展,尤其是大规模深度学习模型的发展,为我们广泛探究APOE变异体的结构和功能重要性提供了新的机遇。研究方法在本研究中,我们采用了一种集合方法,同时使用基于大规模蛋白质序列的模型(包括进化尺度模型和 AlphaFold)以及几种内部功能预测网络服务来研究 APOE 中已知的和可能致病的 SAV,并评估它们在功能和结构上具有破坏性的可能性。结果:结果:在几乎没有特定领域知识的情况下,我们使用集合方法报告了 APOE 基因中的 5 个 SAVs 可能具有破坏性,其中一个 SAVs(C112R)被先前的研究归类为 AD 的关键风险因素。讨论我们的研究为今后的实验和功能验证提供了一个新的框架,用于分析 SAVs 的功能和结构重要性并确定其优先次序。
{"title":"Orthogonal analysis of variants in APOE gene using <i>in-silico</i> approaches reveals novel disrupting variants.","authors":"Chang Li, Ian Hou, Mingjia Ma, Grace Wang, Yongsheng Bai, Xiaoming Liu","doi":"10.3389/fbinf.2023.1122559","DOIUrl":"10.3389/fbinf.2023.1122559","url":null,"abstract":"<p><p><b>Introduction:</b> Alzheimer's disease (AD) is one of the most prominent medical conditions in the world. Understanding the genetic component of the disease can greatly advance our knowledge regarding its progression, treatment and prognosis. Single amino-acid variants (SAVs) in the APOE gene have been widely investigated as a risk factor for AD Studies, including genome-wide association studies, meta-analysis based studies, and <i>in-vivo</i> animal studies, were carried out to investigate the functional importance and pathogenesis potential of APOE SAVs. However, given the high cost of such large-scale or experimental studies, there are only a handful of variants being reported that have definite explanations. The recent development of <i>in-silico</i> analytical approaches, especially large-scale deep learning models, has opened new opportunities for us to probe the structural and functional importance of APOE variants extensively. <b>Method:</b> In this study, we are taking an ensemble approach that simultaneously uses large-scale protein sequence-based models, including Evolutionary Scale Model and AlphaFold, together with a few <i>in-silico</i> functional prediction web services to investigate the known and possibly disease-causing SAVs in APOE and evaluate their likelihood of being functional and structurally disruptive. <b>Results:</b> As a result, using an ensemble approach with little to no prior field-specific knowledge, we reported 5 SAVs in APOE gene to be potentially disruptive, one of which (C112R) was classificed by previous studies as a key risk factor for AD. <b>Discussion:</b> Our study provided a novel framework to analyze and prioritize the functional and structural importance of SAVs for future experimental and functional validation.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1122559"},"PeriodicalIF":2.8,"publicationDate":"2023-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10117898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9388329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-29DOI: 10.1101/2023.02.13.528380
Wanxin Li, Jules Mirone, Ashok Prasad, Nina Miolane, Carine Legrand, K. D. Duc
Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called DeCOr-MDS (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data and human microbiome project data.
{"title":"Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets","authors":"Wanxin Li, Jules Mirone, Ashok Prasad, Nina Miolane, Carine Legrand, K. D. Duc","doi":"10.1101/2023.02.13.528380","DOIUrl":"https://doi.org/10.1101/2023.02.13.528380","url":null,"abstract":"Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called DeCOr-MDS (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data and human microbiome project data.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42926538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-29eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1112649
Emilia Ståhlbom, Jesper Molin, Anders Ynnerman, Claes Lundström
In this perspective article we discuss a certain type of research on visualization for bioinformatics data, namely, methods targeting clinical use. We argue that in this subarea additional complex challenges come into play, particularly so in genomics. We here describe four such challenge areas, elicited from a domain characterization effort in clinical genomics. We also list opportunities for visualization research to address clinical challenges in genomics that were uncovered in the case study. The findings are shown to have parallels with experiences from the diagnostic imaging domain.
{"title":"The thorny complexities of visualization research for clinical settings: A case study from genomics.","authors":"Emilia Ståhlbom, Jesper Molin, Anders Ynnerman, Claes Lundström","doi":"10.3389/fbinf.2023.1112649","DOIUrl":"10.3389/fbinf.2023.1112649","url":null,"abstract":"<p><p>In this perspective article we discuss a certain type of research on visualization for bioinformatics data, namely, methods targeting clinical use. We argue that in this subarea additional complex challenges come into play, particularly so in genomics. We here describe four such challenge areas, elicited from a domain characterization effort in clinical genomics. We also list opportunities for visualization research to address clinical challenges in genomics that were uncovered in the case study. The findings are shown to have parallels with experiences from the diagnostic imaging domain.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1112649"},"PeriodicalIF":2.8,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10090312/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9316222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-22eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1069487
Jover Lee, James Hadfield, Allison Black, Thomas R Sibley, Richard A Neher, Trevor Bedford, John Huddleston
Seasonal influenza vaccines must be updated regularly to account for mutations that allow influenza viruses to escape our existing immunity. A successful vaccine should represent the genetic diversity of recently circulating viruses and induce antibodies that effectively prevent infection by those recent viruses. Thus, linking the genetic composition of circulating viruses and the serological experimental results measuring antibody efficacy is crucial to the vaccine design decision. Historically, genetic and serological data have been presented separately in the form of static visualizations of phylogenetic trees and tabular serological results to identify vaccine candidates. To simplify this decision-making process, we have created an interactive tool for visualizing serological data that has been integrated into Nextstrain's real-time phylogenetic visualization framework, Auspice. We show how the combined interactive visualizations may be used by decision makers to explore the relationships between complex data sets for both prospective vaccine virus selection and retrospectively exploring the performance of vaccine viruses.
{"title":"Joint visualization of seasonal influenza serology and phylogeny to inform vaccine composition.","authors":"Jover Lee, James Hadfield, Allison Black, Thomas R Sibley, Richard A Neher, Trevor Bedford, John Huddleston","doi":"10.3389/fbinf.2023.1069487","DOIUrl":"10.3389/fbinf.2023.1069487","url":null,"abstract":"<p><p>Seasonal influenza vaccines must be updated regularly to account for mutations that allow influenza viruses to escape our existing immunity. A successful vaccine should represent the genetic diversity of recently circulating viruses and induce antibodies that effectively prevent infection by those recent viruses. Thus, linking the genetic composition of circulating viruses and the serological experimental results measuring antibody efficacy is crucial to the vaccine design decision. Historically, genetic and serological data have been presented separately in the form of static visualizations of phylogenetic trees and tabular serological results to identify vaccine candidates. To simplify this decision-making process, we have created an interactive tool for visualizing serological data that has been integrated into Nextstrain's real-time phylogenetic visualization framework, Auspice. We show how the combined interactive visualizations may be used by decision makers to explore the relationships between complex data sets for both prospective vaccine virus selection and retrospectively exploring the performance of vaccine viruses.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1069487"},"PeriodicalIF":0.0,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9272786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-08eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.998991
Gwendal Fouché, Ferran Argelaguet, Emmanuel Faure, Charles Kervrann
The analysis of multidimensional time-varying datasets faces challenges, notably regarding the representation of the data and the visualization of temporal variations. We propose an extension of the well-known Space-Time Cube (STC) visualization technique in order to visualize time-varying 3D spatial data, taking advantage of the interaction capabilities of Virtual Reality (VR). First, we propose the Space-Time Hypercube (STH) as an abstraction for 3D temporal data, extended from the STC concept. Second, through the example of embryo development imaging dataset, we detail the construction and visualization of a STC based on a user-driven projection of the spatial and temporal information. This projection yields a 3D STC visualization, which can also encode additional numerical and categorical data. Additionally, we propose a set of tools allowing the user to filter and manipulate the 3D STC which benefits the visualization, exploration and interaction possibilities offered by VR. Finally, we evaluated the proposed visualization method in the context of 3D temporal cell imaging data analysis, through a user study (n = 5) reporting the feedback from five biologists. These domain experts also accompanied the application design as consultants, providing insights on how the STC visualization could be used for the exploration of complex 3D temporal morphogenesis data.
{"title":"Immersive and interactive visualization of 3D spatio-temporal data using a space time hypercube: Application to cell division and morphogenesis analysis.","authors":"Gwendal Fouché, Ferran Argelaguet, Emmanuel Faure, Charles Kervrann","doi":"10.3389/fbinf.2023.998991","DOIUrl":"10.3389/fbinf.2023.998991","url":null,"abstract":"<p><p>The analysis of multidimensional time-varying datasets faces challenges, notably regarding the representation of the data and the visualization of temporal variations. We propose an extension of the well-known Space-Time Cube (STC) visualization technique in order to visualize time-varying 3D spatial data, taking advantage of the interaction capabilities of Virtual Reality (VR). First, we propose the Space-Time Hypercube (STH) as an abstraction for 3D temporal data, extended from the STC concept. Second, through the example of embryo development imaging dataset, we detail the construction and visualization of a STC based on a user-driven projection of the spatial and temporal information. This projection yields a 3D STC visualization, which can also encode additional numerical and categorical data. Additionally, we propose a set of tools allowing the user to filter and manipulate the 3D STC which benefits the visualization, exploration and interaction possibilities offered by VR. Finally, we evaluated the proposed visualization method in the context of 3D temporal cell imaging data analysis, through a user study (n = 5) reporting the feedback from five biologists. These domain experts also accompanied the application design as consultants, providing insights on how the STC visualization could be used for the exploration of complex 3D temporal morphogenesis data.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"998991"},"PeriodicalIF":2.8,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10031126/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9561071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-03eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1157956
Fotis A Baltoumas, Evangelos Karatzas, David Paez-Espino, Nefeli K Venetsianou, Eleni Aplakidou, Anastasis Oulas, Robert D Finn, Sergey Ovchinnikov, Evangelos Pafilis, Nikos C Kyrpides, Georgios A Pavlopoulos
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
{"title":"Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters.","authors":"Fotis A Baltoumas, Evangelos Karatzas, David Paez-Espino, Nefeli K Venetsianou, Eleni Aplakidou, Anastasis Oulas, Robert D Finn, Sergey Ovchinnikov, Evangelos Pafilis, Nikos C Kyrpides, Georgios A Pavlopoulos","doi":"10.3389/fbinf.2023.1157956","DOIUrl":"10.3389/fbinf.2023.1157956","url":null,"abstract":"<p><p>Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1157956"},"PeriodicalIF":2.8,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10029925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9180381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-28eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1120370
Letícia M F Bertoline, Angélica N Lima, Jose E Krieger, Samantha K Teixeira
Three-dimensional protein structure is directly correlated with its function and its determination is critical to understanding biological processes and addressing human health and life science problems in general. Although new protein structures are experimentally obtained over time, there is still a large difference between the number of protein sequences placed in Uniprot and those with resolved tertiary structure. In this context, studies have emerged to predict protein structures by methods based on a template or free modeling. In the last years, different methods have been combined to overcome their individual limitations, until the emergence of AlphaFold2, which demonstrated that predicting protein structure with high accuracy at unprecedented scale is possible. Despite its current impact in the field, AlphaFold2 has limitations. Recently, new methods based on protein language models have promised to revolutionize the protein structural biology allowing the discovery of protein structure and function only from evolutionary patterns present on protein sequence. Even though these methods do not reach AlphaFold2 accuracy, they already covered some of its limitations, being able to predict with high accuracy more than 200 million proteins from metagenomic databases. In this mini-review, we provide an overview of the breakthroughs in protein structure prediction before and after AlphaFold2 emergence.
{"title":"Before and after AlphaFold2: An overview of protein structure prediction.","authors":"Letícia M F Bertoline, Angélica N Lima, Jose E Krieger, Samantha K Teixeira","doi":"10.3389/fbinf.2023.1120370","DOIUrl":"10.3389/fbinf.2023.1120370","url":null,"abstract":"<p><p>Three-dimensional protein structure is directly correlated with its function and its determination is critical to understanding biological processes and addressing human health and life science problems in general. Although new protein structures are experimentally obtained over time, there is still a large difference between the number of protein sequences placed in Uniprot and those with resolved tertiary structure. In this context, studies have emerged to predict protein structures by methods based on a template or free modeling. In the last years, different methods have been combined to overcome their individual limitations, until the emergence of AlphaFold2, which demonstrated that predicting protein structure with high accuracy at unprecedented scale is possible. Despite its current impact in the field, AlphaFold2 has limitations. Recently, new methods based on protein language models have promised to revolutionize the protein structural biology allowing the discovery of protein structure and function only from evolutionary patterns present on protein sequence. Even though these methods do not reach AlphaFold2 accuracy, they already covered some of its limitations, being able to predict with high accuracy more than 200 million proteins from metagenomic databases. In this mini-review, we provide an overview of the breakthroughs in protein structure prediction before and after AlphaFold2 emergence.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1120370"},"PeriodicalIF":2.8,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10011655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9138164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-24eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1092853
Tristan Hoellinger, Camille Mestre, Hugues Aschard, Wilfried Le Goff, Sylvain Foissac, Thomas Faraut, Sarah Djebali
Differences in cells' functions arise from differential activity of regulatory elements, including enhancers. Enhancers are cis-regulatory elements that cooperate with promoters through transcription factors to activate the expression of one or several genes by getting physically close to them in the 3D space of the nucleus. There is increasing evidence that genetic variants associated with common diseases are enriched in enhancers active in cell types relevant to these diseases. Identifying the enhancers associated with genes and conversely, the sets of genes activated by each enhancer (the so-called enhancer/gene or E/G relationships) across cell types, can help understanding the genetic mechanisms underlying human diseases. There are three broad approaches for the genome-wide identification of E/G relationships in a cell type: 1) genetic link methods or eQTL, 2) functional link methods based on 1D functional data such as open chromatin, histone mark or gene expression and 3) spatial link methods based on 3D data such as HiC. Since 1) and 3) are costly, the current strategy is to develop functional link methods and to use data from 1) and 3) as reference to evaluate them. However, there is still no consensus on the best functional link method to date, and method comparison remain seldom. Here, we compared the relative performances of three recent methods for the identification of enhancer-gene links, TargetFinder, Average-Rank, and the ABC model, using the three latest benchmarks from the field: a reference that combines 3D and eQTL data, called BENGI, and two genetic screening references, called CRiFF and CRiSPRi. Overall, none of the three methods performed best on the three references. CRiFF and CRISPRi reference sets are likely more reliable, but CRiFF is not genome-wide and CRiFF and CRISPRi are mostly available on the K562 cancer cell line. The BENGI reference set is genome-wide but likely contains many false positives. This study therefore calls for new reliable and genome-wide E/G reference data rather than new functional link E/G identification methods.
{"title":"Enhancer/gene relationships: Need for more reliable genome-wide reference sets.","authors":"Tristan Hoellinger, Camille Mestre, Hugues Aschard, Wilfried Le Goff, Sylvain Foissac, Thomas Faraut, Sarah Djebali","doi":"10.3389/fbinf.2023.1092853","DOIUrl":"10.3389/fbinf.2023.1092853","url":null,"abstract":"<p><p>Differences in cells' functions arise from differential activity of regulatory elements, including enhancers. Enhancers are cis-regulatory elements that cooperate with promoters through transcription factors to activate the expression of one or several genes by getting physically close to them in the 3D space of the nucleus. There is increasing evidence that genetic variants associated with common diseases are enriched in enhancers active in cell types relevant to these diseases. Identifying the enhancers associated with genes and conversely, the sets of genes activated by each enhancer (the so-called enhancer/gene or E/G relationships) across cell types, can help understanding the genetic mechanisms underlying human diseases. There are three broad approaches for the genome-wide identification of E/G relationships in a cell type: 1) genetic link methods or eQTL, 2) functional link methods based on 1D functional data such as open chromatin, histone mark or gene expression and 3) spatial link methods based on 3D data such as HiC. Since 1) and 3) are costly, the current strategy is to develop functional link methods and to use data from 1) and 3) as reference to evaluate them. However, there is still no consensus on the best functional link method to date, and method comparison remain seldom. Here, we compared the relative performances of three recent methods for the identification of enhancer-gene links, TargetFinder, Average-Rank, and the ABC model, using the three latest benchmarks from the field: a reference that combines 3D and eQTL data, called BENGI, and two genetic screening references, called CRiFF and CRiSPRi. Overall, none of the three methods performed best on the three references. CRiFF and CRISPRi reference sets are likely more reliable, but CRiFF is not genome-wide and CRiFF and CRISPRi are mostly available on the K562 cancer cell line. The BENGI reference set is genome-wide but likely contains many false positives. This study therefore calls for new reliable and genome-wide E/G reference data rather than new functional link E/G identification methods.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1092853"},"PeriodicalIF":2.8,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9999192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9102019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-17eCollection Date: 2023-01-01DOI: 10.3389/fbinf.2023.1123993
Teng Ann Ng, Shamima Rashid, Chee Keong Kwoh
There exist several databases that provide virus-host protein interactions. While most provide curated records of interacting virus-host protein pairs, information on the strain-specific virulence factors or protein domains involved, is lacking. Some databases offer incomplete coverage of influenza strains because of the need to sift through vast amounts of literature (including those of major viruses including HIV and Dengue, besides others). None have offered complete, strain specific protein-protein interaction records for the influenza A group of viruses. In this paper, we present a comprehensive network of predicted domain-domain interaction(s) (DDI) between influenza A virus (IAV) and mouse host proteins, that will allow the systematic study of disease factors by taking the virulence information (lethal dose) into account. From a previously published dataset of lethal dose studies of IAV infection in mice, we constructed an interacting domain network of mouse and viral protein domains as nodes with weighted edges. The edges were scored with the Domain Interaction Statistical Potential (DISPOT) to indicate putative DDI. The virulence network can be easily navigated via a web browser, with the associated virulence information (LD50 values) prominently displayed. The network will aid influenza A disease modeling by providing strain-specific virulence levels with interacting protein domains. It can possibly contribute to computational methods for uncovering influenza infection mechanisms mediated through protein domain interactions between viral and host proteins. It is available at https://iav-ppi.onrender.com/home.
{"title":"Virulence network of interacting domains of influenza a and mouse proteins.","authors":"Teng Ann Ng, Shamima Rashid, Chee Keong Kwoh","doi":"10.3389/fbinf.2023.1123993","DOIUrl":"10.3389/fbinf.2023.1123993","url":null,"abstract":"<p><p>There exist several databases that provide virus-host protein interactions. While most provide curated records of interacting virus-host protein pairs, information on the strain-specific virulence factors or protein domains involved, is lacking. Some databases offer incomplete coverage of influenza strains because of the need to sift through vast amounts of literature (including those of major viruses including HIV and Dengue, besides others). None have offered complete, strain specific protein-protein interaction records for the influenza A group of viruses. In this paper, we present a comprehensive network of predicted domain-domain interaction(s) (DDI) between influenza A virus (IAV) and mouse host proteins, that will allow the systematic study of disease factors by taking the virulence information (lethal dose) into account. From a previously published dataset of lethal dose studies of IAV infection in mice, we constructed an interacting domain network of mouse and viral protein domains as nodes with weighted edges. The edges were scored with the Domain Interaction Statistical Potential (DISPOT) to indicate putative DDI. The virulence network can be easily navigated <i>via</i> a web browser, with the associated virulence information (LD<sub>50</sub> values) prominently displayed. The network will aid influenza A disease modeling by providing strain-specific virulence levels with interacting protein domains. It can possibly contribute to computational methods for uncovering influenza infection mechanisms mediated through protein domain interactions between viral and host proteins. It is available at https://iav-ppi.onrender.com/home.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1123993"},"PeriodicalIF":2.8,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9982101/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10849436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}