Pub Date : 2024-08-28eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1419274
Hermenegildo Taboada-Castro, Alfredo José Hernández-Álvarez, Juan Miguel Escorcia-Rodríguez, Julio Augusto Freyre-González, Edgardo Galán-Vásquez, Sergio Encarnación-Guevara
Rhizobium etli CFN42 proteome-transcriptome mixed data of exponential growth and nitrogen-fixing bacteroids, as well as Sinorhizobium meliloti 1021 transcriptome data of growth and nitrogen-fixing bacteroids, were integrated into transcriptional regulatory networks (TRNs). The one-step construction network consisted of a matrix-clustering analysis of matrices of the gene profile and all matrices of the transcription factors (TFs) of their genome. The networks were constructed with the prediction of regulatory network application of the RhizoBindingSites database (http://rhizobindingsites.ccg.unam.mx/). The deduced free-living Rhizobium etli network contained 1,146 genes, including 380 TFs and 12 sigma factors. In addition, the bacteroid R. etli CFN42 network contained 884 genes, where 364 were TFs, and 12 were sigma factors, whereas the deduced free-living Sinorhizobium meliloti 1021 network contained 643 genes, where 259 were TFs and seven were sigma factors, and the bacteroid Sinorhizobium meliloti 1021 network contained 357 genes, where 210 were TFs and six were sigma factors. The similarity of these deduced condition-dependent networks and the biological E. coli and B. subtilis independent condition networks segregates from the random Erdös-Rényi networks. Deduced networks showed a low average clustering coefficient. They were not scale-free, showing a gradually diminishing hierarchy of TFs in contrast to the hierarchy role of the sigma factor rpoD in the E. coli K12 network. For rhizobia networks, partitioning the genome in the chromosome, chromids, and plasmids, where essential genes are distributed, and the symbiotic ability that is mostly coded in plasmids, may alter the structure of these deduced condition-dependent networks. It provides potential TF gen-target relationship data for constructing regulons, which are the basic units of a TRN.
{"title":"<i>Rhizobium etli</i> CFN42 and <i>Sinorhizobium meliloti</i> 1021 bioinformatic transcriptional regulatory networks from culture and symbiosis.","authors":"Hermenegildo Taboada-Castro, Alfredo José Hernández-Álvarez, Juan Miguel Escorcia-Rodríguez, Julio Augusto Freyre-González, Edgardo Galán-Vásquez, Sergio Encarnación-Guevara","doi":"10.3389/fbinf.2024.1419274","DOIUrl":"https://doi.org/10.3389/fbinf.2024.1419274","url":null,"abstract":"<p><p><i>Rhizobium etli</i> CFN42 proteome-transcriptome mixed data of exponential growth and nitrogen-fixing bacteroids, as well as <i>Sinorhizobium meliloti</i> 1021 transcriptome data of growth and nitrogen-fixing bacteroids, were integrated into transcriptional regulatory networks (TRNs). The one-step construction network consisted of a matrix-clustering analysis of matrices of the gene profile and all matrices of the transcription factors (TFs) of their genome. The networks were constructed with the prediction of regulatory network application of the RhizoBindingSites database (http://rhizobindingsites.ccg.unam.mx/). The deduced free-living <i>Rhizobium etli</i> network contained 1,146 genes, including 380 TFs and 12 sigma factors. In addition, the bacteroid <i>R. etli</i> CFN42 network contained 884 genes, where 364 were TFs, and 12 were sigma factors, whereas the deduced free-living <i>Sinorhizobium meliloti</i> 1021 network contained 643 genes, where 259 were TFs and seven were sigma factors, and the bacteroid <i>Sinorhizobium meliloti</i> 1021 network contained 357 genes, where 210 were TFs and six were sigma factors. The similarity of these deduced condition-dependent networks and the biological <i>E. coli</i> and <i>B. subtilis</i> independent condition networks segregates from the random Erdös-Rényi networks. Deduced networks showed a low average clustering coefficient. They were not scale-free, showing a gradually diminishing hierarchy of TFs in contrast to the hierarchy role of the sigma factor <i>rpoD</i> in the <i>E. coli</i> K12 network. For rhizobia networks, partitioning the genome in the chromosome, chromids, and plasmids, where essential genes are distributed, and the symbiotic ability that is mostly coded in plasmids, may alter the structure of these deduced condition-dependent networks. It provides potential TF gen-target relationship data for constructing regulons, which are the basic units of a TRN.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1419274"},"PeriodicalIF":2.8,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11387232/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142302475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1353807
Stuart G Jantzen, Gaël McGill, Jodie Jenkinson
Molecular visualization is a powerful way to represent the complex structure of molecules and their higher order assemblies, as well as the dynamics of their interactions. Although conventions for depicting static molecular structures and complexes are now well established and guide the viewer's attention to specific aspects of structure and function, little attention and design classification has been devoted to how molecular motion is depicted. As we continue to probe and discover how molecules move - including their internal flexibility, conformational changes and dynamic associations with binding partners and environments - we are faced with difficult design challenges that are relevant to molecular visualizations both for the scientific community and students of cell and molecular biology. To facilitate these design decisions, we have identified twelve molecular animation design principles that are important to consider when creating molecular animations. Many of these principles pertain to misconceptions that students have primarily regarding the agency of molecules, while others are derived from visual treatments frequently observed in molecular animations that may promote misconceptions. For each principle, we have created a pair of molecular animations that exemplify the principle by depicting the same content in the presence and absence of that design approach. Although not intended to be prescriptive, we hope this set of design principles can be used by the scientific, education, and scientific visualization communities to facilitate and improve the pedagogical effectiveness of molecular animation.
{"title":"Design principles for molecular animation.","authors":"Stuart G Jantzen, Gaël McGill, Jodie Jenkinson","doi":"10.3389/fbinf.2024.1353807","DOIUrl":"10.3389/fbinf.2024.1353807","url":null,"abstract":"<p><p>Molecular visualization is a powerful way to represent the complex structure of molecules and their higher order assemblies, as well as the dynamics of their interactions. Although conventions for depicting static molecular structures and complexes are now well established and guide the viewer's attention to specific aspects of structure and function, little attention and design classification has been devoted to how molecular motion is depicted. As we continue to probe and discover how molecules move - including their internal flexibility, conformational changes and dynamic associations with binding partners and environments - we are faced with difficult design challenges that are relevant to molecular visualizations both for the scientific community and students of cell and molecular biology. To facilitate these design decisions, we have identified twelve molecular animation design principles that are important to consider when creating molecular animations. Many of these principles pertain to misconceptions that students have primarily regarding the agency of molecules, while others are derived from visual treatments frequently observed in molecular animations that may promote misconceptions. For each principle, we have created a pair of molecular animations that exemplify the principle by depicting the same content in the presence and absence of that design approach. Although not intended to be prescriptive, we hope this set of design principles can be used by the scientific, education, and scientific visualization communities to facilitate and improve the pedagogical effectiveness of molecular animation.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1353807"},"PeriodicalIF":2.8,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1358374
Jeremias Schebera, Dirk Zeckzer, Daniel Wiegreffe
Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.
{"title":"A layout framework for genome-wide multiple sequence alignment graphs.","authors":"Jeremias Schebera, Dirk Zeckzer, Daniel Wiegreffe","doi":"10.3389/fbinf.2024.1358374","DOIUrl":"10.3389/fbinf.2024.1358374","url":null,"abstract":"<p><p>Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1358374"},"PeriodicalIF":2.8,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11362851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1425419
Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava
Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).
{"title":"A hybrid approach for predicting transcription factors.","authors":"Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava","doi":"10.3389/fbinf.2024.1425419","DOIUrl":"10.3389/fbinf.2024.1425419","url":null,"abstract":"<p><p>Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of \"TransFacPred\" (https://webs.iiitd.edu.in/raghava/transfacpred).</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1425419"},"PeriodicalIF":2.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141908534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-10eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1400003
Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond
Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.
对病毒病原体的分子监测和从基因组数据推断传播网络在公共卫生工作中发挥着越来越重要的作用,尤其是在 HIV-1 方面。对于许多方法来说,用于连接传播网络中序列的遗传距离阈值是影响推断网络特性的一个关键参数。使用过高的距离阈值会导致网络中出现许多虚假链接,从而难以解释。相反,如果距离阈值过低,则可能导致网络中的链接过少,从而无法捕捉到有关公共卫生问题的关键信息。已发表的使用 HIV-TRACE 软件包进行的研究通常使用 0.015 个取代/位点的默认阈值来处理 HIV pol 基因序列,但在许多情况下,研究人员会启发式地选择其他阈值参数,以更好地捕捉他们正在研究的流行病的潜在动态。在此,我们提出了一种通用的启发式评分方法,用于自适应地调整距离阈值,以防止形成巨大的簇。我们优先考虑最大集群和第二大集群的大小之比,最大限度地增加网络中存在的集群数量。我们将我们的评分启发式应用于具有不同特征的疫情爆发,如区域或时间变异性,并展示了使用评分机制建议的距离阈值来识别表现出风险因素的集群的实用性,否则这些集群将更难识别。例如,我们发现 0.015 个替代/地点的距离阈值是类似美国流行病的典型阈值,而最近在中国男男性行为者(MSM)中爆发的 CRF07_BC 亚型等流行病的最佳阈值较低,为 0.005,以便更好地捕捉从注射吸毒(IDU)到 MSM 作为主要风险因素的转变。另外,在乌干达维多利亚湖周边的社区,异性传播已持续多年,我们发现需要更大的距离阈值,才能在更长的时间内通过稀疏取样捕捉到风险因素更多样化的人群。这样的识别可以让相关公共卫生官员采取更明智的干预行动。
{"title":"AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters.","authors":"Steven Weaver, Vanessa M Dávila Conn, Daniel Ji, Hannah Verdonk, Santiago Ávila-Ríos, Andrew J Leigh Brown, Joel O Wertheim, Sergei L Kosakovsky Pond","doi":"10.3389/fbinf.2024.1400003","DOIUrl":"10.3389/fbinf.2024.1400003","url":null,"abstract":"<p><p>Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained heterosexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1400003"},"PeriodicalIF":2.8,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141861844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-08eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1417428
Mahnoor N Gondal, Saad Ur Rehman Shah, Arul M Chinnaiyan, Marcin Cieslik
Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq) technologies and experimental protocols have led to the generation of vast amounts of transcriptomic data that populates several online databases and repositories. Here, we systematically examined large-scale scRNA-seq databases, categorizing them based on their scope and purpose such as general, tissue-specific databases, disease-specific databases, cancer-focused databases, and cell type-focused databases. Next, we discuss the technical and methodological challenges associated with curating large-scale scRNA-seq databases, along with current computational solutions. We argue that understanding scRNA-seq databases, including their limitations and assumptions, is crucial for effectively utilizing this data to make robust discoveries and identify novel biological insights. Such platforms can help bridge the gap between computational and wet lab scientists through user-friendly web-based interfaces needed for democratizing access to single-cell data. These platforms would facilitate interdisciplinary research, enabling researchers from various disciplines to collaborate effectively. This review underscores the importance of leveraging computational approaches to unravel the complexities of single-cell data and offers a promising direction for future research in the field.
{"title":"A systematic overview of single-cell transcriptomics databases, their use cases, and limitations.","authors":"Mahnoor N Gondal, Saad Ur Rehman Shah, Arul M Chinnaiyan, Marcin Cieslik","doi":"10.3389/fbinf.2024.1417428","DOIUrl":"10.3389/fbinf.2024.1417428","url":null,"abstract":"<p><p>Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq) technologies and experimental protocols have led to the generation of vast amounts of transcriptomic data that populates several online databases and repositories. Here, we systematically examined large-scale scRNA-seq databases, categorizing them based on their scope and purpose such as general, tissue-specific databases, disease-specific databases, cancer-focused databases, and cell type-focused databases. Next, we discuss the technical and methodological challenges associated with curating large-scale scRNA-seq databases, along with current computational solutions. We argue that understanding scRNA-seq databases, including their limitations and assumptions, is crucial for effectively utilizing this data to make robust discoveries and identify novel biological insights. Such platforms can help bridge the gap between computational and wet lab scientists through user-friendly web-based interfaces needed for democratizing access to single-cell data. These platforms would facilitate interdisciplinary research, enabling researchers from various disciplines to collaborate effectively. This review underscores the importance of leveraging computational approaches to unravel the complexities of single-cell data and offers a promising direction for future research in the field.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1417428"},"PeriodicalIF":2.8,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11260681/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141749912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1392613
Trudy M Wassenaar, Terry Harville, Jonathan Chastain, Visanu Wanchai, David W Ussery
The major histocompatibility (MHC) locus, also known as the Human Leukocyte Antigen (HLA) genes, is located on the short arm of chromosome 6, and contains three regions (Class I, Class II and Class III). This 5 Mbp locus is one of the most variable regions of the human genome, yet it also encodes a set of highly conserved and important proteins related to immunological response. Genetic variations in this region are responsible for more diseases than in the entire rest of the human genome. However, information on local structural features of the DNA is largely ignored. With recent advances in long-read sequencing technology, it is now becoming possible to sequence the entire 5 Mbp MHC locus, producing complete diploid haplotypes of the whole region. Here, we describe structural maps based on the complete sequences from six different homozygous HLA cell lines. We find long-range structural variability in the different sequences for DNA stacking energy, position preference and curvature, variation in repeats, as well as more local changes in regions forming open chromatin structures, likely to influence gene expression levels. These structural maps can be useful in visualizing large scale structural variation across HLA types, in particular when this can be complemented with epigenetic signals.
主要组织相容性(MHC)基因座又称人类白细胞抗原(HLA)基因,位于第 6 号染色体的短臂上,包含三个区域(I 类、II 类和 III 类)。这个 5 Mbp 的基因座是人类基因组中变异最大的区域之一,但它也编码了一组与免疫反应有关的高度保守的重要蛋白质。该区域的基因变异导致的疾病比整个人类基因组的其他区域还要多。然而,有关 DNA 局部结构特征的信息在很大程度上被忽视了。随着长线程测序技术的不断进步,现在可以对整个 5 Mbp MHC 基因座进行测序,从而得到整个区域的完整二倍体单倍型。在此,我们描述了基于六个不同同源 HLA 细胞系完整序列的结构图。我们发现不同序列在 DNA 堆叠能、位置偏好和曲率、重复序列的变化等方面存在长程结构变异,而在形成开放染色质结构的区域则存在更多局部变化,这些变化可能会影响基因表达水平。这些结构图有助于直观显示不同 HLA 类型的大规模结构变异,特别是在有表观遗传学信号补充的情况下。
{"title":"DNA structural features and variability of complete MHC locus sequences.","authors":"Trudy M Wassenaar, Terry Harville, Jonathan Chastain, Visanu Wanchai, David W Ussery","doi":"10.3389/fbinf.2024.1392613","DOIUrl":"10.3389/fbinf.2024.1392613","url":null,"abstract":"<p><p>The major histocompatibility (MHC) locus, also known as the Human Leukocyte Antigen (HLA) genes, is located on the short arm of chromosome 6, and contains three regions (Class I, Class II and Class III). This 5 Mbp locus is one of the most variable regions of the human genome, yet it also encodes a set of highly conserved and important proteins related to immunological response. Genetic variations in this region are responsible for more diseases than in the entire rest of the human genome. However, information on local structural features of the DNA is largely ignored. With recent advances in long-read sequencing technology, it is now becoming possible to sequence the entire 5 Mbp MHC locus, producing complete diploid haplotypes of the whole region. Here, we describe structural maps based on the complete sequences from six different homozygous HLA cell lines. We find long-range structural variability in the different sequences for DNA stacking energy, position preference and curvature, variation in repeats, as well as more local changes in regions forming open chromatin structures, likely to influence gene expression levels. These structural maps can be useful in visualizing large scale structural variation across HLA types, in particular when this can be complemented with epigenetic signals.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1392613"},"PeriodicalIF":2.8,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11251971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141636053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1391086
Broňa Brejová, Travis Gagie, Eva Herencsárová, Tomáš Vinař
We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path's score is its vertices' total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.
我们将 Csűrös(IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150)之前研究的寻找最大得分段集问题从序列推广到图。也就是说,给定一个顶点加权图 G 和一个非负的启动惩罚 c,我们可以在 G 中找到一组顶点相交的路径,当每条路径的得分是其顶点的总权重减去 c 时,总得分最大。我们提出的算法对于树宽恒定的图具有线性时间复杂度。从序列到图的泛化使该算法可用于代表多个相关基因组的庞基因组图,并可被视为庞基因组上多个生物学问题的通用抽象,包括 CpG 岛搜索、ChIP-seq 数据分析、功能元素区域富集分析或简单的链问题。
{"title":"Maximum-scoring path sets on pangenome graphs of constant treewidth.","authors":"Broňa Brejová, Travis Gagie, Eva Herencsárová, Tomáš Vinař","doi":"10.3389/fbinf.2024.1391086","DOIUrl":"10.3389/fbinf.2024.1391086","url":null,"abstract":"<p><p>We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150), from sequences to graphs. Namely, given a vertex-weighted graph <i>G</i> and a non-negative startup penalty <i>c</i>, we can find a set of vertex-disjoint paths in <i>G</i> with maximum total score when each path's score is its vertices' total weight minus <i>c</i>. We call this new problem <i>maximum-scoring path sets</i> (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1391086"},"PeriodicalIF":2.8,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246863/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-24eCollection Date: 2024-01-01DOI: 10.3389/fbinf.2024.1381540
Alexander G Lucaci, William E Brew, Jason Lamanna, Avery Selberg, Vincenzo Carnevale, Anna R Moore, Sergei L Kosakovsky Pond
Rad And Gem-Like GTP-Binding Protein 2 (Rem2), a member of the RGK family of Ras-like GTPases, is implicated in Huntington's disease and Long QT Syndrome and is highly expressed in the brain and endocrine cells. We examine the evolutionary history of Rem2 identified in various mammalian species, focusing on the role of purifying selection and coevolution in shaping its sequence and protein structural constraints. Our analysis of Rem2 sequences across 175 mammalian species found evidence for strong purifying selection in 70% of non-invariant codon sites which is characteristic of essential proteins that play critical roles in biological processes and is consistent with Rem2's role in the regulation of neuronal development and function. We inferred epistatic effects in 50 pairs of codon sites in Rem2, some of which are predicted to have deleterious effects on human health. Additionally, we reconstructed the ancestral evolutionary history of mammalian Rem2 using protein structure prediction of extinct and extant sequences which revealed the dynamics of how substitutions that change the gene sequence of Rem2 can impact protein structure in variable regions while maintaining core functional mechanisms. By understanding the selective pressures, protein- and gene - interactions that have shaped the sequence and structure of the Rem2 protein, we gain a stronger understanding of its biological and functional constraints.
{"title":"The evolution of mammalian Rem2: unraveling the impact of purifying selection and coevolution on protein function, and implications for human disorders.","authors":"Alexander G Lucaci, William E Brew, Jason Lamanna, Avery Selberg, Vincenzo Carnevale, Anna R Moore, Sergei L Kosakovsky Pond","doi":"10.3389/fbinf.2024.1381540","DOIUrl":"10.3389/fbinf.2024.1381540","url":null,"abstract":"<p><p>Rad And Gem-Like GTP-Binding Protein 2 (Rem2), a member of the RGK family of Ras-like GTPases, is implicated in Huntington's disease and Long QT Syndrome and is highly expressed in the brain and endocrine cells. We examine the evolutionary history of Rem2 identified in various mammalian species, focusing on the role of purifying selection and coevolution in shaping its sequence and protein structural constraints. Our analysis of Rem2 sequences across 175 mammalian species found evidence for strong purifying selection in 70% of non-invariant codon sites which is characteristic of essential proteins that play critical roles in biological processes and is consistent with Rem2's role in the regulation of neuronal development and function. We inferred epistatic effects in 50 pairs of codon sites in Rem2, some of which are predicted to have deleterious effects on human health. Additionally, we reconstructed the ancestral evolutionary history of mammalian Rem2 using protein structure prediction of extinct and extant sequences which revealed the dynamics of how substitutions that change the gene sequence of Rem2 can impact protein structure in variable regions while maintaining core functional mechanisms. By understanding the selective pressures, protein- and gene - interactions that have shaped the sequence and structure of the Rem2 protein, we gain a stronger understanding of its biological and functional constraints.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1381540"},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11228553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141560465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics, the interdisciplinary field that combines biology, computer science, and data analysis, plays a pivotal role in advancing our understanding of life sciences. In the African context, where the diversity of biological resources and healthcare challenges is substantial, fostering bioinformatics literacy and proficiency among students is important. This perspective provides an overview of the state of bioinformatics literacy among African students, highlighting the significance, challenges, and potential solutions in addressing this critical educational gap. It proposes various strategies to enhance bioinformatics literacy among African students. These include expanding educational resources, fostering collaboration between institutions, and engaging students in research projects. By addressing the current challenges and implementing comprehensive strategies, African students can harness the power of bioinformatics to contribute to innovative solutions in healthcare, agriculture, and biodiversity conservation, ultimately advancing the continent's scientific capabilities and improving the quality of life for her people. In conclusion, promoting bioinformatics literacy among African students is imperative for the continent's scientific development and advancing frontiers of biological research.
{"title":"Bioinformatics proficiency among African students.","authors":"Ashraf Akintayo Akintola, Abdullahi Tunde Aborode, Muhammed Taofiq Hamza, Augustine Amakiri, Benjamin Moore, Suliat Abdulai, Oluyinka Ajibola Iyiola, Lateef Adegboyega Sulaimon, Effiong Effiong, Adedeji Ogunyemi, Boluwatife Dosunmu, Abdulkadir Yusif Maigoro, Opeyemi Lawal, Kayode Raheem, Ui Wook Hwang","doi":"10.3389/fbinf.2024.1328714","DOIUrl":"10.3389/fbinf.2024.1328714","url":null,"abstract":"<p><p>Bioinformatics, the interdisciplinary field that combines biology, computer science, and data analysis, plays a pivotal role in advancing our understanding of life sciences. In the African context, where the diversity of biological resources and healthcare challenges is substantial, fostering bioinformatics literacy and proficiency among students is important. This perspective provides an overview of the state of bioinformatics literacy among African students, highlighting the significance, challenges, and potential solutions in addressing this critical educational gap. It proposes various strategies to enhance bioinformatics literacy among African students. These include expanding educational resources, fostering collaboration between institutions, and engaging students in research projects. By addressing the current challenges and implementing comprehensive strategies, African students can harness the power of bioinformatics to contribute to innovative solutions in healthcare, agriculture, and biodiversity conservation, ultimately advancing the continent's scientific capabilities and improving the quality of life for her people. In conclusion, promoting bioinformatics literacy among African students is imperative for the continent's scientific development and advancing frontiers of biological research.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1328714"},"PeriodicalIF":2.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11222312/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141536364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}