Klara Elisabeth Burger, Solveig Klepper, Ulrike von Luxburg, Franz Baumdicker
Understanding the genetic ancestry of populations is central to numerous scientific and societal fields. It contributes to a better understanding of human evolutionary history, advances personalized medicine, aids in forensic identification, and allows individuals to connect to their genealogical roots. Existing methods, such as ADMIXTURE, have significantly improved our ability to infer ancestries. However, these methods typically work with a fixed number of independent ancestral populations. As a result, they provide insight into genetic admixture, but do not include a hierarchical interpretation. In particular, the intricate ancestral population structures remain difficult to unravel. Alternative methods with a consistent inheritance structure, such as hierarchical clustering, may offer benefits in terms of interpreting the inferred ancestries. Here, we present tangleGen, a soft clustering tool that transfers the hierarchical machine learning framework Tangles, which leverages graph theoretical concepts, to the field of population genetics. The hierarchical perspective of tangleGen on the composition and structure of populations improves the interpretability of the inferred ancestral relationships. Moreover, tangleGen adds a new layer of explainability, as it allows identifying the SNPs that are responsible for the clustering structure. We demonstrate the capabilities and benefits of tangleGen for the inference of ancestral relationships, using both simulated data and data from the 1000 Genomes Project.
{"title":"Inferring ancestry with the hierarchical soft clustering approach tangleGen.","authors":"Klara Elisabeth Burger, Solveig Klepper, Ulrike von Luxburg, Franz Baumdicker","doi":"10.1101/gr.279399.124","DOIUrl":"https://doi.org/10.1101/gr.279399.124","url":null,"abstract":"<p><p>Understanding the genetic ancestry of populations is central to numerous scientific and societal fields. It contributes to a better understanding of human evolutionary history, advances personalized medicine, aids in forensic identification, and allows individuals to connect to their genealogical roots. Existing methods, such as ADMIXTURE, have significantly improved our ability to infer ancestries. However, these methods typically work with a fixed number of independent ancestral populations. As a result, they provide insight into genetic admixture, but do not include a hierarchical interpretation. In particular, the intricate ancestral population structures remain difficult to unravel. Alternative methods with a consistent inheritance structure, such as hierarchical clustering, may offer benefits in terms of interpreting the inferred ancestries. Here, we present tangleGen, a soft clustering tool that transfers the hierarchical machine learning framework Tangles, which leverages graph theoretical concepts, to the field of population genetics. The hierarchical perspective of tangleGen on the composition and structure of populations improves the interpretability of the inferred ancestral relationships. Moreover, tangleGen adds a new layer of explainability, as it allows identifying the SNPs that are responsible for the clustering structure. We demonstrate the capabilities and benefits of tangleGen for the inference of ancestral relationships, using both simulated data and data from the 1000 Genomes Project.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artem Mikelov, George Nefedev, Aleksandr Tashkeev, Oscar L Rodriguez, Diego A Ortmans, Valeriia Skatova, Mark Izraelson, Alexey N Davydov, Stanislav Poslavsky, Souad Rahmouni, Corey T Watson, Dmitriy M Chudakov, Scott D Boyd, Dmitry A Bolotin
Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here we present a novel algorithm for extra-sensitive and specific variable (V) and joining (J) gene allele inference, allowing reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing datasets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (IGH) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA; TRB) AIRR-seq dataset, representing 134 individuals. This allowed us to assess the genetic diversity within the IGH, TRA and TRB loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through an online database.
{"title":"Ultrasensitive allele inference from immune repertoire sequencing data with MiXCR.","authors":"Artem Mikelov, George Nefedev, Aleksandr Tashkeev, Oscar L Rodriguez, Diego A Ortmans, Valeriia Skatova, Mark Izraelson, Alexey N Davydov, Stanislav Poslavsky, Souad Rahmouni, Corey T Watson, Dmitriy M Chudakov, Scott D Boyd, Dmitry A Bolotin","doi":"10.1101/gr.278775.123","DOIUrl":"10.1101/gr.278775.123","url":null,"abstract":"<p><p>Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here we present a novel algorithm for extra-sensitive and specific variable (V) and joining (J) gene allele inference, allowing reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing datasets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (<i>IGH</i>) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA; TRB) AIRR-seq dataset, representing 134 individuals. This allowed us to assess the genetic diversity within the <i>IGH</i>, <i>TRA</i> and <i>TRB</i> loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through an online database.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Song Zhang, Chao Wang, Shenghua Qin, Choulin Chen, Yongzhou Bao, Yuanyuan Zhang, Lingna Xu, Qingyou Liu, Yunxiang Zhao, Kui Li, Zhonglin Tang, Yuwen Liu
Super-enhancers (SEs) govern the expression of genes defining cell identity. However, the dynamic landscape of SEs and their critical constituent enhancers involved in skeletal muscle development remains unclear. In this study, using pig as a model, we employed CUT&Tag to profile the enhancer-associated histone modification marker H3K27ac in skeletal muscle across two prenatal and three postnatal stages and investigated how SEs influence skeletal muscle development. We identified three SE families with distinct temporal dynamics: continuous (Con, 397), transient (TS, 434), and de novo (DN, 756). These SE families are associated with different temporal gene expression trajectories, biological functions, and DNA methylation levels. Notably, several lines of evidence suggest a potential prominent role of Con SEs in regulating porcine muscle development and meat traits. To pinpoint key cis-regulatory units in Con SEs, we developed an integrative approach that leverages information from eRNA annotation, GWAS signals and high-throughput capture STARR-seq experiments. Within Con SEs, we identified 20 candidate critical enhancers with meat and carcass-associated DNA variations that affect enhancer activity and inferred their upstream TFs and downstream target genes. As a proof of concept, we experimentally validated the role of one such enhancer and its potential target gene during myogenesis. Our findings reveal the dynamic regulatory features of SEs in skeletal muscle development and provide a general integrative framework for identifying critical enhancers underlying the formation of complex traits.
超级增强子(SE)控制着决定细胞特性的基因的表达。然而,参与骨骼肌发育的超级增强子及其关键组成增强子的动态图谱仍不清楚。在这项研究中,我们以猪为模型,利用 CUT&Tag 分析了骨骼肌中与增强子相关的组蛋白修饰标记 H3K27ac 在出生前两个阶段和出生后三个阶段的变化,并研究了增强子如何影响骨骼肌的发育。我们发现了三个具有不同时间动态的 SE 家族:连续 SE(Con,397 个)、瞬时 SE(TS,434 个)和新生 SE(DN,756 个)。这些 SE 家族与不同时间的基因表达轨迹、生物功能和 DNA 甲基化水平相关。值得注意的是,一些证据表明,Con SEs 在调节猪肌肉发育和肉质性状方面可能起着重要作用。为了精确定位 Con SEs 中的关键顺式调控单元,我们开发了一种综合方法,利用来自 eRNA 注释、GWAS 信号和高通量捕获 STARR-seq 实验的信息。在 Con SEs 中,我们发现了 20 个候选关键增强子,它们与肉类和胴体相关的 DNA 变异会影响增强子的活性,并推断出了它们的上游 TF 和下游靶基因。作为概念验证,我们通过实验验证了其中一个增强子及其潜在靶基因在肌形成过程中的作用。我们的研究结果揭示了骨骼肌发育过程中增强子的动态调控特征,并为确定复杂性状形成过程中的关键增强子提供了一个通用的综合框架。
{"title":"Analyzing super-enhancer temporal dynamics reveals potential critical enhancers and their gene regulatory networks underlying skeletal muscle development.","authors":"Song Zhang, Chao Wang, Shenghua Qin, Choulin Chen, Yongzhou Bao, Yuanyuan Zhang, Lingna Xu, Qingyou Liu, Yunxiang Zhao, Kui Li, Zhonglin Tang, Yuwen Liu","doi":"10.1101/gr.278344.123","DOIUrl":"https://doi.org/10.1101/gr.278344.123","url":null,"abstract":"<p><p>Super-enhancers (SEs) govern the expression of genes defining cell identity. However, the dynamic landscape of SEs and their critical constituent enhancers involved in skeletal muscle development remains unclear. In this study, using pig as a model, we employed CUT&Tag to profile the enhancer-associated histone modification marker H3K27ac in skeletal muscle across two prenatal and three postnatal stages and investigated how SEs influence skeletal muscle development. We identified three SE families with distinct temporal dynamics: continuous (Con, 397), transient (TS, 434), and de novo (DN, 756). These SE families are associated with different temporal gene expression trajectories, biological functions, and DNA methylation levels. Notably, several lines of evidence suggest a potential prominent role of Con SEs in regulating porcine muscle development and meat traits. To pinpoint key <i>cis</i>-regulatory units in Con SEs, we developed an integrative approach that leverages information from eRNA annotation, GWAS signals and high-throughput capture STARR-seq experiments. Within Con SEs, we identified 20 candidate critical enhancers with meat and carcass-associated DNA variations that affect enhancer activity and inferred their upstream TFs and downstream target genes. As a proof of concept, we experimentally validated the role of one such enhancer and its potential target gene during myogenesis. Our findings reveal the dynamic regulatory features of SEs in skeletal muscle development and provide a general integrative framework for identifying critical enhancers underlying the formation of complex traits.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yueqiang Song, Fuyuan Li, Shangzi Wang, Yuntong Wang, Cong Lai, Lian Chen, Ning Jiang, Jin Li, Xingdong Chen, Swneke D. Bailey, Xiaoyang Zhang
As a major type of structural variants, tandem duplication plays a critical role in tumorigenesis by increasing oncogene dosage. Recent work has revealed that noncoding enhancers are also affected by duplications leading to the activation of oncogenes that are inside or outside of the duplicated regions. However, the prevalence of enhancer duplication and the identity of their target genes remains largely unknown in the cancer genome. Here, by analyzing whole-genome sequencing data in a non-gene-centric manner, we identify 881 duplication hotspots in 13 major cancer types, most of which do not contain protein-coding genes. We show that the hotspots are enriched with distal enhancer elements and are highly lineage-specific. We develop a HiChIP-based methodology that navigates enhancer–promoter contact maps to prioritize the target genes for the duplication hotspots harboring enhancer elements. The methodology identifies many novel enhancer duplication events activating oncogenes such as ESR1, FOXA1, GATA3, GATA6, TP63, and VEGFA, as well as potentially novel oncogenes such as GRHL2, IRF2BP2, and CREB3L1. In particular, we identify a duplication hotspot on Chromosome 10p15 harboring a cluster of enhancers, which skips over two genes, through a long-range chromatin interaction, to activate an oncogenic isoform of the NET1 gene to promote migration of gastric cancer cells. Focusing on tandem duplications, our study substantially extends the catalog of noncoding driver alterations in multiple cancer types, revealing attractive targets for functional characterization and therapeutic intervention.
{"title":"Chromatin interaction maps identify oncogenic targets of enhancer duplications in cancer","authors":"Yueqiang Song, Fuyuan Li, Shangzi Wang, Yuntong Wang, Cong Lai, Lian Chen, Ning Jiang, Jin Li, Xingdong Chen, Swneke D. Bailey, Xiaoyang Zhang","doi":"10.1101/gr.278418.123","DOIUrl":"https://doi.org/10.1101/gr.278418.123","url":null,"abstract":"As a major type of structural variants, tandem duplication plays a critical role in tumorigenesis by increasing oncogene dosage. Recent work has revealed that noncoding enhancers are also affected by duplications leading to the activation of oncogenes that are inside or outside of the duplicated regions. However, the prevalence of enhancer duplication and the identity of their target genes remains largely unknown in the cancer genome. Here, by analyzing whole-genome sequencing data in a non-gene-centric manner, we identify 881 duplication hotspots in 13 major cancer types, most of which do not contain protein-coding genes. We show that the hotspots are enriched with distal enhancer elements and are highly lineage-specific. We develop a HiChIP-based methodology that navigates enhancer–promoter contact maps to prioritize the target genes for the duplication hotspots harboring enhancer elements. The methodology identifies many novel enhancer duplication events activating oncogenes such as <em>ESR1</em>, <em>FOXA1</em>, <em>GATA3, GATA6, TP63</em>, and <em>VEGFA</em>, as well as potentially novel oncogenes such as <em>GRHL2, IRF2BP2</em>, and <em>CREB3L1</em>. In particular, we identify a duplication hotspot on Chromosome 10p15 harboring a cluster of enhancers, which skips over two genes, through a long-range chromatin interaction, to activate an oncogenic isoform of the <em>NET1</em> gene to promote migration of gastric cancer cells. Focusing on tandem duplications, our study substantially extends the catalog of noncoding driver alterations in multiple cancer types, revealing attractive targets for functional characterization and therapeutic intervention.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"233 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142449557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retrotransposable elements (RTEs) are common mobile genetic elements comprising ∼42% of the human genome. RTEs play critical roles in gene regulation and function, but how they are specifically involved in complex diseases is largely unknown. Here, we investigate the cellular heterogeneity of RTEs using 12 single-cell transcriptome profiles covering three neurodegenerative diseases, Alzheimer's disease (AD), Parkinson's disease, and multiple sclerosis. We identify cell type marker RTEs in neurons, astrocytes, oligodendrocytes, and oligodendrocyte precursor cells that are related to these diseases. The differential expression analysis reveals the landscape of dysregulated RTE expression, especially L1s, in excitatory neurons of multiple neurodegenerative diseases. Machine learning algorithms for predicting cell disease stage using a combination of RTE and gene expression features suggests dynamic regulation of RTEs in AD. Furthermore, we construct a single-cell atlas of retrotransposable elements in neurodegenerative disease (scARE) using these data sets and features. scARE has six feature analysis modules to explore RTE dynamics in a user-defined condition. To our knowledge, scARE represents the first systematic investigation of RTE dynamics at the single-cell level within the context of neurodegenerative diseases.
{"title":"Dynamic dysregulation of retrotransposons in neurodegenerative diseases at the single-cell level","authors":"Wankun Deng, Citu Citu, Andi Liu, Zhongming Zhao","doi":"10.1101/gr.279363.124","DOIUrl":"https://doi.org/10.1101/gr.279363.124","url":null,"abstract":"Retrotransposable elements (RTEs) are common mobile genetic elements comprising ∼42% of the human genome. RTEs play critical roles in gene regulation and function, but how they are specifically involved in complex diseases is largely unknown. Here, we investigate the cellular heterogeneity of RTEs using 12 single-cell transcriptome profiles covering three neurodegenerative diseases, Alzheimer's disease (AD), Parkinson's disease, and multiple sclerosis. We identify cell type marker RTEs in neurons, astrocytes, oligodendrocytes, and oligodendrocyte precursor cells that are related to these diseases. The differential expression analysis reveals the landscape of dysregulated RTE expression, especially L1s, in excitatory neurons of multiple neurodegenerative diseases. Machine learning algorithms for predicting cell disease stage using a combination of RTE and gene expression features suggests dynamic regulation of RTEs in AD. Furthermore, we construct a single-cell atlas of retrotransposable elements in neurodegenerative disease (scARE) using these data sets and features. scARE has six feature analysis modules to explore RTE dynamics in a user-defined condition. To our knowledge, scARE represents the first systematic investigation of RTE dynamics at the single-cell level within the context of neurodegenerative diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"14 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela
Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: Apalone spinifera (ZZ/ZW, 2n = 66) and Staurotypus triporcatus (XX/XY, 2n = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in Apalone (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in Staurotypus (Kinosternidae) and Trachemys scripta (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.
{"title":"De novo genome assemblies of two cryptodiran turtles with ZZ/ZW and XX/XY sex chromosomes provide insights into patterns of genome reshuffling and uncover novel 3D genome folding in amniotes","authors":"Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela","doi":"10.1101/gr.279443.124","DOIUrl":"https://doi.org/10.1101/gr.279443.124","url":null,"abstract":"Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: <em>Apalone spinifera</em> (ZZ/ZW, 2<em>n</em> = 66) and <em>Staurotypus triporcatus</em> (XX/XY, 2<em>n</em> = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in <em>Apalone</em> (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in <em>Staurotypus</em> (Kinosternidae) and <em>Trachemys scripta</em> (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"32 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial
PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.
{"title":"PWAS Hub for exploring gene-based associations of common complex diseases","authors":"Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial","doi":"10.1101/gr.278916.123","DOIUrl":"https://doi.org/10.1101/gr.278916.123","url":null,"abstract":"PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"18 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split k-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
{"title":"Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis","authors":"Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees","doi":"10.1101/gr.279449.124","DOIUrl":"https://doi.org/10.1101/gr.279449.124","url":null,"abstract":"Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split <em>k</em>-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"78 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.
{"title":"Telomere-to-telomere assembly by preserving contained reads","authors":"Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain","doi":"10.1101/gr.279311.124","DOIUrl":"https://doi.org/10.1101/gr.279311.124","url":null,"abstract":"Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"10 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the P-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.
{"title":"Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data","authors":"Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen","doi":"10.1101/gr.278843.123","DOIUrl":"https://doi.org/10.1101/gr.278843.123","url":null,"abstract":"High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the <em>P</em>-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"57 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}