Modern pangenome graphs are built using haplotype-resolved genome assemblies. When mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes improves genotyping accuracy. However, the existing rigorous formulations for colinear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for sequence-to-graph alignment and chaining problems. Inspired by the genotype imputation models, we assume that a query sequence is an imperfect mosaic of reference haplotypes. Accordingly, we introduce a recombination penalty in the scoring functions for each haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in [Formula: see text] time, where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than [Formula: see text] is impossible under the strong exponential time hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in [Formula: see text] time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than [Formula: see text] is impossible under SETH. As a proof-of-concept, we implemented our chaining algorithm in the Minichain aligner. By aligning sequences sampled from the human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes, we demonstrate that our algorithm achieves better consistency with ground-truth recombinations compared with a haplotype-agnostic algorithm.
{"title":"Haplotype-aware sequence alignment to pangenome graphs.","authors":"Ghanshyam Chandra, Daniel Gibney, Chirag Jain","doi":"10.1101/gr.279143.124","DOIUrl":"10.1101/gr.279143.124","url":null,"abstract":"<p><p>Modern pangenome graphs are built using haplotype-resolved genome assemblies. When mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes improves genotyping accuracy. However, the existing rigorous formulations for colinear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. In this paper, we develop novel formulations and algorithms for sequence-to-graph alignment and chaining problems. Inspired by the genotype imputation models, we assume that a query sequence is an imperfect mosaic of reference haplotypes. Accordingly, we introduce a recombination penalty in the scoring functions for each haplotype switch. First, we solve haplotype-aware sequence-to-graph alignment in [Formula: see text] time, where <i>Q</i> is the query sequence, <i>E</i> is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than [Formula: see text] is impossible under the strong exponential time hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in [Formula: see text] time after graph preprocessing, where <i>N</i> is the count of input anchors. We then establish that a chaining algorithm significantly faster than [Formula: see text] is impossible under SETH. As a proof-of-concept, we implemented our chaining algorithm in the Minichain aligner. By aligning sequences sampled from the human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes, we demonstrate that our algorithm achieves better consistency with ground-truth recombinations compared with a haplotype-agnostic algorithm.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141626498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Sens, Liubov Shilova, Ludwig Gräf, Maria Grebenshchikova, Bjoern M. Eskofier, Francesco Paolo Casale
Accurate predictive models of future disease onset are crucial for effective preventive healthcare, yet longitudinal data sets linking early risk factors to subsequent health outcomes are limited. To overcome this challenge, we introduce a novel framework, Predictive Risk modeling using Mendelian Randomization (PRiMeR), which utilizes genetic effects as supervisory signals to learn disease risk predictors without relying on longitudinal data. To do so, PRiMeR leverages risk factors and genetic data from a healthy cohort, along with results from genome-wide association studies of diseases of interest. After training, the learned predictor can be used to assess risk for new patients solely based on risk factors. We validate PRiMeR through comprehensive simulations and in future type 2 diabetes predictions in UK Biobank participants without diabetes, using follow-up onset labels for validation. Moreover, we apply PRiMeR to predict future Alzheimer's disease onset from brain imaging biomarkers and future Parkinson's disease onset from accelerometer-derived traits. Overall, with PRiMeR we offer a new perspective in predictive modeling, showing it is possible to learn risk predictors leveraging genetics rather than longitudinal data.
未来疾病发病的准确预测模型对于有效的预防保健至关重要,然而将早期风险因素与后续健康结果联系起来的纵向数据集却很有限。为了克服这一挑战,我们引入了一个新颖的框架--使用孟德尔随机化的风险预测建模(PRiMeR),它利用遗传效应作为监督信号来学习疾病风险预测因子,而无需依赖纵向数据。为此,PRiMeR 利用了健康人群的风险因素和遗传数据,以及相关疾病的全基因组关联研究结果。经过训练后,学习到的预测因子可用于仅根据风险因素评估新患者的风险。我们通过综合模拟验证了 PRiMeR,并利用随访发病标签对英国生物库中未患糖尿病的参与者进行了未来 2 型糖尿病预测验证。此外,我们还将 PRiMeR 应用于根据脑成像生物标记预测阿尔茨海默病的未来发病情况,以及根据加速度计衍生特征预测帕金森病的未来发病情况。总之,通过 PRiMeR,我们为预测建模提供了一个新的视角,表明利用遗传学而不是纵向数据学习风险预测因子是可行的。
{"title":"Genetics-driven risk predictions leveraging the Mendelian randomization framework","authors":"Daniel Sens, Liubov Shilova, Ludwig Gräf, Maria Grebenshchikova, Bjoern M. Eskofier, Francesco Paolo Casale","doi":"10.1101/gr.279252.124","DOIUrl":"https://doi.org/10.1101/gr.279252.124","url":null,"abstract":"Accurate predictive models of future disease onset are crucial for effective preventive healthcare, yet longitudinal data sets linking early risk factors to subsequent health outcomes are limited. To overcome this challenge, we introduce a novel framework, <span>P</span>redictive <span>Ri</span>sk modeling using <span>Me</span>ndelian <span>R</span>andomization (PRiMeR), which utilizes genetic effects as supervisory signals to learn disease risk predictors without relying on longitudinal data. To do so, PRiMeR leverages risk factors and genetic data from a healthy cohort, along with results from genome-wide association studies of diseases of interest. After training, the learned predictor can be used to assess risk for new patients solely based on risk factors. We validate PRiMeR through comprehensive simulations and in future type 2 diabetes predictions in UK Biobank participants without diabetes, using follow-up onset labels for validation. Moreover, we apply PRiMeR to predict future Alzheimer's disease onset from brain imaging biomarkers and future Parkinson's disease onset from accelerometer-derived traits. Overall, with PRiMeR we offer a new perspective in predictive modeling, showing it is possible to learn risk predictors leveraging genetics rather than longitudinal data.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142329221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taishan Hu, Timothy L. Mosbruger, Nikolaos G. Tairis, Amalia Dinou, Pushkala Jayaraman, Mahdi Sarmady, Kingham Brewster, Yang Li, Tristan J. Hayeck, Jamie L. Duke, Dimitri S. Monos
The human Major Histocompatibility Complex (MHC) is an approximately 4 Mb genomic segment on Chromosome 6 that plays a pivotal role in the immune response. Despite its importance in various traits and diseases, its complex nature makes it challenging to accurately characterize on a routine basis. We present a novel approach allowing targeted sequencing and de novo haplotypic assembly of the MHC region in heterozygous samples, using long-read sequencing technologies. Our approach is validated using two reference samples, two family trios, and an African-American sample. We achieved excellent coverage (96.6-99.9% with at least 30× depth) and high accuracy (99.89-99.99%) for the different haplotypes. This methodology offers a reliable and cost-effective method for sequencing and fully characterizing the MHC without the need for whole-genome sequencing, facilitating broader studies on this important genomic segment and having significant implications in immunology, genetics and medicine.
{"title":"Targeted and complete genomic sequencing of the Major Histocompatibility Complex in haplotypic form of individual heterozygous samples","authors":"Taishan Hu, Timothy L. Mosbruger, Nikolaos G. Tairis, Amalia Dinou, Pushkala Jayaraman, Mahdi Sarmady, Kingham Brewster, Yang Li, Tristan J. Hayeck, Jamie L. Duke, Dimitri S. Monos","doi":"10.1101/gr.278588.123","DOIUrl":"https://doi.org/10.1101/gr.278588.123","url":null,"abstract":"The human Major Histocompatibility Complex (MHC) is an approximately 4 Mb genomic segment on Chromosome 6 that plays a pivotal role in the immune response. Despite its importance in various traits and diseases, its complex nature makes it challenging to accurately characterize on a routine basis. We present a novel approach allowing targeted sequencing and de novo haplotypic assembly of the MHC region in heterozygous samples, using long-read sequencing technologies. Our approach is validated using two reference samples, two family trios, and an African-American sample. We achieved excellent coverage (96.6-99.9% with at least 30× depth) and high accuracy (99.89-99.99%) for the different haplotypes. This methodology offers a reliable and cost-effective method for sequencing and fully characterizing the MHC without the need for whole-genome sequencing, facilitating broader studies on this important genomic segment and having significant implications in immunology, genetics and medicine.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcin P Sajek, Danielle Y Bilodeau, Michael A Beer, Emma Horton, Yukiko Miyamoto, Katrina B Velle, Lars Eckmann, Lillian Fritz-Laylin, Olivia S Rissland, Neelanjan Mukherjee
The poly(A) signal, together with auxiliary elements, directs cleavage of a pre-mRNA and thus determines the 3' end of the mature transcript. In many species, including humans, the poly(A) signal is an AAUAAA hexamer, but we recently found that the deeply branching eukaryote Giardia lamblia uses a distinct hexamer (AGURAA) and lacks any known auxiliary elements. Our discovery prompted us to explore the evolutionary dynamics of poly(A) signals and auxiliary elements in the eukaryotic kingdom. We used direct RNA sequencing to determine poly(A) signals for four protists within the Metamonada clade (which also contains Giardia lamblia) and two outgroup protists. These experiments revealed that the AAUAAA hexamer serves as the poly(A) signal in at least four different eukaryotic clades, indicating that it is likely the ancestral signal, whereas the unusual Giardia version is derived. We found that the use and relative strengths of auxiliary elements are also surprisingly plastic; in fact, within Metamonada, species like Giardia lamblia make use of a previously unrecognized auxiliary element where nucleotides flanking the poly(A) signal itself specify genuine cleavage sites. Thus, despite the fundamental nature of pre-mRNA cleavage for the expression of all protein-coding genes, the motifs controlling this process are dynamic on evolutionary timescales, providing motivation for future biochemical and structural studies as well as new therapeutic angles to target eukaryotic pathogens.
{"title":"Evolutionary dynamics of polyadenylation signals and their recognition strategies in protists","authors":"Marcin P Sajek, Danielle Y Bilodeau, Michael A Beer, Emma Horton, Yukiko Miyamoto, Katrina B Velle, Lars Eckmann, Lillian Fritz-Laylin, Olivia S Rissland, Neelanjan Mukherjee","doi":"10.1101/gr.279526.124","DOIUrl":"https://doi.org/10.1101/gr.279526.124","url":null,"abstract":"The poly(A) signal, together with auxiliary elements, directs cleavage of a pre-mRNA and thus determines the 3' end of the mature transcript. In many species, including humans, the poly(A) signal is an AAUAAA hexamer, but we recently found that the deeply branching eukaryote <em>Giardia lamblia</em> uses a distinct hexamer (AGURAA) and lacks any known auxiliary elements. Our discovery prompted us to explore the evolutionary dynamics of poly(A) signals and auxiliary elements in the eukaryotic kingdom. We used direct RNA sequencing to determine poly(A) signals for four protists within the Metamonada clade (which also contains <em>Giardia lamblia</em>) and two outgroup protists. These experiments revealed that the AAUAAA hexamer serves as the poly(A) signal in at least four different eukaryotic clades, indicating that it is likely the ancestral signal, whereas the unusual <em>Giardia</em> version is derived. We found that the use and relative strengths of auxiliary elements are also surprisingly plastic; in fact, within Metamonada, species like <em>Giardia lamblia</em> make use of a previously unrecognized auxiliary element where nucleotides flanking the poly(A) signal itself specify genuine cleavage sites. Thus, despite the fundamental nature of pre-mRNA cleavage for the expression of all protein-coding genes, the motifs controlling this process are dynamic on evolutionary timescales, providing motivation for future biochemical and structural studies as well as new therapeutic angles to target eukaryotic pathogens.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan
Cis-regulatory elements (CREs), such as promoters and enhancers, are DNA sequences that regulate the expression of genes. The activity of a CRE is influenced by the order, composition, and spacing of sequence motifs that are bound by proteins called transcription factors (TFs). Synthetic CREs with specific properties are needed for biomanufacturing as well as for many therapeutic applications including cell and gene therapy. Here, we present regLM, a framework to design synthetic CREs with desired properties, such as high, low, or cell type–specific activity, using autoregressive language models in conjunction with supervised sequence-to-function models. We used our framework to design synthetic yeast promoters and cell type–specific human enhancers. We demonstrate that the synthetic CREs generated by our approach are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. regLM thus facilitates the design of realistic regulatory DNA elements while providing insights into the cis-regulatory code.
启动子和增强子等顺式调节元件(CRE)是调节基因表达的 DNA 序列。CRE 的活性受被称为转录因子(TF)的蛋白质结合的序列基序的顺序、组成和间距的影响。生物制造和许多治疗应用(包括细胞和基因治疗)都需要具有特定特性的合成 CRE。在这里,我们介绍了 regLM,这是一种利用自回归语言模型和监督序列到功能模型设计具有所需特性(如高、低或细胞类型特异性活性)的合成 CRE 的框架。我们利用我们的框架设计了合成酵母启动子和细胞类型特异性人类增强子。我们证明,用我们的方法生成的合成 CRE 不仅能预测出所需的功能,而且还包含与实验验证的 CRE 相似的生物学特征。因此,regLM 可以促进现实调控 DNA 元件的设计,同时提供对顺式调控代码的深入了解。
{"title":"Designing realistic regulatory DNA with autoregressive language models","authors":"Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan","doi":"10.1101/gr.279142.124","DOIUrl":"https://doi.org/10.1101/gr.279142.124","url":null,"abstract":"<em>Cis</em>-regulatory elements (CREs), such as promoters and enhancers, are DNA sequences that regulate the expression of genes. The activity of a CRE is influenced by the order, composition, and spacing of sequence motifs that are bound by proteins called transcription factors (TFs). Synthetic CREs with specific properties are needed for biomanufacturing as well as for many therapeutic applications including cell and gene therapy. Here, we present regLM, a framework to design synthetic CREs with desired properties, such as high, low, or cell type–specific activity, using autoregressive language models in conjunction with supervised sequence-to-function models. We used our framework to design synthetic yeast promoters and cell type–specific human enhancers. We demonstrate that the synthetic CREs generated by our approach are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. regLM thus facilitates the design of realistic regulatory DNA elements while providing insights into the <em>cis</em>-regulatory code.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James L Shepherdson, David M Granas, Jie Li, Zara Shariff, Stephen P Plassmeyer, Alex S Holehouse, Michael A White, Barak A Cohen
The transcription factor (TF) cone-rod homeobox (CRX) is essential for the differentiation and maintenance of photoreceptor cell identity. Several human CRX variants cause degenerative retinopathies, but most are variants of uncertain significance (VUS). We performed a deep mutational scan (DMS) of nearly all possible single amino acid substitutions in CRX using a cell-based transcriptional reporter assay, curating a high-confidence list of nearly 2,000 variants with altered transcriptional activity. In the structured homeodomain, activity scores closely aligned to a predicted structure and demonstrated position-specific constraints on amino acid substitution. By contrast, the intrinsically disordered transcriptional effector domain displayed a qualitatively different pattern of substitution effects, following compositional constraints without specific residue position requirements in the peptide chain. These compositional constraints were consistent with the acidic exposure model of transcriptional activation. We evaluated the performance of the DMS assay as a clinical variant classification tool using gold-standard classified human variants from ClinVar, identifying pathogenic variants with high specificity and moderate sensitivity. That this performance could be achieved using a synthetic reporter assay in a foreign cell type, even for a highly cell type-specific TF like CRX, suggests that this approach shows promise for DMS of other TFs that function in cell types that are not easily accessible. Together, the results of the CRX DMS identify molecular features of the CRX effector domain and demonstrate utility for integration into the clinical variant classification pipeline.
{"title":"Mutational scanning of CRX classifies clinical variants and reveals biochemical properties of the transcriptional effector domain","authors":"James L Shepherdson, David M Granas, Jie Li, Zara Shariff, Stephen P Plassmeyer, Alex S Holehouse, Michael A White, Barak A Cohen","doi":"10.1101/gr.279415.124","DOIUrl":"https://doi.org/10.1101/gr.279415.124","url":null,"abstract":"The transcription factor (TF) cone-rod homeobox (CRX) is essential for the differentiation and maintenance of photoreceptor cell identity. Several human CRX variants cause degenerative retinopathies, but most are variants of uncertain significance (VUS). We performed a deep mutational scan (DMS) of nearly all possible single amino acid substitutions in CRX using a cell-based transcriptional reporter assay, curating a high-confidence list of nearly 2,000 variants with altered transcriptional activity. In the structured homeodomain, activity scores closely aligned to a predicted structure and demonstrated position-specific constraints on amino acid substitution. By contrast, the intrinsically disordered transcriptional effector domain displayed a qualitatively different pattern of substitution effects, following compositional constraints without specific residue position requirements in the peptide chain. These compositional constraints were consistent with the acidic exposure model of transcriptional activation. We evaluated the performance of the DMS assay as a clinical variant classification tool using gold-standard classified human variants from ClinVar, identifying pathogenic variants with high specificity and moderate sensitivity. That this performance could be achieved using a synthetic reporter assay in a foreign cell type, even for a highly cell type-specific TF like CRX, suggests that this approach shows promise for DMS of other TFs that function in cell types that are not easily accessible. Together, the results of the CRX DMS identify molecular features of the CRX effector domain and demonstrate utility for integration into the clinical variant classification pipeline.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefania Fornezza, Vincenza Simona Delvecchio, William T Harvey, Philip C Dishuck, Evan E Eichler, Giuliana Giannuzzi
The 10q11.22 chromosomal region is a duplication-rich interval of the human genome and one of the last to be fully assembled. It carries copy-number variable genes associated with intellectual disability, bipolar disorder, and obesity. In this study, we characterized the structural diversity at this locus by analyzing 64 haploid assemblies produced by the Human Pangenome Reference Consortium. We identified eleven alternative haplotypes that differ in the copy number and/or orientation of large genomic segments, ranging from hundreds of kilobase pairs (kbp) to over one megabase pair (Mbp). We uncovered a 2.4 Mbp size difference between the shortest and longest haplotypes. Breakpoint analysis revealed that genomic instability results from nonallelic homologous recombination between segmental duplication (SD) pairs with varying similarity (94.4-99.6%). Nonetheless, these pairs generally recombine at positions where their identity is higher (>99.6%). Recurrent inversions occur with varying breakpoints within the same inverted SD pair. Inversion polymorphisms shuffle the entire SD arrangement, creating new predispositions to copy-number variations. The SD architecture is associated with a catarrhine-specific subgroup of the AGAP gene family, which likely triggered the accumulation of SDs at this locus over the past 25 million years of human evolution. Our results reveal extensive structural diversity and genomic instability at the 10q11.22 locus and expand the general understanding of the mutational mechanisms behind SD-mediated rearrangements.
{"title":"AGAP duplicons associate with structural diversity at Chromosome 10q11.22","authors":"Stefania Fornezza, Vincenza Simona Delvecchio, William T Harvey, Philip C Dishuck, Evan E Eichler, Giuliana Giannuzzi","doi":"10.1101/gr.279454.124","DOIUrl":"https://doi.org/10.1101/gr.279454.124","url":null,"abstract":"The 10q11.22 chromosomal region is a duplication-rich interval of the human genome and one of the last to be fully assembled. It carries copy-number variable genes associated with intellectual disability, bipolar disorder, and obesity. In this study, we characterized the structural diversity at this locus by analyzing 64 haploid assemblies produced by the Human Pangenome Reference Consortium. We identified eleven alternative haplotypes that differ in the copy number and/or orientation of large genomic segments, ranging from hundreds of kilobase pairs (kbp) to over one megabase pair (Mbp). We uncovered a 2.4 Mbp size difference between the shortest and longest haplotypes. Breakpoint analysis revealed that genomic instability results from nonallelic homologous recombination between segmental duplication (SD) pairs with varying similarity (94.4-99.6%). Nonetheless, these pairs generally recombine at positions where their identity is higher (>99.6%). Recurrent inversions occur with varying breakpoints within the same inverted SD pair. Inversion polymorphisms shuffle the entire SD arrangement, creating new predispositions to copy-number variations. The SD architecture is associated with a catarrhine-specific subgroup of the <em>AGAP</em> gene family, which likely triggered the accumulation of SDs at this locus over the past 25 million years of human evolution. Our results reveal extensive structural diversity and genomic instability at the 10q11.22 locus and expand the general understanding of the mutational mechanisms behind SD-mediated rearrangements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apurva Narechania, Dean Bobo, Kevin Deitz, Rob DeSalle, Paul Planet, Barun Mathema
The COVID-19 pandemic has highlighted the critical role of genomic surveillance for guiding policy and control. Timeliness is key, but sequence alignment and phylogeny slows most surveillance techniques. Millions of SARS-CoV-2 genomes have been assembled. Phylogenetic methods are ill equipped to handle this sheer scale. We introduce a pangenomic measure that examines the information diversity of a k-mer library drawn from a country's complete set of clinical, pooled, or wastewater sequence. Quantifying diversity is central to ecology. Hill numbers, or the effective number of species in a sample, provide a simple metric for comparing species diversity across environments. The more diverse the sample, the higher the Hill number. We adopt this ecological approach and consider each k-mer an individual and each genome a transect in the pangenome of the species. Structured in this way, Hill numbers summarize the temporal trajectory of pandemic variants, collapsing each day's assemblies into genome equivalents. For pooled or wastewater sequence, we instead compare days using survey sequence divorced from individual infections. Across data from the UK, USA, and South Africa, we trace the ascendance of new variants of concern as they emerge in local populations well before these variants are named and added to phylogenetic databases. Using data from San Diego wastewater, we monitor these same population changes from raw, unassembled sequence. This history of emerging variants senses all available data as it is sequenced, intimating variant sweeps to dominance or declines to extinction at the leading edge of the COVID19 pandemic.
{"title":"Rapid SARS-COV2 surveillance using clinical, pooled, or wastewater sequence as a sensor for population change","authors":"Apurva Narechania, Dean Bobo, Kevin Deitz, Rob DeSalle, Paul Planet, Barun Mathema","doi":"10.1101/gr.278594.123","DOIUrl":"https://doi.org/10.1101/gr.278594.123","url":null,"abstract":"The COVID-19 pandemic has highlighted the critical role of genomic surveillance for guiding policy and control. Timeliness is key, but sequence alignment and phylogeny slows most surveillance techniques. Millions of SARS-CoV-2 genomes have been assembled. Phylogenetic methods are ill equipped to handle this sheer scale. We introduce a pangenomic measure that examines the information diversity of a <em>k</em>-mer library drawn from a country's complete set of clinical, pooled, or wastewater sequence. Quantifying diversity is central to ecology. Hill numbers, or the effective number of species in a sample, provide a simple metric for comparing species diversity across environments. The more diverse the sample, the higher the Hill number. We adopt this ecological approach and consider each <em>k</em>-mer an individual and each genome a transect in the pangenome of the species. Structured in this way, Hill numbers summarize the temporal trajectory of pandemic variants, collapsing each day's assemblies into genome equivalents. For pooled or wastewater sequence, we instead compare days using survey sequence divorced from individual infections. Across data from the UK, USA, and South Africa, we trace the ascendance of new variants of concern as they emerge in local populations well before these variants are named and added to phylogenetic databases. Using data from San Diego wastewater, we monitor these same population changes from raw, unassembled sequence. This history of emerging variants senses all available data as it is sequenced, intimating variant sweeps to dominance or declines to extinction at the leading edge of the COVID19 pandemic.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seong W Han, San Jewell, Andrei Thomas-Tikhonenko, Yoseph Barash
Mapping transcriptomic variations using either short- or long-reads RNA sequencing is a staple of genomic research. Long reads are able to capture entire isoforms and overcome repetitive regions, while short reads still provide improved coverage and error rates. Yet how to quantitatively compare the technologies, can we combine those, and what may be the benefit of such a combined view remain open questions. We tackle these questions by first creating a pipeline to assess matched long and short reads data using a variety of transcriptome statistics. We find that across datasets, algorithms, and technologies, matched short reads data detects roughly 30% more splice junctions such that 10-30% of the splice junctions included at 20% or more by short reads are missed by long reads. In contrast, long reads detect many more intron retention events and can detect full isoforms, pointing to the benefit of combining the technologies. We introduce MAJIQ-L, an extension of the MAJIQ software to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. Our software can be used to assess any future long-read technology or algorithm, and combine it with short reads data for improved transcriptome analysis.
{"title":"Contrasting and combining transcriptome complexity captured by short and long RNA sequencing reads","authors":"Seong W Han, San Jewell, Andrei Thomas-Tikhonenko, Yoseph Barash","doi":"10.1101/gr.278659.123","DOIUrl":"https://doi.org/10.1101/gr.278659.123","url":null,"abstract":"Mapping transcriptomic variations using either short- or long-reads RNA sequencing is a staple of genomic research. Long reads are able to capture entire isoforms and overcome repetitive regions, while short reads still provide improved coverage and error rates. Yet how to quantitatively compare the technologies, can we combine those, and what may be the benefit of such a combined view remain open questions. We tackle these questions by first creating a pipeline to assess matched long and short reads data using a variety of transcriptome statistics. We find that across datasets, algorithms, and technologies, matched short reads data detects roughly 30% more splice junctions such that 10-30% of the splice junctions included at 20% or more by short reads are missed by long reads. In contrast, long reads detect many more intron retention events and can detect full isoforms, pointing to the benefit of combining the technologies. We introduce MAJIQ-L, an extension of the MAJIQ software to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. Our software can be used to assess any future long-read technology or algorithm, and combine it with short reads data for improved transcriptome analysis.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Borislav H Hristov, William Stafford Noble, Alessandro Bertero
Most studies of genome organization have focused on intrachromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Interchromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework that uses Hi-C data to identify sets of loci that jointly interact in trans. This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparumvar genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes coregulated by the same trans-acting element (i.e., a transcription or splicing factor) colocalize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same DNA binding proteins interact with one another in trans, especially those bound by factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA-binding proteins correlates with trans interaction of the encoding loci. Intriguingly, we observe that these trans-interacting loci are close to nuclear speckles. Our findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.
{"title":"Systematic identification of interchromosomal interaction networks supports the existence of specialized RNA factories","authors":"Borislav H Hristov, William Stafford Noble, Alessandro Bertero","doi":"10.1101/gr.278327.123","DOIUrl":"https://doi.org/10.1101/gr.278327.123","url":null,"abstract":"Most studies of genome organization have focused on intrachromosomal (<em>cis</em>) contacts because they harbor key features such as DNA loops and topologically associating domains. Interchromosomal (<em>trans</em>) contacts have received much less attention, and tools for interrogating potential biologically relevant <em>trans</em> structures are lacking. Here, we develop a computational framework that uses Hi-C data to identify sets of loci that jointly interact in <em>trans</em>. This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of <em>trans</em>-contacting loci. We validate trans-C in three increasingly complex models of established <em>trans</em> contacts: the <em>Plasmodium falciparum</em> <em>var</em> genes, the mouse olfactory receptor \"Greek islands\", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes coregulated by the same <em>trans</em>-acting element (i.e., a transcription or splicing factor) colocalize in three dimensions to form \"RNA factories\" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same DNA binding proteins interact with one another in <em>trans</em>, especially those bound by factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA-binding proteins correlates with <em>trans</em> interaction of the encoding loci. Intriguingly, we observe that these <em>trans</em>-interacting loci are close to nuclear speckles. Our findings support the existence of <em>trans</em> interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of <em>trans</em> interactions, empowering studies of a poorly understood aspect of genome architecture.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}