首页 > 最新文献

Genome research最新文献

英文 中文
Inferring ancestry with the hierarchical soft clustering approach tangleGen. 用分层软聚类法 tangleGen 推断祖先。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-21 DOI: 10.1101/gr.279399.124
Klara Elisabeth Burger, Solveig Klepper, Ulrike von Luxburg, Franz Baumdicker

Understanding the genetic ancestry of populations is central to numerous scientific and societal fields. It contributes to a better understanding of human evolutionary history, advances personalized medicine, aids in forensic identification, and allows individuals to connect to their genealogical roots. Existing methods, such as ADMIXTURE, have significantly improved our ability to infer ancestries. However, these methods typically work with a fixed number of independent ancestral populations. As a result, they provide insight into genetic admixture, but do not include a hierarchical interpretation. In particular, the intricate ancestral population structures remain difficult to unravel. Alternative methods with a consistent inheritance structure, such as hierarchical clustering, may offer benefits in terms of interpreting the inferred ancestries. Here, we present tangleGen, a soft clustering tool that transfers the hierarchical machine learning framework Tangles, which leverages graph theoretical concepts, to the field of population genetics. The hierarchical perspective of tangleGen on the composition and structure of populations improves the interpretability of the inferred ancestral relationships. Moreover, tangleGen adds a new layer of explainability, as it allows identifying the SNPs that are responsible for the clustering structure. We demonstrate the capabilities and benefits of tangleGen for the inference of ancestral relationships, using both simulated data and data from the 1000 Genomes Project.

了解人口的遗传祖先对许多科学和社会领域都至关重要。它有助于更好地了解人类进化史,促进个性化医疗,帮助法医鉴定,并让个人与自己的家谱根源建立联系。ADMIXTURE 等现有方法大大提高了我们推断祖先的能力。然而,这些方法通常只适用于固定数量的独立祖先人群。因此,这些方法虽然能让我们深入了解基因混杂的情况,但并不包括层次解释。特别是,错综复杂的祖先种群结构仍然难以解开。具有一致遗传结构的替代方法,如分层聚类,可能会在解释推断的祖先方面带来好处。在这里,我们介绍一种软聚类工具 tangleGen,它将利用图论概念的分层机器学习框架 Tangles 移植到了群体遗传学领域。tangleGen 从分层的角度看待种群的组成和结构,提高了推断祖先关系的可解释性。此外,tangleGen 还增加了一层新的可解释性,因为它可以确定造成聚类结构的 SNPs。我们利用模拟数据和来自 1000 基因组计划的数据展示了 tangleGen 在推断祖先关系方面的能力和优势。
{"title":"Inferring ancestry with the hierarchical soft clustering approach tangleGen.","authors":"Klara Elisabeth Burger, Solveig Klepper, Ulrike von Luxburg, Franz Baumdicker","doi":"10.1101/gr.279399.124","DOIUrl":"https://doi.org/10.1101/gr.279399.124","url":null,"abstract":"<p><p>Understanding the genetic ancestry of populations is central to numerous scientific and societal fields. It contributes to a better understanding of human evolutionary history, advances personalized medicine, aids in forensic identification, and allows individuals to connect to their genealogical roots. Existing methods, such as ADMIXTURE, have significantly improved our ability to infer ancestries. However, these methods typically work with a fixed number of independent ancestral populations. As a result, they provide insight into genetic admixture, but do not include a hierarchical interpretation. In particular, the intricate ancestral population structures remain difficult to unravel. Alternative methods with a consistent inheritance structure, such as hierarchical clustering, may offer benefits in terms of interpreting the inferred ancestries. Here, we present tangleGen, a soft clustering tool that transfers the hierarchical machine learning framework Tangles, which leverages graph theoretical concepts, to the field of population genetics. The hierarchical perspective of tangleGen on the composition and structure of populations improves the interpretability of the inferred ancestral relationships. Moreover, tangleGen adds a new layer of explainability, as it allows identifying the SNPs that are responsible for the clustering structure. We demonstrate the capabilities and benefits of tangleGen for the inference of ancestral relationships, using both simulated data and data from the 1000 Genomes Project.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ultrasensitive allele inference from immune repertoire sequencing data with MiXCR. 利用 MiXCR 从免疫谱系测序数据中进行超灵敏等位基因推断。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-21 DOI: 10.1101/gr.278775.123
Artem Mikelov, George Nefedev, Aleksandr Tashkeev, Oscar L Rodriguez, Diego A Ortmans, Valeriia Skatova, Mark Izraelson, Alexey N Davydov, Stanislav Poslavsky, Souad Rahmouni, Corey T Watson, Dmitriy M Chudakov, Scott D Boyd, Dmitry A Bolotin

Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here we present a novel algorithm for extra-sensitive and specific variable (V) and joining (J) gene allele inference, allowing reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing datasets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (IGH) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA; TRB) AIRR-seq dataset, representing 134 individuals. This allowed us to assess the genetic diversity within the IGH, TRA and TRB loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through an online database.

适应性免疫受体基因座中的等位基因变异性对病原体和疫苗的免疫反应至关重要,而适应性免疫受体基因座中含有编码 B 细胞和 T 细胞受体(BCR/TCR)的基因片段。适应性免疫受体复合物测序(AIRR-seq)已在免疫学研究中得到广泛应用,使其成为有关免疫球蛋白(IG)和T细胞受体(TR)基因座等位基因多样性的最便捷信息来源。在这里,我们提出了一种用于超灵敏和特异性可变(V)和连接(J)基因等位基因推断的新算法,允许重建单个高质量基因片段库。该方法可用于从外周血淋巴细胞 BCR 和 TCR 重排测序数据(包括高突变同型切换 BCR 序列)中推断等位基因变异,从而实现从各种现有数据集中高通量发现新型等位基因。开发的算法是 MiXCR 软件的一部分。我们使用 AIRR-seq 与长线程基因组测序数据配对,证明了这种方法的准确性,并将其与广泛使用的算法 TIgGER 进行了比较。我们将该算法应用于来自不同祖先群体的 450 名供体的大量 IG 重链(IGH)AIRR-seq 数据集,以及已报道的代表 134 个个体的最大全长 TCR alpha 和 beta 链(TRA; TRB)AIRR-seq 数据集。这使我们能够评估不同人群中 IGH、TRA 和 TRB 位点的遗传多样性,并建立了一个数据库,其中包含从 AIRR-seq 数据中推断出的 V 和 J 基因等位基因及其人群频率,公众可通过在线数据库免费访问。
{"title":"Ultrasensitive allele inference from immune repertoire sequencing data with MiXCR.","authors":"Artem Mikelov, George Nefedev, Aleksandr Tashkeev, Oscar L Rodriguez, Diego A Ortmans, Valeriia Skatova, Mark Izraelson, Alexey N Davydov, Stanislav Poslavsky, Souad Rahmouni, Corey T Watson, Dmitriy M Chudakov, Scott D Boyd, Dmitry A Bolotin","doi":"10.1101/gr.278775.123","DOIUrl":"10.1101/gr.278775.123","url":null,"abstract":"<p><p>Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here we present a novel algorithm for extra-sensitive and specific variable (V) and joining (J) gene allele inference, allowing reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing datasets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (<i>IGH</i>) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA; TRB) AIRR-seq dataset, representing 134 individuals. This allowed us to assess the genetic diversity within the <i>IGH</i>, <i>TRA</i> and <i>TRB</i> loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through an online database.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing super-enhancer temporal dynamics reveals potential critical enhancers and their gene regulatory networks underlying skeletal muscle development. 分析超级增强子的时间动态可揭示骨骼肌发育过程中潜在的关键增强子及其基因调控网络。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-21 DOI: 10.1101/gr.278344.123
Song Zhang, Chao Wang, Shenghua Qin, Choulin Chen, Yongzhou Bao, Yuanyuan Zhang, Lingna Xu, Qingyou Liu, Yunxiang Zhao, Kui Li, Zhonglin Tang, Yuwen Liu

Super-enhancers (SEs) govern the expression of genes defining cell identity. However, the dynamic landscape of SEs and their critical constituent enhancers involved in skeletal muscle development remains unclear. In this study, using pig as a model, we employed CUT&Tag to profile the enhancer-associated histone modification marker H3K27ac in skeletal muscle across two prenatal and three postnatal stages and investigated how SEs influence skeletal muscle development. We identified three SE families with distinct temporal dynamics: continuous (Con, 397), transient (TS, 434), and de novo (DN, 756). These SE families are associated with different temporal gene expression trajectories, biological functions, and DNA methylation levels. Notably, several lines of evidence suggest a potential prominent role of Con SEs in regulating porcine muscle development and meat traits. To pinpoint key cis-regulatory units in Con SEs, we developed an integrative approach that leverages information from eRNA annotation, GWAS signals and high-throughput capture STARR-seq experiments. Within Con SEs, we identified 20 candidate critical enhancers with meat and carcass-associated DNA variations that affect enhancer activity and inferred their upstream TFs and downstream target genes. As a proof of concept, we experimentally validated the role of one such enhancer and its potential target gene during myogenesis. Our findings reveal the dynamic regulatory features of SEs in skeletal muscle development and provide a general integrative framework for identifying critical enhancers underlying the formation of complex traits.

超级增强子(SE)控制着决定细胞特性的基因的表达。然而,参与骨骼肌发育的超级增强子及其关键组成增强子的动态图谱仍不清楚。在这项研究中,我们以猪为模型,利用 CUT&Tag 分析了骨骼肌中与增强子相关的组蛋白修饰标记 H3K27ac 在出生前两个阶段和出生后三个阶段的变化,并研究了增强子如何影响骨骼肌的发育。我们发现了三个具有不同时间动态的 SE 家族:连续 SE(Con,397 个)、瞬时 SE(TS,434 个)和新生 SE(DN,756 个)。这些 SE 家族与不同时间的基因表达轨迹、生物功能和 DNA 甲基化水平相关。值得注意的是,一些证据表明,Con SEs 在调节猪肌肉发育和肉质性状方面可能起着重要作用。为了精确定位 Con SEs 中的关键顺式调控单元,我们开发了一种综合方法,利用来自 eRNA 注释、GWAS 信号和高通量捕获 STARR-seq 实验的信息。在 Con SEs 中,我们发现了 20 个候选关键增强子,它们与肉类和胴体相关的 DNA 变异会影响增强子的活性,并推断出了它们的上游 TF 和下游靶基因。作为概念验证,我们通过实验验证了其中一个增强子及其潜在靶基因在肌形成过程中的作用。我们的研究结果揭示了骨骼肌发育过程中增强子的动态调控特征,并为确定复杂性状形成过程中的关键增强子提供了一个通用的综合框架。
{"title":"Analyzing super-enhancer temporal dynamics reveals potential critical enhancers and their gene regulatory networks underlying skeletal muscle development.","authors":"Song Zhang, Chao Wang, Shenghua Qin, Choulin Chen, Yongzhou Bao, Yuanyuan Zhang, Lingna Xu, Qingyou Liu, Yunxiang Zhao, Kui Li, Zhonglin Tang, Yuwen Liu","doi":"10.1101/gr.278344.123","DOIUrl":"https://doi.org/10.1101/gr.278344.123","url":null,"abstract":"<p><p>Super-enhancers (SEs) govern the expression of genes defining cell identity. However, the dynamic landscape of SEs and their critical constituent enhancers involved in skeletal muscle development remains unclear. In this study, using pig as a model, we employed CUT&Tag to profile the enhancer-associated histone modification marker H3K27ac in skeletal muscle across two prenatal and three postnatal stages and investigated how SEs influence skeletal muscle development. We identified three SE families with distinct temporal dynamics: continuous (Con, 397), transient (TS, 434), and de novo (DN, 756). These SE families are associated with different temporal gene expression trajectories, biological functions, and DNA methylation levels. Notably, several lines of evidence suggest a potential prominent role of Con SEs in regulating porcine muscle development and meat traits. To pinpoint key <i>cis</i>-regulatory units in Con SEs, we developed an integrative approach that leverages information from eRNA annotation, GWAS signals and high-throughput capture STARR-seq experiments. Within Con SEs, we identified 20 candidate critical enhancers with meat and carcass-associated DNA variations that affect enhancer activity and inferred their upstream TFs and downstream target genes. As a proof of concept, we experimentally validated the role of one such enhancer and its potential target gene during myogenesis. Our findings reveal the dynamic regulatory features of SEs in skeletal muscle development and provide a general integrative framework for identifying critical enhancers underlying the formation of complex traits.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142463175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chromatin interaction maps identify oncogenic targets of enhancer duplications in cancer 染色质相互作用图谱确定癌症中增强子重复的致癌靶点
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-18 DOI: 10.1101/gr.278418.123
Yueqiang Song, Fuyuan Li, Shangzi Wang, Yuntong Wang, Cong Lai, Lian Chen, Ning Jiang, Jin Li, Xingdong Chen, Swneke D. Bailey, Xiaoyang Zhang
As a major type of structural variants, tandem duplication plays a critical role in tumorigenesis by increasing oncogene dosage. Recent work has revealed that noncoding enhancers are also affected by duplications leading to the activation of oncogenes that are inside or outside of the duplicated regions. However, the prevalence of enhancer duplication and the identity of their target genes remains largely unknown in the cancer genome. Here, by analyzing whole-genome sequencing data in a non-gene-centric manner, we identify 881 duplication hotspots in 13 major cancer types, most of which do not contain protein-coding genes. We show that the hotspots are enriched with distal enhancer elements and are highly lineage-specific. We develop a HiChIP-based methodology that navigates enhancer–promoter contact maps to prioritize the target genes for the duplication hotspots harboring enhancer elements. The methodology identifies many novel enhancer duplication events activating oncogenes such as ESR1, FOXA1, GATA3, GATA6, TP63, and VEGFA, as well as potentially novel oncogenes such as GRHL2, IRF2BP2, and CREB3L1. In particular, we identify a duplication hotspot on Chromosome 10p15 harboring a cluster of enhancers, which skips over two genes, through a long-range chromatin interaction, to activate an oncogenic isoform of the NET1 gene to promote migration of gastric cancer cells. Focusing on tandem duplications, our study substantially extends the catalog of noncoding driver alterations in multiple cancer types, revealing attractive targets for functional characterization and therapeutic intervention.
作为结构变异的一种主要类型,串联重复通过增加癌基因的剂量在肿瘤发生中发挥着关键作用。最近的研究发现,非编码增强子也会受到重复的影响,从而激活重复区域内外的癌基因。然而,在癌症基因组中,增强子重复的普遍性及其靶基因的身份在很大程度上仍然未知。在这里,通过以非基因为中心的方式分析全基因组测序数据,我们在 13 种主要癌症类型中发现了 881 个重复热点,其中大部分不包含蛋白编码基因。我们发现这些热点富含远端增强子元件,而且具有高度的系特异性。我们开发了一种基于 HiChIP 的方法,该方法可浏览增强子-启动子接触图,从而优先确定含有增强子元件的复制热点的目标基因。该方法发现了许多激活 ESR1、FOXA1、GATA3、GATA6、TP63 和 VEGFA 等致癌基因的新型增强子重复事件,以及 GRHL2、IRF2BP2 和 CREB3L1 等潜在的新型致癌基因。特别是,我们在染色体10p15上发现了一个重复热点,它含有一组增强子,通过长程染色质相互作用跳过两个基因,激活NET1基因的致癌异构体,促进胃癌细胞的迁移。我们的研究以串联重复为重点,大大扩展了多种癌症类型中的非编码驱动基因改变目录,为功能表征和治疗干预揭示了有吸引力的靶点。
{"title":"Chromatin interaction maps identify oncogenic targets of enhancer duplications in cancer","authors":"Yueqiang Song, Fuyuan Li, Shangzi Wang, Yuntong Wang, Cong Lai, Lian Chen, Ning Jiang, Jin Li, Xingdong Chen, Swneke D. Bailey, Xiaoyang Zhang","doi":"10.1101/gr.278418.123","DOIUrl":"https://doi.org/10.1101/gr.278418.123","url":null,"abstract":"As a major type of structural variants, tandem duplication plays a critical role in tumorigenesis by increasing oncogene dosage. Recent work has revealed that noncoding enhancers are also affected by duplications leading to the activation of oncogenes that are inside or outside of the duplicated regions. However, the prevalence of enhancer duplication and the identity of their target genes remains largely unknown in the cancer genome. Here, by analyzing whole-genome sequencing data in a non-gene-centric manner, we identify 881 duplication hotspots in 13 major cancer types, most of which do not contain protein-coding genes. We show that the hotspots are enriched with distal enhancer elements and are highly lineage-specific. We develop a HiChIP-based methodology that navigates enhancer–promoter contact maps to prioritize the target genes for the duplication hotspots harboring enhancer elements. The methodology identifies many novel enhancer duplication events activating oncogenes such as <em>ESR1</em>, <em>FOXA1</em>, <em>GATA3, GATA6, TP63</em>, and <em>VEGFA</em>, as well as potentially novel oncogenes such as <em>GRHL2, IRF2BP2</em>, and <em>CREB3L1</em>. In particular, we identify a duplication hotspot on Chromosome 10p15 harboring a cluster of enhancers, which skips over two genes, through a long-range chromatin interaction, to activate an oncogenic isoform of the <em>NET1</em> gene to promote migration of gastric cancer cells. Focusing on tandem duplications, our study substantially extends the catalog of noncoding driver alterations in multiple cancer types, revealing attractive targets for functional characterization and therapeutic intervention.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"233 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142449557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic dysregulation of retrotransposons in neurodegenerative diseases at the single-cell level 单细胞水平上神经退行性疾病中逆转录转座子的动态失调
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-18 DOI: 10.1101/gr.279363.124
Wankun Deng, Citu Citu, Andi Liu, Zhongming Zhao
Retrotransposable elements (RTEs) are common mobile genetic elements comprising ∼42% of the human genome. RTEs play critical roles in gene regulation and function, but how they are specifically involved in complex diseases is largely unknown. Here, we investigate the cellular heterogeneity of RTEs using 12 single-cell transcriptome profiles covering three neurodegenerative diseases, Alzheimer's disease (AD), Parkinson's disease, and multiple sclerosis. We identify cell type marker RTEs in neurons, astrocytes, oligodendrocytes, and oligodendrocyte precursor cells that are related to these diseases. The differential expression analysis reveals the landscape of dysregulated RTE expression, especially L1s, in excitatory neurons of multiple neurodegenerative diseases. Machine learning algorithms for predicting cell disease stage using a combination of RTE and gene expression features suggests dynamic regulation of RTEs in AD. Furthermore, we construct a single-cell atlas of retrotransposable elements in neurodegenerative disease (scARE) using these data sets and features. scARE has six feature analysis modules to explore RTE dynamics in a user-defined condition. To our knowledge, scARE represents the first systematic investigation of RTE dynamics at the single-cell level within the context of neurodegenerative diseases.
可逆转座元件(RTE)是一种常见的移动遗传元件,占人类基因组的 42%。RTEs 在基因调控和功能方面发挥着关键作用,但它们如何具体参与复杂疾病的发生,目前尚不清楚。在这里,我们利用 12 个单细胞转录组图谱研究了 RTE 的细胞异质性,这些图谱涵盖了三种神经退行性疾病:阿尔茨海默病(AD)、帕金森病和多发性硬化症。我们在神经元、星形胶质细胞、少突胶质细胞和少突胶质细胞前体细胞中发现了与这些疾病相关的细胞类型标志物 RTE。差异表达分析揭示了多种神经退行性疾病的兴奋性神经元中 RTE(尤其是 L1s)表达失调的情况。结合 RTE 和基因表达特征预测细胞疾病阶段的机器学习算法表明,RTE 在 AD 中的动态调控。此外,我们还利用这些数据集和特征构建了神经退行性疾病逆转录表达元件单细胞图谱(scARE)。据我们所知,scARE 是首次在神经退行性疾病的背景下对单细胞水平的 RTE 动态进行的系统研究。
{"title":"Dynamic dysregulation of retrotransposons in neurodegenerative diseases at the single-cell level","authors":"Wankun Deng, Citu Citu, Andi Liu, Zhongming Zhao","doi":"10.1101/gr.279363.124","DOIUrl":"https://doi.org/10.1101/gr.279363.124","url":null,"abstract":"Retrotransposable elements (RTEs) are common mobile genetic elements comprising ∼42% of the human genome. RTEs play critical roles in gene regulation and function, but how they are specifically involved in complex diseases is largely unknown. Here, we investigate the cellular heterogeneity of RTEs using 12 single-cell transcriptome profiles covering three neurodegenerative diseases, Alzheimer's disease (AD), Parkinson's disease, and multiple sclerosis. We identify cell type marker RTEs in neurons, astrocytes, oligodendrocytes, and oligodendrocyte precursor cells that are related to these diseases. The differential expression analysis reveals the landscape of dysregulated RTE expression, especially L1s, in excitatory neurons of multiple neurodegenerative diseases. Machine learning algorithms for predicting cell disease stage using a combination of RTE and gene expression features suggests dynamic regulation of RTEs in AD. Furthermore, we construct a single-cell atlas of retrotransposable elements in neurodegenerative disease (scARE) using these data sets and features. scARE has six feature analysis modules to explore RTE dynamics in a user-defined condition. To our knowledge, scARE represents the first systematic investigation of RTE dynamics at the single-cell level within the context of neurodegenerative diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"14 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De novo genome assemblies of two cryptodiran turtles with ZZ/ZW and XX/XY sex chromosomes provide insights into patterns of genome reshuffling and uncover novel 3D genome folding in amniotes 两只具有 ZZ/ZW 和 XX/XY 性染色体的隐翅龟的全新基因组组装揭示了基因组重组模式,并发现了羊膜动物新的三维基因组折叠方式
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-16 DOI: 10.1101/gr.279443.124
Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela
Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: Apalone spinifera (ZZ/ZW, 2n = 66) and Staurotypus triporcatus (XX/XY, 2n = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in Apalone (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in Staurotypus (Kinosternidae) and Trachemys scripta (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.
了解物种间染色质构象的进化是阐明基因组结构和可塑性的基础。线性距离较远的基因座之间的非随机相互作用以物种特有的模式调节基因功能,影响基因组功能、进化,并最终影响物种分化。然而,来自非模式生物的数据却很少。为了捕捉脊椎动物染色质构象的宏观进化多样性,我们在这里通过Illumina测序、染色体构象捕获和RNA-seq,为两只隐颈龟(Apalone spinifera (ZZ/ZW, 2n = 66)和Staurotypus triporcatus (XX/XY, 2n = 54)生成了全新的基因组组装。)除了在线性基因组中检测到的融合/分裂事件外,我们还检测到龟鳖的三维(3D)染色质结构与其他羊膜动物存在差异。也就是说,全基因组比较揭示了龟类染色体重排的不同趋势:(1) Apalone(龟鳖目)的基因组重排率较低,其核型与鸡相比高度保守(可能是龟类的祖先);(2) Staurotypus(龟鳖目)和 Trachemys scripta(龟鳖科)的融合/裂变率适中。此外,我们还发现了一种染色体折叠模式,这种模式可使 "中心粒-端粒相互作用 "成为可能,而这种相互作用以前在龟类中尚未发现。龟类的 "中心粒-telomere相互作用"(在此发现)加上 "中心粒聚类"(以前在猿猴类中报道过)的组合模式在羊膜动物中是新颖的,它反驳了以前关于羊膜动物三维染色质结构的假说。我们假设,在龟类中发现的分化模式起源于羊膜动物的祖先状态,这种祖先状态是由核构型确定的,核构型中的微染色体之间存在广泛的关联,这种关联在线性基因组重新洗牌后得以保留。
{"title":"De novo genome assemblies of two cryptodiran turtles with ZZ/ZW and XX/XY sex chromosomes provide insights into patterns of genome reshuffling and uncover novel 3D genome folding in amniotes","authors":"Basanta Bista, Laura González-Rodelas, Lucía Álvarez-González, Zhi-qiang Wu, Eugenia E. Montiel, Ling Sze Lee, Daleen B. Badenhorst, Srihari Radhakrishnan, Robert Literman, Beatriz Navarro-Dominguez, John B. Iverson, Simon Orozco-Arias, Josefa González, Aurora Ruiz-Herrera, Nicole Valenzuela","doi":"10.1101/gr.279443.124","DOIUrl":"https://doi.org/10.1101/gr.279443.124","url":null,"abstract":"Understanding the evolution of chromatin conformation among species is fundamental to elucidate the architecture and plasticity of genomes. Nonrandom interactions of linearly distant loci regulate gene function in species-specific patterns, affecting genome function, evolution, and, ultimately, speciation. Yet, data from nonmodel organisms are scarce. To capture the macroevolutionary diversity of vertebrate chromatin conformation, here we generate de novo genome assemblies for two cryptodiran (hidden-neck) turtles via Illumina sequencing, chromosome conformation capture, and RNA-seq: <em>Apalone spinifera</em> (ZZ/ZW, 2<em>n</em> = 66) and <em>Staurotypus triporcatus</em> (XX/XY, 2<em>n</em> = 54). We detected differences in the three-dimensional (3D) chromatin structure in turtles compared to other amniotes beyond the fusion/fission events detected in the linear genomes. Namely, whole-genome comparisons revealed distinct trends of chromosome rearrangements in turtles: (1) a low rate of genome reshuffling in <em>Apalone</em> (Trionychidae) whose karyotype is highly conserved when compared to chicken (likely ancestral for turtles), and (2) a moderate rate of fusions/fissions in <em>Staurotypus</em> (Kinosternidae) and <em>Trachemys scripta</em> (Emydidae). Furthermore, we identified a chromosome folding pattern that enables “centromere–telomere interactions” previously undetected in turtles. The combined turtle pattern of “centromere–telomere interactions” (discovered here) plus “centromere clustering” (previously reported in sauropsids) is novel for amniotes and it counters previous hypotheses about amniote 3D chromatin structure. We hypothesize that the divergent pattern found in turtles originated from an amniote ancestral state defined by a nuclear configuration with extensive associations among microchromosomes that were preserved upon the reshuffling of the linear genome.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"32 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PWAS Hub for exploring gene-based associations of common complex diseases 探索常见复杂疾病基因关联的 PWAS 中心
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.278916.123
Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial
PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.
全蛋白质组关联研究(PWAS)是一种创新的遗传关联方法,是对广泛使用的全基因组关联研究(GWAS)等方法的补充。PWAS 方法包括几个连续的阶段。首先,机器学习建模和概率考虑量化遗传变异对蛋白编码基因生化功能的影响。其次,针对每个个体,汇总每个基因的变异,确定基因损害评分。最后,在病例对照设置中启动标准统计检验,得出每个表型中具有统计学意义的基因。PWAS 中枢为深入研究英国生物库(UKB)中基因与疾病的关联提供了一个用户友好型界面。PWAS 的结果涵盖 99 种常见疾病和病症,每种表型都有超过 10,000 名确诊患者。用户可以探索与这些疾病相关的基因,并分别对男性和女性进行分析。对于每种表型,分析都会考虑到基于性别的遗传效应、遗传模式(显性和隐性)以及相关基因的多效性。PWAS 中枢通过浏览蛋白质组遗传分析,展示了其对哮喘的实用性。通过检查 PWAS 列出的哮喘基因(共 27 个),可以深入了解潜在的细胞和分子机制。将常见疾病中具有统计学意义的 PWAS 基因与 Open Targets 基准进行比较后发现,大多数表型的基因关联存在部分但显著的重叠。图形工具便于比较 PWAS 和编码 GWAS 结果之间的遗传效应,有助于了解性别特异性遗传对常见疾病的影响。这个适应性强的平台对临床医生、研究人员和有兴趣深入研究基因-疾病关联和性别特异性遗传效应的个人很有吸引力。
{"title":"PWAS Hub for exploring gene-based associations of common complex diseases","authors":"Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial","doi":"10.1101/gr.278916.123","DOIUrl":"https://doi.org/10.1101/gr.278916.123","url":null,"abstract":"PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"18 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis 利用分裂 k-mer 分析法无缝、快速、准确地分析疫情基因组数据
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279449.124
Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split k-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
在病原体种群中观察到的序列变异可用于重要的公共卫生和进化基因组分析,尤其是疫情分析和传播重建。识别这种变异通常是通过将序列读数与参考基因组进行比对来实现的,但这种方法容易受到参考偏差的影响,而且需要仔细过滤被调用的基因型。我们需要能处理日益增长的细菌基因组数据的工具,这些工具既要能快速提供结果,又要简单易用,无需训练有素的生物信息学家、昂贵的数据分析以及长期存储和处理大量文件。在这里,我们介绍了分裂 k-mer分析(SKA2),这是一种支持无参考文献和基于参考文献制图的方法,可利用测序读数或基因组组装快速准确地对细菌群体进行基因分型。SKA2 对密切相关的样本具有很高的准确性,在疫情模拟中,我们发现与基于参考的方法相比,SKA2 具有更高的变异召回率,而且没有假阳性。SKA2 还能准确地将变异映射到参考文献,并与重组检测方法一起用于快速重建垂直进化史。SKA2 比同类方法快许多倍,可用于将新基因组添加到现有的调用集,从而可连续使用,而无需重新分析整个集合。SKA2 固有的无参考偏差、高准确性和强大的实施能力,有望成为细菌基因分型的首选工具。SKA2 采用 Rust 语言实现,是免费的开源软件。
{"title":"Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis","authors":"Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees","doi":"10.1101/gr.279449.124","DOIUrl":"https://doi.org/10.1101/gr.279449.124","url":null,"abstract":"Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split <em>k</em>-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"78 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Telomere-to-telomere assembly by preserving contained reads 通过保留所含读数进行端粒到端粒组装
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279311.124
Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.
二倍体和多倍体基因组的端粒到端粒(T2T)自动从头组装仍然是一项艰巨的任务。字符串图是组装算法中常用的组装图表示法。字符串图表述采用了图简化启发式方法,可大幅减少顶点和边的数量。其中一种启发式方法是删除长读取中包含的读取。在实践中,这种启发式偶尔会在装配中引入间隙,因为它会移除覆盖一个或多个基因组区间的所有读数。造成这种间隙的因素仍然鲜为人知。在这项工作中,我们用数学方法推导了在种系和体细胞杂合变异位点附近观察到间隙的频率。我们的分析表明:(i) 由于牛津纳米孔读数与 PacBio HiFi 读数在读数长度分布上的差异,在牛津纳米孔读数中因包含的读数缺失而导致的装配间隙要比在 PacBio HiFi 读数中出现的频率高出一个数量级;(ii) 随着测序深度的增加,出现间隙的频率会降低。根据这些观察结果,我们开发了 RAFT 组装算法,以解决字符串图公式的弱点。RAFT 通过对读数进行片段化处理,使读数长度分布更加均匀,从而解决了包含读数的问题。该算法在分片过程中保留了读数中的跨距重复序列。我们利用模拟数据集实证证明,RAFT 能显著减少间隙的数量。使用 HG002 人类基因组的真实 Oxford Nanopore 和 PacBio HiFi 数据集,与 Hifiasm 相比,我们的等位基因 NG50 和单体型解析 T2T 等位基因数量增加了两倍。
{"title":"Telomere-to-telomere assembly by preserving contained reads","authors":"Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain","doi":"10.1101/gr.279311.124","DOIUrl":"https://doi.org/10.1101/gr.279311.124","url":null,"abstract":"Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"10 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data 两个负二项分布之差的理论框架及其在测序数据比较分析中的应用
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.278843.123
Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen
High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the P-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.
高通量测序(HTS)技术在研究体细胞和单细胞水平的生物问题方面发挥了重要作用。两个 HTS 数据集的比较分析通常依赖于测试两个负二项分布(DOTNB)差异的统计学意义。虽然负二项分布已被深入研究,但 DOTNB 的理论结果在很大程度上仍未被探索。在此,我们推导出 DOTNB 的基本分析结果,并研究其渐近特性。作为 DOTNB 的最新应用,我们介绍了 DEGage,这是一种在 scRNA-seq 数据中检测差异表达基因(DEG)的计算方法。DEGage 计算基因表达水平样本差异的平均值作为检验统计量,并通过使用 DOTNB 计算 P 值来确定显著的差异表达。使用模拟和真实的 scRNA-seq 数据集进行的广泛验证表明,DEGage 优于五种流行的 DEG 分析工具:DEGseq2、DEsingle、edgeR、Monocle3 和 scDD。DEGage 对高丢失水平具有很强的鲁棒性,在应用于平衡和不平衡数据集时,即使样本量较小,也能表现出卓越的灵敏度。我们利用 DEGage 分析了前列腺癌 scRNA-seq 数据集,并确定了 17 种细胞类型的标记基因。此外,我们还将 DEGage 应用于具有和不具有恐惧记忆的小鼠神经元的 scRNA-seq 数据集,并揭示了以往分析中忽略的八个潜在记忆相关基因。DOTNB 的理论结果和支持软件可广泛应用于 HTS 中分散计数数据的比较分析和广泛的研究问题。
{"title":"Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data","authors":"Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen","doi":"10.1101/gr.278843.123","DOIUrl":"https://doi.org/10.1101/gr.278843.123","url":null,"abstract":"High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the <em>P</em>-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"57 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genome research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1