首页 > 最新文献

Genome research最新文献

英文 中文
PWAS Hub for exploring gene-based associations of common complex diseases 探索常见复杂疾病基因关联的 PWAS 中心
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.278916.123
Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial
PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.
全蛋白质组关联研究(PWAS)是一种创新的遗传关联方法,是对广泛使用的全基因组关联研究(GWAS)等方法的补充。PWAS 方法包括几个连续的阶段。首先,机器学习建模和概率考虑量化遗传变异对蛋白编码基因生化功能的影响。其次,针对每个个体,汇总每个基因的变异,确定基因损害评分。最后,在病例对照设置中启动标准统计检验,得出每个表型中具有统计学意义的基因。PWAS 中枢为深入研究英国生物库(UKB)中基因与疾病的关联提供了一个用户友好型界面。PWAS 的结果涵盖 99 种常见疾病和病症,每种表型都有超过 10,000 名确诊患者。用户可以探索与这些疾病相关的基因,并分别对男性和女性进行分析。对于每种表型,分析都会考虑到基于性别的遗传效应、遗传模式(显性和隐性)以及相关基因的多效性。PWAS 中枢通过浏览蛋白质组遗传分析,展示了其对哮喘的实用性。通过检查 PWAS 列出的哮喘基因(共 27 个),可以深入了解潜在的细胞和分子机制。将常见疾病中具有统计学意义的 PWAS 基因与 Open Targets 基准进行比较后发现,大多数表型的基因关联存在部分但显著的重叠。图形工具便于比较 PWAS 和编码 GWAS 结果之间的遗传效应,有助于了解性别特异性遗传对常见疾病的影响。这个适应性强的平台对临床医生、研究人员和有兴趣深入研究基因-疾病关联和性别特异性遗传效应的个人很有吸引力。
{"title":"PWAS Hub for exploring gene-based associations of common complex diseases","authors":"Guy Kelman, Roei Zucker, Nadav Brandes, Michal Linial","doi":"10.1101/gr.278916.123","DOIUrl":"https://doi.org/10.1101/gr.278916.123","url":null,"abstract":"PWAS (proteome-wide association study) is an innovative genetic association approach that complements widely used methods like GWAS (genome-wide association study). The PWAS approach involves consecutive phases. Initially, machine learning modeling and probabilistic considerations quantify the impact of genetic variants on protein-coding genes’ biochemical functions. Secondly, for each individual, aggregating the variants per gene determines a gene-damaging score. Finally, standard statistical tests are activated in the case-control setting to yield statistically significant genes per phenotype. The PWAS Hub offers a user-friendly interface for an in-depth exploration of gene–disease associations from the UK Biobank (UKB). Results from PWAS cover 99 common diseases and conditions, each with over 10,000 diagnosed individuals per phenotype. Users can explore genes associated with these diseases, with separate analyses conducted for males and females. For each phenotype, the analyses account for sex-based genetic effects, inheritance modes (dominant and recessive), and the pleiotropic nature of associated genes. The PWAS Hub showcases its usefulness for asthma by navigating through proteomic-genetic analyses. Inspecting PWAS asthma-listed genes (a total of 27) provide insights into the underlying cellular and molecular mechanisms. Comparison of PWAS-statistically significant genes for common diseases to the Open Targets benchmark shows partial but significant overlap in gene associations for most phenotypes. Graphical tools facilitate comparing genetic effects between PWAS and coding GWAS results, aiding in understanding the sex-specific genetic impact on common diseases. This adaptable platform is attractive to clinicians, researchers, and individuals interested in delving into gene–disease associations and sex-specific genetic effects.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Telomere-to-telomere assembly by preserving contained reads 通过保留所含读数进行端粒到端粒组装
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279311.124
Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.
二倍体和多倍体基因组的端粒到端粒(T2T)自动从头组装仍然是一项艰巨的任务。字符串图是组装算法中常用的组装图表示法。字符串图表述采用了图简化启发式方法,可大幅减少顶点和边的数量。其中一种启发式方法是删除长读取中包含的读取。在实践中,这种启发式偶尔会在装配中引入间隙,因为它会移除覆盖一个或多个基因组区间的所有读数。造成这种间隙的因素仍然鲜为人知。在这项工作中,我们用数学方法推导了在种系和体细胞杂合变异位点附近观察到间隙的频率。我们的分析表明:(i) 由于牛津纳米孔读数与 PacBio HiFi 读数在读数长度分布上的差异,在牛津纳米孔读数中因包含的读数缺失而导致的装配间隙要比在 PacBio HiFi 读数中出现的频率高出一个数量级;(ii) 随着测序深度的增加,出现间隙的频率会降低。根据这些观察结果,我们开发了 RAFT 组装算法,以解决字符串图公式的弱点。RAFT 通过对读数进行片段化处理,使读数长度分布更加均匀,从而解决了包含读数的问题。该算法在分片过程中保留了读数中的跨距重复序列。我们利用模拟数据集实证证明,RAFT 能显著减少间隙的数量。使用 HG002 人类基因组的真实 Oxford Nanopore 和 PacBio HiFi 数据集,与 Hifiasm 相比,我们的等位基因 NG50 和单体型解析 T2T 等位基因数量增加了两倍。
{"title":"Telomere-to-telomere assembly by preserving contained reads","authors":"Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain","doi":"10.1101/gr.279311.124","DOIUrl":"https://doi.org/10.1101/gr.279311.124","url":null,"abstract":"Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis 利用分裂 k-mer 分析法无缝、快速、准确地分析疫情基因组数据
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279449.124
Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split k-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
在病原体种群中观察到的序列变异可用于重要的公共卫生和进化基因组分析,尤其是疫情分析和传播重建。识别这种变异通常是通过将序列读数与参考基因组进行比对来实现的,但这种方法容易受到参考偏差的影响,而且需要仔细过滤被调用的基因型。我们需要能处理日益增长的细菌基因组数据的工具,这些工具既要能快速提供结果,又要简单易用,无需训练有素的生物信息学家、昂贵的数据分析以及长期存储和处理大量文件。在这里,我们介绍了分裂 k-mer分析(SKA2),这是一种支持无参考文献和基于参考文献制图的方法,可利用测序读数或基因组组装快速准确地对细菌群体进行基因分型。SKA2 对密切相关的样本具有很高的准确性,在疫情模拟中,我们发现与基于参考的方法相比,SKA2 具有更高的变异召回率,而且没有假阳性。SKA2 还能准确地将变异映射到参考文献,并与重组检测方法一起用于快速重建垂直进化史。SKA2 比同类方法快许多倍,可用于将新基因组添加到现有的调用集,从而可连续使用,而无需重新分析整个集合。SKA2 固有的无参考偏差、高准确性和强大的实施能力,有望成为细菌基因分型的首选工具。SKA2 采用 Rust 语言实现,是免费的开源软件。
{"title":"Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis","authors":"Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees","doi":"10.1101/gr.279449.124","DOIUrl":"https://doi.org/10.1101/gr.279449.124","url":null,"abstract":"Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split <em>k</em>-mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global characterization of somatic mutations and DNA methylation changes during vegetative propagation in strawberries 草莓无性繁殖过程中体细胞突变和 DNA 甲基化变化的总体特征
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279378.124
Shaoqiang Hu, Xiangguo Zeng, Yuguo Liu, Yongping Li, Minghao Qu, Wen-Biao Jiao, Yongchao Han, Chunying Kang
Somatic mutations arise and accumulate during tissue culture and vegetative propagation, potentially affecting various traits in horticultural crops, but their characteristics are still unclear. Here, somatic mutations in regenerated woodland strawberry derived from tissue culture of shoot tips under different conditions and 12 cultivated strawberry individuals are analyzed by whole genome sequencing. The mutation frequency of single nucleotide variants is significantly increased with increased hormone levels or prolonged culture time in the range of 3.3 × 10−8–3.0 × 10−6 mutations per site. CG methylation shows a stable reduction (0.71%–8.03%) in regenerated plants, and hypoCG-DMRs are more heritable after sexual reproduction. A high-quality haplotype-resolved genome is assembled for the strawberry cultivar “Beni hoppe.” The 12 “Beni hoppe” individuals randomly selected from different locations show 4731–6005 mutations relative to the reference genome, and the mutation frequency varies among the subgenomes. Our study has systematically characterized the genetic and epigenetic variants in regenerated woodland strawberry plants and different individuals of the same strawberry cultivar, providing an accurate assessment of somatic mutations at the genomic scale and nucleotide resolution in plants.
体细胞突变是在组织培养和无性繁殖过程中产生和积累的,可能会影响园艺作物的各种性状,但其特征仍不清楚。在此,我们通过全基因组测序分析了在不同条件下从组织培养的嫩梢中获得的再生林地草莓和 12 个栽培草莓个体的体细胞突变。单核苷酸变体的突变频率随着激素水平的增加或培养时间的延长而显著增加,每个位点的突变频率在 3.3 × 10-8-3.0 × 10-6 之间。CG 甲基化在再生植株中显示出稳定的降低(0.71%-8.03%),有性生殖后低CG-DMR的遗传性更高。为草莓栽培品种 "Beni hoppe "组装了高质量的单倍型分辨基因组。从不同地点随机选取的 12 个 "Beni hoppe "个体与参考基因组相比出现了 4731-6005 个突变,而且不同亚基因组的突变频率各不相同。我们的研究系统地描述了再生林地草莓植株和同一草莓栽培品种不同个体的遗传和表观遗传变异,在基因组尺度和核苷酸分辨率上准确评估了植物的体细胞突变。
{"title":"Global characterization of somatic mutations and DNA methylation changes during vegetative propagation in strawberries","authors":"Shaoqiang Hu, Xiangguo Zeng, Yuguo Liu, Yongping Li, Minghao Qu, Wen-Biao Jiao, Yongchao Han, Chunying Kang","doi":"10.1101/gr.279378.124","DOIUrl":"https://doi.org/10.1101/gr.279378.124","url":null,"abstract":"Somatic mutations arise and accumulate during tissue culture and vegetative propagation, potentially affecting various traits in horticultural crops, but their characteristics are still unclear. Here, somatic mutations in regenerated woodland strawberry derived from tissue culture of shoot tips under different conditions and 12 cultivated strawberry individuals are analyzed by whole genome sequencing. The mutation frequency of single nucleotide variants is significantly increased with increased hormone levels or prolonged culture time in the range of 3.3 × 10<sup>−8</sup>–3.0 × 10<sup>−6</sup> mutations per site. CG methylation shows a stable reduction (0.71%–8.03%) in regenerated plants, and hypoCG-DMRs are more heritable after sexual reproduction. A high-quality haplotype-resolved genome is assembled for the strawberry cultivar “Beni hoppe.” The 12 “Beni hoppe” individuals randomly selected from different locations show 4731–6005 mutations relative to the reference genome, and the mutation frequency varies among the subgenomes. Our study has systematically characterized the genetic and epigenetic variants in regenerated woodland strawberry plants and different individuals of the same strawberry cultivar, providing an accurate assessment of somatic mutations at the genomic scale and nucleotide resolution in plants.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data 两个负二项分布之差的理论框架及其在测序数据比较分析中的应用
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.278843.123
Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen
High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the P-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.
高通量测序(HTS)技术在研究体细胞和单细胞水平的生物问题方面发挥了重要作用。两个 HTS 数据集的比较分析通常依赖于测试两个负二项分布(DOTNB)差异的统计学意义。虽然负二项分布已被深入研究,但 DOTNB 的理论结果在很大程度上仍未被探索。在此,我们推导出 DOTNB 的基本分析结果,并研究其渐近特性。作为 DOTNB 的最新应用,我们介绍了 DEGage,这是一种在 scRNA-seq 数据中检测差异表达基因(DEG)的计算方法。DEGage 计算基因表达水平样本差异的平均值作为检验统计量,并通过使用 DOTNB 计算 P 值来确定显著的差异表达。使用模拟和真实的 scRNA-seq 数据集进行的广泛验证表明,DEGage 优于五种流行的 DEG 分析工具:DEGseq2、DEsingle、edgeR、Monocle3 和 scDD。DEGage 对高丢失水平具有很强的鲁棒性,在应用于平衡和不平衡数据集时,即使样本量较小,也能表现出卓越的灵敏度。我们利用 DEGage 分析了前列腺癌 scRNA-seq 数据集,并确定了 17 种细胞类型的标记基因。此外,我们还将 DEGage 应用于具有和不具有恐惧记忆的小鼠神经元的 scRNA-seq 数据集,并揭示了以往分析中忽略的八个潜在记忆相关基因。DOTNB 的理论结果和支持软件可广泛应用于 HTS 中分散计数数据的比较分析和广泛的研究问题。
{"title":"Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data","authors":"Alicia Petrany, Ruoyu Chen, Shaoqiang Zhang, Yong Chen","doi":"10.1101/gr.278843.123","DOIUrl":"https://doi.org/10.1101/gr.278843.123","url":null,"abstract":"High-throughput sequencing (HTS) technologies have been instrumental in investigating biological questions at the bulk and single-cell levels. Comparative analysis of two HTS data sets often relies on testing the statistical significance for the difference of two negative binomial distributions (DOTNB). Although negative binomial distributions are well studied, the theoretical results for DOTNB remain largely unexplored. Here, we derive basic analytical results for DOTNB and examine its asymptotic properties. As a state-of-the-art application of DOTNB, we introduce DEGage, a computational method for detecting differentially expressed genes (DEGs) in scRNA-seq data. DEGage calculates the mean of the sample-wise differences of gene expression levels as the test statistic and determines significant differential expression by computing the <em>P</em>-value with DOTNB. Extensive validation using simulated and real scRNA-seq data sets demonstrates that DEGage outperforms five popular DEG analysis tools: DEGseq2, DEsingle, edgeR, Monocle3, and scDD. DEGage is robust against high dropout levels and exhibits superior sensitivity when applied to balanced and imbalanced data sets, even with small sample sizes. We utilize DEGage to analyze prostate cancer scRNA-seq data sets and identify marker genes for 17 cell types. Furthermore, we apply DEGage to scRNA-seq data sets of mouse neurons with and without fear memory and reveal eight potential memory-related genes overlooked in previous analyses. The theoretical results and supporting software for DOTNB can be widely applied to comparative analyses of dispersed count data in HTS and broad research questions.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complete genomes of Asgard archaea reveal diverse integrated and mobile genetic elements 阿斯加德古菌的完整基因组揭示了多样化的整合和移动遗传元素
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279480.124
Luis E. Valentin-Alvarado, Ling-Dong Shi, Kathryn E. Appler, Alexander Crits-Christoph, Valerie De Anda, Benjamin A. Adler, Michael L. Cui, Lynn Ly, Pedro Leão, Richard J. Roberts, Rohan Sachdeva, Brett J. Baker, David F. Savage, Jillian F. Banfield
Asgard archaea are of great interest as the progenitors of Eukaryotes, but little is known about the mobile genetic elements (MGEs) that may shape their ongoing evolution. Here, we describe MGEs that replicate in Atabeyarchaeia, a wetland Asgard archaea lineage represented by two complete genomes. We used soil depth–resolved population metagenomic data sets to track 18 MGEs for which genome structures were defined and precise chromosome integration sites could be identified for confident host linkage. Additionally, we identified a complete 20.67 kbp circular plasmid and two family-level groups of viruses linked to Atabeyarchaeia, via CRISPR spacer targeting. Closely related 40 kbp viruses possess a hypervariable genomic region encoding combinations of specific genes for small cysteine-rich proteins structurally similar to restriction-homing endonucleases. One 10.9 kbp integrative conjugative element (ICE) integrates genomically into the Atabeyarchaeum deiterrae-1 chromosome and has a 2.5 kbp circularizable element integrated within it. The 10.9 kbp ICE encodes an expressed Type IIG restriction-modification system with a sequence specificity matching an active methylation motif identified by Pacific Biosciences (PacBio) high-accuracy long-read (HiFi) metagenomic sequencing. Restriction-modification of Atabeyarchaeia differs from that of another coexisting Asgard archaea, Freyarchaeia, which has few identified MGEs but possesses diverse defense mechanisms, including DISARM and Hachiman, not found in Atabeyarchaeia. Overall, defense systems and methylation mechanisms of Asgard archaea likely modulate their interactions with MGEs, and integration/excision and copy number variation of MGEs in turn enable host genetic versatility.
阿斯加德古菌是真核生物的祖先,因此备受关注,但人们对可能影响其持续进化的移动遗传因子(MGEs)却知之甚少。在这里,我们描述了在阿塔贝古细菌(Atabeyarchaeia)中复制的移动遗传因子,这是一个由两个完整基因组代表的湿地阿斯加德古细菌系。我们利用土壤深度分辨种群元基因组数据集追踪了 18 个 MGEs,这些 MGEs 的基因组结构已经确定,而且可以识别出精确的染色体整合位点,以确定宿主联系。此外,我们还通过 CRISPR spacer targeting,鉴定出了一个完整的 20.67 kbp 环状质粒和两个与 Atabeyarchaeia 相关的科级病毒群。密切相关的 40 kbp 病毒拥有一个超变异基因组区域,编码结构类似于限制性归巢内切酶的富含半胱氨酸小蛋白的特异基因组合。一个 10.9 kbp 的整合共轭元件(ICE)在基因组上整合到 Atabeyarchaeum deiterrae-1 染色体中,并在其中整合了一个 2.5 kbp 的可循环元件。10.9 kbp 的 ICE 编码一个表达的 IIG 型限制性修饰系统,其序列特异性与太平洋生物科学公司(PacBio)高精度长读数(HiFi)元基因组测序确定的活性甲基化基序相匹配。Atabeyarchaeia的限制性修饰与另一种共存的阿斯加德古菌Freyarchaeia的限制性修饰不同,Freyarchaeia几乎没有被鉴定出的MGEs,但却拥有多种防御机制,包括Atabeyarchaeia所没有的DISARM和Hachiman。总之,阿斯加德古菌的防御系统和甲基化机制很可能调节了它们与MGEs的相互作用,而MGEs的整合/切割和拷贝数变异反过来又使宿主的遗传多样性得以实现。
{"title":"Complete genomes of Asgard archaea reveal diverse integrated and mobile genetic elements","authors":"Luis E. Valentin-Alvarado, Ling-Dong Shi, Kathryn E. Appler, Alexander Crits-Christoph, Valerie De Anda, Benjamin A. Adler, Michael L. Cui, Lynn Ly, Pedro Leão, Richard J. Roberts, Rohan Sachdeva, Brett J. Baker, David F. Savage, Jillian F. Banfield","doi":"10.1101/gr.279480.124","DOIUrl":"https://doi.org/10.1101/gr.279480.124","url":null,"abstract":"Asgard archaea are of great interest as the progenitors of Eukaryotes, but little is known about the mobile genetic elements (MGEs) that may shape their ongoing evolution. Here, we describe MGEs that replicate in Atabeyarchaeia, a wetland Asgard archaea lineage represented by two complete genomes. We used soil depth–resolved population metagenomic data sets to track 18 MGEs for which genome structures were defined and precise chromosome integration sites could be identified for confident host linkage. Additionally, we identified a complete 20.67 kbp circular plasmid and two family-level groups of viruses linked to Atabeyarchaeia, via CRISPR spacer targeting. Closely related 40 kbp viruses possess a hypervariable genomic region encoding combinations of specific genes for small cysteine-rich proteins structurally similar to restriction-homing endonucleases. One 10.9 kbp integrative conjugative element (ICE) integrates genomically into the <em>Atabeyarchaeum deiterrae-1</em> chromosome and has a 2.5 kbp circularizable element integrated within it. The 10.9 kbp ICE encodes an expressed Type IIG restriction-modification system with a sequence specificity matching an active methylation motif identified by Pacific Biosciences (PacBio) high-accuracy long-read (HiFi) metagenomic sequencing. Restriction-modification of Atabeyarchaeia differs from that of another coexisting Asgard archaea, Freyarchaeia, which has few identified MGEs but possesses diverse defense mechanisms, including DISARM and Hachiman, not found in Atabeyarchaeia. Overall, defense systems and methylation mechanisms of Asgard archaea likely modulate their interactions with MGEs, and integration/excision and copy number variation of MGEs in turn enable host genetic versatility.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterising tandem repeat complexities across long-read sequencing platforms with TREAT and otter 利用 TREAT 和 otter 分析长读程测序平台的串联重复复杂性
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.279351.124
Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege
Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the accurate characterization of TRs, however, the underlying bioinformatics perspectives remain challenging. We present otter and TREAT: otter is a fast targeted local assembler, cross-compatible across different sequencing platforms. It is integrated in TREAT, an end-to-end workflow for TR characterization, visualization and analysis across multiple genomes. In a comparison with existing tools based on long-read sequencing data from both Oxford Nanopore Technology (ONT, Simplex and Duplex) and PacBio (Sequel 2 and Revio), otter and TREAT achieved state-of-the-art genotyping and motif characterisation accuracy. Applied to clinically relevant TRs, TREAT/otter significantly identified individuals with pathogenic TR expansions. When applied to a case-control setting, we significantly replicated previously reported associations of TRs with Alzheimer's Disease, including those near or within APOC1 (p=2.63x10-9), SPI1 (p=6.5x10-3) and ABCA7 (p=0.04) genes. We used TREAT/otter to systematically evaluate potential biases when genotyping TRs using diverse ONT and PacBio long-read sequencing datasets. We showed that, in rare cases (0.06%), long-read sequencing suffers from coverage drops in TRs, including the disease-associated TRs in ABCA7 and RFC1 genes. Such coverage drops can lead to TR misgenotyping, hampering the accurate characterization of TR alleles. Taken together, our tools can accurately genotype TR across different sequencing technologies and with minimal requirements, allowing end-to-end analysis and comparisons of TR in human genomes, with broad applications in research and clinical fields.
串联重复序列(TR)在人类基因组变异和疾病风险中发挥着重要作用。长读数测序可以准确鉴定 TRs,然而,其背后的生物信息学视角仍具有挑战性。我们介绍了 otter 和 TREAT:otter 是一种快速定向局部组装器,可在不同测序平台上交叉兼容。它集成在 TREAT 中,TREAT 是一个用于 TR 特征描述、可视化和多基因组分析的端到端工作流程。在与基于牛津纳米孔技术公司(ONT、Simplex 和 Duplex)和 PacBio 公司(Sequel 2 和 Revio)长线程测序数据的现有工具的比较中,otter 和 TREAT 实现了最先进的基因分型和图案表征准确性。将 TREAT/otter 应用于临床相关的 TR 时,能显著识别出具有致病性 TR 扩增的个体。当应用于病例对照环境时,我们显著地重复了之前报道的 TR 与阿尔茨海默病的关联,包括 APOC1(p=2.63x10-9)、SPI1(p=6.5x10-3)和 ABCA7(p=0.04)基因附近或内部的 TR。我们使用 TREAT/otter 系统地评估了使用不同 ONT 和 PacBio 长读程测序数据集对 TR 进行基因分型时可能出现的偏差。我们发现,在极少数情况下(0.06%),长线程测序会导致TRs的覆盖率下降,包括ABCA7和RFC1基因中的疾病相关TRs。这种覆盖率下降可能会导致TR错误分型,阻碍TR等位基因的准确鉴定。综上所述,我们的工具可以在不同测序技术中以最低的要求对TR进行准确的基因分型,从而对人类基因组中的TR进行端到端的分析和比较,在研究和临床领域有着广泛的应用。
{"title":"Characterising tandem repeat complexities across long-read sequencing platforms with TREAT and otter","authors":"Niccolo Tesi, Alex Salazar, Yaran Zhang, Sven van der Lee, Marc Hulsman, Lydian Knoop, Sanduni Wijesekera, Jana Krizova, Anne-Fleur Schneider, Maartje Pennings, Kristel Sleegers, Erik-Jan Kamsteeg, Marcel Reinders, Henne Holstege","doi":"10.1101/gr.279351.124","DOIUrl":"https://doi.org/10.1101/gr.279351.124","url":null,"abstract":"Tandem repeats (TR) play important roles in genomic variation and disease risk in humans. Long-read sequencing allows for the accurate characterization of TRs, however, the underlying bioinformatics perspectives remain challenging. We present otter and TREAT: otter is a fast targeted local assembler, cross-compatible across different sequencing platforms. It is integrated in TREAT, an end-to-end workflow for TR characterization, visualization and analysis across multiple genomes. In a comparison with existing tools based on long-read sequencing data from both Oxford Nanopore Technology (ONT, Simplex and Duplex) and PacBio (Sequel 2 and Revio), otter and TREAT achieved state-of-the-art genotyping and motif characterisation accuracy. Applied to clinically relevant TRs, TREAT/otter significantly identified individuals with pathogenic TR expansions. When applied to a case-control setting, we significantly replicated previously reported associations of TRs with Alzheimer's Disease, including those near or within <em>APOC1</em> (p=2.63x10-9), <em>SPI1</em> (p=6.5x10-3) and <em>ABCA7</em> (p=0.04) genes. We used TREAT/otter to systematically evaluate potential biases when genotyping TRs using diverse ONT and PacBio long-read sequencing datasets. We showed that, in rare cases (0.06%), long-read sequencing suffers from coverage drops in TRs, including the disease-associated TRs in <em>ABCA7</em> and <em>RFC1</em> genes. Such coverage drops can lead to TR misgenotyping, hampering the accurate characterization of TR alleles. Taken together, our tools can accurately genotype TR across different sequencing technologies and with minimal requirements, allowing end-to-end analysis and comparisons of TR in human genomes, with broad applications in research and clinical fields.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework 利用半监督学习框架从纳米孔测序中检测 m6A RNA 修饰
IF 7 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-15 DOI: 10.1101/gr.278960.124
Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford
Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.
基于直接纳米孔的 RNA 测序可用于检测转录后碱基修饰,如 m6A 甲基化,其依据是修饰碱基的不同化学结构所产生的电流信号。一个关键的挑战是缺乏足够的已知甲基化修饰的训练数据。我们介绍的 Xron 是一种混合编码器-解码器框架,它通过对合成 RNA 数据和基于免疫沉淀的实验数据进行训练,分两步提供直接的甲基化区分碱基召唤器。首先,我们通过硅交叉连接生成具有更多样化修饰组合的数据。其次,我们利用该数据集训练端到端神经网络基底调用器,然后利用标签平滑技术对基于免疫沉淀的实验数据进行微调。经过训练的神经网络基底唤醒器在读数级和位点级预测得分上都优于现有的甲基化检测方法。Xron 是一种独立的端到端 m6A 区分碱基召唤器,能够直接从原始测序信号中检测甲基化碱基,从而实现从头甲基化组组装。
{"title":"Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework","authors":"Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, Carl Kingsford","doi":"10.1101/gr.278960.124","DOIUrl":"https://doi.org/10.1101/gr.278960.124","url":null,"abstract":"Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoRAL accurately resolves extrachromosomal DNA genome structures with long-read sequencing. CoRAL 可通过长线程测序准确解析染色体外 DNA 基因组结构。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-11 DOI: 10.1101/gr.279131.124
Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna

Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in ∼15% of early-stage cancers and ∼30% of late-stage cancers. ecDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, in which the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graphs have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose Complete Reconstruction of Amplifications with Long reads (CoRAL) for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 10 data sets from previously characterized cell lines compared with previous short- and long-read-based tools. As long-read usage becomes widespread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.

染色体外 DNA(ecDNA)是癌症病灶癌基因扩增的核心机制,大约 15%的早期癌症和 30%的晚期癌症都会出现这种情况。蜕变DNA通过动态调节癌基因拷贝数和重构基因调控网络,推动肿瘤的形成、进化和耐药性。阐明ecDNA扩增的基因组结构对于了解肿瘤病理和开发更有效的疗法至关重要。人们利用成对短线程(Illumina)测序和绘图技术,用断点图来表示ecDNA扩增,在断点图中,ecDNA的推断结构被编码为一个周期。对断点图的遍历已成功用于预测癌症样本中是否存在 ecDNA。然而,短读取技术在断点识别、复杂重排和内部重复的分期、ecDNA 结构的细胞间异质性解旋等方面存在固有的局限性。牛津纳米孔技术公司(Oxford Nanopore Technologies)等公司的长读数技术具有改善推断的潜力,因为长读数能更好地绘制结构变异图,而且更有可能跨越重排或重复区域。在此,我们提出利用长读数数据重建 ecDNA 结构的 CoRAL(长读数扩增完全重建)方案。CoRAL 采用二次编程法重建可能的循环结构,同时优化重建的解析性、解释的拷贝数和长读数映射的一致性。与以前基于短线程和长线程的工具相比,CoRAL 在大量模拟和 10 个数据集(来自以前表征过的细胞系)中大大提高了重建效果。随着长线程的广泛使用,我们预计CoRAL将成为分析肿瘤病灶扩增情况和演变的重要工具。
{"title":"CoRAL accurately resolves extrachromosomal DNA genome structures with long-read sequencing.","authors":"Kaiyuan Zhu, Matthew G Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S Mischel, Howard Y Chang, Vineet Bafna","doi":"10.1101/gr.279131.124","DOIUrl":"10.1101/gr.279131.124","url":null,"abstract":"<p><p>Extrachromosomal DNA (ecDNA) is a central mechanism for focal oncogene amplification in cancer, occurring in ∼15% of early-stage cancers and ∼30% of late-stage cancers. ecDNAs drive tumor formation, evolution, and drug resistance by dynamically modulating oncogene copy number and rewiring gene-regulatory networks. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Paired-end short-read (Illumina) sequencing and mapping have been utilized to represent ecDNA amplifications using a breakpoint graph, in which the inferred architecture of ecDNA is encoded as a cycle in the graph. Traversals of breakpoint graphs have been used to successfully predict ecDNA presence in cancer samples. However, short-read technologies are intrinsically limited in the identification of breakpoints, phasing together complex rearrangements and internal duplications, and deconvolution of cell-to-cell heterogeneity of ecDNA structures. Long-read technologies, such as from Oxford Nanopore Technologies, have the potential to improve inference as the longer reads are better at mapping structural variants and are more likely to span rearranged or duplicated regions. Here, we propose Complete Reconstruction of Amplifications with Long reads (CoRAL) for reconstructing ecDNA architectures using long-read data. CoRAL reconstructs likely cyclic architectures using quadratic programming that simultaneously optimizes parsimony of reconstruction, explained copy number, and consistency of long-read mapping. CoRAL substantially improves reconstructions in extensive simulations and 10 data sets from previously characterized cell lines compared with previous short- and long-read-based tools. As long-read usage becomes widespread, we anticipate that CoRAL will be a valuable tool for profiling the landscape and evolution of focal amplifications in tumors.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529860/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141563231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian inference of sample-specific coexpression networks. 样本特异性共表达网络的贝叶斯推断。
IF 6.2 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2024-10-11 DOI: 10.1101/gr.279117.124
Enakshi Saha, Viola Fanfani, Panagiotis Mandros, Marouen Ben Guebila, Jonas Fischer, Katherine H Shutta, Dawn L DeMeo, Camila M Lopes-Ramos, John Quackenbush

Gene regulatory networks (GRNs) are effective tools for inferring complex interactions between molecules that regulate biological processes and hence can provide insights into drivers of biological systems. Inferring coexpression networks is a critical element of GRN inference, as the correlation between expression patterns may indicate that genes are coregulated by common factors. However, methods that estimate coexpression networks generally derive an aggregate network representing the mean regulatory properties of the population and so fail to fully capture population heterogeneity. Bayesian optimized networks obtained by assimilating omic data (BONOBO) is a scalable Bayesian model for deriving individual sample-specific coexpression matrices that recognizes variations in molecular interactions across individuals. For each sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific coexpression matrix constructed from all other samples in the data. Combining the sample-specific gene coexpression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific coexpression matrices, thus allowing the analysis of large data sets. We demonstrate BONOBO's utility in several contexts, including analyzing gene regulation in yeast transcription factor knockout studies, the prognostic significance of miRNA-mRNA interaction in human breast cancer subtypes, and sex differences in gene regulation within human thyroid tissue. We find that BONOBO outperforms other methods that have been used for sample-specific coexpression network inference and provides insight into individual differences in the drivers of biological processes.

基因调控网络(GRN)是推断调控生物过程的分子之间复杂相互作用的有效工具,因此可以深入了解生物系统的驱动因素。推断共表达网络是基因调控网络推断的关键要素,因为表达模式之间的相关性可能表明基因受到共同因素的核心调控。然而,估算共表达网络的方法通常会推导出一个代表群体平均调控特性的总体网络,因此无法完全捕捉群体的异质性。BONOBO(Bayesian Optimized Networks Obtained By assimilating Omics data)是一种可扩展的贝叶斯模型,用于推导个体样本特异性共表达矩阵,它能识别个体间分子相互作用的差异。对于每个样本,BONOBO 假设对数转换后的中心基因表达量呈高斯分布,并假设从数据中所有其他样本构建的样本特异性共表达矩阵呈共轭先验分布。结合样本特异性基因共表达与先验分布,BONOBO 得出了样本特异性共表达矩阵后验分布的闭式解,从而可以对大型数据集进行分析。我们在多种情况下展示了 BONOBO 的实用性,包括分析酵母转录因子敲除研究中的基因调控、人类乳腺癌亚型中 miRNA-mRNA 相互作用的预后意义以及人类甲状腺组织中基因调控的性别差异。我们发现,BONOBO 优于其他用于样本特异性共表达网络推断的方法,并能深入了解生物过程驱动因素的个体差异。
{"title":"Bayesian inference of sample-specific coexpression networks.","authors":"Enakshi Saha, Viola Fanfani, Panagiotis Mandros, Marouen Ben Guebila, Jonas Fischer, Katherine H Shutta, Dawn L DeMeo, Camila M Lopes-Ramos, John Quackenbush","doi":"10.1101/gr.279117.124","DOIUrl":"10.1101/gr.279117.124","url":null,"abstract":"<p><p>Gene regulatory networks (GRNs) are effective tools for inferring complex interactions between molecules that regulate biological processes and hence can provide insights into drivers of biological systems. Inferring coexpression networks is a critical element of GRN inference, as the correlation between expression patterns may indicate that genes are coregulated by common factors. However, methods that estimate coexpression networks generally derive an aggregate network representing the mean regulatory properties of the population and so fail to fully capture population heterogeneity. Bayesian optimized networks obtained by assimilating omic data (BONOBO) is a scalable Bayesian model for deriving individual sample-specific coexpression matrices that recognizes variations in molecular interactions across individuals. For each sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific coexpression matrix constructed from all other samples in the data. Combining the sample-specific gene coexpression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific coexpression matrices, thus allowing the analysis of large data sets. We demonstrate BONOBO's utility in several contexts, including analyzing gene regulation in yeast transcription factor knockout studies, the prognostic significance of miRNA-mRNA interaction in human breast cancer subtypes, and sex differences in gene regulation within human thyroid tissue. We find that BONOBO outperforms other methods that have been used for sample-specific coexpression network inference and provides insight into individual differences in the drivers of biological processes.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141970984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Genome research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1