bioRxiv - Genomics最新文献_第10页

Iso-Seq enables discovery of novel isoform variants in human retina at single cell resolution Iso-Seq 能够以单细胞分辨率发现人类视网膜中的新型同工酶变体

bioRxiv - Genomics

Pub Date : 2024-08-09 DOI: 10.1101/2024.08.08.607267

Luozixian Wang, Daniel Urrutia-Cabrera, Sandy Shen-Chi Hung, Alex W Hewitt, Samuel W Lukowski, Careen Foord, Peng-Yuan Wang, Hagen Tilgner, Raymond Wong

Recent single cell transcriptomic profiling of the human retina provided important insights into the genetic signals in heterogeneous retinal cell populations that enable vision. However, conventional single cell RNAseq with 3' short-read sequencing is not suitable to identify isoform variants. Here we utilized Iso-Seq with full-length sequencing to profile the human retina at single cell resolution for isoform discovery. We generated a retina transcriptome dataset consisting of 25,302 nuclei from three donor retina, and detected 49,710 known transcripts and 241,949 novel transcripts across major retinal cell types. We surveyed the use of alternative promoters to drive transcript variant expression, and showed that 1-8% of genes utilized multiple promoters across major retinal cell types. Also, our results enabled gene expression profiling of novel transcript variants for inherited retinal disease (IRD) genes, and identified differential usage of exon splicing in major retinal cell types. Altogether, we generated a human retina transcriptome dataset at single cell resolution with full-length sequencing. Our study highlighted the potential of Iso-Seq to map the isoform diversity in the human retina, providing an expanded view of the complex transcriptomic landscape in the retina.

最近对人类视网膜进行的单细胞转录组分析提供了重要信息，使人们能够深入了解异质性视网膜细胞群中的遗传信号，从而产生视觉。然而，传统的单细胞 RNAseq 3' 短线程测序并不适合鉴定同工酶变异。在这里，我们利用全长测序的 Iso-Seq 技术，以单细胞分辨率对人类视网膜进行剖析，从而发现同工酶变体。我们生成了一个视网膜转录组数据集，该数据集由来自三个供体视网膜的 25,302 个细胞核组成，在主要视网膜细胞类型中检测到 49,710 个已知转录本和 241,949 个新转录本。我们调查了利用替代启动子驱动转录本变异表达的情况，结果显示，在主要视网膜细胞类型中，1%-8%的基因利用了多个启动子。此外，我们的研究结果还对遗传性视网膜疾病（IRD）基因的新转录本变体进行了基因表达谱分析，并确定了主要视网膜细胞类型中外显子剪接的不同用法。总之，我们利用全长测序技术生成了单细胞分辨率的人类视网膜转录组数据集。我们的研究凸显了 Iso-Seq 在绘制人类视网膜同工酶组多样性图谱方面的潜力，为视网膜复杂的转录组景观提供了更广阔的视野。

{"title":"Iso-Seq enables discovery of novel isoform variants in human retina at single cell resolution","authors":"Luozixian Wang, Daniel Urrutia-Cabrera, Sandy Shen-Chi Hung, Alex W Hewitt, Samuel W Lukowski, Careen Foord, Peng-Yuan Wang, Hagen Tilgner, Raymond Wong","doi":"10.1101/2024.08.08.607267","DOIUrl":"https://doi.org/10.1101/2024.08.08.607267","url":null,"abstract":"Recent single cell transcriptomic profiling of the human retina provided important insights into the genetic signals in heterogeneous retinal cell populations that enable vision. However, conventional single cell RNAseq with 3' short-read sequencing is not suitable to identify isoform variants. Here we utilized Iso-Seq with full-length sequencing to profile the human retina at single cell resolution for isoform discovery. We generated a retina transcriptome dataset consisting of 25,302 nuclei from three donor retina, and detected 49,710 known transcripts and 241,949 novel transcripts across major retinal cell types. We surveyed the use of alternative promoters to drive transcript variant expression, and showed that 1-8% of genes utilized multiple promoters across major retinal cell types. Also, our results enabled gene expression profiling of novel transcript variants for inherited retinal disease (IRD) genes, and identified differential usage of exon splicing in major retinal cell types. Altogether, we generated a human retina transcriptome dataset at single cell resolution with full-length sequencing. Our study highlighted the potential of Iso-Seq to map the isoform diversity in the human retina, providing an expanded view of the complex transcriptomic landscape in the retina.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"370 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Draft Pacific Ancestry Pangenome Reference 太平洋祖先庞基因组参考文献草案

bioRxiv - Genomics

Pub Date : 2024-08-09 DOI: 10.1101/2024.08.07.606392

Connor C Littlefield, Jose M Lazaro-Guevara, Devorah Stucki, Michael Lansford, Melissa H Pezzolesi, Emma J Taylor, Etoni Ma'asi C Wolfgramm, Jacob Taloa, Kime Lao, C Dave Dumaguit, Perry G Ridge, Justina P Tavana, William L Holland, Kalani L Raphael, Marcus G. Pezzolesi

Individuals of Pacific ancestry suffer some of the highest rates of health disparities yet remain vastly underrepresented in genomic research, including currently available linear and pangenome references. To begin addressing this, we developed the first Pacific ancestry pangenome reference using 23 individuals with diverse Pacific ancestry. We assembled 46 haploid genomes from these 23 individuals, resulting in highly accurate and contiguous genome assemblies with an average quality value of 55.0 and an average N50 of 40.7 Mb, marking the first de novo assembly of highly accurate Pacific ancestry genomes. We combined these assemblies to create a pangenome reference, which added 30.6 Mb of novel sequence missing from the Human Pangenome Reference Consortium (HPRC) reference. Mapping short-reads to this pangenome reduced variant call errors and yielded more true-positive variants compared to the HPRC and T2T-CHM13 references. This Pacific ancestry pangenome reference serves as a resource to enhance genetic analyses for this underserved population.

太平洋血统的人在健康方面的不平等比例最高，但他们在基因组研究（包括目前可用的线性基因组和泛基因组参考）中的代表性却远远不够。为了着手解决这一问题，我们利用 23 个具有不同太平洋血统的个体开发了首个太平洋血统庞基因组参考。我们从这 23 个个体中组装了 46 个单倍体基因组，得到了平均质量值为 55.0、平均 N50 为 40.7 Mb 的高度准确和连续的基因组组装结果，标志着首次从头组装高度准确的太平洋祖先基因组。我们将这些装配组合在一起，创建了一个泛基因组参考文献，增加了人类泛基因组参考文献联盟（Human Pangenome Reference Consortium, HPRC）参考文献中缺失的 30.6 Mb 的新序列。与 HPRC 和 T2T-CHM13 参考文献相比，将短读数映射到该 pangenome 可减少变异调用错误，并产生更多真正的阳性变异。这一太平洋祖先 pangenome 参考作为一种资源，可加强对这一未得到充分服务的人群的遗传分析。

{"title":"A Draft Pacific Ancestry Pangenome Reference","authors":"Connor C Littlefield, Jose M Lazaro-Guevara, Devorah Stucki, Michael Lansford, Melissa H Pezzolesi, Emma J Taylor, Etoni Ma'asi C Wolfgramm, Jacob Taloa, Kime Lao, C Dave Dumaguit, Perry G Ridge, Justina P Tavana, William L Holland, Kalani L Raphael, Marcus G. Pezzolesi","doi":"10.1101/2024.08.07.606392","DOIUrl":"https://doi.org/10.1101/2024.08.07.606392","url":null,"abstract":"Individuals of Pacific ancestry suffer some of the highest rates of health disparities yet remain vastly underrepresented in genomic research, including currently available linear and pangenome references. To begin addressing this, we developed the first Pacific ancestry pangenome reference using 23 individuals with diverse Pacific ancestry. We assembled 46 haploid genomes from these 23 individuals, resulting in highly accurate and contiguous genome assemblies with an average quality value of 55.0 and an average N50 of 40.7 Mb, marking the first de novo assembly of highly accurate Pacific ancestry genomes. We combined these assemblies to create a pangenome reference, which added 30.6 Mb of novel sequence missing from the Human Pangenome Reference Consortium (HPRC) reference. Mapping short-reads to this pangenome reduced variant call errors and yielded more true-positive variants compared to the HPRC and T2T-CHM13 references. This Pacific ancestry pangenome reference serves as a resource to enhance genetic analyses for this underserved population.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The repertoire of short tandem repeats across the tree of life 生命树上的短串联重复序列

bioRxiv - Genomics

Pub Date : 2024-08-09 DOI: 10.1101/2024.08.08.607201

Nikol Chantzi, Ilias Georgakopoulos-Soares

Short tandem repeats (STRs) are widespread, dynamic repetitive elements with a number of biological functions and relevance to human diseases. However, their prevalence across taxa remains poorly characterized. Here we examined the impact of STRs in the genomes of 117,253 organisms spanning the tree of life. We find that there are large differences in the frequencies of STRs between organismal genomes and these differences are largely driven by the taxonomic group an organism belongs to. Using simulated genomes, we find that on average there is no enrichment of STRs in bacterial and archaeal genomes, suggesting that these genomes are not particularly repetitive. In contrast, we find that eukaryotic genomes are orders of magnitude more repetitive than expected. STRs are preferentially located at functional loci at specific taxa. Finally, we utilize the recently completed Telomere-to-Telomere genomes of human and other great apes, and find that STRs are highly abundant and variable between primate species, particularly in peri/centromeric regions. We conclude that STRs have expanded in eukaryotic and viral lineages and not in archaea or bacteria, resulting in large discrepancies in genomic composition.

短串联重复序列（STR）是一种广泛存在的动态重复元件，具有多种生物学功能，并与人类疾病有关。然而，短串联重复序列在不同类群中的流行程度还很低。在这里，我们研究了横跨生命树的 117,253 种生物基因组中 STR 的影响。我们发现，不同生物基因组中的 STRs 频率存在很大差异，而这些差异在很大程度上是由生物所属的分类群决定的。通过模拟基因组，我们发现细菌和古生物基因组中的 STRs 平均并不丰富，这表明这些基因组并不特别重复。相比之下，我们发现真核生物基因组的重复性要比预期的高出几个数量级。STR 优先位于特定类群的功能位点。最后，我们利用最近完成的人类和其他类人猿的端粒到端粒基因组，发现 STRs 在灵长类物种之间非常丰富和多变，尤其是在周边/中心粒区域。我们的结论是，STR 在真核生物和病毒系中得到了扩展，而在古细菌或细菌中却没有，这导致了基因组组成的巨大差异。

{"title":"The repertoire of short tandem repeats across the tree of life","authors":"Nikol Chantzi, Ilias Georgakopoulos-Soares","doi":"10.1101/2024.08.08.607201","DOIUrl":"https://doi.org/10.1101/2024.08.08.607201","url":null,"abstract":"Short tandem repeats (STRs) are widespread, dynamic repetitive elements with a number of biological functions and relevance to human diseases. However, their prevalence across taxa remains poorly characterized. Here we examined the impact of STRs in the genomes of 117,253 organisms spanning the tree of life. We find that there are large differences in the frequencies of STRs between organismal genomes and these differences are largely driven by the taxonomic group an organism belongs to. Using simulated genomes, we find that on average there is no enrichment of STRs in bacterial and archaeal genomes, suggesting that these genomes are not particularly repetitive. In contrast, we find that eukaryotic genomes are orders of magnitude more repetitive than expected. STRs are preferentially located at functional loci at specific taxa. Finally, we utilize the recently completed Telomere-to-Telomere genomes of human and other great apes, and find that STRs are highly abundant and variable between primate species, particularly in peri/centromeric regions. We conclude that STRs have expanded in eukaryotic and viral lineages and not in archaea or bacteria, resulting in large discrepancies in genomic composition.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Examining Sex-Specific DNA Methylation and Variability Post In Vitro Fertilization 研究体外受精后的性别特异性 DNA 甲基化和变异性

bioRxiv - Genomics

Pub Date : 2024-08-09 DOI: 10.1101/2024.08.08.604307

Melanie Lemaire, Keaton Warrick Smith, Samantha L Wilson

Infertility impacts up to 17.5% of reproductive-aged couples worldwide. To aid in conception, many couples turn to assisted reproductive technology, such as in vitro fertilization (IVF). IVF can introduce both physical and environmental stressors that may alter DNA methylation regulation, an important and dynamic process during early fetal development. This meta-analysis aims to assess the differences in the placental DNA methylome between spontaneous and IVF pregnancies. We identified three studies from NCBI GEO that measured DNA methylation with an Illumina Infinium Microarray in post-delivery placental tissue from both IVF and spontaneous pregnancies with a total of 575 samples for analysis (n = 96 IVF, n = 479 spontaneous). While there were no significant or differentially methylated CpGs in mixed or female stratified populations, we identified 9 CpGs that reached statistical significance (FDR <0.05) between IVF (n = 56) and spontaneous (n = 238) placentae. 7 autosomal CpGs and 1 X chromosome CpG was hypermethylated and 2 autosomal CpGs were hypomethylated in the IVF placentae compared to spontaneous. Autosomal CpGs closest to LIPJ, EEF1A2, and FBRSL1 also met our criteria to be classified as biologically differentially methylated CpGs (FDR <0.05; δ β|>0.05|). When analyzing variability differences in δβ values between IVF females, IVF males, spontaneous females and spontaneous males, we found a significant shift to greater variability in the both IVF males and females compared to spontaneous (p <2.2e-16, p <2.2e-16). Trends of variability were further analyzed in the biologically differentially methylated autosomal CpGs near LIPJ EEF1A2, and FBRSL1, and while these regions were statistically significant in males, the female δβ and δCoVs followed a similar trend that differed in magnitude. In males and females there was a statistically significant difference in proportions of endothelial cells, hofbauer cells, stromal cells and syncytiotrophoblasts between spontaneous and IVF populations. We also observed significant differences between sex within reproduction type in syncytiotrophoblasts and trophoblasts. The results of this study are critical to further understand the impact of IVF on tissue epigenetics which may help to investigate the connections between IVF and negative pregnancy outcomes. Additionally, our study supports sex specific differences in placental DNA methylation and cell composition should be considered as factors for future placental DNA methylation analyses.

不孕症影响着全球 17.5% 的育龄夫妇。为了帮助受孕，许多夫妇转向辅助生殖技术，如体外受精（IVF）。体外受精可能会带来物理和环境压力，从而改变 DNA 甲基化调控，而 DNA 甲基化是胎儿早期发育过程中一个重要的动态过程。本荟萃分析旨在评估自然妊娠和体外受精妊娠胎盘 DNA 甲基组的差异。我们从 NCBI GEO 中找到了三项研究，这些研究用 Illumina Infinium 芯片测量了试管婴儿和自然妊娠分娩后胎盘组织中的 DNA 甲基化，共分析了 575 个样本（n = 96 个试管婴儿样本，n = 479 个自然妊娠样本）。虽然在混合或女性分层人群中没有显著或差异甲基化的 CpGs，但我们在试管婴儿胎盘（n = 56）和自然妊娠胎盘（n = 238）之间发现了 9 个达到统计学意义（FDR <0.05）的 CpGs。与自然胎盘相比，试管婴儿胎盘中有 7 个常染色体 CpG 和 1 个 X 染色体 CpG 发生了高甲基化，2 个常染色体 CpG 发生了低甲基化。与 LIPJ、EEF1A2 和 FBRSL1 最接近的常染色体 CpGs 也符合我们的标准，被归类为生物差异甲基化 CpGs（FDR <0.05;δβ|>0.05|）。在分析试管婴儿雌性、试管婴儿雄性、自发雌性和自发雄性之间 δβ 值的变异性差异时，我们发现试管婴儿雄性和雌性的变异性均显著高于自发雄性（p <2.2e-16，p <2.2e-16）。我们进一步分析了 LIPJ EEF1A2 和 FBRSL1 附近常染色体 CpGs 生物甲基化差异的变异趋势。在男性和女性中，内皮细胞、hofbauer 细胞、基质细胞和合胞滋养细胞的比例在自发性和体外受精人群中存在明显的统计学差异。我们还观察到，在合胞滋养细胞和滋养细胞中，生殖类型内的性别差异也很明显。这项研究的结果对于进一步了解体外受精对组织表观遗传学的影响至关重要，这可能有助于研究体外受精与不良妊娠结局之间的联系。此外，我们的研究还支持胎盘 DNA 甲基化的性别差异，细胞组成也应作为未来胎盘 DNA 甲基化分析的考虑因素。

{"title":"Examining Sex-Specific DNA Methylation and Variability Post In Vitro Fertilization","authors":"Melanie Lemaire, Keaton Warrick Smith, Samantha L Wilson","doi":"10.1101/2024.08.08.604307","DOIUrl":"https://doi.org/10.1101/2024.08.08.604307","url":null,"abstract":"Infertility impacts up to 17.5% of reproductive-aged couples worldwide. To aid in conception, many couples turn to assisted reproductive technology, such as in vitro fertilization (IVF). IVF can introduce both physical and environmental stressors that may alter DNA methylation regulation, an important and dynamic process during early fetal development. This meta-analysis aims to assess the differences in the placental DNA methylome between spontaneous and IVF pregnancies. We identified three studies from NCBI GEO that measured DNA methylation with an Illumina Infinium Microarray in post-delivery placental tissue from both IVF and spontaneous pregnancies with a total of 575 samples for analysis (n = 96 IVF, n = 479 spontaneous). While there were no significant or differentially methylated CpGs in mixed or female stratified populations, we identified 9 CpGs that reached statistical significance (FDR <0.05) between IVF (n = 56) and spontaneous (n = 238) placentae. 7 autosomal CpGs and 1 X chromosome CpG was hypermethylated and 2 autosomal CpGs were hypomethylated in the IVF placentae compared to spontaneous. Autosomal CpGs closest to LIPJ, EEF1A2, and FBRSL1 also met our criteria to be classified as biologically differentially methylated CpGs (FDR <0.05; δ β|>0.05|). When analyzing variability differences in δβ values between IVF females, IVF males, spontaneous females and spontaneous males, we found a significant shift to greater variability in the both IVF males and females compared to spontaneous (p <2.2e-16, p <2.2e-16). Trends of variability were further analyzed in the biologically differentially methylated autosomal CpGs near LIPJ EEF1A2, and FBRSL1, and while these regions were statistically significant in males, the female δβ and δCoVs followed a similar trend that differed in magnitude. In males and females there was a statistically significant difference in proportions of endothelial cells, hofbauer cells, stromal cells and syncytiotrophoblasts between spontaneous and IVF populations. We also observed significant differences between sex within reproduction type in syncytiotrophoblasts and trophoblasts. The results of this study are critical to further understand the impact of IVF on tissue epigenetics which may help to investigate the connections between IVF and negative pregnancy outcomes. Additionally, our study supports sex specific differences in placental DNA methylation and cell composition should be considered as factors for future placental DNA methylation analyses.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Epigenetic clock and lifespan prediction in the short-lived killifish Nothobranchius furzeri 短寿鳉鱼Nothobranchius furzeri的表观遗传时钟和寿命预测

bioRxiv - Genomics

Pub Date : 2024-08-09 DOI: 10.1101/2024.08.07.606986

Chiara Giannuzzi, Mario Baumgart, Francesco Neri, Alessandro Cellerino

Aging, characterized by a gradual decline in organismal fitness, is the primary risk factor for numerous diseases including cancer, cardiovascular, and neurodegenerative disorders. The inter-individual variability in aging and disease susceptibility has led to the concept of biological age an indirect measure of an individual relative fitness. Epigenetic changes, particularly DNA methylation, have emerged as reliable biomarkers for estimating biological age, leading to the development of predictive models known as epigenetic clocks. Initially created for humans, these clocks have been extended to various mammalian species. Here we set to expand these tools for the short-lived killifish, Nothobranchius furzeri. This species, with its remarkably short lifespan and expression of canonical aging hallmarks, offers a unique model for experimental aging studies.We developed an epigenetic clock for N. furzeri using reduced-representation bisulfite sequencing (RRBS) to analyze DNA methylation in brain and caudal fin tissues across different ages. Our study involved generating comprehensive DNA methylation datasets and employing machine learning to create predictive models based on individual CpG sites and co-methylation modules. These models demonstrated high accuracy in estimating chronological age, with a median absolute error of 3 weeks (7.5% of median lifespan) for a clock based on methylation of individual CpG and 1.5 weeks (3.7% of median lifespan) for an eigenvector-based clock. Our investigation extended to assessing epigenetic age acceleration in different strains and the potential resetting effect of regeneration on fin tissue. Notably, our models indicated that a shorter-lived strain has accelerated epigenetic aging and that regeneration does not reset, but may decelerate epigenetic aging. Additionally, we used longitudinal data to develop an "epigenetic timer" for direct prediction of individual lifespan based on fin biopsies and eigenvector-based method, achieving a median absolute error of 4.5 weeks in the prediction of actual age of death. This surprising result underscores the existence of intrinsic determinants of lifespan established early in life.This study presents the first epigenetic clocks and lifespan predictors for N. furzeri, highlighting their potential as aging biomarkers and sets the stage for future research on life-extending interventions in this model organism.

衰老的特点是机体体能逐渐下降，是包括癌症、心血管疾病和神经退行性疾病在内的多种疾病的主要风险因素。衰老和疾病易感性的个体间差异导致了生物年龄的概念，它是个体相对健康状况的间接衡量标准。表观遗传变化，尤其是 DNA 甲基化，已成为估算生物年龄的可靠生物标志物，并由此开发出被称为表观遗传时钟的预测模型。这些时钟最初是为人类创建的，现在已扩展到各种哺乳动物物种。在这里，我们着手将这些工具扩展到短寿的鳉鱼--毛鳞鳉（Nothobranchius furzeri）。这种鱼的寿命非常短，而且表现出典型的衰老特征，为实验性衰老研究提供了一个独特的模型。我们利用还原-代表性亚硫酸氢盐测序（RRBS）技术开发了一种N. furzeri的表观遗传时钟，用于分析不同年龄段大脑和尾鳍组织中的DNA甲基化情况。我们的研究包括生成全面的DNA甲基化数据集，并利用机器学习创建基于单个CpG位点和共甲基化模块的预测模型。这些模型在估计年代年龄方面表现出很高的准确性，基于单个CpG甲基化的时钟的中位绝对误差为3周（中位寿命的7.5%），而基于特征向量的时钟的中位绝对误差为1.5周（中位寿命的3.7%）。我们的研究扩展到评估不同品系的表观遗传年龄加速以及鳍组织再生的潜在重置效应。值得注意的是，我们的模型表明，寿命较短的品系会加速表观遗传学衰老，而再生不会重置表观遗传学衰老，但可能会减速表观遗传学衰老。此外，我们利用纵向数据开发了一种 "表观遗传计时器"，根据鳍活检结果和基于特征向量的方法直接预测个体寿命，预测实际死亡年龄的中位绝对误差为 4.5 周。这项研究首次提出了毛鳞鱼的表观遗传时钟和寿命预测指标，凸显了它们作为衰老生物标志物的潜力，并为今后在这种模式生物中开展延长寿命干预措施的研究奠定了基础。

{"title":"Epigenetic clock and lifespan prediction in the short-lived killifish Nothobranchius furzeri","authors":"Chiara Giannuzzi, Mario Baumgart, Francesco Neri, Alessandro Cellerino","doi":"10.1101/2024.08.07.606986","DOIUrl":"https://doi.org/10.1101/2024.08.07.606986","url":null,"abstract":"Aging, characterized by a gradual decline in organismal fitness, is the primary risk factor for numerous diseases including cancer, cardiovascular, and neurodegenerative disorders. The inter-individual variability in aging and disease susceptibility has led to the concept of biological age an indirect measure of an individual relative fitness. Epigenetic changes, particularly DNA methylation, have emerged as reliable biomarkers for estimating biological age, leading to the development of predictive models known as epigenetic clocks. Initially created for humans, these clocks have been extended to various mammalian species. Here we set to expand these tools for the short-lived killifish, Nothobranchius furzeri. This species, with its remarkably short lifespan and expression of canonical aging hallmarks, offers a unique model for experimental aging studies.\u0000We developed an epigenetic clock for N. furzeri using reduced-representation bisulfite sequencing (RRBS) to analyze DNA methylation in brain and caudal fin tissues across different ages. Our study involved generating comprehensive DNA methylation datasets and employing machine learning to create predictive models based on individual CpG sites and co-methylation modules. These models demonstrated high accuracy in estimating chronological age, with a median absolute error of 3 weeks (7.5% of median lifespan) for a clock based on methylation of individual CpG and 1.5 weeks (3.7% of median lifespan) for an eigenvector-based clock. Our investigation extended to assessing epigenetic age acceleration in different strains and the potential resetting effect of regeneration on fin tissue. Notably, our models indicated that a shorter-lived strain has accelerated epigenetic aging and that regeneration does not reset, but may decelerate epigenetic aging. Additionally, we used longitudinal data to develop an \"epigenetic timer\" for direct prediction of individual lifespan based on fin biopsies and eigenvector-based method, achieving a median absolute error of 4.5 weeks in the prediction of actual age of death. This surprising result underscores the existence of intrinsic determinants of lifespan established early in life.\u0000This study presents the first epigenetic clocks and lifespan predictors for N. furzeri, highlighting their potential as aging biomarkers and sets the stage for future research on life-extending interventions in this model organism.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"199 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transposable elements impact the regulatory landscape through cell type specific epigenomic associations 可转座元件通过细胞类型特异性表观基因组关联影响调控格局

bioRxiv - Genomics

Pub Date : 2024-08-07 DOI: 10.1101/2024.08.07.606967

Jeffrey Hyacinthe, Guillaume Bourque

Transposable elements (TEs) are DNA sequences able to create copies of themselves within the genome. Despite their limited expression due to silencing, TEs still manage to impact the host genome. For instance, some TEs have been shown to act as cis-regulatory elements and be co-opted in the human genome. This highlights that the contributions of TEs to the host might come from their relationship with the epigenome rather than their expression. However, a systematic analysis that relates TEs in the human genome directly with chromatin histone marks across distinct cell types remains lacking. Here we leverage a new dataset from the International Human Epigenome Consortium with 4867 uniformly processed ChIP-seq experiments for 6 histone marks across 175 annotated cell labels and show that TEs have drastically different enrichments levels across marks. Overall, we find that TEs are generally depleted in H3K9me3 histone modification, except for L1s, while MIRs were highly enriched in H3K4me1, H3K27ac and H3K27me3 and Alus were enriched in H3K36me3. Furthermore, we present a generalised profile of the relationship between TEs enrichment and TE age which reveals a few TE families (Alu, MIR, L2) as diverging from expected dynamics. We also find some significant differences in TE enrichment between cell types and that in 20% of the cases, these enrichments were cell-type specific. We report that at least 4% of cell types with healthy and cancer samples featured significant differences. Notably, we identify 456 TE-Cell Type-histone triplet candidates with the strongest cell-type specific enrichments. We show that many of these candidates are associated with relevant biological processes and genes expressed in the relevant cell type. These results further support a role for TE in genome regulation and highlight novel associations between TEs and histone marks across cell types.

可转座元件（Transposable elements，TEs）是能够在基因组内复制自身的 DNA 序列。尽管可转座元件因沉默而表达有限，但它们仍能对宿主基因组产生影响。例如，一些可转座元件已被证明可作为顺式调控元件在人类基因组中发挥作用。这突出表明，TEs 对宿主的贡献可能来自它们与表观基因组的关系，而不是它们的表达。然而，在不同的细胞类型中，将人类基因组中的TE与染色质组蛋白标记直接联系起来的系统分析仍然缺乏。在这里，我们利用了国际人类表观基因组联盟（International Human Epigenome Consortium）的一个新数据集，该数据集包含 4867 项统一处理的 ChIP-seq 实验，涉及 175 个注释细胞标记中的 6 个组蛋白标记，结果表明 TEs 在不同标记中的富集水平大相径庭。总体而言，我们发现除 L1s 外，TEs 普遍缺乏 H3K9me3 组蛋白修饰，而 MIRs 则高度富集 H3K4me1、H3K27ac 和 H3K27me3，Alus 则富集 H3K36me3。此外，我们还展示了 TEs 富集与 TE 年龄之间关系的一般概况，发现少数 TE 家族（Alu、MIR、L2）与预期的动态不同。我们还发现细胞类型之间的 TE 富集存在一些显著差异，在 20% 的情况下，这些富集具有细胞类型特异性。我们发现至少有 4% 的细胞类型在健康样本和癌症样本中存在显著差异。值得注意的是，我们发现了 456 个具有最强细胞类型特异性富集的 TE 细胞类型组蛋白三元组候选物。我们发现其中许多候选基因与相关生物过程和在相关细胞类型中表达的基因有关。这些结果进一步支持了 TE 在基因组调控中的作用，并强调了 TE 与不同细胞类型组蛋白标记之间的新关联。

{"title":"Transposable elements impact the regulatory landscape through cell type specific epigenomic associations","authors":"Jeffrey Hyacinthe, Guillaume Bourque","doi":"10.1101/2024.08.07.606967","DOIUrl":"https://doi.org/10.1101/2024.08.07.606967","url":null,"abstract":"Transposable elements (TEs) are DNA sequences able to create copies of themselves within the genome. Despite their limited expression due to silencing, TEs still manage to impact the host genome. For instance, some TEs have been shown to act as cis-regulatory elements and be co-opted in the human genome. This highlights that the contributions of TEs to the host might come from their relationship with the epigenome rather than their expression. However, a systematic analysis that relates TEs in the human genome directly with chromatin histone marks across distinct cell types remains lacking. Here we leverage a new dataset from the International Human Epigenome Consortium with 4867 uniformly processed ChIP-seq experiments for 6 histone marks across 175 annotated cell labels and show that TEs have drastically different enrichments levels across marks. Overall, we find that TEs are generally depleted in H3K9me3 histone modification, except for L1s, while MIRs were highly enriched in H3K4me1, H3K27ac and H3K27me3 and Alus were enriched in H3K36me3. Furthermore, we present a generalised profile of the relationship between TEs enrichment and TE age which reveals a few TE families (Alu, MIR, L2) as diverging from expected dynamics. We also find some significant differences in TE enrichment between cell types and that in 20% of the cases, these enrichments were cell-type specific. We report that at least 4% of cell types with healthy and cancer samples featured significant differences. Notably, we identify 456 TE-Cell Type-histone triplet candidates with the strongest cell-type specific enrichments. We show that many of these candidates are associated with relevant biological processes and genes expressed in the relevant cell type. These results further support a role for TE in genome regulation and highlight novel associations between TEs and histone marks across cell types.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Genome-Wide Association Study and transcriptome analysis reveals a complex gene network that regulates opsin gene expression and cell fate determination in Drosophila R7 photoreceptor cells 全基因组关联研究和转录组分析揭示了调控果蝇 R7 感光细胞中视蛋白基因表达和细胞命运决定的复杂基因网络

bioRxiv - Genomics

Pub Date : 2024-08-07 DOI: 10.1101/2024.08.05.606616

John C. Aldrich, Lauren A. Vanderlinden, Thomas L. Jacobsen, Cheyret Wood, Laura M. Saba, Steven G. Britt

Background An animal’s ability to discriminate between differing wavelengths of light (i.e., color vision) is mediated, in part, by a subset of photoreceptor cells that express opsins with distinct absorption spectra. In Drosophila R7 photoreceptors, expression of the rhodopsin molecules, Rh3 or Rh4, is determined by a stochastic process mediated by the transcription factor spineless. The goal of this study was to identify additional factors that regulate R7 cell fate and opsin choice using a Genome Wide Association Study (GWAS) paired with transcriptome analysis via RNA-Seq.

背景动物分辨不同波长光线的能力（即色觉）部分是由表达具有不同吸收光谱的视蛋白的感光细胞亚群介导的。在果蝇的 R7 感光细胞中，Rh3 或 Rh4 的表达是由无棘转录因子介导的随机过程决定的。本研究的目的是通过基因组全关联研究（GWAS）和RNA-Seq转录组分析，找出调节R7细胞命运和视紫红质选择的其他因子。

引用次数: 0

Extensively acquired antimicrobial resistant bacteria restructure the individual microbial community in post-antibiotic conditions 广泛获得的抗菌细菌在后抗生素条件下重组个体微生物群落

bioRxiv - Genomics

Pub Date : 2024-08-07 DOI: 10.1101/2024.08.07.606955

Jae Woo Baek, Songwon Lim, Nayeon Park, Byeongsop Song, Nikhil Kirtipal, Jens Nielsen, Adil Mardinoglu, Saeed Shoaie, Jae-il Kim, Jang Won Son, Ara Koh, Sunjae Lee

In recent years, the overuse of antibiotics has led to the emergence of antimicrobial resistant (AMR) bacteria. To evaluate the spread of AMR bacteria, the reservoir of AMR genes (resistome) has traditionally been identified from environmental samples, hospital environments, and human populations; however, the functional role of AMR bacteria in the human gut microbiome and their persistency within individuals has not been fully investigated. Here, we performed a strain-resolved in-depth analysis of the resistome changes by reconstructing a large number of metagenome-assembled genomes (MAGs) of antibiotics- treated individual’s gut microbiome. Interestingly, we identified two bacterial populations with different resistome profiles, extensively acquired antimicrobial resistant bacteria (EARB) and sporadically acquired antimicrobial resistant bacteria (SARB), and found that EARB showed broader drug resistance and a significant functional role in shaping individual microbiome composition after antibiotic treatment. Furthermore, longitudinal strain analysis revealed that EARB bacteria were inherently carried by individuals and can reemerge through strain switching in the human gut microbiome. Our data on the presence of AMR bacteria in the human gut microbiome provides a new avenue for controlling the spread of AMR bacteria in the human community.

近年来，抗生素的过度使用导致了抗菌素耐药性（AMR）细菌的出现。为了评估 AMR 细菌的传播情况，传统上从环境样本、医院环境和人类群体中鉴定 AMR 基因库（耐药性基因组）；然而，AMR 细菌在人类肠道微生物组中的功能作用及其在个体中的持久性尚未得到充分研究。在这里，我们通过重建抗生素治疗个体肠道微生物组的大量元基因组组装基因组（MAGs），对耐药性组的变化进行了菌株分辨的深入分析。有趣的是，我们发现了两种具有不同抗性谱的细菌群，即广泛获得性抗菌素耐药菌（EARB）和零星获得性抗菌素耐药菌（SARB），并发现 EARB 表现出更广泛的耐药性，而且在抗生素治疗后塑造个体微生物组组成方面具有重要的功能作用。此外，纵向菌株分析表明，EARB 细菌是个体固有携带的细菌，可通过菌株转换在人体肠道微生物组中重新出现。我们关于人类肠道微生物群中存在 AMR 细菌的数据为控制 AMR 细菌在人类群落中的传播提供了一条新途径。

{"title":"Extensively acquired antimicrobial resistant bacteria restructure the individual microbial community in post-antibiotic conditions","authors":"Jae Woo Baek, Songwon Lim, Nayeon Park, Byeongsop Song, Nikhil Kirtipal, Jens Nielsen, Adil Mardinoglu, Saeed Shoaie, Jae-il Kim, Jang Won Son, Ara Koh, Sunjae Lee","doi":"10.1101/2024.08.07.606955","DOIUrl":"https://doi.org/10.1101/2024.08.07.606955","url":null,"abstract":"In recent years, the overuse of antibiotics has led to the emergence of antimicrobial resistant (AMR) bacteria. To evaluate the spread of AMR bacteria, the reservoir of AMR genes (resistome) has traditionally been identified from environmental samples, hospital environments, and human populations; however, the functional role of AMR bacteria in the human gut microbiome and their persistency within individuals has not been fully investigated. Here, we performed a strain-resolved in-depth analysis of the resistome changes by reconstructing a large number of metagenome-assembled genomes (MAGs) of antibiotics- treated individual’s gut microbiome. Interestingly, we identified two bacterial populations with different resistome profiles, extensively acquired antimicrobial resistant bacteria (EARB) and sporadically acquired antimicrobial resistant bacteria (SARB), and found that EARB showed broader drug resistance and a significant functional role in shaping individual microbiome composition after antibiotic treatment. Furthermore, longitudinal strain analysis revealed that EARB bacteria were inherently carried by individuals and can reemerge through strain switching in the human gut microbiome. Our data on the presence of AMR bacteria in the human gut microbiome provides a new avenue for controlling the spread of AMR bacteria in the human community.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable imaging-free spatial genomics through computational reconstruction 通过计算重建实现可扩展的无成像空间基因组学

bioRxiv - Genomics

Pub Date : 2024-08-07 DOI: 10.1101/2024.08.05.606465

Chenlei Hu, Mehdi Borji, Giovanni J. Marrero, Vipin Kumar, Jackson A. Weir, Sachin V. Kammula, Evan Z. Macosko, Fei Chen

Tissue organization arises from the coordinated molecular programs of cells. Spatial genomics maps cells and their molecular programs within the spatial context of tissues. However, current methods measure spatial information through imaging or direct registration, which often require specialized equipment and are limited in scale. Here, we developed an imaging-free spatial transcriptomics method that uses molecular diffusion patterns to computationally reconstruct spatial data. To do so, we utilize a simple experimental protocol on two dimensional barcode arrays to establish an interaction network between barcodes via molecular diffusion. Sequencing these interactions generates a high dimensional matrix of interactions between different spatial barcodes. Then, we perform dimensionality reduction to regenerate a two-dimensional manifold, which represents the spatial locations of the barcode arrays. Surprisingly, we found that the UMAP algorithm, with minimal modifications can faithfully successfully reconstruct the arrays. We demonstrated that this method is compatible with capture array based spatial transcriptomics/genomics methods, Slide-seq and Slide-tags, with high fidelity. We systematically explore the fidelity of the reconstruction through comparisons with experimentally derived ground truth data, and demonstrate that reconstruction generates high quality spatial genomics data. We also scaled this technique to reconstruct high-resolution spatial information over areas up to 1.2 centimeters. This computational reconstruction method effectively converts spatial genomics measurements to molecular biology, enabling spatial transcriptomics with high accessibility, and scalability.

组织结构源于细胞的分子协调程序。空间基因组学将细胞及其分子程序映射到组织的空间环境中。然而，目前的方法是通过成像或直接配准来测量空间信息，这通常需要专门的设备，而且规模有限。在这里，我们开发了一种无需成像的空间转录组学方法，利用分子扩散模式计算重建空间数据。为此，我们在二维条形码阵列上采用简单的实验方案，通过分子扩散建立条形码之间的相互作用网络。对这些相互作用进行排序，可生成不同空间条形码之间相互作用的高维矩阵。然后，我们进行降维处理，重新生成代表条形码阵列空间位置的二维流形。令人惊奇的是，我们发现 UMAP 算法只需进行少量修改，就能忠实地成功重建阵列。我们证明，这种方法与基于捕获阵列的空间转录组学/基因组学方法（Slide-seq 和 Slide-tags）兼容，而且保真度很高。我们通过与实验得出的地面实况数据进行比较，系统地探索了重建的保真度，并证明重建生成了高质量的空间基因组学数据。我们还扩展了这一技术，以重建高达 1.2 厘米区域的高分辨率空间信息。这种计算重建方法有效地将空间基因组学测量转换为分子生物学测量，实现了空间转录组学的高易用性和可扩展性。

{"title":"Scalable imaging-free spatial genomics through computational reconstruction","authors":"Chenlei Hu, Mehdi Borji, Giovanni J. Marrero, Vipin Kumar, Jackson A. Weir, Sachin V. Kammula, Evan Z. Macosko, Fei Chen","doi":"10.1101/2024.08.05.606465","DOIUrl":"https://doi.org/10.1101/2024.08.05.606465","url":null,"abstract":"Tissue organization arises from the coordinated molecular programs of cells. Spatial genomics maps cells and their molecular programs within the spatial context of tissues. However, current methods measure spatial information through imaging or direct registration, which often require specialized equipment and are limited in scale. Here, we developed an imaging-free spatial transcriptomics method that uses molecular diffusion patterns to computationally reconstruct spatial data. To do so, we utilize a simple experimental protocol on two dimensional barcode arrays to establish an interaction network between barcodes via molecular diffusion. Sequencing these interactions generates a high dimensional matrix of interactions between different spatial barcodes. Then, we perform dimensionality reduction to regenerate a two-dimensional manifold, which represents the spatial locations of the barcode arrays. Surprisingly, we found that the UMAP algorithm, with minimal modifications can faithfully successfully reconstruct the arrays. We demonstrated that this method is compatible with capture array based spatial transcriptomics/genomics methods, Slide-seq and Slide-tags, with high fidelity. We systematically explore the fidelity of the reconstruction through comparisons with experimentally derived ground truth data, and demonstrate that reconstruction generates high quality spatial genomics data. We also scaled this technique to reconstruct high-resolution spatial information over areas up to 1.2 centimeters. This computational reconstruction method effectively converts spatial genomics measurements to molecular biology, enabling spatial transcriptomics with high accessibility, and scalability.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Latent generative modeling of long genetic sequences with GANs 利用 GAN 对长遗传序列进行潜在生成建模

bioRxiv - Genomics

Pub Date : 2024-08-07 DOI: 10.1101/2024.08.07.607012

Antoine Szatkownik, Cyril Furtlehner, Guillaume Charpiat, Burak Yelmen, Flora Jay

Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.

通过生成模型生成合成数据最近已成为基因组学的一个重要研究领域，其应用范围从功能序列设计到高质量、保护隐私的人工硅学基因组。在利用原始基因组输入训练的各种生成模型创建人工基因组（AGs）的大量工作之后，我们提出了一种概念上不同的方法，以解决高维度基因组数据生成的可扩展性和复杂性问题。我们的方法结合了通过主成分分析（PCA）实现的降维和在降维空间中学习的生成对抗网络（GAN）。利用这一框架，我们生成了世界各地不同人类群体的基因组代理数据集。我们将我们的方法生成的 AGs 的质量与现有模型生成的 AGs 的质量进行了比较，并报告了在捕捉种群结构、连锁不平衡和隐私泄露相关指标方面的改进。此外，我们还开发了一种节俭型模型，其参数数量少，性能与大型模型相当。在质量评估方面，我们还采用了一种基于信息论的新评估指标来衡量局部单倍型多样性，结果表明生成模型产生的多样性高于真实基因组。此外，我们还解决了与 PCA 和生成模型相关的收缩问题，研究了其与近邻相似度指标的关系，并提出了解决方法。最后，我们评估了不同二值化方法对输出 AG 质量的影响。

{"title":"Latent generative modeling of long genetic sequences with GANs","authors":"Antoine Szatkownik, Cyril Furtlehner, Guillaume Charpiat, Burak Yelmen, Flora Jay","doi":"10.1101/2024.08.07.607012","DOIUrl":"https://doi.org/10.1101/2024.08.07.607012","url":null,"abstract":"Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"199 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0