A fundamental goal of genetics is to identify which and how genetic variants are associated with a trait, often using the regression results from genome-wide association (GWA) studies. Important methodological challenges account for inflation in GWA effect estimates as well as in investigating more than one trait simultaneously. We leverage machine learning approaches for these two challenges, developing a computationally efficient method called ML-MAGES. First, we shrink the inflation in GWA effect sizes caused by nonindependence among variants using neural networks. We then cluster variant associations among multiple traits via variational inference. We compare the performance of shrinkage via neural networks to regularized regression and fine-mapping, two approaches used for addressing inflated effects but dealing with variants in focal regions of different sizes. Our neural network shrinkage outperforms both methods in approximating the true effect sizes in simulated data. Our infinite mixture clustering approach offers a flexible, data-driven way to distinguish different types of associations—trait-specific, shared across traits, or nonprioritized—among multiple traits based on their regularized effects. Clustering applied to our neural network shrinkage results also produces consistently higher precision and recall for distinguishing gene-level associations in simulations. We demonstrate the application of ML-MAGES on association analyses of two quantitative traits and two binary traits in the UK Biobank. Our identified associated genes from single-trait enrichment tests overlap with those having known relevant biological processes to the traits. Besides trait-specific associations, ML-MAGES identifies several variants with shared multitrait associations, suggesting putative shared genetic architecture.
{"title":"ML-MAGES enables multivariate genetic association analyses with genes and effect size shrinkage","authors":"Xiran Liu, Lorin Crawford, Sohini Ramachandran","doi":"10.1101/gr.280440.125","DOIUrl":"https://doi.org/10.1101/gr.280440.125","url":null,"abstract":"A fundamental goal of genetics is to identify which and how genetic variants are associated with a trait, often using the regression results from genome-wide association (GWA) studies. Important methodological challenges account for inflation in GWA effect estimates as well as in investigating more than one trait simultaneously. We leverage machine learning approaches for these two challenges, developing a computationally efficient method called ML-MAGES. First, we shrink the inflation in GWA effect sizes caused by nonindependence among variants using neural networks. We then cluster variant associations among multiple traits via variational inference. We compare the performance of shrinkage via neural networks to regularized regression and fine-mapping, two approaches used for addressing inflated effects but dealing with variants in focal regions of different sizes. Our neural network shrinkage outperforms both methods in approximating the true effect sizes in simulated data. Our infinite mixture clustering approach offers a flexible, data-driven way to distinguish different types of associations—trait-specific, shared across traits, or nonprioritized—among multiple traits based on their regularized effects. Clustering applied to our neural network shrinkage results also produces consistently higher precision and recall for distinguishing gene-level associations in simulations. We demonstrate the application of ML-MAGES on association analyses of two quantitative traits and two binary traits in the UK Biobank. Our identified associated genes from single-trait enrichment tests overlap with those having known relevant biological processes to the traits. Besides trait-specific associations, ML-MAGES identifies several variants with shared multitrait associations, suggesting putative shared genetic architecture.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"19 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145553433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robin Kosch, Katharina Limm, Annette M. Staiger, Nadine S. Kurz, Nicole Seifert, Bence Oláh, Stefan Solbrig, Viola Poeschel, Gerhard Held, Marita Ziepert, Norbert Schmitz, Emil Chteinberg, Reiner Siebert, Rainer Spang, Helena U. Zacharias, German Ott, Peter J. Oefner, Michael Altenbuchinger
High-throughput bottom-up proteomics data cover 1,000s of proteins and related co- and post-translational modifications (CTMs/PTMs). Yet, it remains an open question how to holistically explore such data and their relationship to complementary omics/phenotypical information. Graphical models are particularly suited to study molecular networks and underlying regulatory mechanisms, as they can distinguish direct from indirect relationships, aside from their generalizability to diverse data types. We propose PriOmics to integrate proteomics data with complementary omics and phenotypical data. PriOmics models intensities of individual proteotypic peptides and incorporates their protein affiliation as prior knowledge to resolve statistical relationships between proteins and CTMs/PTMs. This was verified in simulation studies, which also demonstrate that PriOmics can disentangle regulatory effects of protein modifications from those of respective protein abundances. These findings were substantiated in a Diffuse Large B-Cell Lymphoma (DLBCL) dataset where we integrated SWATH-MS-based proteomics with transcriptomic and phenotypic data.
{"title":"Integration of high-throughput proteomic data and complementary omics layers with PriOmics","authors":"Robin Kosch, Katharina Limm, Annette M. Staiger, Nadine S. Kurz, Nicole Seifert, Bence Oláh, Stefan Solbrig, Viola Poeschel, Gerhard Held, Marita Ziepert, Norbert Schmitz, Emil Chteinberg, Reiner Siebert, Rainer Spang, Helena U. Zacharias, German Ott, Peter J. Oefner, Michael Altenbuchinger","doi":"10.1101/gr.279487.124","DOIUrl":"https://doi.org/10.1101/gr.279487.124","url":null,"abstract":"High-throughput bottom-up proteomics data cover 1,000s of proteins and related co- and post-translational modifications (CTMs/PTMs). Yet, it remains an open question how to holistically explore such data and their relationship to complementary omics/phenotypical information. Graphical models are particularly suited to study molecular networks and underlying regulatory mechanisms, as they can distinguish direct from indirect relationships, aside from their generalizability to diverse data types. We propose PriOmics to integrate proteomics data with complementary omics and phenotypical data. PriOmics models intensities of individual proteotypic peptides and incorporates their protein affiliation as prior knowledge to resolve statistical relationships between proteins and CTMs/PTMs. This was verified in simulation studies, which also demonstrate that PriOmics can disentangle regulatory effects of protein modifications from those of respective protein abundances. These findings were substantiated in a Diffuse Large B-Cell Lymphoma (DLBCL) dataset where we integrated SWATH-MS-based proteomics with transcriptomic and phenotypic data.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"32 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aging compromises intestinal integrity, yet the chromatin changes driving this decline remain unclear. Polycomb-mediated repression is essential for silencing developmental genes, but this regulatory mechanism becomes dysregulated with age. Although shifts in Polycomb regulation within intestinal stem cells have been linked to gut aging, the Polycomb landscape of differentiated cell types remains unexplored. Differentiated cells comprise the majority of the gut epithelium and directly impact both tissue and whole organismal aging. Using single-cell chromatin profiling of the Drosophila intestine, we identify cell type-specific chromatin landscape changes during aging. We find that old enterocytes aberrantly repress genes essential for transmembrane transport and chitin metabolism, contributing to intestinal barrier decline – an example of antagonistic pleiotropy in a regenerative tissue. Barrier decline leads to derepression of JAK/STAT ligands in all cell types and increased proliferation of aging stem cells, with elevated RNA Polymerase II (RNAPII) at S-phase-dependent histone genes. Specific upregulation of histone genes during aging stem cell proliferation resembles RNAPII hypertranscription of histone genes in aggressive human cancers. Our work reveals that misregulation of the Polycomb-mediated H3K27me3 histone modification in differentiated cells during aging not only underlies tissue decline but also mirrors transcriptional changes in cancer, suggesting a common mechanism linking aging and cancer progression.
{"title":"Polycomb misregulation in enterocytes drives tissue decline in the aging Drosophila intestine","authors":"Sarah Leichter, Kami Ahmad, Steve Henikoff","doi":"10.1101/gr.281058.125","DOIUrl":"https://doi.org/10.1101/gr.281058.125","url":null,"abstract":"Aging compromises intestinal integrity, yet the chromatin changes driving this decline remain unclear. Polycomb-mediated repression is essential for silencing developmental genes, but this regulatory mechanism becomes dysregulated with age. Although shifts in Polycomb regulation within intestinal stem cells have been linked to gut aging, the Polycomb landscape of differentiated cell types remains unexplored. Differentiated cells comprise the majority of the gut epithelium and directly impact both tissue and whole organismal aging. Using single-cell chromatin profiling of the <em>Drosophila</em> intestine, we identify cell type-specific chromatin landscape changes during aging. We find that old enterocytes aberrantly repress genes essential for transmembrane transport and chitin metabolism, contributing to intestinal barrier decline – an example of antagonistic pleiotropy in a regenerative tissue. Barrier decline leads to derepression of JAK/STAT ligands in all cell types and increased proliferation of aging stem cells, with elevated RNA Polymerase II (RNAPII) at S-phase-dependent histone genes. Specific upregulation of histone genes during aging stem cell proliferation resembles RNAPII hypertranscription of histone genes in aggressive human cancers. Our work reveals that misregulation of the Polycomb-mediated H3K27me3 histone modification in differentiated cells during aging not only underlies tissue decline but also mirrors transcriptional changes in cancer, suggesting a common mechanism linking aging and cancer progression.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"7 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145536191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatially resolved transcriptomics (SRT) technologies measure gene expression across thousands of spatial locations within a tissue slice. Multiple SRT technologies are currently available and others are in active development, with each technology having varying spatial resolution (subcellular, single-cell, or multicellular regions), gene coverage (targeted vs. whole-transcriptome), and sequencing depth per location. For example, the widely used 10x Genomics Visium platform measures whole transcriptomes from multiple-cell-sized spots, whereas the 10x Genomics Xenium platform measures a few hundred genes at subcellular resolution. A number of studies apply multiple SRT technologies to slices that originate from the same biological tissue. Integration of data from different SRT technologies can overcome limitations of the individual technologies, enabling the imputation of expression from unmeasured genes in targeted technologies and/or the deconvolution of admixed expression from technologies with lower spatial resolution. Here, we introduce Spatial Integration for Imputation and Deconvolution (SIID), an algorithm to reconstruct a latent spatial gene expression matrix from a pair of observations from different SRT technologies. SIID leverages a spatial alignment and uses a joint nonnegative factorization model to accurately impute missing gene expression and infer gene expression signatures of cell types from admixed SRT data. In simulations involving paired SRT data sets from different technologies (e.g., Xenium and Visium), SIID shows superior performance in reconstructing spot-to-cell-type assignments, recovering cell type–specific gene expression and imputing missing data compared to contemporary tools. When applied to real-world 10x Xenium-Visium pairs from human breast and colon cancer tissues, SIID achieves highest performance in imputing holdout gene expression.
{"title":"Joint imputation and deconvolution of gene expression across spatial transcriptomics platforms","authors":"Hongyu Zheng, Hirak Sarkar, Benjamin J. Raphael","doi":"10.1101/gr.280555.125","DOIUrl":"https://doi.org/10.1101/gr.280555.125","url":null,"abstract":"Spatially resolved transcriptomics (SRT) technologies measure gene expression across thousands of spatial locations within a tissue slice. Multiple SRT technologies are currently available and others are in active development, with each technology having varying spatial resolution (subcellular, single-cell, or multicellular regions), gene coverage (targeted vs. whole-transcriptome), and sequencing depth per location. For example, the widely used 10x Genomics Visium platform measures whole transcriptomes from multiple-cell-sized spots, whereas the 10x Genomics Xenium platform measures a few hundred genes at subcellular resolution. A number of studies apply multiple SRT technologies to slices that originate from the same biological tissue. Integration of data from different SRT technologies can overcome limitations of the individual technologies, enabling the imputation of expression from unmeasured genes in targeted technologies and/or the deconvolution of admixed expression from technologies with lower spatial resolution. Here, we introduce Spatial Integration for Imputation and Deconvolution (SIID), an algorithm to reconstruct a latent spatial gene expression matrix from a pair of observations from different SRT technologies. SIID leverages a spatial alignment and uses a joint nonnegative factorization model to accurately impute missing gene expression and infer gene expression signatures of cell types from admixed SRT data. In simulations involving paired SRT data sets from different technologies (e.g., Xenium and Visium), SIID shows superior performance in reconstructing spot-to-cell-type assignments, recovering cell type–specific gene expression and imputing missing data compared to contemporary tools. When applied to real-world 10x Xenium-Visium pairs from human breast and colon cancer tissues, SIID achieves highest performance in imputing holdout gene expression.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"3 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145536177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siavash Raeisi Dehkordi, Zhaoyang Jia, Joey Estabrook, Jen Hauenstein, Neil Miller, Naz Güleray-Lafci, Jürgen Neesen, Alex Hastie, Alka Chaubey, Andy Wing Chun Pang, Paul Dremsek, Vineet Bafna
The whole-genome karyotype refers to the sequence of large chromosomal segments comprising an individual's genotype. Karyotype analysis, which includes identifying aneuploidies and structural rearrangements, is essential for understanding genetic risk factors, informing diagnosis and treatment, and guiding genetic counseling in constitutional disorders. The current karyotyping standard relies on microscopic chromosome examination, a complex and expertise-dependent process with megabase-scale resolution. Optical genome mapping (OGM) technology offers an efficient approach to detect large-scale genomic lesions. Here, we introduce OMKar, a computational method that generates virtual karyotypes from OGM data. OMKar integrates structural variants (SVs) and copy number (CN) variants into a breakpoint graph representation. It re-estimates CNs using integer linear programming to enforce CN balance and then identifies constrained Eulerian paths corresponding to full chromosome structures. OMKar is evaluated on 38 whole-genome simulations of constitutional disorders, achieving 88% precision and 95% recall for SV concordance and a 95% Jaccard score for CN concordance. We further apply OMKar to 154 clinical samples including 50 prenatal, 41 postnatal, and 63 parental genomes collected across 10 sites. It correctly reconstructs the karyotype in 144 cases, including 25 of 25 aneuploidies, 32 of 32 balanced translocations, and 72 of 82 unbalanced rearrangements. Identified disorders include cri-du-chat, Wolf–Hirschhorn, Prader–Willi, Down, and Turner syndromes. Notably, OMKar uncovers plausible genetic mechanisms in five previously unexplained cases. These results demonstrate the accuracy and utility of OMKar for OGM-based constitutional karyotyping.
{"title":"OMKar automates genome karyotyping using optical maps to identify constitutional abnormalities","authors":"Siavash Raeisi Dehkordi, Zhaoyang Jia, Joey Estabrook, Jen Hauenstein, Neil Miller, Naz Güleray-Lafci, Jürgen Neesen, Alex Hastie, Alka Chaubey, Andy Wing Chun Pang, Paul Dremsek, Vineet Bafna","doi":"10.1101/gr.280536.125","DOIUrl":"https://doi.org/10.1101/gr.280536.125","url":null,"abstract":"The whole-genome karyotype refers to the sequence of large chromosomal segments comprising an individual's genotype. Karyotype analysis, which includes identifying aneuploidies and structural rearrangements, is essential for understanding genetic risk factors, informing diagnosis and treatment, and guiding genetic counseling in constitutional disorders. The current karyotyping standard relies on microscopic chromosome examination, a complex and expertise-dependent process with megabase-scale resolution. Optical genome mapping (OGM) technology offers an efficient approach to detect large-scale genomic lesions. Here, we introduce OMKar, a computational method that generates virtual karyotypes from OGM data. OMKar integrates structural variants (SVs) and copy number (CN) variants into a breakpoint graph representation. It re-estimates CNs using integer linear programming to enforce CN balance and then identifies constrained Eulerian paths corresponding to full chromosome structures. OMKar is evaluated on 38 whole-genome simulations of constitutional disorders, achieving 88% precision and 95% recall for SV concordance and a 95% Jaccard score for CN concordance. We further apply OMKar to 154 clinical samples including 50 prenatal, 41 postnatal, and 63 parental genomes collected across 10 sites. It correctly reconstructs the karyotype in 144 cases, including 25 of 25 aneuploidies, 32 of 32 balanced translocations, and 72 of 82 unbalanced rearrangements. Identified disorders include cri-du-chat, Wolf–Hirschhorn, Prader–Willi, Down, and Turner syndromes. Notably, OMKar uncovers plausible genetic mechanisms in five previously unexplained cases. These results demonstrate the accuracy and utility of OMKar for OGM-based constitutional karyotyping.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"11 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145515801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enzo Battistella, Anant Maheshwari, Barış Ekim, Bonnie Berger, Victoria Popic
Haplotype assembly is the problem of reconstructing the combination of alleles on the maternally and paternally inherited chromosome copies. Individual haplotypes are essential to our understanding of how combinations of different variants impact phenotype. In this work, we focus on read-based haplotype assembly of individual diploid genomes, which reconstructs the two haplotypes directly from read alignments at variant loci. We introduce Ralphi, a novel deep reinforcement learning framework for haplotype assembly, which integrates the representational power of deep learning with reinforcement learning to accurately partition read fragments into their respective haplotype sets. To set the reward objective for reinforcement learning, our approach uses the classic reduction of the problem to the maximum fragment cut formulation on fragment graphs, in which nodes correspond to reads and edge weights capture the conflict or agreement of the reads at shared variant sites. We train Ralphi on a diverse data set of fragment graph topologies derived from genomes in the 1000 Genomes Project. We show that Ralphi achieves lower error rates at comparable or longer haplotype block lengths over the state of the art for short and long reads at varying coverage in standard human genome benchmarks.
{"title":"Graph-based deep reinforcement learning for haplotype assembly with Ralphi","authors":"Enzo Battistella, Anant Maheshwari, Barış Ekim, Bonnie Berger, Victoria Popic","doi":"10.1101/gr.280569.125","DOIUrl":"https://doi.org/10.1101/gr.280569.125","url":null,"abstract":"Haplotype assembly is the problem of reconstructing the combination of alleles on the maternally and paternally inherited chromosome copies. Individual haplotypes are essential to our understanding of how combinations of different variants impact phenotype. In this work, we focus on read-based haplotype assembly of individual diploid genomes, which reconstructs the two haplotypes directly from read alignments at variant loci. We introduce Ralphi, a novel deep reinforcement learning framework for haplotype assembly, which integrates the representational power of deep learning with reinforcement learning to accurately partition read fragments into their respective haplotype sets. To set the reward objective for reinforcement learning, our approach uses the classic reduction of the problem to the <em>maximum fragment cut</em> formulation on fragment graphs, in which nodes correspond to reads and edge weights capture the conflict or agreement of the reads at shared variant sites. We train Ralphi on a diverse data set of fragment graph topologies derived from genomes in the 1000 Genomes Project. We show that Ralphi achieves lower error rates at comparable or longer haplotype block lengths over the state of the art for short and long reads at varying coverage in standard human genome benchmarks.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"87 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145515657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial transcriptomics (ST) has transformed our understanding of tissue architecture and cellular interactions, but integrating ST data across platforms remains challenging due to differences in gene panels, data sparsity, and technical variability. Here, we introduce LLOKI, a novel framework for integrating imaging-based ST data from diverse platforms without requiring shared gene panels. LLOKI addresses ST integration through two key alignment tasks: feature alignment across technologies and batch alignment across data sets. Optimal transport-guided feature propagation adjusts data sparsity to match scRNA-seq references through graph-based imputation, enabling single-cell foundation models such as scGPT to generate unified features. Batch alignment then refines scGPT-transformed embeddings, mitigating batch effects while preserving biological variability. Evaluations on mouse brain samples from five different technologies demonstrate that LLOKI outperforms existing methods and is effective for cross-technology spatial gene program identification, and tissue slice alignment. Applying LLOKI to five ovarian cancer data sets, we identify an integrated gene program indicative of tumor-infiltrating T cells across gene panels. Together, LLOKI provides a robust foundation for cross-platform ST studies, with the potential to scale to large atlas data sets, enabling deeper insights into cellular organization and tissue environments.
{"title":"Unified integration of spatial transcriptomics across platforms with LLOKI","authors":"Ellie Haber, Ajinkya Deshpande, Jian Ma, Spencer Krieger","doi":"10.1101/gr.280803.125","DOIUrl":"https://doi.org/10.1101/gr.280803.125","url":null,"abstract":"Spatial transcriptomics (ST) has transformed our understanding of tissue architecture and cellular interactions, but integrating ST data across platforms remains challenging due to differences in gene panels, data sparsity, and technical variability. Here, we introduce LLOKI, a novel framework for integrating imaging-based ST data from diverse platforms without requiring shared gene panels. LLOKI addresses ST integration through two key alignment tasks: feature alignment across technologies and batch alignment across data sets. Optimal transport-guided feature propagation adjusts data sparsity to match scRNA-seq references through graph-based imputation, enabling single-cell foundation models such as scGPT to generate unified features. Batch alignment then refines scGPT-transformed embeddings, mitigating batch effects while preserving biological variability. Evaluations on mouse brain samples from five different technologies demonstrate that LLOKI outperforms existing methods and is effective for cross-technology spatial gene program identification, and tissue slice alignment. Applying LLOKI to five ovarian cancer data sets, we identify an integrated gene program indicative of tumor-infiltrating T cells across gene panels. Together, LLOKI provides a robust foundation for cross-platform ST studies, with the potential to scale to large atlas data sets, enabling deeper insights into cellular organization and tissue environments.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"55 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145509231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Blanco-Berdugo, Alexis Garretson, Beth L Dumont
Approximately 2.6% of live births in the United States are conceived using assisted reproductive technologies (ART). While some ART, including in vitro fertilization (IVF) and intracytoplasmic sperm injection, are known to alter the epigenetic landscape of early embryonic development, their impact on DNA sequence stability is unclear. Here, we leverage the strengths of the laboratory mouse model system to investigate whether a standard ART series (ovarian hyperstimulation, gamete isolation, IVF, embryo culture, and embryo transfer) affects genome stability. Age-matched cohorts of 12 ART-derived and 16 naturally conceived C57BL/6J inbred mice were reared in a controlled setting and whole-genome sequenced to ~50× coverage. Using a rigorous pipeline for de novo single nucleotide variant (dnSNV) discovery, we observe a ~30% (95% CI: 4.5% - 56%) increase in the dnSNV rate in ART compared to naturally conceived mice (P = 0.017). Analysis of the dnSNV mutation spectrum identified signatures attributable to germline DNA repair activity but revealed no differentially enriched signatures between cohorts. We observe no enrichment of dnSNVs in specific genomic contexts, suggesting that the observed rate increase in ART-derived mice is a general genome-wide phenomenon. Together, our findings show that ART is moderately mutagenic in house mice and motivate future work to define the procedure(s) associated with this increased mutational vulnerability. While we caution that our findings cannot be immediately translated to humans, they nonetheless emphasize a pressing need for investigations on the potential mutagenicity of ART in our species.
{"title":"Modest increase in the de novo single nucleotide mutation rate in house mice born by assisted reproduction","authors":"Laura Blanco-Berdugo, Alexis Garretson, Beth L Dumont","doi":"10.1101/gr.281180.125","DOIUrl":"https://doi.org/10.1101/gr.281180.125","url":null,"abstract":"Approximately 2.6% of live births in the United States are conceived using assisted reproductive technologies (ART). While some ART, including in vitro fertilization (IVF) and intracytoplasmic sperm injection, are known to alter the epigenetic landscape of early embryonic development, their impact on DNA sequence stability is unclear. Here, we leverage the strengths of the laboratory mouse model system to investigate whether a standard ART series (ovarian hyperstimulation, gamete isolation, IVF, embryo culture, and embryo transfer) affects genome stability. Age-matched cohorts of 12 ART-derived and 16 naturally conceived C57BL/6J inbred mice were reared in a controlled setting and whole-genome sequenced to ~50× coverage. Using a rigorous pipeline for de novo single nucleotide variant (dnSNV) discovery, we observe a ~30% (95% CI: 4.5% - 56%) increase in the dnSNV rate in ART compared to naturally conceived mice (<em>P</em> = 0.017). Analysis of the dnSNV mutation spectrum identified signatures attributable to germline DNA repair activity but revealed no differentially enriched signatures between cohorts. We observe no enrichment of dnSNVs in specific genomic contexts, suggesting that the observed rate increase in ART-derived mice is a general genome-wide phenomenon. Together, our findings show that ART is moderately mutagenic in house mice and motivate future work to define the procedure(s) associated with this increased mutational vulnerability. While we caution that our findings cannot be immediately translated to humans, they nonetheless emphasize a pressing need for investigations on the potential mutagenicity of ART in our species.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"115 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145509232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An Wang, Stephanie Hicks, Donald Geman, Laurent Younes
The selection of marker gene panels is critical for capturing the cellular and spatial heterogeneity in the expanding atlases of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. Most current approaches to marker gene selection operate in a label-based framework, which is inherently limited by its dependency on predefined cell type labels or clustering results. In contrast, existing label-free methods often struggle to identify genes that characterize rare cell types or subtle spatial patterns, and they frequently fail to scale efficiently with large data sets. Here, we introduce geneCover, a label-free combinatorial method that selects an optimal panel of minimally redundant marker genes based on gene-gene correlations. Our method demonstrates excellent scalability to large data sets and identifies marker gene panels that capture distinct correlation structures across the transcriptome. This allows geneCover to distinguish cell states in various tissues of living organisms effectively, including those associated with rare or otherwise difficult-to-identify cell types. We evaluate the performance of geneCover across various scRNA-seq and spatial transcriptomics data sets, comparing it to other label-free algorithms to highlight its utility and potential in diverse biological contexts.
{"title":"Label-free selection of marker genes in single-cell and spatial transcriptomics with geneCover","authors":"An Wang, Stephanie Hicks, Donald Geman, Laurent Younes","doi":"10.1101/gr.280539.125","DOIUrl":"https://doi.org/10.1101/gr.280539.125","url":null,"abstract":"The selection of marker gene panels is critical for capturing the cellular and spatial heterogeneity in the expanding atlases of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. Most current approaches to marker gene selection operate in a label-based framework, which is inherently limited by its dependency on predefined cell type labels or clustering results. In contrast, existing label-free methods often struggle to identify genes that characterize rare cell types or subtle spatial patterns, and they frequently fail to scale efficiently with large data sets. Here, we introduce geneCover, a label-free combinatorial method that selects an optimal panel of minimally redundant marker genes based on gene-gene correlations. Our method demonstrates excellent scalability to large data sets and identifies marker gene panels that capture distinct correlation structures across the transcriptome. This allows geneCover to distinguish cell states in various tissues of living organisms effectively, including those associated with rare or otherwise difficult-to-identify cell types. We evaluate the performance of geneCover across various scRNA-seq and spatial transcriptomics data sets, comparing it to other label-free algorithms to highlight its utility and potential in diverse biological contexts.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"171 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natnatee Dokmai, Kaiyuan Zhu, S. Cenk Sahinalp, Hyunghoon Cho
Genotype imputation servers enable researchers with limited resources to extract valuable insights from their data with enhanced accuracy and ease. However, the utility of these services is limited for those with sensitive study cohorts or those in restrictive regulatory environments owing to data privacy concerns. Although privacy-preserving analysis tools have been developed to broaden access to these servers, none of the existing methods support haplotype phasing, a critical component of the imputation workflow. The complexity of phasing algorithms poses a significant challenge in maintaining practical performance under privacy constraints. Here, we introduce TX-Phase, a secure haplotype phasing method based on the framework of trusted execution environments (TEEs). TX-Phase allows users’ private genomic data to be phased while ensuring data confidentiality and integrity of the computation. We introduce novel data-oblivious algorithmic techniques based on compressed reference panels and dynamic fixed-point arithmetic that comprehensively mitigate side-channel leakages in TEEs to provide robust protection of users’ genomic data throughout the analysis. Our experiments on a range of data sets from the UK Biobank and Haplotype Reference Consortium demonstrate the state-of-the-art phasing accuracy and practical runtimes of TX-Phase. Our work enables secure phasing of private genomes, opening access to large reference genomic data sets for a broader scientific community.
{"title":"Secure phasing of private genomes in a trusted execution environment with TX-Phase","authors":"Natnatee Dokmai, Kaiyuan Zhu, S. Cenk Sahinalp, Hyunghoon Cho","doi":"10.1101/gr.280558.125","DOIUrl":"https://doi.org/10.1101/gr.280558.125","url":null,"abstract":"Genotype imputation servers enable researchers with limited resources to extract valuable insights from their data with enhanced accuracy and ease. However, the utility of these services is limited for those with sensitive study cohorts or those in restrictive regulatory environments owing to data privacy concerns. Although privacy-preserving analysis tools have been developed to broaden access to these servers, none of the existing methods support haplotype phasing, a critical component of the imputation workflow. The complexity of phasing algorithms poses a significant challenge in maintaining practical performance under privacy constraints. Here, we introduce TX-Phase, a secure haplotype phasing method based on the framework of trusted execution environments (TEEs). TX-Phase allows users’ private genomic data to be phased while ensuring data confidentiality and integrity of the computation. We introduce novel data-oblivious algorithmic techniques based on compressed reference panels and dynamic fixed-point arithmetic that comprehensively mitigate side-channel leakages in TEEs to provide robust protection of users’ genomic data throughout the analysis. Our experiments on a range of data sets from the UK Biobank and Haplotype Reference Consortium demonstrate the state-of-the-art phasing accuracy and practical runtimes of TX-Phase. Our work enables secure phasing of private genomes, opening access to large reference genomic data sets for a broader scientific community.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"368 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}