Paul Ashford, Alexander M Frankell, Zofia Piszka, Camilla S M Pang, Mahnaz Abbasian, Maise Al Bakir, Mariam Jamal-Hanjani, Nicholas McGranahan, Charles Swanton, Christine A Orengo
Tumors evolve through a process of selection on somatic mutations, driving cell division and tissue growth through aberrations in cell-cycle control. In non-small-cell lung cancer (NSCLC), genome instability occurs early in tumor growth, resulting in pronounced intratumor heterogeneity, including changes in gene copy number, and whole-genome doubling (WGD) in ∼75% of tumors. Gene duplication, genetic drift, and selection mediate functional diversification during evolution. In this study, we seek to identify the diversification and potential gene neofunctionalization of lung tumors in the TRACERx cohort. We develop a novel computational protocol to identify preduplication and postduplication mutations predicted to affect protein function. Mutations are analyzed using paralogs grouped into functional families with highly similar functions, identifying 355 functional impact events (FIEs) through their proximity and clustering near to functional sites. The use of functional family paralogs to map mutations to protein structures from the PDB helps predict putative rare driver events in lung tumors. By extending the analysis with high-quality structural models from AlphaFold using The Encyclopedia of Domains (TED), we find a significant increase in the diversity of both genes and functional families with postduplication FIEs in lung adenocarcinomas, including some metabolic enzymes with the potential to be neofunctional. The postduplication diversification of driver genes and functions may indicate selection for somatic copy number changes in lung tumors and an increased scope for tumor adaptations.
肿瘤的进化是通过对体细胞突变的选择,通过细胞周期控制的畸变驱动细胞分裂和组织生长。在非小细胞肺癌(NSCLC)中,基因组不稳定性发生在肿瘤生长的早期,导致肿瘤内明显的异质性,包括基因拷贝数的变化,以及约75%的肿瘤的全基因组加倍(WGD)。基因复制、遗传漂变和选择介导了进化过程中的功能多样化。在这项研究中,我们试图在TRACERx队列中确定肺肿瘤的多样化和潜在的基因新功能化。我们开发了一种新的计算协议,以确定预测影响蛋白质功能的复制前和复制后突变。突变分析使用类似性分组到具有高度相似功能的功能家族中,通过它们在功能位点附近的接近性和聚类确定了355个功能影响事件(FIEs)。使用功能家族相似性将突变映射到PDB的蛋白质结构有助于预测肺肿瘤中假定的罕见驱动事件。通过使用来自AlphaFold的高质量结构模型,使用the Encyclopedia of Domains (TED)进行分析,我们发现肺腺癌中复制后FIEs的基因和功能家族的多样性显著增加,包括一些具有新功能潜力的代谢酶。驱动基因和功能的复制后多样化可能表明肺肿瘤中体细胞拷贝数变化的选择和肿瘤适应范围的增加。
{"title":"Gene duplication is associated with gene diversification and potential neofunctionalization in lung cancer evolution.","authors":"Paul Ashford, Alexander M Frankell, Zofia Piszka, Camilla S M Pang, Mahnaz Abbasian, Maise Al Bakir, Mariam Jamal-Hanjani, Nicholas McGranahan, Charles Swanton, Christine A Orengo","doi":"10.1101/gr.278663.123","DOIUrl":"10.1101/gr.278663.123","url":null,"abstract":"<p><p>Tumors evolve through a process of selection on somatic mutations, driving cell division and tissue growth through aberrations in cell-cycle control. In non-small-cell lung cancer (NSCLC), genome instability occurs early in tumor growth, resulting in pronounced intratumor heterogeneity, including changes in gene copy number, and whole-genome doubling (WGD) in ∼75% of tumors. Gene duplication, genetic drift, and selection mediate functional diversification during evolution. In this study, we seek to identify the diversification and potential gene neofunctionalization of lung tumors in the TRACERx cohort. We develop a novel computational protocol to identify preduplication and postduplication mutations predicted to affect protein function. Mutations are analyzed using paralogs grouped into functional families with highly similar functions, identifying 355 functional impact events (FIEs) through their proximity and clustering near to functional sites. The use of functional family paralogs to map mutations to protein structures from the PDB helps predict putative rare driver events in lung tumors. By extending the analysis with high-quality structural models from AlphaFold using The Encyclopedia of Domains (TED), we find a significant increase in the diversity of both genes and functional families with postduplication FIEs in lung adenocarcinomas, including some metabolic enzymes with the potential to be neofunctional. The postduplication diversification of driver genes and functions may indicate selection for somatic copy number changes in lung tumors and an increased scope for tumor adaptations.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"561-577"},"PeriodicalIF":5.5,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146226481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domitille Chalopin, Carine Rey, Jeremy Ganofsky, Juliana Blin, Pascale Chevret, Marion Mouginot, Laurent Guéguen, Bastien Boussau, Sophie Pantalacci, Marie Sémon
Species adapting to a similar lifestyle may undergo convergent changes in organ structure and cellular function, themselves relying or not on these convergent genetic changes. The extent of genomic convergence is thus debated and may further depend on the interplay between temporal factors, such as species relatedness or the age of the transition. Rodents have repeatedly adapted to life in arid conditions, notably with altered renal morphology and physiology. By analyzing kidney transcriptomes from 33 species, we find convergence at all examined biological levels, from the whole kidney transcriptome down to the coding sequences and expression level of individual genes. Transcriptome-level signatures reflect convergent changes in cell proportions, suggesting convergent structural adaptations of the kidney. A large proportion of genes shows convergent substitutions, but those happened in small subsets of species, showing that there are multiple genetic paths repeatedly taken in a mosaic manner. A similar mosaic signal of convergence is found comparing gene expression in species spanning the Rodentia order, but convergence is more widely shared at the lower level of the Murinae family. Therefore, we test more directly the influence of temporal factors. We observe more convergent changes when we select species independently adapted from more closely than more distantly related ancestors and when we select older transitions rather than recent transitions. Our study shows that there are many different, yet repeatedly selected, ways to adapt to aridity and that the degree of convergent evolution increases with both the age of the transitions and species relatedness.
{"title":"Degrees of convergent evolution in rodent adaptations to arid environments.","authors":"Domitille Chalopin, Carine Rey, Jeremy Ganofsky, Juliana Blin, Pascale Chevret, Marion Mouginot, Laurent Guéguen, Bastien Boussau, Sophie Pantalacci, Marie Sémon","doi":"10.1101/gr.280089.124","DOIUrl":"10.1101/gr.280089.124","url":null,"abstract":"<p><p>Species adapting to a similar lifestyle may undergo convergent changes in organ structure and cellular function, themselves relying or not on these convergent genetic changes. The extent of genomic convergence is thus debated and may further depend on the interplay between temporal factors, such as species relatedness or the age of the transition. Rodents have repeatedly adapted to life in arid conditions, notably with altered renal morphology and physiology. By analyzing kidney transcriptomes from 33 species, we find convergence at all examined biological levels, from the whole kidney transcriptome down to the coding sequences and expression level of individual genes. Transcriptome-level signatures reflect convergent changes in cell proportions, suggesting convergent structural adaptations of the kidney. A large proportion of genes shows convergent substitutions, but those happened in small subsets of species, showing that there are multiple genetic paths repeatedly taken in a mosaic manner. A similar mosaic signal of convergence is found comparing gene expression in species spanning the <i>Rodentia</i> order, but convergence is more widely shared at the lower level of the <i>Murinae</i> family. Therefore, we test more directly the influence of temporal factors. We observe more convergent changes when we select species independently adapted from more closely than more distantly related ancestors and when we select older transitions rather than recent transitions. Our study shows that there are many different, yet repeatedly selected, ways to adapt to aridity and that the degree of convergent evolution increases with both the age of the transitions and species relatedness.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"472-486"},"PeriodicalIF":5.5,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146018238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aayush Grover, Till Muser, Liine Kasak, Lin Zhang, Ekaterina Krymova, Valentina Boeva
Fine-grained prediction of chromatin accessibility from DNA sequence is a foundational step in modeling gene expression changes resulting from sequence variants. Yet, few methods operate at the resolution necessary to capture subtle effects of single-nucleotide changes. Furthermore, it remains unclear which architectural components, such as residual connections, normalization strategies, or attention mechanisms, drive performance in these high-resolution predictions. To address these knowledge gaps, we systematically evaluate classic architectural choices and introduce ConvNeXt V2 blocks, originally developed for computer vision, as high-resolution feature extractors in deep learning models for genomic data. Integrated into diverse architectures such as convoluted neural networks (CNNs), long short-term memory (LSTM), dilated CNNs, and transformers, ConvNeXt V2 blocks consistently improve performance, leading to similar prediction accuracy across these different model types. This reveals that early feature extraction, rather than downstream architecture, is the primary determinant of prediction accuracy. A comprehensive evaluation of these models on ATAC-seq signal prediction at 4-bp resolution in a cell type-specific manner identifies the ConvNeXt-based dilated CNN as the most robust performer, better preserving the signal's shape. Our codebase and benchmarks provide practical tools for high-resolution chromatin modeling.
{"title":"Early feature extraction drives model performance in high-resolution chromatin accessibility prediction.","authors":"Aayush Grover, Till Muser, Liine Kasak, Lin Zhang, Ekaterina Krymova, Valentina Boeva","doi":"10.1101/gr.281042.125","DOIUrl":"10.1101/gr.281042.125","url":null,"abstract":"<p><p>Fine-grained prediction of chromatin accessibility from DNA sequence is a foundational step in modeling gene expression changes resulting from sequence variants. Yet, few methods operate at the resolution necessary to capture subtle effects of single-nucleotide changes. Furthermore, it remains unclear which architectural components, such as residual connections, normalization strategies, or attention mechanisms, drive performance in these high-resolution predictions. To address these knowledge gaps, we systematically evaluate classic architectural choices and introduce ConvNeXt V2 blocks, originally developed for computer vision, as high-resolution feature extractors in deep learning models for genomic data. Integrated into diverse architectures such as convoluted neural networks (CNNs), long short-term memory (LSTM), dilated CNNs, and transformers, ConvNeXt V2 blocks consistently improve performance, leading to similar prediction accuracy across these different model types. This reveals that early feature extraction, rather than downstream architecture, is the primary determinant of prediction accuracy. A comprehensive evaluation of these models on ATAC-seq signal prediction at 4-bp resolution in a cell type-specific manner identifies the ConvNeXt-based dilated CNN as the most robust performer, better preserving the signal's shape. Our codebase and benchmarks provide practical tools for high-resolution chromatin modeling.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"619-629"},"PeriodicalIF":5.5,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145959206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin S O'Leary, Meng-Yen Li, Kevyn Jackson, Lijie Shi, Elena Ezhkova, Bernice E Morrow, Deyou Zheng
The XIST RNA is known for its critical roles in X Chromosome inactivation (XCI). It is thought to be expressed exclusively from one copy of the X Chromosome and silence it by recruiting various chromatin factors in female cells. In this study, we find XIST expression in male peripheral glia after integrated analyses of single-cell RNA-seq data from multiple human tissues and organs. Single-cell epigenomic data further indicate that the expression is likely driven by an alternative promoter at the end of the first exon, resulting in at least one shorter transcript (referred to as sXIST) that is active in Schwann cells and, moreover, at a higher level in nonmyelinating Schwann cells. This promoter exhibits similar activity in female glia. Multiple lines of evidence from bulk transcriptomic and epigenomic data from peripheral nerve tissues further support these findings. Genes coexpressed positively and strongly with sXIST in male glia show functional enrichment in axon assembly and cilia signaling, with many of them sharing putative miRNA binding sites with sXIST, whereas the negatively correlated genes are enriched for processes important for neuromuscular junctions. This suggests possible functions of sXIST in modulating glia-neuron interactions, perhaps via competitive miRNA binding. This idea is also supported by overexpression analysis of a partial sXIST sequence and the finding of significant XIST expression changes in human cardiomyopathy and polyneuropathy. In summary, the current study suggests a novel, non-XCI role of XIST in peripheral Schwann cells that is mediated by a newly recognized transcript.
{"title":"Transcription and potential functions of a novel <i>XIST</i> isoform in male peripheral glia.","authors":"Kevin S O'Leary, Meng-Yen Li, Kevyn Jackson, Lijie Shi, Elena Ezhkova, Bernice E Morrow, Deyou Zheng","doi":"10.1101/gr.280832.125","DOIUrl":"10.1101/gr.280832.125","url":null,"abstract":"<p><p>The <i>XIST</i> RNA is known for its critical roles in X Chromosome inactivation (XCI). It is thought to be expressed exclusively from one copy of the X Chromosome and silence it by recruiting various chromatin factors in female cells. In this study, we find <i>XIST</i> expression in male peripheral glia after integrated analyses of single-cell RNA-seq data from multiple human tissues and organs. Single-cell epigenomic data further indicate that the expression is likely driven by an alternative promoter at the end of the first exon, resulting in at least one shorter transcript (referred to as <i>sXIST</i>) that is active in Schwann cells and, moreover, at a higher level in nonmyelinating Schwann cells. This promoter exhibits similar activity in female glia. Multiple lines of evidence from bulk transcriptomic and epigenomic data from peripheral nerve tissues further support these findings. Genes coexpressed positively and strongly with <i>sXIST</i> in male glia show functional enrichment in axon assembly and cilia signaling, with many of them sharing putative miRNA binding sites with <i>sXIST</i>, whereas the negatively correlated genes are enriched for processes important for neuromuscular junctions. This suggests possible functions of <i>sXIST</i> in modulating glia-neuron interactions, perhaps via competitive miRNA binding. This idea is also supported by overexpression analysis of a partial <i>sXIST</i> sequence and the finding of significant <i>XIST</i> expression changes in human cardiomyopathy and polyneuropathy. In summary, the current study suggests a novel, non-XCI role of <i>XIST</i> in peripheral Schwann cells that is mediated by a newly recognized transcript.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"257-274"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863056/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meilong Shi, Chuanqi Teng, Shan Zhang, Xiaobo He, Lingyun Xu, Fengxian Han, Rongqi Wen, Ganjun Yu, Jingwen Liu, Yang Feng, Yanfeng Wu, Yan Ren, Gang Jin, Jing Li
Eukaryotic genomes contain numerous transposable elements (TEs), whose dysregulation threatens genome stability and may contribute to cancer. Pancreatic adenocarcinoma (PAAD) is among the deadliest cancers, marked by abundant stroma that obscures tumor-specific molecular signals, complicating bulk-tissue analyses. Here, using 71 patient-derived PAAD organoids, we show that TE activities may potentially promote tumorigenesis and provide a source of novel immunotherapeutic targets. We identify 16 new TE-derived transcripts fused with 15 known oncogenes, exhibiting potential oncogenic function and prognostic value. Notably, LTR7-PLAAT4, present in 29% of tumors, encodes a protein variant transcriptionally regulated by FOXM1 binding to the LTR7 promoter. LTR7-PLAAT4 isoform 2 is associated with increased cholesterol ester accumulation and lipid droplet formation mediated through BSCL2 coexpression, potentially fostering tumor progression. On the immunogenic front, HLA-I immunopeptidomics of AsPC-1 cells and DAC13 organoids identify over 11,000 peptides respectively. Althought mutation-derived neoantigens are rare, several peptides are originated from TE-chimeric transcripts, including four predicted by TEprof2. The peptide FLIQHLPLV, detected in 27% of organoids, exhibits robust immunogenicity, validated by T2 binding, mass spectrometry and ELISPOT assays with HLA-genotyped PBMCs. Together, these findings suggest that TE activities may contribute to PAAD progression and diversify its immunopeptidome, providing new opportunities for molecular subtyping and potential immunotherapeutic intervention.
{"title":"Transcriptomic landscape of transposable elements reveals <i>LTR7</i>-<i>PLAAT4</i> as a potential oncogene and therapeutic target in pancreatic adenocarcinoma.","authors":"Meilong Shi, Chuanqi Teng, Shan Zhang, Xiaobo He, Lingyun Xu, Fengxian Han, Rongqi Wen, Ganjun Yu, Jingwen Liu, Yang Feng, Yanfeng Wu, Yan Ren, Gang Jin, Jing Li","doi":"10.1101/gr.280528.125","DOIUrl":"10.1101/gr.280528.125","url":null,"abstract":"<p><p>Eukaryotic genomes contain numerous transposable elements (TEs), whose dysregulation threatens genome stability and may contribute to cancer. Pancreatic adenocarcinoma (PAAD) is among the deadliest cancers, marked by abundant stroma that obscures tumor-specific molecular signals, complicating bulk-tissue analyses. Here, using 71 patient-derived PAAD organoids, we show that TE activities may potentially promote tumorigenesis and provide a source of novel immunotherapeutic targets. We identify 16 new TE-derived transcripts fused with 15 known oncogenes, exhibiting potential oncogenic function and prognostic value. Notably, <i>LTR7</i>-<i>PLAAT4</i>, present in 29% of tumors, encodes a protein variant transcriptionally regulated by <i>FOXM1</i> binding to the <i>LTR7</i> promoter. <i>LTR7</i>-<i>PLAAT4</i> isoform 2 is associated with increased cholesterol ester accumulation and lipid droplet formation mediated through <i>BSCL2</i> coexpression, potentially fostering tumor progression. On the immunogenic front, HLA-I immunopeptidomics of AsPC-1 cells and DAC13 organoids identify over 11,000 peptides respectively. Althought mutation-derived neoantigens are rare, several peptides are originated from TE-chimeric transcripts, including four predicted by TEprof2. The peptide FLIQHLPLV, detected in 27% of organoids, exhibits robust immunogenicity, validated by T2 binding, mass spectrometry and ELISPOT assays with HLA-genotyped PBMCs. Together, these findings suggest that TE activities may contribute to PAAD progression and diversify its immunopeptidome, providing new opportunities for molecular subtyping and potential immunotherapeutic intervention.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"275-290"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863058/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Ye, Xiaoyan Li, Na Cheng, Yansen Su, Junfeng Xia
Synonymous single-nucleotide variants (sSNVs) are increasingly recognized as contributors to disease, yet existing variant annotation databases offer limited functional insights for sSNVs. Here, we present SynMall, a comprehensive resource designed to decipher the functional impact of synonymous variation. SynMall catalogs 25 million potential human sSNVs and integrates evolutionary and population information of sSNVs from 45 non-human species. For each human sSNV, SynMall provides multilevel annotations that combine American College of Medical Genetics and Genomics (ACMG)-aligned variant interpretation information, such as allele frequencies and functional effects, with more than 100 descriptors at the DNA, RNA, and protein levels. These include both handcrafted features and embeddings from large language models to support advanced representation learning. To prioritize pathogenic sSNVs, we have developed SynScore, a machine learning framework that integrates ACMG guidelines and diverse biological characteristics. Benchmark comparisons show that SynScore achieves state-of-the-art performance, validating its effectiveness for genome-wide pathogenicity inference. Furthermore, SynMall enables mechanistic exploration by investigating in silico assessments and curated literature evidence to evaluate sSNV effects on miRNA-mRNA interactions, mRNA splicing, mRNA stability, and codon usage. By consolidating these features into a unified platform, we anticipate that SynMall will serve as a valuable resource for elucidating the functional role of synonymous mutations.
{"title":"The SynMall resource for characterizing the functional impact of synonymous variation.","authors":"Chen Ye, Xiaoyan Li, Na Cheng, Yansen Su, Junfeng Xia","doi":"10.1101/gr.281257.125","DOIUrl":"10.1101/gr.281257.125","url":null,"abstract":"<p><p>Synonymous single-nucleotide variants (sSNVs) are increasingly recognized as contributors to disease, yet existing variant annotation databases offer limited functional insights for sSNVs. Here, we present SynMall, a comprehensive resource designed to decipher the functional impact of synonymous variation. SynMall catalogs 25 million potential human sSNVs and integrates evolutionary and population information of sSNVs from 45 non-human species. For each human sSNV, SynMall provides multilevel annotations that combine American College of Medical Genetics and Genomics (ACMG)-aligned variant interpretation information, such as allele frequencies and functional effects, with more than 100 descriptors at the DNA, RNA, and protein levels. These include both handcrafted features and embeddings from large language models to support advanced representation learning. To prioritize pathogenic sSNVs, we have developed SynScore, a machine learning framework that integrates ACMG guidelines and diverse biological characteristics. Benchmark comparisons show that SynScore achieves state-of-the-art performance, validating its effectiveness for genome-wide pathogenicity inference. Furthermore, SynMall enables mechanistic exploration by investigating in silico assessments and curated literature evidence to evaluate sSNV effects on miRNA-mRNA interactions, mRNA splicing, mRNA stability, and codon usage. By consolidating these features into a unified platform, we anticipate that SynMall will serve as a valuable resource for elucidating the functional role of synonymous mutations.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"421-431"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samin Rahman Khan, M Saifur Rahman, M Sohel Rahman, Md Abul Hassan Samee
The surge in single-cell data sets and reference atlases has enabled the comparison of cell states across conditions, yet a gap persists in quantifying pathological shifts from healthy cell states. To address this gap, we introduce single-cell Pathological Shift Scoring (scPSS), which provides a statistical measure for how much a "query" cell from a diseased sample has shifted away from a reference group of healthy cells. In scPSS, the distance of a cell to its k-th nearest reference cell is considered as its pathological shift score. Euclidean distances in the top n principal component space of the gene expressions are used to measure distances between cells. The distribution of shift scores of the reference cells forms a null model. This allows a P-value to be assigned to each query cell's shift score, quantifying its statistical significance of being in the reference cell group. This makes our method both simple and statistically rigorous. The key strength of scPSS is its applicability in a "semisupervised" setting, where only healthy reference cells are known and diseased-labeled data are not provided for model training. As existing methods do not support cell-level pathological progression measurement in this setting, we adapt state-of-the-art supervised pathological prediction and contrastive models for benchmarking. Comparative evaluations against these adapted models demonstrate our method's superiority in accuracy and efficiency. Additionally, we show that the aggregation of cell-level pathological scores from scPSS can be used to predict health conditions at the individual level.
{"title":"Quantifying pathological progression from single-cell transcriptomic data with scPSS.","authors":"Samin Rahman Khan, M Saifur Rahman, M Sohel Rahman, Md Abul Hassan Samee","doi":"10.1101/gr.280411.125","DOIUrl":"10.1101/gr.280411.125","url":null,"abstract":"<p><p>The surge in single-cell data sets and reference atlases has enabled the comparison of cell states across conditions, yet a gap persists in quantifying pathological shifts from healthy cell states. To address this gap, we introduce <u>s</u>ingle-<u>c</u>ell <u>P</u>athological <u>S</u>hift <u>S</u>coring (scPSS), which provides a statistical measure for how much a \"query\" cell from a diseased sample has shifted away from a reference group of healthy cells. In scPSS, the distance of a cell to its <i>k</i>-th nearest reference cell is considered as its pathological shift score. Euclidean distances in the top <i>n</i> principal component space of the gene expressions are used to measure distances between cells. The distribution of shift scores of the reference cells forms a null model. This allows a <i>P</i>-value to be assigned to each query cell's shift score, quantifying its statistical significance of being in the reference cell group. This makes our method both simple and statistically rigorous. The key strength of scPSS is its applicability in a \"semisupervised\" setting, where only healthy reference cells are known and diseased-labeled data are not provided for model training. As existing methods do not support cell-level pathological progression measurement in this setting, we adapt state-of-the-art supervised pathological prediction and contrastive models for benchmarking. Comparative evaluations against these adapted models demonstrate our method's superiority in accuracy and efficiency. Additionally, we show that the aggregation of cell-level pathological scores from scPSS can be used to predict health conditions at the individual level.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"375-386"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863187/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145984736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maureen Pittman, Kihyun Lee, Franco Felix, Yu Huang, Adrienne Lam, Mauro W Costa, Deepak Srivastava, Katherine S Pollard
Exome sequencing of thousands of families has revealed many risk genes for congenital heart defects (CHDs), yet most cases cannot be explained by a single causal mutation. Even within the same family, individuals carrying a particular mutation in a known risk gene often demonstrate variable phenotypes, suggesting the presence of genetic modifiers. To explore oligogenic causes of CHD without assessing billions of variant combinations, we develop an efficient, simulation-based method to detect gene sets that carry co-occurring damaging variants in probands at a higher rate than expected given parental genotypes. We implement this approach in software called Gene Combinations in Oligogenic Disease (GCOD) and apply it to a cohort of 3377 CHD trios with exome sequencing. This analysis detects 160 gene pairs in which damaging variants are transmitted with higher-than-expected frequency to CHD probands but rarely or never appear in combination in their unaffected parents. Stratifying by specific phenotypes and considering gene combinations of higher orders yields an additional 6026 gene sets. Genes found in oligogenic sets are overrepresented in pathways related to heart development and often co-occur in sets of cell type marker genes from single-cell expression data. Compound heterozygosity of the newly identified digenic pair Gata6-Por leads to higher CHD incidence in mice compared with single hemizygotes, validating predicted genetic interactions. As genome sequencing is applied to more families and other disorders, GCOD will enable detection of increasingly large, novel gene combinations, shedding light on combinatorial causes of genetic diseases.
{"title":"The oligogenic inheritance test GCOD detects risk genes and their interactions in congenital heart defects.","authors":"Maureen Pittman, Kihyun Lee, Franco Felix, Yu Huang, Adrienne Lam, Mauro W Costa, Deepak Srivastava, Katherine S Pollard","doi":"10.1101/gr.281141.125","DOIUrl":"10.1101/gr.281141.125","url":null,"abstract":"<p><p>Exome sequencing of thousands of families has revealed many risk genes for congenital heart defects (CHDs), yet most cases cannot be explained by a single causal mutation. Even within the same family, individuals carrying a particular mutation in a known risk gene often demonstrate variable phenotypes, suggesting the presence of genetic modifiers. To explore oligogenic causes of CHD without assessing billions of variant combinations, we develop an efficient, simulation-based method to detect gene sets that carry co-occurring damaging variants in probands at a higher rate than expected given parental genotypes. We implement this approach in software called Gene Combinations in Oligogenic Disease (GCOD) and apply it to a cohort of 3377 CHD trios with exome sequencing. This analysis detects 160 gene pairs in which damaging variants are transmitted with higher-than-expected frequency to CHD probands but rarely or never appear in combination in their unaffected parents. Stratifying by specific phenotypes and considering gene combinations of higher orders yields an additional 6026 gene sets. Genes found in oligogenic sets are overrepresented in pathways related to heart development and often co-occur in sets of cell type marker genes from single-cell expression data. Compound heterozygosity of the newly identified digenic pair <i>Gata6-Por</i> leads to higher CHD incidence in mice compared with single hemizygotes, validating predicted genetic interactions. As genome sequencing is applied to more families and other disorders, GCOD will enable detection of increasingly large, novel gene combinations, shedding light on combinatorial causes of genetic diseases.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"330-347"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863180/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Tarbier, Sebastian D Mackowiak, Vaishnovi Sekar, Franziska Bonath, Etka Yapar, Bastian Fromm, Omid R Faridani, Inna Biryukova, Marc R Friedländer
microRNAs are small RNA molecules that can repress the expression of protein-coding genes post-transcriptionally. Previous studies have shown that microRNAs can also have alternative functions, including influencing target expression variation and covariation, but these observations have been limited to a few microRNAs. Here we systematically study microRNA alternative functions in mouse embryonic stem cells (mESCs) by genetically deleting Drosha, leading to global loss of microRNAs. We apply complementary single-cell RNA-seq methods to study the variation of the targets and the microRNAs themselves, and transcriptional inhibition to measure target half-lives. We find that microRNAs form four distinct coexpression groups across single cells. In particular, the mir-290 and the mir-182 genome clusters are abundantly, variably, and inversely expressed. Some cells have global biases toward specific miRNAs originating from either end of the hairpin precursor, suggesting the presence of unknown regulatory cofactors. We find that microRNAs generally increase variation and covariation of their targets at the RNA level, but we also find microRNAs such as miR-182 that appear to have opposite functions. In particular, microRNAs that are themselves variable in expression, such as miR-291a, are more likely to induce covariations. In summary, we apply genetic perturbation and multiomics to give the first global picture of microRNA dynamics at the single-cell level.
{"title":"Landscape of microRNA and target expression variation and covariation in single mouse embryonic stem cells.","authors":"Marcel Tarbier, Sebastian D Mackowiak, Vaishnovi Sekar, Franziska Bonath, Etka Yapar, Bastian Fromm, Omid R Faridani, Inna Biryukova, Marc R Friedländer","doi":"10.1101/gr.279914.124","DOIUrl":"10.1101/gr.279914.124","url":null,"abstract":"<p><p>microRNAs are small RNA molecules that can repress the expression of protein-coding genes post-transcriptionally. Previous studies have shown that microRNAs can also have alternative functions, including influencing target expression variation and covariation, but these observations have been limited to a few microRNAs. Here we systematically study microRNA alternative functions in mouse embryonic stem cells (mESCs) by genetically deleting <i>Drosha</i>, leading to global loss of microRNAs. We apply complementary single-cell RNA-seq methods to study the variation of the targets and the microRNAs themselves, and transcriptional inhibition to measure target half-lives. We find that microRNAs form four distinct coexpression groups across single cells. In particular, the <i>mir-290</i> and the <i>mir-182</i> genome clusters are abundantly, variably, and inversely expressed. Some cells have global biases toward specific miRNAs originating from either end of the hairpin precursor, suggesting the presence of unknown regulatory cofactors. We find that microRNAs generally increase variation and covariation of their targets at the RNA level, but we also find microRNAs such as miR-182 that appear to have opposite functions. In particular, microRNAs that are themselves variable in expression, such as miR-291a, are more likely to induce covariations. In summary, we apply genetic perturbation and multiomics to give the first global picture of microRNA dynamics at the single-cell level.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"291-302"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145959287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenhai Zhang, Yuansheng Liu, Guangyi Li, Jialu Xu, Enlian Chen, Alexander Schönhuth, Xiao Luo
Microbes are omnipresent, thriving in a range of habitats, from oceans to soils, and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the diversity in terms of strains in microbial communities is crucial, as variations between strains can lead to different phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, most existing strain-level taxonomic classifiers rely on the sequence representation of multiple linear reference genomes, which fails to capture the sequence correlations among these genomes, potentially introducing ambiguity and biases in metagenomic profiling. Here, we present PanTax, a pangenome graph-based taxonomic profiler that overcomes the shortcomings of sequence-based approaches, because pangenome graphs possess the capability to depict the full range of genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher F1 score at the strain level, while maintaining comparable or better performance in other aspects across various data sets.
{"title":"Strain-level metagenomic profiling using pangenome graphs with PanTax.","authors":"Wenhai Zhang, Yuansheng Liu, Guangyi Li, Jialu Xu, Enlian Chen, Alexander Schönhuth, Xiao Luo","doi":"10.1101/gr.280858.125","DOIUrl":"10.1101/gr.280858.125","url":null,"abstract":"<p><p>Microbes are omnipresent, thriving in a range of habitats, from oceans to soils, and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the diversity in terms of strains in microbial communities is crucial, as variations between strains can lead to different phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, most existing strain-level taxonomic classifiers rely on the sequence representation of multiple linear reference genomes, which fails to capture the sequence correlations among these genomes, potentially introducing ambiguity and biases in metagenomic profiling. Here, we present PanTax, a pangenome graph-based taxonomic profiler that overcomes the shortcomings of sequence-based approaches, because pangenome graphs possess the capability to depict the full range of genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher F1 score at the strain level, while maintaining comparable or better performance in other aspects across various data sets.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"405-420"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863173/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145984796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}