Meilong Shi, Chuanqi Teng, Shan Zhang, Xiaobo He, Lingyun Xu, Fengxian Han, Rongqi Wen, Ganjun Yu, Jingwen Liu, Yang Feng, Yanfeng Wu, Yan Ren, Gang Jin, Jing Li
Eukaryotic genomes contain numerous transposable elements (TEs), whose dysregulation threatens genome stability and may contribute to cancer. Pancreatic adenocarcinoma (PAAD) is among the deadliest cancers, marked by abundant stroma that obscures tumor-specific molecular signals, complicating bulk-tissue analyses. Here, using 71 patient-derived PAAD organoids, we show that TE activities may potentially promote tumorigenesis and provide a source of novel immunotherapeutic targets. We identify 16 new TE-derived transcripts fused with 15 known oncogenes, exhibiting potential oncogenic function and prognostic value. Notably, LTR7-PLAAT4, present in 29% of tumors, encodes a protein variant transcriptionally regulated by FOXM1 binding to the LTR7 promoter. LTR7-PLAAT4 isoform 2 is associated with increased cholesterol ester accumulation and lipid droplet formation mediated through BSCL2 coexpression, potentially fostering tumor progression. On the immunogenic front, HLA-I immunopeptidomics of AsPC-1 cells and DAC13 organoids identify over 11,000 peptides respectively. Althought mutation-derived neoantigens are rare, several peptides are originated from TE-chimeric transcripts, including four predicted by TEprof2. The peptide FLIQHLPLV, detected in 27% of organoids, exhibits robust immunogenicity, validated by T2 binding, mass spectrometry and ELISPOT assays with HLA-genotyped PBMCs. Together, these findings suggest that TE activities may contribute to PAAD progression and diversify its immunopeptidome, providing new opportunities for molecular subtyping and potential immunotherapeutic intervention.
{"title":"Transcriptomic landscape of transposable elements reveals <i>LTR7</i>-<i>PLAAT4</i> as a potential oncogene and therapeutic target in pancreatic adenocarcinoma.","authors":"Meilong Shi, Chuanqi Teng, Shan Zhang, Xiaobo He, Lingyun Xu, Fengxian Han, Rongqi Wen, Ganjun Yu, Jingwen Liu, Yang Feng, Yanfeng Wu, Yan Ren, Gang Jin, Jing Li","doi":"10.1101/gr.280528.125","DOIUrl":"10.1101/gr.280528.125","url":null,"abstract":"<p><p>Eukaryotic genomes contain numerous transposable elements (TEs), whose dysregulation threatens genome stability and may contribute to cancer. Pancreatic adenocarcinoma (PAAD) is among the deadliest cancers, marked by abundant stroma that obscures tumor-specific molecular signals, complicating bulk-tissue analyses. Here, using 71 patient-derived PAAD organoids, we show that TE activities may potentially promote tumorigenesis and provide a source of novel immunotherapeutic targets. We identify 16 new TE-derived transcripts fused with 15 known oncogenes, exhibiting potential oncogenic function and prognostic value. Notably, <i>LTR7</i>-<i>PLAAT4</i>, present in 29% of tumors, encodes a protein variant transcriptionally regulated by <i>FOXM1</i> binding to the <i>LTR7</i> promoter. <i>LTR7</i>-<i>PLAAT4</i> isoform 2 is associated with increased cholesterol ester accumulation and lipid droplet formation mediated through <i>BSCL2</i> coexpression, potentially fostering tumor progression. On the immunogenic front, HLA-I immunopeptidomics of AsPC-1 cells and DAC13 organoids identify over 11,000 peptides respectively. Althought mutation-derived neoantigens are rare, several peptides are originated from TE-chimeric transcripts, including four predicted by TEprof2. The peptide FLIQHLPLV, detected in 27% of organoids, exhibits robust immunogenicity, validated by T2 binding, mass spectrometry and ELISPOT assays with HLA-genotyped PBMCs. Together, these findings suggest that TE activities may contribute to PAAD progression and diversify its immunopeptidome, providing new opportunities for molecular subtyping and potential immunotherapeutic intervention.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"275-290"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863058/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin S O'Leary, Meng-Yen Li, Kevyn Jackson, Lijie Shi, Elena Ezhkova, Bernice E Morrow, Deyou Zheng
The XIST RNA is known for its critical roles in X Chromosome inactivation (XCI). It is thought to be expressed exclusively from one copy of the X Chromosome and silence it by recruiting various chromatin factors in female cells. In this study, we find XIST expression in male peripheral glia after integrated analyses of single-cell RNA-seq data from multiple human tissues and organs. Single-cell epigenomic data further indicate that the expression is likely driven by an alternative promoter at the end of the first exon, resulting in at least one shorter transcript (referred to as sXIST) that is active in Schwann cells and, moreover, at a higher level in nonmyelinating Schwann cells. This promoter exhibits similar activity in female glia. Multiple lines of evidence from bulk transcriptomic and epigenomic data from peripheral nerve tissues further support these findings. Genes coexpressed positively and strongly with sXIST in male glia show functional enrichment in axon assembly and cilia signaling, with many of them sharing putative miRNA binding sites with sXIST, whereas the negatively correlated genes are enriched for processes important for neuromuscular junctions. This suggests possible functions of sXIST in modulating glia-neuron interactions, perhaps via competitive miRNA binding. This idea is also supported by overexpression analysis of a partial sXIST sequence and the finding of significant XIST expression changes in human cardiomyopathy and polyneuropathy. In summary, the current study suggests a novel, non-XCI role of XIST in peripheral Schwann cells that is mediated by a newly recognized transcript.
{"title":"Transcription and potential functions of a novel <i>XIST</i> isoform in male peripheral glia.","authors":"Kevin S O'Leary, Meng-Yen Li, Kevyn Jackson, Lijie Shi, Elena Ezhkova, Bernice E Morrow, Deyou Zheng","doi":"10.1101/gr.280832.125","DOIUrl":"10.1101/gr.280832.125","url":null,"abstract":"<p><p>The <i>XIST</i> RNA is known for its critical roles in X Chromosome inactivation (XCI). It is thought to be expressed exclusively from one copy of the X Chromosome and silence it by recruiting various chromatin factors in female cells. In this study, we find <i>XIST</i> expression in male peripheral glia after integrated analyses of single-cell RNA-seq data from multiple human tissues and organs. Single-cell epigenomic data further indicate that the expression is likely driven by an alternative promoter at the end of the first exon, resulting in at least one shorter transcript (referred to as <i>sXIST</i>) that is active in Schwann cells and, moreover, at a higher level in nonmyelinating Schwann cells. This promoter exhibits similar activity in female glia. Multiple lines of evidence from bulk transcriptomic and epigenomic data from peripheral nerve tissues further support these findings. Genes coexpressed positively and strongly with <i>sXIST</i> in male glia show functional enrichment in axon assembly and cilia signaling, with many of them sharing putative miRNA binding sites with <i>sXIST</i>, whereas the negatively correlated genes are enriched for processes important for neuromuscular junctions. This suggests possible functions of <i>sXIST</i> in modulating glia-neuron interactions, perhaps via competitive miRNA binding. This idea is also supported by overexpression analysis of a partial <i>sXIST</i> sequence and the finding of significant <i>XIST</i> expression changes in human cardiomyopathy and polyneuropathy. In summary, the current study suggests a novel, non-XCI role of <i>XIST</i> in peripheral Schwann cells that is mediated by a newly recognized transcript.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"257-274"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863056/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Ye, Xiaoyan Li, Na Cheng, Yansen Su, Junfeng Xia
Synonymous single-nucleotide variants (sSNVs) are increasingly recognized as contributors to disease, yet existing variant annotation databases offer limited functional insights for sSNVs. Here, we present SynMall, a comprehensive resource designed to decipher the functional impact of synonymous variation. SynMall catalogs 25 million potential human sSNVs and integrates evolutionary and population information of sSNVs from 45 non-human species. For each human sSNV, SynMall provides multilevel annotations that combine American College of Medical Genetics and Genomics (ACMG)-aligned variant interpretation information, such as allele frequencies and functional effects, with more than 100 descriptors at the DNA, RNA, and protein levels. These include both handcrafted features and embeddings from large language models to support advanced representation learning. To prioritize pathogenic sSNVs, we have developed SynScore, a machine learning framework that integrates ACMG guidelines and diverse biological characteristics. Benchmark comparisons show that SynScore achieves state-of-the-art performance, validating its effectiveness for genome-wide pathogenicity inference. Furthermore, SynMall enables mechanistic exploration by investigating in silico assessments and curated literature evidence to evaluate sSNV effects on miRNA-mRNA interactions, mRNA splicing, mRNA stability, and codon usage. By consolidating these features into a unified platform, we anticipate that SynMall will serve as a valuable resource for elucidating the functional role of synonymous mutations.
{"title":"The SynMall resource for characterizing the functional impact of synonymous variation.","authors":"Chen Ye, Xiaoyan Li, Na Cheng, Yansen Su, Junfeng Xia","doi":"10.1101/gr.281257.125","DOIUrl":"10.1101/gr.281257.125","url":null,"abstract":"<p><p>Synonymous single-nucleotide variants (sSNVs) are increasingly recognized as contributors to disease, yet existing variant annotation databases offer limited functional insights for sSNVs. Here, we present SynMall, a comprehensive resource designed to decipher the functional impact of synonymous variation. SynMall catalogs 25 million potential human sSNVs and integrates evolutionary and population information of sSNVs from 45 non-human species. For each human sSNV, SynMall provides multilevel annotations that combine American College of Medical Genetics and Genomics (ACMG)-aligned variant interpretation information, such as allele frequencies and functional effects, with more than 100 descriptors at the DNA, RNA, and protein levels. These include both handcrafted features and embeddings from large language models to support advanced representation learning. To prioritize pathogenic sSNVs, we have developed SynScore, a machine learning framework that integrates ACMG guidelines and diverse biological characteristics. Benchmark comparisons show that SynScore achieves state-of-the-art performance, validating its effectiveness for genome-wide pathogenicity inference. Furthermore, SynMall enables mechanistic exploration by investigating in silico assessments and curated literature evidence to evaluate sSNV effects on miRNA-mRNA interactions, mRNA splicing, mRNA stability, and codon usage. By consolidating these features into a unified platform, we anticipate that SynMall will serve as a valuable resource for elucidating the functional role of synonymous mutations.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"421-431"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145774461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samin Rahman Khan, M Saifur Rahman, M Sohel Rahman, Md Abul Hassan Samee
The surge in single-cell data sets and reference atlases has enabled the comparison of cell states across conditions, yet a gap persists in quantifying pathological shifts from healthy cell states. To address this gap, we introduce single-cell Pathological Shift Scoring (scPSS), which provides a statistical measure for how much a "query" cell from a diseased sample has shifted away from a reference group of healthy cells. In scPSS, the distance of a cell to its k-th nearest reference cell is considered as its pathological shift score. Euclidean distances in the top n principal component space of the gene expressions are used to measure distances between cells. The distribution of shift scores of the reference cells forms a null model. This allows a P-value to be assigned to each query cell's shift score, quantifying its statistical significance of being in the reference cell group. This makes our method both simple and statistically rigorous. The key strength of scPSS is its applicability in a "semisupervised" setting, where only healthy reference cells are known and diseased-labeled data are not provided for model training. As existing methods do not support cell-level pathological progression measurement in this setting, we adapt state-of-the-art supervised pathological prediction and contrastive models for benchmarking. Comparative evaluations against these adapted models demonstrate our method's superiority in accuracy and efficiency. Additionally, we show that the aggregation of cell-level pathological scores from scPSS can be used to predict health conditions at the individual level.
{"title":"Quantifying pathological progression from single-cell transcriptomic data with scPSS.","authors":"Samin Rahman Khan, M Saifur Rahman, M Sohel Rahman, Md Abul Hassan Samee","doi":"10.1101/gr.280411.125","DOIUrl":"10.1101/gr.280411.125","url":null,"abstract":"<p><p>The surge in single-cell data sets and reference atlases has enabled the comparison of cell states across conditions, yet a gap persists in quantifying pathological shifts from healthy cell states. To address this gap, we introduce <u>s</u>ingle-<u>c</u>ell <u>P</u>athological <u>S</u>hift <u>S</u>coring (scPSS), which provides a statistical measure for how much a \"query\" cell from a diseased sample has shifted away from a reference group of healthy cells. In scPSS, the distance of a cell to its <i>k</i>-th nearest reference cell is considered as its pathological shift score. Euclidean distances in the top <i>n</i> principal component space of the gene expressions are used to measure distances between cells. The distribution of shift scores of the reference cells forms a null model. This allows a <i>P</i>-value to be assigned to each query cell's shift score, quantifying its statistical significance of being in the reference cell group. This makes our method both simple and statistically rigorous. The key strength of scPSS is its applicability in a \"semisupervised\" setting, where only healthy reference cells are known and diseased-labeled data are not provided for model training. As existing methods do not support cell-level pathological progression measurement in this setting, we adapt state-of-the-art supervised pathological prediction and contrastive models for benchmarking. Comparative evaluations against these adapted models demonstrate our method's superiority in accuracy and efficiency. Additionally, we show that the aggregation of cell-level pathological scores from scPSS can be used to predict health conditions at the individual level.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"375-386"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863187/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145984736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maureen Pittman, Kihyun Lee, Franco Felix, Yu Huang, Adrienne Lam, Mauro W Costa, Deepak Srivastava, Katherine S Pollard
Exome sequencing of thousands of families has revealed many risk genes for congenital heart defects (CHDs), yet most cases cannot be explained by a single causal mutation. Even within the same family, individuals carrying a particular mutation in a known risk gene often demonstrate variable phenotypes, suggesting the presence of genetic modifiers. To explore oligogenic causes of CHD without assessing billions of variant combinations, we develop an efficient, simulation-based method to detect gene sets that carry co-occurring damaging variants in probands at a higher rate than expected given parental genotypes. We implement this approach in software called Gene Combinations in Oligogenic Disease (GCOD) and apply it to a cohort of 3377 CHD trios with exome sequencing. This analysis detects 160 gene pairs in which damaging variants are transmitted with higher-than-expected frequency to CHD probands but rarely or never appear in combination in their unaffected parents. Stratifying by specific phenotypes and considering gene combinations of higher orders yields an additional 6026 gene sets. Genes found in oligogenic sets are overrepresented in pathways related to heart development and often co-occur in sets of cell type marker genes from single-cell expression data. Compound heterozygosity of the newly identified digenic pair Gata6-Por leads to higher CHD incidence in mice compared with single hemizygotes, validating predicted genetic interactions. As genome sequencing is applied to more families and other disorders, GCOD will enable detection of increasingly large, novel gene combinations, shedding light on combinatorial causes of genetic diseases.
{"title":"The oligogenic inheritance test GCOD detects risk genes and their interactions in congenital heart defects.","authors":"Maureen Pittman, Kihyun Lee, Franco Felix, Yu Huang, Adrienne Lam, Mauro W Costa, Deepak Srivastava, Katherine S Pollard","doi":"10.1101/gr.281141.125","DOIUrl":"10.1101/gr.281141.125","url":null,"abstract":"<p><p>Exome sequencing of thousands of families has revealed many risk genes for congenital heart defects (CHDs), yet most cases cannot be explained by a single causal mutation. Even within the same family, individuals carrying a particular mutation in a known risk gene often demonstrate variable phenotypes, suggesting the presence of genetic modifiers. To explore oligogenic causes of CHD without assessing billions of variant combinations, we develop an efficient, simulation-based method to detect gene sets that carry co-occurring damaging variants in probands at a higher rate than expected given parental genotypes. We implement this approach in software called Gene Combinations in Oligogenic Disease (GCOD) and apply it to a cohort of 3377 CHD trios with exome sequencing. This analysis detects 160 gene pairs in which damaging variants are transmitted with higher-than-expected frequency to CHD probands but rarely or never appear in combination in their unaffected parents. Stratifying by specific phenotypes and considering gene combinations of higher orders yields an additional 6026 gene sets. Genes found in oligogenic sets are overrepresented in pathways related to heart development and often co-occur in sets of cell type marker genes from single-cell expression data. Compound heterozygosity of the newly identified digenic pair <i>Gata6-Por</i> leads to higher CHD incidence in mice compared with single hemizygotes, validating predicted genetic interactions. As genome sequencing is applied to more families and other disorders, GCOD will enable detection of increasingly large, novel gene combinations, shedding light on combinatorial causes of genetic diseases.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"330-347"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863180/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145742090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The analysis of spatial transcriptomics is hindered by high noise levels and missing gene measurements, challenges that are further compounded by the higher cost of spatial data compared to traditional single-cell data. To overcome this challenge, we introduce spRefine, a deep learning framework that leverages genomic language models to jointly denoise and impute spatial transcriptomic data. Our results demonstrate that spRefine yields more robust cell- and spot-level representations after denoising and imputation, substantially improving data integration. In addition, spRefine serves as a strong framework for model pretraining and the discovery of novel biological signals, as highlighted by multiple downstream applications across datasets of varying scales. Notably, spRefine enhances the accuracy of spatial ageing clock estimations and uncovers new aging-related relationships associated with key biological processes, such as neuronal function loss, which offers new insights for analyzing ageing effect with spatial transcriptomics.
{"title":"spRefine denoises and imputes spatial transcriptomics with a reference-free framework powered by genomic language model.","authors":"Tianyu Liu, Tinglin Huang, Wengong Jin, Tinyi Chu, Rex Ying, Hongyu Zhao","doi":"10.1101/gr.281001.125","DOIUrl":"10.1101/gr.281001.125","url":null,"abstract":"<p><p>The analysis of spatial transcriptomics is hindered by high noise levels and missing gene measurements, challenges that are further compounded by the higher cost of spatial data compared to traditional single-cell data. To overcome this challenge, we introduce spRefine, a deep learning framework that leverages genomic language models to jointly denoise and impute spatial transcriptomic data. Our results demonstrate that spRefine yields more robust cell- and spot-level representations after denoising and imputation, substantially improving data integration. In addition, spRefine serves as a strong framework for model pretraining and the discovery of novel biological signals, as highlighted by multiple downstream applications across datasets of varying scales. Notably, spRefine enhances the accuracy of spatial ageing clock estimations and uncovers new aging-related relationships associated with key biological processes, such as neuronal function loss, which offers new insights for analyzing ageing effect with spatial transcriptomics.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":""},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Tarbier, Sebastian D Mackowiak, Vaishnovi Sekar, Franziska Bonath, Etka Yapar, Bastian Fromm, Omid R Faridani, Inna Biryukova, Marc R Friedländer
microRNAs are small RNA molecules that can repress the expression of protein-coding genes post-transcriptionally. Previous studies have shown that microRNAs can also have alternative functions, including influencing target expression variation and covariation, but these observations have been limited to a few microRNAs. Here we systematically study microRNA alternative functions in mouse embryonic stem cells (mESCs) by genetically deleting Drosha, leading to global loss of microRNAs. We apply complementary single-cell RNA-seq methods to study the variation of the targets and the microRNAs themselves, and transcriptional inhibition to measure target half-lives. We find that microRNAs form four distinct coexpression groups across single cells. In particular, the mir-290 and the mir-182 genome clusters are abundantly, variably, and inversely expressed. Some cells have global biases toward specific miRNAs originating from either end of the hairpin precursor, suggesting the presence of unknown regulatory cofactors. We find that microRNAs generally increase variation and covariation of their targets at the RNA level, but we also find microRNAs such as miR-182 that appear to have opposite functions. In particular, microRNAs that are themselves variable in expression, such as miR-291a, are more likely to induce covariations. In summary, we apply genetic perturbation and multiomics to give the first global picture of microRNA dynamics at the single-cell level.
{"title":"Landscape of microRNA and target expression variation and covariation in single mouse embryonic stem cells.","authors":"Marcel Tarbier, Sebastian D Mackowiak, Vaishnovi Sekar, Franziska Bonath, Etka Yapar, Bastian Fromm, Omid R Faridani, Inna Biryukova, Marc R Friedländer","doi":"10.1101/gr.279914.124","DOIUrl":"10.1101/gr.279914.124","url":null,"abstract":"<p><p>microRNAs are small RNA molecules that can repress the expression of protein-coding genes post-transcriptionally. Previous studies have shown that microRNAs can also have alternative functions, including influencing target expression variation and covariation, but these observations have been limited to a few microRNAs. Here we systematically study microRNA alternative functions in mouse embryonic stem cells (mESCs) by genetically deleting <i>Drosha</i>, leading to global loss of microRNAs. We apply complementary single-cell RNA-seq methods to study the variation of the targets and the microRNAs themselves, and transcriptional inhibition to measure target half-lives. We find that microRNAs form four distinct coexpression groups across single cells. In particular, the <i>mir-290</i> and the <i>mir-182</i> genome clusters are abundantly, variably, and inversely expressed. Some cells have global biases toward specific miRNAs originating from either end of the hairpin precursor, suggesting the presence of unknown regulatory cofactors. We find that microRNAs generally increase variation and covariation of their targets at the RNA level, but we also find microRNAs such as miR-182 that appear to have opposite functions. In particular, microRNAs that are themselves variable in expression, such as miR-291a, are more likely to induce covariations. In summary, we apply genetic perturbation and multiomics to give the first global picture of microRNA dynamics at the single-cell level.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"291-302"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145959287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenhai Zhang, Yuansheng Liu, Guangyi Li, Jialu Xu, Enlian Chen, Alexander Schönhuth, Xiao Luo
Microbes are omnipresent, thriving in a range of habitats, from oceans to soils, and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the diversity in terms of strains in microbial communities is crucial, as variations between strains can lead to different phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, most existing strain-level taxonomic classifiers rely on the sequence representation of multiple linear reference genomes, which fails to capture the sequence correlations among these genomes, potentially introducing ambiguity and biases in metagenomic profiling. Here, we present PanTax, a pangenome graph-based taxonomic profiler that overcomes the shortcomings of sequence-based approaches, because pangenome graphs possess the capability to depict the full range of genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher F1 score at the strain level, while maintaining comparable or better performance in other aspects across various data sets.
{"title":"Strain-level metagenomic profiling using pangenome graphs with PanTax.","authors":"Wenhai Zhang, Yuansheng Liu, Guangyi Li, Jialu Xu, Enlian Chen, Alexander Schönhuth, Xiao Luo","doi":"10.1101/gr.280858.125","DOIUrl":"10.1101/gr.280858.125","url":null,"abstract":"<p><p>Microbes are omnipresent, thriving in a range of habitats, from oceans to soils, and even within our gastrointestinal tracts. They play a vital role in maintaining ecological equilibrium and promoting the health of their hosts. Consequently, understanding the diversity in terms of strains in microbial communities is crucial, as variations between strains can lead to different phenotypic expressions or diverse biological functions. However, current methods for taxonomic classification from metagenomic sequencing data have several limitations, including their reliance solely on species resolution, support for either short or long reads, or their confinement to a given single species. Most notably, most existing strain-level taxonomic classifiers rely on the sequence representation of multiple linear reference genomes, which fails to capture the sequence correlations among these genomes, potentially introducing ambiguity and biases in metagenomic profiling. Here, we present PanTax, a pangenome graph-based taxonomic profiler that overcomes the shortcomings of sequence-based approaches, because pangenome graphs possess the capability to depict the full range of genetic variability present across multiple evolutionarily or environmentally related genomes. PanTax provides a comprehensive solution to taxonomic classification for strain resolution, compatibility with both short and long reads, and compatibility with single or multiple species. Extensive benchmarking results demonstrate that PanTax drastically outperforms state-of-the-art approaches, primarily evidenced by its significantly higher F1 score at the strain level, while maintaining comparable or better performance in other aspects across various data sets.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"405-420"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863173/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145984796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margarita Geleta, Daniel Mas Montserrat, Xavier Giro-I-Nieto, Alexander G Ioannidis
Modern biobanks are providing numerous high-resolution genomic sequences of diverse populations. In order to account for diverse and admixed populations, new algorithmic tools are needed in order to properly capture the genetic composition of populations. Here, we explore deep learning techniques, namely, variational autoencoders (VAEs), to process genomic data from a population perspective. We show the power of VAEs for a variety of tasks relating to the interpretation, compression, classification, and simulation of genomic data with several worldwide whole genome data sets from both humans and canids, and evaluate the performance of the proposed applications with and without ancestry conditioning. The unsupervised setting of autoencoders allows for the detection and learning of granular population structure and inferring of informative latent factors. The learned latent spaces of VAEs are able to capture and represent differentiated Gaussian-like clusters of samples with similar genetic composition on a fine scale from single nucleotide polymorphisms (SNPs), enabling applications in dimensionality reduction and data simulation. These individual genotype sequences can then be decomposed into latent representations and reconstruction errors (residuals), which provide a sparse representation useful for lossless compression. We show that different populations have differentiated compression ratios and classification accuracies. Additionally, we analyze the entropy of the SNP data, its effect on compression across populations, and its relation to historical migrations, and we show how to introduce autoencoders into existing compression pipelines.
{"title":"Autoencoders for genomic variation analysis.","authors":"Margarita Geleta, Daniel Mas Montserrat, Xavier Giro-I-Nieto, Alexander G Ioannidis","doi":"10.1101/gr.280086.124","DOIUrl":"10.1101/gr.280086.124","url":null,"abstract":"<p><p>Modern biobanks are providing numerous high-resolution genomic sequences of diverse populations. In order to account for diverse and admixed populations, new algorithmic tools are needed in order to properly capture the genetic composition of populations. Here, we explore deep learning techniques, namely, variational autoencoders (VAEs), to process genomic data from a population perspective. We show the power of VAEs for a variety of tasks relating to the interpretation, compression, classification, and simulation of genomic data with several worldwide whole genome data sets from both humans and canids, and evaluate the performance of the proposed applications with and without ancestry conditioning. The unsupervised setting of autoencoders allows for the detection and learning of granular population structure and inferring of informative latent factors. The learned latent spaces of VAEs are able to capture and represent differentiated Gaussian-like clusters of samples with similar genetic composition on a fine scale from single nucleotide polymorphisms (SNPs), enabling applications in dimensionality reduction and data simulation. These individual genotype sequences can then be decomposed into latent representations and reconstruction errors (residuals), which provide a sparse representation useful for lossless compression. We show that different populations have differentiated compression ratios and classification accuracies. Additionally, we analyze the entropy of the SNP data, its effect on compression across populations, and its relation to historical migrations, and we show how to introduce autoencoders into existing compression pipelines.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"348-360"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863191/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146010065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhitao Huang, Ruiqing Zheng, Pengzhen Jia, Xuhua Yan, Jinmiao Chen, Min Li
Currently, with the emergence of abundant single-cell multiomics data, there is a trend where labels are transferred from well-annotated scRNA-seq data to less-annotated omics data, such as scATAC-seq. This approach leverages the gene expression profiles available in scRNA-seq to help annotate common cell types and even novel cell types for other omics data. However, the heterogeneous features between scRNA-seq and scATAC-seq pose challenges for identifying different cell types, which hinders the discovery of novel types. In this study, we propose a new label transfer tool scSHEFT, which simultaneously considers gene expression count data, peak count data, and Gene Activity Scores as inputs to bridge the gap of heterogeneous features. Specifically, we transform scATAC-seq data into Gene Activity Scores based on prior knowledge to harmonize heterogeneous features. As the feature transformation would result in information loss, we introduce the raw ATAC-seq embeddings to preserve the original information. To achieve a balance between interomics alignment and intraomics heterogeneity, we propose a dual alignment strategy. Specifically, scSHEFT employs an anchor-based approach to align interomics anchor pairs and a contrastive-based strategy to preserve cellular heterogeneity within each omics layer. Benchmarking scSHEFT against 11 state-of-the-art methods across seven data sets demonstrates its superiority in handling data sets of varying scales and technical noises.
{"title":"scSHEFT enables multiomics label transfer from scRNA-seq to scATAC-seq through dual alignment.","authors":"Zhitao Huang, Ruiqing Zheng, Pengzhen Jia, Xuhua Yan, Jinmiao Chen, Min Li","doi":"10.1101/gr.280410.125","DOIUrl":"10.1101/gr.280410.125","url":null,"abstract":"<p><p>Currently, with the emergence of abundant single-cell multiomics data, there is a trend where labels are transferred from well-annotated scRNA-seq data to less-annotated omics data, such as scATAC-seq. This approach leverages the gene expression profiles available in scRNA-seq to help annotate common cell types and even novel cell types for other omics data. However, the heterogeneous features between scRNA-seq and scATAC-seq pose challenges for identifying different cell types, which hinders the discovery of novel types. In this study, we propose a new label transfer tool scSHEFT, which simultaneously considers gene expression count data, peak count data, and Gene Activity Scores as inputs to bridge the gap of heterogeneous features. Specifically, we transform scATAC-seq data into Gene Activity Scores based on prior knowledge to harmonize heterogeneous features. As the feature transformation would result in information loss, we introduce the raw ATAC-seq embeddings to preserve the original information. To achieve a balance between interomics alignment and intraomics heterogeneity, we propose a dual alignment strategy. Specifically, scSHEFT employs an anchor-based approach to align interomics anchor pairs and a contrastive-based strategy to preserve cellular heterogeneity within each omics layer. Benchmarking scSHEFT against 11 state-of-the-art methods across seven data sets demonstrates its superiority in handling data sets of varying scales and technical noises.</p>","PeriodicalId":12678,"journal":{"name":"Genome research","volume":" ","pages":"387-396"},"PeriodicalIF":5.5,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863186/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146010073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}