Pub Date : 2025-12-08eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf173
Mark A Santcroos, Walter A Kosters, Mihai Lefter, Jeroen F J Laros, Jonathan K Vis
Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g. combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.
{"title":"A graph-based approach to variant description extraction from sequences.","authors":"Mark A Santcroos, Walter A Kosters, Mihai Lefter, Jeroen F J Laros, Jonathan K Vis","doi":"10.1093/nargab/lqaf173","DOIUrl":"10.1093/nargab/lqaf173","url":null,"abstract":"<p><p>Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g. combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf173"},"PeriodicalIF":2.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145716134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Precision oncology, enabled by next-generation sequencing (NGS), has shown tremendous potential for use in a clinical setting for cancer diagnosis and treatment. The biggest promise is to make treatment more precise and tailored for individual patients, departing from the one-size-fits-all approach. However, the translation of genomic panels into clinical practice and their wider implementation are met with challenges. Currently, only those patients who have frequently observed mutations in that cancer benefit from the NGS approach. There is an urgent need to expand the scope of this to all patients, for which new methods are required to be developed so as to identify key actionable gene panels in all patients. We address this need and present a new algorithm, PrOPs (Precision Onco Panels), that identifies short actionable driver panels by integrating genomics, transcriptomics, genome-wide protein-protein interactions, and precision network construction and analysis. We tested the algorithm on 2180 patients from six cancer types from TCGA (BRCA, COAD, GBM, LIHC, LUAD, and SKCM) and predicted patient-specific cancer driver genes. PrOPs outperforms the existing network-based methods that identify personalized drivers and also capture rare and patient-specific cancer drivers. Among the clinical cohorts, PrOPs identified clinically relevant actionable panels in 93% of patient cases. The extensive testing of our algorithm and demonstrated generalizability in six different cancers indicate the usefulness of our algorithm in precision oncology.
{"title":"A new algorithm Precision OncoPanels (PrOPs) identifies short individualized actionable panels that can guide cancer treatment: a pan-cancer analysis of TCGA cohorts.","authors":"Shrisruti Sriraman, Debajyoti Das, Nagasuma Chandra","doi":"10.1093/nargab/lqaf177","DOIUrl":"10.1093/nargab/lqaf177","url":null,"abstract":"<p><p>Precision oncology, enabled by next-generation sequencing (NGS), has shown tremendous potential for use in a clinical setting for cancer diagnosis and treatment. The biggest promise is to make treatment more precise and tailored for individual patients, departing from the one-size-fits-all approach. However, the translation of genomic panels into clinical practice and their wider implementation are met with challenges. Currently, only those patients who have frequently observed mutations in that cancer benefit from the NGS approach. There is an urgent need to expand the scope of this to all patients, for which new methods are required to be developed so as to identify key actionable gene panels in all patients. We address this need and present a new algorithm, PrOPs (Precision Onco Panels), that identifies short actionable driver panels by integrating genomics, transcriptomics, genome-wide protein-protein interactions, and precision network construction and analysis. We tested the algorithm on 2180 patients from six cancer types from TCGA (BRCA, COAD, GBM, LIHC, LUAD, and SKCM) and predicted patient-specific cancer driver genes. PrOPs outperforms the existing network-based methods that identify personalized drivers and also capture rare and patient-specific cancer drivers. Among the clinical cohorts, PrOPs identified clinically relevant actionable panels in 93% of patient cases. The extensive testing of our algorithm and demonstrated generalizability in six different cancers indicate the usefulness of our algorithm in precision oncology.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf177"},"PeriodicalIF":2.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf171
Kazuteru Yamamura, Goro Terai, Kiyoshi Asai
The structure of RNA is deeply related to its function, and information about RNA substructure energy parameters is useful for predicting its structure from its sequence. RNA in cells is often modified, and these various types of modifications affect its structure and function. In recent years, the use of pseudouridine modifications in RNA vaccines has increased the importance of predicting structures that include modified bases. However, energy parameters of substructures involving modified bases have not yet been sufficiently determined. Therefore, in this paper, we propose a method for inversely calculating energy parameters from base-pairing probabilities. This method optimizes energy parameters using the same mechanism as gradient descent in deep learning. We also propose efficient computational approaches, including the calculation of the derivative of the partition function using a dynamic programming method following computations with the McCaskill algorithm. Because base-pairing probabilities can be obtained by adjusting them through chemical probing methods, it is expected that parameter estimation can be performed without relying on labor-intensive experiments or molecular dynamics simulations.
{"title":"A method for estimating energy parameters of RNAs by differentiating base-pairing probabilities.","authors":"Kazuteru Yamamura, Goro Terai, Kiyoshi Asai","doi":"10.1093/nargab/lqaf171","DOIUrl":"10.1093/nargab/lqaf171","url":null,"abstract":"<p><p>The structure of RNA is deeply related to its function, and information about RNA substructure energy parameters is useful for predicting its structure from its sequence. RNA in cells is often modified, and these various types of modifications affect its structure and function. In recent years, the use of pseudouridine modifications in RNA vaccines has increased the importance of predicting structures that include modified bases. However, energy parameters of substructures involving modified bases have not yet been sufficiently determined. Therefore, in this paper, we propose a method for inversely calculating energy parameters from base-pairing probabilities. This method optimizes energy parameters using the same mechanism as gradient descent in deep learning. We also propose efficient computational approaches, including the calculation of the derivative of the partition function using a dynamic programming method following computations with the McCaskill algorithm. Because base-pairing probabilities can be obtained by adjusting them through chemical probing methods, it is expected that parameter estimation can be performed without relying on labor-intensive experiments or molecular dynamics simulations.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf171"},"PeriodicalIF":2.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684387/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf178
Reethika Veluri, Gareth Pollin, Jessica B Wagenknecht, Raul Urrutia, Michael T Zimmermann
Miniproteins, defined as polypeptides containing fewer than 50 amino acids, have recently elicited significant interest due to an emerging understanding of their diverse roles in fundamental biological processes. In addition, miniprotein dysregulation underlies human diseases and is a considerable focus for biotechnology and drug development. The human genome project revealed many miniproteins, most of which remain uncharacterized. This study reports an approach for analyzing and scoring previously uncharacterized miniproteins by integrating knowledge from classic sequence-based bioinformatics, computational biophysics, and system biology annotations. We identified 85 human miniproteins using this simple multi-tier approach. Then, we predicted miniprotein three-dimensional structures using AI-based methods and peptide modeling to determine their relative yields for these understudied polymers. We identify that structural propensity is not strictly dependent on polymer length, and peptide-based algorithms may have advantages over AI-based algorithms for certain groups of miniproteins. Subsequently, we used several computational biophysics methods and structure-based calculations to annotate and evaluate results from both algorithms. We propose novel structure-function relationships for miniproteins, which expands our understanding of their potential roles in cellular processes. Finally, we practically identify which sequence- and structure-based tools provide the most information, aiding future studies of miniproteins, with emphasis on their biomedical relevance.
{"title":"An integrative multitiered computational analysis for better understanding the structure and function of 85 miniproteins.","authors":"Reethika Veluri, Gareth Pollin, Jessica B Wagenknecht, Raul Urrutia, Michael T Zimmermann","doi":"10.1093/nargab/lqaf178","DOIUrl":"10.1093/nargab/lqaf178","url":null,"abstract":"<p><p>Miniproteins, defined as polypeptides containing fewer than 50 amino acids, have recently elicited significant interest due to an emerging understanding of their diverse roles in fundamental biological processes. In addition, miniprotein dysregulation underlies human diseases and is a considerable focus for biotechnology and drug development. The human genome project revealed many miniproteins, most of which remain uncharacterized. This study reports an approach for analyzing and scoring previously uncharacterized miniproteins by integrating knowledge from classic sequence-based bioinformatics, computational biophysics, and system biology annotations. We identified 85 human miniproteins using this simple multi-tier approach. Then, we predicted miniprotein three-dimensional structures using AI-based methods and peptide modeling to determine their relative yields for these understudied polymers. We identify that structural propensity is not strictly dependent on polymer length, and peptide-based algorithms may have advantages over AI-based algorithms for certain groups of miniproteins. Subsequently, we used several computational biophysics methods and structure-based calculations to annotate and evaluate results from both algorithms. We propose novel structure-function relationships for miniproteins, which expands our understanding of their potential roles in cellular processes. Finally, we practically identify which sequence- and structure-based tools provide the most information, aiding future studies of miniproteins, with emphasis on their biomedical relevance.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf178"},"PeriodicalIF":2.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12673850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145678891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf169
Nathan J Clement, Nobuko Hamasaki-Katagiri, Brian Lin, Anton A Komar, Michael DiCuccio, Haim Bar, Chava Kimchi-Sarfaty
Current strategies for optimizing gene therapeutics and recombinant protein production typically rely on universal host codon usage indices. However, there is a growing shift toward incorporating gene-specific traits to enhance therapeutic characteristics. In this study, we investigate position-specific variations in codon and adjacent codon-pair usage biases (CPUBs), offering potential for more tailored gene engineering approaches. We focus our analysis on the coding sequences of four coagulation factors: ADAMTS13, von Willebrand factor, factor VIII, and factor IX, which have been used in therapeutic applications. By aligning transcript homologs with human sequences for each gene using Discontiguous Megablast and MACSE, we assess "sequence-position-specific" codon and CPUBs; 157 homologous sequences for ADAMTS13, 148 for F8, 96 for F9, and 202 for VWF. Species with homologs ranged from Primates and Artiodactyla (Even-toed Ungulates) to Testudines. Statistically significant, position-specific positive CPUBs were observed that contrasted with conventional, alignment-specific negative CPUBs. Moreover, we observed that codon and codon-pair usages are highly associated at sequence positions despite little or no association in conventional-position-agnostic analyses. The distinct biases observed at different positions/functionally critical domains in coding sequences highlight the importance of considering position-specific effects in codon optimization strategies.
{"title":"Uncovering position-specific patterns in codon and codon-pair usage in candidate genes associated with blood coagulation diseases.","authors":"Nathan J Clement, Nobuko Hamasaki-Katagiri, Brian Lin, Anton A Komar, Michael DiCuccio, Haim Bar, Chava Kimchi-Sarfaty","doi":"10.1093/nargab/lqaf169","DOIUrl":"10.1093/nargab/lqaf169","url":null,"abstract":"<p><p>Current strategies for optimizing gene therapeutics and recombinant protein production typically rely on universal host codon usage indices. However, there is a growing shift toward incorporating gene-specific traits to enhance therapeutic characteristics. In this study, we investigate position-specific variations in codon and adjacent codon-pair usage biases (CPUBs), offering potential for more tailored gene engineering approaches. We focus our analysis on the coding sequences of four coagulation factors: ADAMTS13, von Willebrand factor, factor VIII, and factor IX, which have been used in therapeutic applications. By aligning transcript homologs with human sequences for each gene using Discontiguous Megablast and MACSE, we assess \"sequence-position-specific\" codon and CPUBs; 157 homologous sequences for <i>ADAMTS13</i>, 148 for <i>F8</i>, 96 for <i>F9</i>, and 202 for <i>VWF</i>. Species with homologs ranged from Primates and Artiodactyla (Even-toed Ungulates) to Testudines. Statistically significant, position-specific positive CPUBs were observed that contrasted with conventional, alignment-specific negative CPUBs. Moreover, we observed that codon and codon-pair usages are highly associated at sequence positions despite little or no association in conventional-position-agnostic analyses. The distinct biases observed at different positions/functionally critical domains in coding sequences highlight the importance of considering position-specific effects in codon optimization strategies.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf169"},"PeriodicalIF":2.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12673856/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145678946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf176
Izabela Mamede, Lucio R Queiroz, Carlos Mata-Machado, Júlia Teixeira Rodrigues, Thomaz Luscher-Dias, Nayara E de Toledo, Paulo P Amaral, Luigi Marchionni, Gloria R Franco
Transcriptome analysis is one of the bases of modern biology, yet it is typically performed at the gene level, ignoring the complexity of alternative splicing and differential transcription initiation/termination events. Over 95% of mammalian genes produce multiple transcripts, yet most RNA-seq analyses rely on short-read data, for which transcript-level interpretation remains challenging. Current tools suffer from low accuracy, inconsistency with annotations, and lack quick solutions for downstream biological interpretation. Here, we present Isoformic, a customizable R pipeline for transcript-level analysis of short-read RNA-seq data, available on GitHub. Isoformic processes differential expression results to detect genes with transcript-level changes, visualize exon-intron structures, and perform functional enrichment stratified by transcript type. Validated on diverse datasets, including preeclampsia, SARS-CoV-2 infection, and murine anxiety models, Isoformic reveals biologically relevant transcript variants and their possible phenotypic associations. Compatible with GENCODE reference transcriptome, Isoformic enhances the resolution of RNA-seq studies, enabling researchers to uncover the regulatory roles of alternative transcription events.
{"title":"Isoformic: a workflow for transcript-level RNA-seq interpretation.","authors":"Izabela Mamede, Lucio R Queiroz, Carlos Mata-Machado, Júlia Teixeira Rodrigues, Thomaz Luscher-Dias, Nayara E de Toledo, Paulo P Amaral, Luigi Marchionni, Gloria R Franco","doi":"10.1093/nargab/lqaf176","DOIUrl":"10.1093/nargab/lqaf176","url":null,"abstract":"<p><p>Transcriptome analysis is one of the bases of modern biology, yet it is typically performed at the gene level, ignoring the complexity of alternative splicing and differential transcription initiation/termination events. Over 95% of mammalian genes produce multiple transcripts, yet most RNA-seq analyses rely on short-read data, for which transcript-level interpretation remains challenging. Current tools suffer from low accuracy, inconsistency with annotations, and lack quick solutions for downstream biological interpretation. Here, we present Isoformic, a customizable R pipeline for transcript-level analysis of short-read RNA-seq data, available on GitHub. Isoformic processes differential expression results to detect genes with transcript-level changes, visualize exon-intron structures, and perform functional enrichment stratified by transcript type. Validated on diverse datasets, including preeclampsia, SARS-CoV-2 infection, and murine anxiety models, Isoformic reveals biologically relevant transcript variants and their possible phenotypic associations. Compatible with GENCODE reference transcriptome, Isoformic enhances the resolution of RNA-seq studies, enabling researchers to uncover the regulatory roles of alternative transcription events.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf176"},"PeriodicalIF":2.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12673842/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145678948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf155
Alexander Coulter, Chun Yip Tong, Yang Ni, Yuchao Jiang
Mapping expression quantitative trait loci (eQTLs) is a powerful method to study how genetic variation influences gene expression. Traditional bulk eQTL methods rely on averaged gene expression across a possibly heterogeneous mixture of cells, which can obscure underlying regulatory heterogeneity. Single-cell eQTL methods circumvent the averaging artifacts, providing an immense opportunity to interrogate transcriptional regulation at a much finer resolution. Recent developments in metric space regression methods allow the use of full empirical distributions as response objects instead of simple summary statistics such as mean. Here, we leverage Fréchet regression to identify distribution QTLs (distQTLs) using population-scale single-cell RNA-sequencing (scRNA-seq) data. We apply distQTL to the OneK1K cohort, consisting of scRNA-seq data of peripheral blood mononuclear cells from 982 donors, and compare results to various eQTL approaches based on summary statistics and mixed effects modeling. We demonstrate the superior performance of distQTL across different gene expression contexts compared to other methods and benchmark our results against findings from the Genotype-Tissue Expression Project. Finally, we orthogonally validate calls from distQTL using cell-type-specific epigenomic profiles.
{"title":"distQTL: distribution quantitative trait loci identification by population-scale single-cell data.","authors":"Alexander Coulter, Chun Yip Tong, Yang Ni, Yuchao Jiang","doi":"10.1093/nargab/lqaf155","DOIUrl":"10.1093/nargab/lqaf155","url":null,"abstract":"<p><p>Mapping expression quantitative trait loci (eQTLs) is a powerful method to study how genetic variation influences gene expression. Traditional bulk eQTL methods rely on averaged gene expression across a possibly heterogeneous mixture of cells, which can obscure underlying regulatory heterogeneity. Single-cell eQTL methods circumvent the averaging artifacts, providing an immense opportunity to interrogate transcriptional regulation at a much finer resolution. Recent developments in metric space regression methods allow the use of full empirical distributions as response objects instead of simple summary statistics such as mean. Here, we leverage Fréchet regression to identify distribution QTLs (distQTLs) using population-scale single-cell RNA-sequencing (scRNA-seq) data. We apply distQTL to the OneK1K cohort, consisting of scRNA-seq data of peripheral blood mononuclear cells from 982 donors, and compare results to various eQTL approaches based on summary statistics and mixed effects modeling. We demonstrate the superior performance of distQTL across different gene expression contexts compared to other methods and benchmark our results against findings from the Genotype-Tissue Expression Project. Finally, we orthogonally validate calls from distQTL using cell-type-specific epigenomic profiles.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf155"},"PeriodicalIF":2.8,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12661327/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145649428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf156
Guy M Twa, Robert A Phillips, Nathaniel J Robinson, Jeremy J Day
Single-nucleus RNA sequencing (snRNA-seq) technology offers unprecedented resolution for studying cell type-specific gene expression patterns. However, snRNA-seq poses high costs and technical limitations, often requiring the pooling of independent biological samples and the loss of individual sample-level data. Deconvolution of sample identity using inherent features would enable the incorporation of pooled barcoding and sequencing protocols, thereby increasing data throughput and analytical sample size without requiring increases in experimental sample size and sequencing costs. In this study, we demonstrate a proof of concept that sex-dependent gene expression patterns can be leveraged for the deconvolution of pooled snRNA-seq data. Using previously published snRNA-seq data from the rat ventral tegmental area, we trained a range of machine learning models to classify cell sex using genes differentially expressed in cells from male and female rats. Models that used sex-dependent gene expression predicted cell sex with high accuracy (93%-95%) and generalizability and outperformed simple classification models using only sex chromosome gene expression (88%-90%). This work provides a model for future snRNA-seq studies to perform sample deconvolution using a two-sex pooled sample sequencing design and benchmarks the performance of various machine learning approaches to deconvolve sample identification from inherent sample features.
{"title":"Accurate sample deconvolution of pooled snRNA-seq using sex-dependent gene expression patterns.","authors":"Guy M Twa, Robert A Phillips, Nathaniel J Robinson, Jeremy J Day","doi":"10.1093/nargab/lqaf156","DOIUrl":"10.1093/nargab/lqaf156","url":null,"abstract":"<p><p>Single-nucleus RNA sequencing (snRNA-seq) technology offers unprecedented resolution for studying cell type-specific gene expression patterns. However, snRNA-seq poses high costs and technical limitations, often requiring the pooling of independent biological samples and the loss of individual sample-level data. Deconvolution of sample identity using inherent features would enable the incorporation of pooled barcoding and sequencing protocols, thereby increasing data throughput and analytical sample size without requiring increases in experimental sample size and sequencing costs. In this study, we demonstrate a proof of concept that sex-dependent gene expression patterns can be leveraged for the deconvolution of pooled snRNA-seq data. Using previously published snRNA-seq data from the rat ventral tegmental area, we trained a range of machine learning models to classify cell sex using genes differentially expressed in cells from male and female rats. Models that used sex-dependent gene expression predicted cell sex with high accuracy (93%-95%) and generalizability and outperformed simple classification models using only sex chromosome gene expression (88%-90%). This work provides a model for future snRNA-seq studies to perform sample deconvolution using a two-sex pooled sample sequencing design and benchmarks the performance of various machine learning approaches to deconvolve sample identification from inherent sample features.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf156"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12639244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145589076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-12-01DOI: 10.1093/nargab/lqaf168
Dylan De Groote, Daniele Pepe, Xander Janssens, Kim De Keersmaecker
Accurate annotation of genetic variants-distinguishing whether they affect protein-coding or noncoding genomic regions-is crucial for evaluating their potential role in disease development. Prominent examples have been identified of variants that for many years had been considered to be coding missense or synonymous mutations targeting one gene, and that recently turned out to be noncoding variants, sometimes even modulating a shared regulatory region of multiple genes. These errors were caused by annotating to a canonical reference transcript, whereas an alternative transcript was in reality expressed in respect to which the mutations have a different annotation. Unfortunately, this practice of annotating genetic variants to a reference transcript, without verifying whether this transcript is expressed or whether the mutation causes a change of expressed transcript, is still widespread. However, the implementation of RNA sequencing and availability of these data in online portals allow to verify expressed transcripts in relevant tissues. Integration of DNA- and RNA-sequencing data, in which detected DNA mutations are annotated in respect to the transcripts that are expressed in the corresponding tissue or disease sample as detected by RNA sequencing, avoids misinterpretation of noncoding variants as coding and vice versa, thereby improving the functional interpretation of genetic variants.
{"title":"To be or not to be a protein coding mutation, that's the question!","authors":"Dylan De Groote, Daniele Pepe, Xander Janssens, Kim De Keersmaecker","doi":"10.1093/nargab/lqaf168","DOIUrl":"10.1093/nargab/lqaf168","url":null,"abstract":"<p><p>Accurate annotation of genetic variants-distinguishing whether they affect protein-coding or noncoding genomic regions-is crucial for evaluating their potential role in disease development. Prominent examples have been identified of variants that for many years had been considered to be coding missense or synonymous mutations targeting one gene, and that recently turned out to be noncoding variants, sometimes even modulating a shared regulatory region of multiple genes. These errors were caused by annotating to a canonical reference transcript, whereas an alternative transcript was in reality expressed in respect to which the mutations have a different annotation. Unfortunately, this practice of annotating genetic variants to a reference transcript, without verifying whether this transcript is expressed or whether the mutation causes a change of expressed transcript, is still widespread. However, the implementation of RNA sequencing and availability of these data in online portals allow to verify expressed transcripts in relevant tissues. Integration of DNA- and RNA-sequencing data, in which detected DNA mutations are annotated in respect to the transcripts that are expressed in the corresponding tissue or disease sample as detected by RNA sequencing, avoids misinterpretation of noncoding variants as coding and <i>vice versa</i>, thereby improving the functional interpretation of genetic variants.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf168"},"PeriodicalIF":2.8,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12634407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145589150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26eCollection Date: 2025-09-01DOI: 10.1093/nargab/lqaf129
Hamda Ajmal, Sutanu Nandi, Narod Kebabci, Colm J Ryan
Synthetic lethality (SL) is an extreme form of negative genetic interaction, where simultaneous disruption of two non-essential genes causes cell death. SL can be exploited to develop cancer therapies that target tumour cells with specific mutations, potentially limiting toxicity. Pooled combinatorial CRISPR screens, where two genes are simultaneously perturbed and the resulting impacts on fitness estimated, are now widely used for the identification of SL targets in cancer. Various scoring methods have been developed to infer SL genetic interactions from these screens, but there has been no systematic comparison of these approaches. Here, we performed a comprehensive analysis of five scoring methods for SL detection using five combinatorial CRISPR datasets. We assessed the performance of each algorithm on each screen dataset using two different benchmarks of paralog SL. We find that no single method performs best across all screens but identify two methods that perform well across most datasets. Of these two scores, Gemini-Sensitive has an available R package that can be applied to most screen designs, making it a reasonable first choice.
{"title":"Benchmarking genetic interaction scoring methods for identifying synthetic lethality from combinatorial CRISPR screens.","authors":"Hamda Ajmal, Sutanu Nandi, Narod Kebabci, Colm J Ryan","doi":"10.1093/nargab/lqaf129","DOIUrl":"10.1093/nargab/lqaf129","url":null,"abstract":"<p><p>Synthetic lethality (SL) is an extreme form of negative genetic interaction, where simultaneous disruption of two non-essential genes causes cell death. SL can be exploited to develop cancer therapies that target tumour cells with specific mutations, potentially limiting toxicity. Pooled combinatorial CRISPR screens, where two genes are simultaneously perturbed and the resulting impacts on fitness estimated, are now widely used for the identification of SL targets in cancer. Various scoring methods have been developed to infer SL genetic interactions from these screens, but there has been no systematic comparison of these approaches. Here, we performed a comprehensive analysis of five scoring methods for SL detection using five combinatorial CRISPR datasets. We assessed the performance of each algorithm on each screen dataset using two different benchmarks of paralog SL. We find that no single method performs best across all screens but identify two methods that perform well across most datasets. Of these two scores, Gemini-Sensitive has an available R package that can be applied to most screen designs, making it a reasonable first choice.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf129"},"PeriodicalIF":2.8,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12464814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145187043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}