Pub Date : 2025-12-17DOI: 10.1186/s12859-025-06350-7
Jack Freeman, Robert J Millikin, Leo Xu, Ishaan Sharma, Bethany Moore, Cannon Lock, Kevin Shine George, Aviral Bal, Chitrasen Mohanty, Ron Stewart
{"title":"SKiM-GPT: combining biomedical literature-based discovery with large language model hypothesis evaluation.","authors":"Jack Freeman, Robert J Millikin, Leo Xu, Ishaan Sharma, Bethany Moore, Cannon Lock, Kevin Shine George, Aviral Bal, Chitrasen Mohanty, Ron Stewart","doi":"10.1186/s12859-025-06350-7","DOIUrl":"10.1186/s12859-025-06350-7","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"16"},"PeriodicalIF":3.3,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145773432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1186/s12859-025-06332-9
Jacob Pfeil, Liqian Ma, Hin Ching Lo, Tolga Turan, R Tyler McLaughlin, Xu Shi, Severiano Villarruel, Stephen Wilson, Xi Zhao, Josue Samayoa, Kyle Halliwill
{"title":"Pairwise ratio transformation of gene expression data leads to improved checkpoint response prediction in lung cancer patients.","authors":"Jacob Pfeil, Liqian Ma, Hin Ching Lo, Tolga Turan, R Tyler McLaughlin, Xu Shi, Severiano Villarruel, Stephen Wilson, Xi Zhao, Josue Samayoa, Kyle Halliwill","doi":"10.1186/s12859-025-06332-9","DOIUrl":"10.1186/s12859-025-06332-9","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"15"},"PeriodicalIF":3.3,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12809930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145740262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1186/s12859-025-06343-6
Jie Kang, Melanie K Hess, Ken G Dodds, Rudiger Brauning, John C McEwan, Barry J Foote, Judy F Foote, Agnieszka Konkolewska, Shannon M Clarke, Andrew S Hess
{"title":"SimGBS: a rapid method for simulating large-scale genotyping-by-sequencing data.","authors":"Jie Kang, Melanie K Hess, Ken G Dodds, Rudiger Brauning, John C McEwan, Barry J Foote, Judy F Foote, Agnieszka Konkolewska, Shannon M Clarke, Andrew S Hess","doi":"10.1186/s12859-025-06343-6","DOIUrl":"10.1186/s12859-025-06343-6","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"12"},"PeriodicalIF":3.3,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12801997/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1186/s12859-025-06313-y
Jiahui Sun, Shengli Wu, Xiangjun Shen, Chris Nugent, Hu Lu
To improve the effectiveness and efficiency of biomedical information retrieval by proposing ranking-based methods for selecting an optimal subset of retrieval systems for data fusion, we propose three ranking-based subset selection methods SFS (Sequential Forward Search), D&P (Diversity & Performance), and P&D (Performance & Diversity). These methods were applied in combination with the Reciprocal Rank Fusion technique. Experiments were conducted on four medical datasets from TREC, using between 62 and 125 candidate retrieval systems, and selecting up to 15 for fusion. The proposed subset selection methods significantly improved retrieval performance. Fusing the selected systems using RRF yielded improvements ranging from 10% to over 60% compared to the best individual retrieval system across the datasets. They also outperform the state-of-the-art technology by a large margin. In summary, our subset selection approach offers a practical and cost-efficient solution for biomedical information retrieval, achieving substantial performance gains while reducing computational overhead.
{"title":"Subset selection based fusion for biomedical information retrieval tasks.","authors":"Jiahui Sun, Shengli Wu, Xiangjun Shen, Chris Nugent, Hu Lu","doi":"10.1186/s12859-025-06313-y","DOIUrl":"10.1186/s12859-025-06313-y","url":null,"abstract":"<p><p>To improve the effectiveness and efficiency of biomedical information retrieval by proposing ranking-based methods for selecting an optimal subset of retrieval systems for data fusion, we propose three ranking-based subset selection methods SFS (Sequential Forward Search), D&P (Diversity & Performance), and P&D (Performance & Diversity). These methods were applied in combination with the Reciprocal Rank Fusion technique. Experiments were conducted on four medical datasets from TREC, using between 62 and 125 candidate retrieval systems, and selecting up to 15 for fusion. The proposed subset selection methods significantly improved retrieval performance. Fusing the selected systems using RRF yielded improvements ranging from 10% to over 60% compared to the best individual retrieval system across the datasets. They also outperform the state-of-the-art technology by a large margin. In summary, our subset selection approach offers a practical and cost-efficient solution for biomedical information retrieval, achieving substantial performance gains while reducing computational overhead.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"11"},"PeriodicalIF":3.3,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12801601/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1186/s12859-025-06333-8
Tazin Rahman, Ananth Kalyanaraman
Sequence sketching-a class of techniques aimed at generating compact representations of longer sequences-has become widely used in numerous long read applications, including assembly and mapping. Instead of comparing sequences, sketches allow us to sample from a subspace of k-mers and use those samples for comparison, saving both time and memory in the end application. One of the important metrics that determines the performance of a sketch is the sketch density, which refers to the fraction of the sampled k-mers retained by the sketch. While a lower density is preferable for space considerations, it could also impact the sensitivity of the mapping process. In this work, we visit the problem of reducing sketch density while preserving accuracy in the context of long-read mapping. We present an efficient algorithm called MHsketch that uses Jaccard estimators to reduce sketch density in mapping applications. Starting from an initial ground set of k-mers generated through a sketching method of choice, the approach applies MinHashing to derive a smaller sketch and uses that for mapping. In addition to reducing density, this approach is also easily parallelizable. To demonstrate the efficacy of our method, we modified a recently developed long read mapping tool (JEM-mapper) to adopt different sketching schemes, including Syncmer and Strobemer, and incorporated MHsketch to evaluate the effectiveness of downsampling. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, our experiments reveal that MHsketch (syncmers) achieves high-quality mapping while reducing time-to-solution (speedups between [Formula: see text] to [Formula: see text]), and drastically reducing memory usage ([Formula: see text] savings) compared to state-of-the-art tools. Availability: https://github.com/TazinRahman1105050/MHsketch .
{"title":"Density-reducing Jaccard estimators for sketch-based long read applications.","authors":"Tazin Rahman, Ananth Kalyanaraman","doi":"10.1186/s12859-025-06333-8","DOIUrl":"10.1186/s12859-025-06333-8","url":null,"abstract":"<p><p>Sequence sketching-a class of techniques aimed at generating compact representations of longer sequences-has become widely used in numerous long read applications, including assembly and mapping. Instead of comparing sequences, sketches allow us to sample from a subspace of k-mers and use those samples for comparison, saving both time and memory in the end application. One of the important metrics that determines the performance of a sketch is the sketch density, which refers to the fraction of the sampled k-mers retained by the sketch. While a lower density is preferable for space considerations, it could also impact the sensitivity of the mapping process. In this work, we visit the problem of reducing sketch density while preserving accuracy in the context of long-read mapping. We present an efficient algorithm called MHsketch that uses Jaccard estimators to reduce sketch density in mapping applications. Starting from an initial ground set of k-mers generated through a sketching method of choice, the approach applies MinHashing to derive a smaller sketch and uses that for mapping. In addition to reducing density, this approach is also easily parallelizable. To demonstrate the efficacy of our method, we modified a recently developed long read mapping tool (JEM-mapper) to adopt different sketching schemes, including Syncmer and Strobemer, and incorporated MHsketch to evaluate the effectiveness of downsampling. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, our experiments reveal that MHsketch (syncmers) achieves high-quality mapping while reducing time-to-solution (speedups between [Formula: see text] to [Formula: see text]), and drastically reducing memory usage ([Formula: see text] savings) compared to state-of-the-art tools. Availability: https://github.com/TazinRahman1105050/MHsketch .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"5"},"PeriodicalIF":3.3,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12781685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1186/s12859-025-06334-7
Szabolcs Makai, Diána Makai, Erika Chonata-Jiménez, Ildikó Karsai, Péter Mikó, Adél Sepsi, András Cseh
Background: Crossovers are essential for genome stability and genetic diversity, yet in plants they occur infrequently, typically restricted to only one to three per chromosome pair. Genotyping approaches such as SNP arrays or genotyping-by-sequencing (GBS) enable high-resolution detection of crossover frequency, a critical step for elucidating the mechanisms that regulate meiotic recombination and for exploiting it in plant breeding. Despite their widespread use and the availability of highly reproducible marker sets, user-friendly tools for reliable recombination analysis remain scarce.
Results: Here we present X-cross/over, a web-based platform that applies a graph-theoretical algorithm to estimate crossover frequencies from SNP datasets in HapMap format. The platform was evaluated using publicly available barley backcross inbred populations and newly developed wheat doubled haploid lines. Across both datasets, X-cross/over detected crossover events with high accuracy and sensitivity, yielding results consistent with published genotyping and cytological analyses. Importantly, the tool produces outcomes comparable to expert analyses while remaining accessible to users without bioinformatics expertise.
Conclusions: X-cross/over provides a consistent and transparent framework for detecting crossover sites and quantifying their frequency. Implemented in a platform-independent environment, the application is freely available at https://insilicolabdesk.atk.kinin.hu , making it a versatile resource for exploring the genetic and epigenetic regulation of meiotic recombination across plant species.
{"title":"X-cross/over: a web tool for graph-based estimation of meiotic crossover events in plants.","authors":"Szabolcs Makai, Diána Makai, Erika Chonata-Jiménez, Ildikó Karsai, Péter Mikó, Adél Sepsi, András Cseh","doi":"10.1186/s12859-025-06334-7","DOIUrl":"10.1186/s12859-025-06334-7","url":null,"abstract":"<p><strong>Background: </strong>Crossovers are essential for genome stability and genetic diversity, yet in plants they occur infrequently, typically restricted to only one to three per chromosome pair. Genotyping approaches such as SNP arrays or genotyping-by-sequencing (GBS) enable high-resolution detection of crossover frequency, a critical step for elucidating the mechanisms that regulate meiotic recombination and for exploiting it in plant breeding. Despite their widespread use and the availability of highly reproducible marker sets, user-friendly tools for reliable recombination analysis remain scarce.</p><p><strong>Results: </strong>Here we present X-cross/over, a web-based platform that applies a graph-theoretical algorithm to estimate crossover frequencies from SNP datasets in HapMap format. The platform was evaluated using publicly available barley backcross inbred populations and newly developed wheat doubled haploid lines. Across both datasets, X-cross/over detected crossover events with high accuracy and sensitivity, yielding results consistent with published genotyping and cytological analyses. Importantly, the tool produces outcomes comparable to expert analyses while remaining accessible to users without bioinformatics expertise.</p><p><strong>Conclusions: </strong>X-cross/over provides a consistent and transparent framework for detecting crossover sites and quantifying their frequency. Implemented in a platform-independent environment, the application is freely available at https://insilicolabdesk.atk.kinin.hu , making it a versatile resource for exploring the genetic and epigenetic regulation of meiotic recombination across plant species.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"21"},"PeriodicalIF":3.3,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12849401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1186/s12859-025-06341-8
Martin Engst, Martin Brokeš, Tereza Čalounová, Raman Samusevich, Roman Bushuiev, Anton Bushuiev, Ratthachat Chatpatanasiri, Adéla Tajovská, Safa Mert Akmeşe, Milana Perković, Matouš Soldát, Josef Sivic, Tomáš Pluskal
Background: Terpene synthases (TPSs) are enzymes that catalyze some of the most complex reactions in nature-the cyclizations of terpenes, which form the carbon backbones to the largest group of natural products, the terpenoids. On average, more than half of the carbon atoms in a terpene scaffold undergo a change in connectivity or configuration during these enzymatic cascades. Understanding TPS reaction mechanisms remains challenging, often requiring intricate computational modeling and isotopic labelling studies. Moreover, the relationship between TPS sequence and catalytic function is difficult to decipher, while data-driven approaches remain limited due to a lack of comprehensive, high-quality data sources. MAIN: We introduce the Mechanisms And Reactions of Terpene Synthases DataBase (MARTS-DB)-a manually curated, structured, and searchable database that integrates TPS enzymes, the terpenes they produce, and their detailed reaction mechanisms. MARTS-DB includes over 2850 reactions catalyzed by 1432 annotated enzymes from across all domains of life, with reaction mechanisms mapped as stepwise cascades for more than 500 terpenes. Accessible at https://www.marts-db.org , the database provides advanced search functionality and supports full dataset downloads in machine-readable formats. It also encourages community contributions to promote continuous growth.
Conclusion: User-friendly and comprehensive, MARTS-DB enables the systematic exploration of TPS catalysis, opening new avenues for computational analysis and machine learning, as recently demonstrated in the prediction of novel TPSs.
{"title":"MARTS-DB: a database of mechanisms and reactions of terpene synthases.","authors":"Martin Engst, Martin Brokeš, Tereza Čalounová, Raman Samusevich, Roman Bushuiev, Anton Bushuiev, Ratthachat Chatpatanasiri, Adéla Tajovská, Safa Mert Akmeşe, Milana Perković, Matouš Soldát, Josef Sivic, Tomáš Pluskal","doi":"10.1186/s12859-025-06341-8","DOIUrl":"10.1186/s12859-025-06341-8","url":null,"abstract":"<p><strong>Background: </strong>Terpene synthases (TPSs) are enzymes that catalyze some of the most complex reactions in nature-the cyclizations of terpenes, which form the carbon backbones to the largest group of natural products, the terpenoids. On average, more than half of the carbon atoms in a terpene scaffold undergo a change in connectivity or configuration during these enzymatic cascades. Understanding TPS reaction mechanisms remains challenging, often requiring intricate computational modeling and isotopic labelling studies. Moreover, the relationship between TPS sequence and catalytic function is difficult to decipher, while data-driven approaches remain limited due to a lack of comprehensive, high-quality data sources. MAIN: We introduce the Mechanisms And Reactions of Terpene Synthases DataBase (MARTS-DB)-a manually curated, structured, and searchable database that integrates TPS enzymes, the terpenes they produce, and their detailed reaction mechanisms. MARTS-DB includes over 2850 reactions catalyzed by 1432 annotated enzymes from across all domains of life, with reaction mechanisms mapped as stepwise cascades for more than 500 terpenes. Accessible at https://www.marts-db.org , the database provides advanced search functionality and supports full dataset downloads in machine-readable formats. It also encourages community contributions to promote continuous growth.</p><p><strong>Conclusion: </strong>User-friendly and comprehensive, MARTS-DB enables the systematic exploration of TPS catalysis, opening new avenues for computational analysis and machine learning, as recently demonstrated in the prediction of novel TPSs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"10"},"PeriodicalIF":3.3,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1186/s12859-025-06342-7
Chun Hing She, Sophelia Hoi-Shan Chan, Wanling Yang
Background: Accurate structural variant detection from short-read sequencing data remains challenged by false positives, particularly for heterozygous deletions where reduced allelic support and coverage-based detection methods are ambiguous. Existing SV genotyping and filtering approaches suffer from significant recall reductions, dependencies on additional pre-computed resources, or restriction to depth-based signals that overlook read level evidence.
Results: Here we present SVhet, a novel computational framework that leverages the heterozygosity patterns detected from different read evidences to identify false heterozygous deletions. Comprehensive benchmarking using 31 Human Genome Structural Variation Consortium Phase 3 samples demonstrated SVhet's ability to further reduce false positives while maintaining baseline recall. Hybrid approach of duphold and SVhet achieved up to 60% reduction in false positive counts while preserving recall. We also showed SVhet to be computationally efficient that can complete a whole genome structural variant callset under 5 min using 4 CPU cores. SVhet is available under a permissive MIT license via https://github.com/snakesch/SVhet .
Conclusion: SVhet provides an accurate and efficient solution for evaluating heterozygous deletions derived from short read sequencing data. SVhet can be used as a standalone tool or in conjunction with other filtering tools such as duphold. Importantly, it does not require additional variant sets, and can operate with minimal compute. Altogether, SVhet adds to the current effort to achieve accurate structural variant detection using short reads.
{"title":"SVhet: towards accurate detection of germline heterozygous deletions using short reads.","authors":"Chun Hing She, Sophelia Hoi-Shan Chan, Wanling Yang","doi":"10.1186/s12859-025-06342-7","DOIUrl":"10.1186/s12859-025-06342-7","url":null,"abstract":"<p><strong>Background: </strong>Accurate structural variant detection from short-read sequencing data remains challenged by false positives, particularly for heterozygous deletions where reduced allelic support and coverage-based detection methods are ambiguous. Existing SV genotyping and filtering approaches suffer from significant recall reductions, dependencies on additional pre-computed resources, or restriction to depth-based signals that overlook read level evidence.</p><p><strong>Results: </strong>Here we present SVhet, a novel computational framework that leverages the heterozygosity patterns detected from different read evidences to identify false heterozygous deletions. Comprehensive benchmarking using 31 Human Genome Structural Variation Consortium Phase 3 samples demonstrated SVhet's ability to further reduce false positives while maintaining baseline recall. Hybrid approach of duphold and SVhet achieved up to 60% reduction in false positive counts while preserving recall. We also showed SVhet to be computationally efficient that can complete a whole genome structural variant callset under 5 min using 4 CPU cores. SVhet is available under a permissive MIT license via https://github.com/snakesch/SVhet .</p><p><strong>Conclusion: </strong>SVhet provides an accurate and efficient solution for evaluating heterozygous deletions derived from short read sequencing data. SVhet can be used as a standalone tool or in conjunction with other filtering tools such as duphold. Importantly, it does not require additional variant sets, and can operate with minimal compute. Altogether, SVhet adds to the current effort to achieve accurate structural variant detection using short reads.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"9"},"PeriodicalIF":3.3,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798059/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145699631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1186/s12859-025-06335-6
Annekathrin Silvia Nedwed, Arsenij Ustjanzew, Najla Abassi, Leon Dammer, Alicia Schulze, Sara Salome Helbich, Michael Delacher, Konstantin Strauch, Federico Marini
{"title":"GeDi: simplifying gene set distances for enhanced omics interpretation in R/Bioconductor.","authors":"Annekathrin Silvia Nedwed, Arsenij Ustjanzew, Najla Abassi, Leon Dammer, Alicia Schulze, Sara Salome Helbich, Michael Delacher, Konstantin Strauch, Federico Marini","doi":"10.1186/s12859-025-06335-6","DOIUrl":"10.1186/s12859-025-06335-6","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"14"},"PeriodicalIF":3.3,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12809992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145699681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Transcription factors and their target genes form regulatory modules known as regulons, which exhibit significant specificity across various cell types. The integration of single-cell transcriptome data, transcription factor motif data, and ChIP-seq data presents a challenging task in identifying cell-type-specific regulons and examining their activities.
Results: In response, this study presents a Deep Structured Semantic Model for inferring and prioritizing cell-type-specific Regulons (DSSMReg). This approach utilizes single-cell transcriptome and transcription factor motif data to map transcription factors and target genes into a low-dimensional semantic space, resulting in the generation of feature vectors. The model then computes the cosine similarity between transcription factors and target genes to evaluate their regulatory strength and subsequently infers cell-type-specific regulons based on this assessment. Moreover, DSSMReg employs the AUCell algorithm to rank the importance of regulons for each cell type.
Conclusions: We compared DSSMReg against five representative gene regulatory inference algorithms using scRNA-seq data from five cell lines, with DSSMReg achieving the highest evaluation metrics for both AUROC and AUPRC. Furthermore, we applied DSSMReg to infer cell-type-specific regulons from scRNA-seq data of triple-negative breast cancer and human bone marrow hematopoietic stem cells. Our results indicated that regulons with high AUCell scores possess significant biological relevance. The source code of DSSMReg is freely available at https://github.com/YaxinF/DSSMReg .
{"title":"A DSSM network for inferring and prioritizing cell-type-specific regulons using single-cell RNA-seq data.","authors":"Yaxin Fan, Yichao Mei, Shengbao Bao, Jianyong Wang, Junxiang Gao","doi":"10.1186/s12859-025-06329-4","DOIUrl":"10.1186/s12859-025-06329-4","url":null,"abstract":"<p><strong>Background: </strong>Transcription factors and their target genes form regulatory modules known as regulons, which exhibit significant specificity across various cell types. The integration of single-cell transcriptome data, transcription factor motif data, and ChIP-seq data presents a challenging task in identifying cell-type-specific regulons and examining their activities.</p><p><strong>Results: </strong>In response, this study presents a Deep Structured Semantic Model for inferring and prioritizing cell-type-specific Regulons (DSSMReg). This approach utilizes single-cell transcriptome and transcription factor motif data to map transcription factors and target genes into a low-dimensional semantic space, resulting in the generation of feature vectors. The model then computes the cosine similarity between transcription factors and target genes to evaluate their regulatory strength and subsequently infers cell-type-specific regulons based on this assessment. Moreover, DSSMReg employs the AUCell algorithm to rank the importance of regulons for each cell type.</p><p><strong>Conclusions: </strong>We compared DSSMReg against five representative gene regulatory inference algorithms using scRNA-seq data from five cell lines, with DSSMReg achieving the highest evaluation metrics for both AUROC and AUPRC. Furthermore, we applied DSSMReg to infer cell-type-specific regulons from scRNA-seq data of triple-negative breast cancer and human bone marrow hematopoietic stem cells. Our results indicated that regulons with high AUCell scores possess significant biological relevance. The source code of DSSMReg is freely available at https://github.com/YaxinF/DSSMReg .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"8"},"PeriodicalIF":3.3,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798040/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145699660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}