Pub Date : 2026-01-30DOI: 10.1186/s12859-026-06383-6
Guojing Cong, Frank Chao, Daniel L Svoboda, Jeremy N Erickson, Michele R Balik-Meisner, Deepak Mav, Dhiral P Phadke, Elizabeth H Scholl, Ruchir R Shah, Scott S Auerbach
Background: Transcriptomic profiling technologies have advanced the analysis of biological and toxicological responses. However, substantial differences in probe design, dynamic range, gene coverage, and preprocessing pipelines across platforms introduce artifacts that limit cross-study integration and hinder the reuse of historical datasets. We aim to develop computational methods for accurate cross-platform translation to maximize the value of legacy resources.
Results: We present TransPlatformer a deep learning framework for translating gene expression profiles across heterogeneous toxicogenomics platforms. TransPlatformer employs a novel attention-based architecture to map high-dimensional fold-change vectors from legacy microarray technologies to current platforms. Models are trained and evaluated using DrugMatrix, spanning three technological generations. We investigate mixed-tissue, single-tissue, and cross-tissue training paradigms and benchmark performance against multilayer perceptron and matrix-completion baselines. In mixed-tissue training, TransPlatformer achieves a greater than 50% reduction in mean absolute error (0.043 vs. 0.09) and nearly doubles Pearson correlation (≈ 0.71 vs. 0.37) relative to baseline methods. Importantly, TransPlatformer preserves rare but biologically meaningful over- and under-expressed signals, with mean absolute error below 0.22. Single-tissue models yield further improvements for well-represented organs, such as a 10% reduction in liver mean absolute error, while underscoring the need for data augmentation strategies in low-sample tissues.ra CONCLUSIONS: TransPlatformer provides an effective and scalable computational solution for cross-platform transcriptomic translation. By enabling biologically faithful harmonization of gene expression data, the proposed approach facilitates the reuse of legacy toxicogenomics datasets, enhances downstream biomarker discovery, and supports more reproducible predictive modeling in toxicology.
背景:转录组学分析技术促进了生物和毒理学反应的分析。然而,不同平台在探针设计、动态范围、基因覆盖和预处理管道方面的巨大差异引入了限制交叉研究集成和阻碍历史数据集重用的工件。我们的目标是开发精确跨平台翻译的计算方法,以最大限度地发挥遗留资源的价值。结果:我们提出了TransPlatformer一个深度学习框架,用于跨异质毒物基因组学平台翻译基因表达谱。TransPlatformer采用一种新颖的基于注意力的架构,将传统微阵列技术的高维折叠变化向量映射到当前平台。模型的训练和评估使用药物矩阵,跨越三代技术。我们研究了混合组织、单组织和跨组织训练范例,以及针对多层感知器和矩阵补全基线的基准性能。在混合组织训练中,与基线方法相比,TransPlatformer的平均绝对误差降低了50%以上(0.043 vs. 0.09), Pearson相关性(≈0.71 vs. 0.37)几乎翻了一番。重要的是,《TransPlatformer》保留了罕见但具有生物学意义的过度和未充分表达的信号,平均绝对误差低于0.22。单组织模型进一步改善了代表性较好的器官,如肝脏平均绝对误差降低10%,同时强调了对低样本组织的数据增强策略的需求。结论:TransPlatformer为跨平台转录组翻译提供了有效且可扩展的计算解决方案。通过实现基因表达数据的生物学一致性,该方法促进了遗留毒理学基因组学数据集的重用,增强了下游生物标志物的发现,并支持毒理学中更具可重复性的预测建模。
{"title":"Transplatformer: translating toxicogenomic profiles between generations of platforms.","authors":"Guojing Cong, Frank Chao, Daniel L Svoboda, Jeremy N Erickson, Michele R Balik-Meisner, Deepak Mav, Dhiral P Phadke, Elizabeth H Scholl, Ruchir R Shah, Scott S Auerbach","doi":"10.1186/s12859-026-06383-6","DOIUrl":"https://doi.org/10.1186/s12859-026-06383-6","url":null,"abstract":"<p><strong>Background: </strong>Transcriptomic profiling technologies have advanced the analysis of biological and toxicological responses. However, substantial differences in probe design, dynamic range, gene coverage, and preprocessing pipelines across platforms introduce artifacts that limit cross-study integration and hinder the reuse of historical datasets. We aim to develop computational methods for accurate cross-platform translation to maximize the value of legacy resources.</p><p><strong>Results: </strong>We present TransPlatformer a deep learning framework for translating gene expression profiles across heterogeneous toxicogenomics platforms. TransPlatformer employs a novel attention-based architecture to map high-dimensional fold-change vectors from legacy microarray technologies to current platforms. Models are trained and evaluated using DrugMatrix, spanning three technological generations. We investigate mixed-tissue, single-tissue, and cross-tissue training paradigms and benchmark performance against multilayer perceptron and matrix-completion baselines. In mixed-tissue training, TransPlatformer achieves a greater than 50% reduction in mean absolute error (0.043 vs. 0.09) and nearly doubles Pearson correlation (≈ 0.71 vs. 0.37) relative to baseline methods. Importantly, TransPlatformer preserves rare but biologically meaningful over- and under-expressed signals, with mean absolute error below 0.22. Single-tissue models yield further improvements for well-represented organs, such as a 10% reduction in liver mean absolute error, while underscoring the need for data augmentation strategies in low-sample tissues.ra CONCLUSIONS: TransPlatformer provides an effective and scalable computational solution for cross-platform transcriptomic translation. By enabling biologically faithful harmonization of gene expression data, the proposed approach facilitates the reuse of legacy toxicogenomics datasets, enhances downstream biomarker discovery, and supports more reproducible predictive modeling in toxicology.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146092027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1186/s12859-025-06339-2
Fang Liu, Rui He, Thomas Sheeley, David A Scheiblin, Stephen J Lockett, Lisa A Ridnour, David A Wink, Mark Jensen, Janelle Cortner, George Zaki
Background: Advancements in spatially resolved single-cell technologies are transforming our understanding of tissue architecture and disease microenvironments. However, analyzing the resulting high-dimensional, gigabyte-scale datasets remains challenging due to fragmented workflows, intensive computational requirements, and a lack of accessible, user-friendly tools for non-technical researchers.
Results: We introduce SPAC (analysis of SPAtial single-Cell datasets), a scalable, web-based platform for efficient and reproducible single-cell spatial analysis. SPAC employs a four-tier architecture that includes a modular Python-based analysis engine, seamless integration with high-performance computing (HPC) and GPU acceleration, an interactive browser interface for no-code workflow configuration, and a real-time visualization layer powered by Shiny for Python dashboards. This design empowers distinct user roles: data scientists can extend and customize analysis modules, while bench scientists can execute complete workflows and interactively explore results without coding. Built-in reproducibility features and collaborative workflow support ensure that analyses are transparent and easily shared across research teams. Using a 2.6-million-cell multiplex imaging dataset from a 4T1 breast tumor model as a benchmark, SPAC reduced unsupervised clustering time from ~3 hours on a CPU to under 10 minutes with GPU acceleration, achieving more than a 20-fold speedup. It also enabled fine-grained spatial profiling of distinct tumor microenvironment compartments, demonstrating the platform's scalability and performance.
Conclusions: SPAC addresses major barriers in single-cell spatial analysis by uniting an intuitive, user-friendly interface with scalable, high-performance computation in a robust and reproducible framework. By streamlining complex analyses and bridging the gap between experimental and computational researchers, SPAC fosters collaborative workflows and accelerates the transformation of large-scale spatial datasets into actionable biological insights.
{"title":"SPAC: a scalable and integrated enterprise platform for single-cell spatial analysis.","authors":"Fang Liu, Rui He, Thomas Sheeley, David A Scheiblin, Stephen J Lockett, Lisa A Ridnour, David A Wink, Mark Jensen, Janelle Cortner, George Zaki","doi":"10.1186/s12859-025-06339-2","DOIUrl":"10.1186/s12859-025-06339-2","url":null,"abstract":"<p><strong>Background: </strong>Advancements in spatially resolved single-cell technologies are transforming our understanding of tissue architecture and disease microenvironments. However, analyzing the resulting high-dimensional, gigabyte-scale datasets remains challenging due to fragmented workflows, intensive computational requirements, and a lack of accessible, user-friendly tools for non-technical researchers.</p><p><strong>Results: </strong>We introduce SPAC (analysis of SPAtial single-Cell datasets), a scalable, web-based platform for efficient and reproducible single-cell spatial analysis. SPAC employs a four-tier architecture that includes a modular Python-based analysis engine, seamless integration with high-performance computing (HPC) and GPU acceleration, an interactive browser interface for no-code workflow configuration, and a real-time visualization layer powered by Shiny for Python dashboards. This design empowers distinct user roles: data scientists can extend and customize analysis modules, while bench scientists can execute complete workflows and interactively explore results without coding. Built-in reproducibility features and collaborative workflow support ensure that analyses are transparent and easily shared across research teams. Using a 2.6-million-cell multiplex imaging dataset from a 4T1 breast tumor model as a benchmark, SPAC reduced unsupervised clustering time from ~3 hours on a CPU to under 10 minutes with GPU acceleration, achieving more than a 20-fold speedup. It also enabled fine-grained spatial profiling of distinct tumor microenvironment compartments, demonstrating the platform's scalability and performance.</p><p><strong>Conclusions: </strong>SPAC addresses major barriers in single-cell spatial analysis by uniting an intuitive, user-friendly interface with scalable, high-performance computation in a robust and reproducible framework. By streamlining complex analyses and bridging the gap between experimental and computational researchers, SPAC fosters collaborative workflows and accelerates the transformation of large-scale spatial datasets into actionable biological insights.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"27 1","pages":"25"},"PeriodicalIF":3.3,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1186/s12859-025-06348-1
Nabil Rahiman, Michael A Ochsenkühn, Shady A Amin, Kristin C Gunsalus
{"title":"MIMI: Molecular Isotope Mass Identifier for stable isotope-labeled Fourier transform ultra-high mass resolution data analysis.","authors":"Nabil Rahiman, Michael A Ochsenkühn, Shady A Amin, Kristin C Gunsalus","doi":"10.1186/s12859-025-06348-1","DOIUrl":"10.1186/s12859-025-06348-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"27 1","pages":"41"},"PeriodicalIF":3.3,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12879430/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1186/s12859-026-06367-6
Saman Zabihi, Sattar Hashemi, Eghbal Mansoori
Background: DNA sequences are fundamental carriers of genetic information, and their accurate classification is essential for understanding gene regulation, disease mechanisms, and translational genomics. Existing encoding methods often fail to capture both local and long-range dependencies simultaneously.
Results: We introduce EDEN (Expected Density of Nucleotide Encoding), a unified multiscale encoding framework based on kernel density estimation. EDEN captures position-specific and context-dependent nucleotide patterns and integrates them into a hybrid deep learning architecture. Across sixteen benchmark datasets covering promoter detection, core promoter detection, and transcription factor binding prediction, EDEN achieves the best average performance while using orders of magnitude fewer parameters compared with state-of-the-art models. All source code, pretrained models, and datasets are publicly available at: https://github.com/zabihis/EDEN .
Conclusions: EDEN provides an efficient, biologically informed, and interpretable multiscale representation for genomic sequence classification. Its favorable parameter-performance ratio and robust consistency across tasks underscore its practicality for large-scale genomic applications.
背景:DNA序列是遗传信息的基本载体,其准确分类对于理解基因调控、疾病机制和翻译基因组学至关重要。现有的编码方法常常不能同时捕获本地和远程依赖关系。结果:我们引入了基于核密度估计的统一多尺度编码框架EDEN (Expected Density of Nucleotide Encoding)。EDEN捕获特定位置和上下文相关的核苷酸模式,并将它们集成到混合深度学习架构中。在涵盖启动子检测、核心启动子检测和转录因子结合预测的16个基准数据集中,EDEN实现了最佳的平均性能,同时使用的参数比最先进的模型少了几个数量级。所有源代码、预训练模型和数据集都可在以下网站公开获取:https://github.com/zabihis/EDEN.Conclusions: EDEN为基因组序列分类提供了高效、生物学信息丰富、可解释的多尺度表示。其良好的参数性能比和跨任务的鲁棒一致性强调了其在大规模基因组应用中的实用性。
{"title":"EDEN: multiscale expected density of nucleotide encoding for enhanced DNA sequence classification with hybrid deep learning.","authors":"Saman Zabihi, Sattar Hashemi, Eghbal Mansoori","doi":"10.1186/s12859-026-06367-6","DOIUrl":"10.1186/s12859-026-06367-6","url":null,"abstract":"<p><strong>Background: </strong>DNA sequences are fundamental carriers of genetic information, and their accurate classification is essential for understanding gene regulation, disease mechanisms, and translational genomics. Existing encoding methods often fail to capture both local and long-range dependencies simultaneously.</p><p><strong>Results: </strong>We introduce EDEN (Expected Density of Nucleotide Encoding), a unified multiscale encoding framework based on kernel density estimation. EDEN captures position-specific and context-dependent nucleotide patterns and integrates them into a hybrid deep learning architecture. Across sixteen benchmark datasets covering promoter detection, core promoter detection, and transcription factor binding prediction, EDEN achieves the best average performance while using orders of magnitude fewer parameters compared with state-of-the-art models. All source code, pretrained models, and datasets are publicly available at: https://github.com/zabihis/EDEN .</p><p><strong>Conclusions: </strong>EDEN provides an efficient, biologically informed, and interpretable multiscale representation for genomic sequence classification. Its favorable parameter-performance ratio and robust consistency across tasks underscore its practicality for large-scale genomic applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"40"},"PeriodicalIF":3.3,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12879454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146043789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1186/s12859-026-06373-8
Charlie Bayne, Brianna Hurysz, David J Gonzalez, Anthony O'Donoghue
Background: Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method for determining the substrate specificity of proteolytic enzymes, which is essential for developing protease inhibitors, diagnostics, and protease-activated therapeutics. However, the complex datasets generated by MSP-MS pose significant analytical challenges and have limited accessibility for non-specialist users.
Results: We developed mspms, a Bioconductor R package with an accompanying graphical interface, to streamline the analysis of MSP-MS data. Mspms standardizes workflows for data preparation, processing, statistical analysis, and visualization. The tool is designed for accessibility, serving advanced users through the R package and broader audiences through a web-based interface. We validated mspms using data from four well-characterized cathepsins (A-D), demonstrating that it reliably captures expected substrate specificities.
Conclusions: mspms is the first publicly available, comprehensive platform for MSP-MS data analysis downstream of peptide identification and quantification. It integrates preprocessing, normalization, statistical testing, and visualization into a single, transparent, and user-friendly framework, making it a valuable resource for the protease research community. The package is distributed via Bioconductor, and a graphical interface is available online for interactive use.
{"title":"mspms: an R package and GUI for multiplex substrate profiling by mass spectrometry.","authors":"Charlie Bayne, Brianna Hurysz, David J Gonzalez, Anthony O'Donoghue","doi":"10.1186/s12859-026-06373-8","DOIUrl":"https://doi.org/10.1186/s12859-026-06373-8","url":null,"abstract":"<p><strong>Background: </strong>Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method for determining the substrate specificity of proteolytic enzymes, which is essential for developing protease inhibitors, diagnostics, and protease-activated therapeutics. However, the complex datasets generated by MSP-MS pose significant analytical challenges and have limited accessibility for non-specialist users.</p><p><strong>Results: </strong>We developed mspms, a Bioconductor R package with an accompanying graphical interface, to streamline the analysis of MSP-MS data. Mspms standardizes workflows for data preparation, processing, statistical analysis, and visualization. The tool is designed for accessibility, serving advanced users through the R package and broader audiences through a web-based interface. We validated mspms using data from four well-characterized cathepsins (A-D), demonstrating that it reliably captures expected substrate specificities.</p><p><strong>Conclusions: </strong>mspms is the first publicly available, comprehensive platform for MSP-MS data analysis downstream of peptide identification and quantification. It integrates preprocessing, normalization, statistical testing, and visualization into a single, transparent, and user-friendly framework, making it a valuable resource for the protease research community. The package is distributed via Bioconductor, and a graphical interface is available online for interactive use.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146043733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1186/s12859-025-06364-1
Artem Ershov, Renpeng Ding, Qian Fu, Ivan Kozlov, Ekaterina Fadeeva, Evgeniy Mozheiko, Ming Ni, Yong Hou, Yan Zhou
Background: High-throughput sequencing technologies generate massive amounts of FASTQ data comprising nucleotide sequences, quality scores, and read identifiers, necessitating efficient compression to alleviate storage and transmission burdens. Compared to general-purpose compressors, specialized FASTQ compressors achieve higher compression performance by exploiting the inherent redundancy in FASTQ files. However, existing FASTQ-specialized compressors often suffer from limited data applicability and tend to over-optimize either compression ratio or compression speed at the expense of the other.
Results: We present zDUR, a reference-free FASTQ compressor designed for efficient and scalable handling of next-generation sequencing data across diverse platforms and sequencing data types. Benchmarking against six reference-free compressors on 15 representative datasets spanning four sequencing data types demonstrates that zDUR achieves a favorable overall balance between compression ratio and speed, with broad applicability across data types. In particular, on single-cell RNA-seq and spatial transcriptomics datasets, zDUR achieves over a tenfold increase in runtime performance while maintaining higher compression ratios than SPRING, one of the state-of-the-art reference-free FASTQ compressors.
Conclusions: zDUR offers a scalable and efficient solution for reference-free FASTQ compression, balancing performance, speed, and usability across diverse datasets.
{"title":"zDUR: reference-free FASTQ compressor with high compression ratio and speed.","authors":"Artem Ershov, Renpeng Ding, Qian Fu, Ivan Kozlov, Ekaterina Fadeeva, Evgeniy Mozheiko, Ming Ni, Yong Hou, Yan Zhou","doi":"10.1186/s12859-025-06364-1","DOIUrl":"https://doi.org/10.1186/s12859-025-06364-1","url":null,"abstract":"<p><strong>Background: </strong>High-throughput sequencing technologies generate massive amounts of FASTQ data comprising nucleotide sequences, quality scores, and read identifiers, necessitating efficient compression to alleviate storage and transmission burdens. Compared to general-purpose compressors, specialized FASTQ compressors achieve higher compression performance by exploiting the inherent redundancy in FASTQ files. However, existing FASTQ-specialized compressors often suffer from limited data applicability and tend to over-optimize either compression ratio or compression speed at the expense of the other.</p><p><strong>Results: </strong>We present zDUR, a reference-free FASTQ compressor designed for efficient and scalable handling of next-generation sequencing data across diverse platforms and sequencing data types. Benchmarking against six reference-free compressors on 15 representative datasets spanning four sequencing data types demonstrates that zDUR achieves a favorable overall balance between compression ratio and speed, with broad applicability across data types. In particular, on single-cell RNA-seq and spatial transcriptomics datasets, zDUR achieves over a tenfold increase in runtime performance while maintaining higher compression ratios than SPRING, one of the state-of-the-art reference-free FASTQ compressors.</p><p><strong>Conclusions: </strong>zDUR offers a scalable and efficient solution for reference-free FASTQ compression, balancing performance, speed, and usability across diverse datasets.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1186/s12859-026-06370-x
Grazia Gargano, Flavia Esposito, Nicoletta Del Buono, Sabino Ciavarella, Maria Carmela Vegliante
{"title":"Identification of differentially expressed genes in RNA-seq data via semi-rigid orthogonal sparse KL-NMTF.","authors":"Grazia Gargano, Flavia Esposito, Nicoletta Del Buono, Sabino Ciavarella, Maria Carmela Vegliante","doi":"10.1186/s12859-026-06370-x","DOIUrl":"10.1186/s12859-026-06370-x","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"39"},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1186/s12859-026-06375-6
Simone Montalbano, G Bragi Walters, Gudbjorn F Jonsson, Jesper R Gådin, Thomas Werge, Daniel F Gudbjartsson, Hreinn Stefansson, Andrés Ingason
Background: Large, rare copy number variants (CNVs) are a main source of genetic variation in the genome and are important in both evolution and disease risk. CNVs can be detected using different data sources, including genome sequencing, genotyping arrays and quantitative PCR experiments, but in most large cohorts, genotyping arrays remain the most prevalent source. Current methods to call CNVs from genotyping array data suffer from high false positive rates and while multiple approaches, including QC filtering, visual inspection of intensity tracks, and wet-lab validation are commonly applied to counter this problem, such methods are often non-specific (QC filtering) or inefficient (visual and wet-lab validations) at a genome-wide scale.
Results: We have assembled the largest collection of human-verified CNV calls using visual validation, totalling almost 60,000 calls from 22,500 samples from three cohorts genotyped on several different arrays. Across all cohorts our visual validation found the majority of CNV calls to be false positive (53.7%) or unclear (9.7%). The false positive fraction varied substantially across datasets and genomic regions, and we show that existing filtering methods based on QC metrics are inefficient in removing false calls. Given the supremacy of visual validation over existing filtering methods in controlling the false positive fraction, we used a subset of our visual validation dataset to train a convolutional neural network to automate the validation of CNVs through machine vision. We tested the efficacy of the model using the remainder of the dataset and found the performance exceeded 90% in most measures, approximating that of a human analyst. Orthogonal validation with genome sequencing data found our visual validation to be highly accurate, with only 1.7% of calls supported by the sequencing dataset deemed as false by the human analyst, and a further 7.5% deemed as unclear.
Conclusions: Visual inspection is the only effective validation approach for CNV calls. Our model is capable of automating this task at scale with very high accuracy, as shown by testing both within-sample and out-of-sample. The software is available as an R package at https://github.com/SinomeM/CNValidatron_fl.
{"title":"CNValidatron: accurate and efficient validation of PennCNV calls using computer vision.","authors":"Simone Montalbano, G Bragi Walters, Gudbjorn F Jonsson, Jesper R Gådin, Thomas Werge, Daniel F Gudbjartsson, Hreinn Stefansson, Andrés Ingason","doi":"10.1186/s12859-026-06375-6","DOIUrl":"10.1186/s12859-026-06375-6","url":null,"abstract":"<p><strong>Background: </strong>Large, rare copy number variants (CNVs) are a main source of genetic variation in the genome and are important in both evolution and disease risk. CNVs can be detected using different data sources, including genome sequencing, genotyping arrays and quantitative PCR experiments, but in most large cohorts, genotyping arrays remain the most prevalent source. Current methods to call CNVs from genotyping array data suffer from high false positive rates and while multiple approaches, including QC filtering, visual inspection of intensity tracks, and wet-lab validation are commonly applied to counter this problem, such methods are often non-specific (QC filtering) or inefficient (visual and wet-lab validations) at a genome-wide scale.</p><p><strong>Results: </strong>We have assembled the largest collection of human-verified CNV calls using visual validation, totalling almost 60,000 calls from 22,500 samples from three cohorts genotyped on several different arrays. Across all cohorts our visual validation found the majority of CNV calls to be false positive (53.7%) or unclear (9.7%). The false positive fraction varied substantially across datasets and genomic regions, and we show that existing filtering methods based on QC metrics are inefficient in removing false calls. Given the supremacy of visual validation over existing filtering methods in controlling the false positive fraction, we used a subset of our visual validation dataset to train a convolutional neural network to automate the validation of CNVs through machine vision. We tested the efficacy of the model using the remainder of the dataset and found the performance exceeded 90% in most measures, approximating that of a human analyst. Orthogonal validation with genome sequencing data found our visual validation to be highly accurate, with only 1.7% of calls supported by the sequencing dataset deemed as false by the human analyst, and a further 7.5% deemed as unclear.</p><p><strong>Conclusions: </strong>Visual inspection is the only effective validation approach for CNV calls. Our model is capable of automating this task at scale with very high accuracy, as shown by testing both within-sample and out-of-sample. The software is available as an R package at https://github.com/SinomeM/CNValidatron_fl.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1186/s12859-025-06362-3
Hyotae Kim, Nazema Y Siddiqui, Lisa Karstens, Li Ma
Background: Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations.
Results: Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately.
Conclusions: We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.
{"title":"A negative binomial latent factor model for paired microbiome sequencing data.","authors":"Hyotae Kim, Nazema Y Siddiqui, Lisa Karstens, Li Ma","doi":"10.1186/s12859-025-06362-3","DOIUrl":"10.1186/s12859-025-06362-3","url":null,"abstract":"<p><strong>Background: </strong>Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations.</p><p><strong>Results: </strong>Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately.</p><p><strong>Conclusions: </strong>We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146028136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}