首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Transplatformer: translating toxicogenomic profiles between generations of platforms. Transplatformer:在几代平台之间翻译毒物基因组图谱。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-30 DOI: 10.1186/s12859-026-06383-6
Guojing Cong, Frank Chao, Daniel L Svoboda, Jeremy N Erickson, Michele R Balik-Meisner, Deepak Mav, Dhiral P Phadke, Elizabeth H Scholl, Ruchir R Shah, Scott S Auerbach

Background: Transcriptomic profiling technologies have advanced the analysis of biological and toxicological responses. However, substantial differences in probe design, dynamic range, gene coverage, and preprocessing pipelines across platforms introduce artifacts that limit cross-study integration and hinder the reuse of historical datasets. We aim to develop computational methods for accurate cross-platform translation to maximize the value of legacy resources.

Results: We present TransPlatformer a deep learning framework for translating gene expression profiles across heterogeneous toxicogenomics platforms. TransPlatformer employs a novel attention-based architecture to map high-dimensional fold-change vectors from legacy microarray technologies to current platforms. Models are trained and evaluated using DrugMatrix, spanning three technological generations. We investigate mixed-tissue, single-tissue, and cross-tissue training paradigms and benchmark performance against multilayer perceptron and matrix-completion baselines. In mixed-tissue training, TransPlatformer achieves a greater than 50% reduction in mean absolute error (0.043 vs. 0.09) and nearly doubles Pearson correlation (≈ 0.71 vs. 0.37) relative to baseline methods. Importantly, TransPlatformer preserves rare but biologically meaningful over- and under-expressed signals, with mean absolute error below 0.22. Single-tissue models yield further improvements for well-represented organs, such as a 10% reduction in liver mean absolute error, while underscoring the need for data augmentation strategies in low-sample tissues.ra CONCLUSIONS: TransPlatformer provides an effective and scalable computational solution for cross-platform transcriptomic translation. By enabling biologically faithful harmonization of gene expression data, the proposed approach facilitates the reuse of legacy toxicogenomics datasets, enhances downstream biomarker discovery, and supports more reproducible predictive modeling in toxicology.

背景:转录组学分析技术促进了生物和毒理学反应的分析。然而,不同平台在探针设计、动态范围、基因覆盖和预处理管道方面的巨大差异引入了限制交叉研究集成和阻碍历史数据集重用的工件。我们的目标是开发精确跨平台翻译的计算方法,以最大限度地发挥遗留资源的价值。结果:我们提出了TransPlatformer一个深度学习框架,用于跨异质毒物基因组学平台翻译基因表达谱。TransPlatformer采用一种新颖的基于注意力的架构,将传统微阵列技术的高维折叠变化向量映射到当前平台。模型的训练和评估使用药物矩阵,跨越三代技术。我们研究了混合组织、单组织和跨组织训练范例,以及针对多层感知器和矩阵补全基线的基准性能。在混合组织训练中,与基线方法相比,TransPlatformer的平均绝对误差降低了50%以上(0.043 vs. 0.09), Pearson相关性(≈0.71 vs. 0.37)几乎翻了一番。重要的是,《TransPlatformer》保留了罕见但具有生物学意义的过度和未充分表达的信号,平均绝对误差低于0.22。单组织模型进一步改善了代表性较好的器官,如肝脏平均绝对误差降低10%,同时强调了对低样本组织的数据增强策略的需求。结论:TransPlatformer为跨平台转录组翻译提供了有效且可扩展的计算解决方案。通过实现基因表达数据的生物学一致性,该方法促进了遗留毒理学基因组学数据集的重用,增强了下游生物标志物的发现,并支持毒理学中更具可重复性的预测建模。
{"title":"Transplatformer: translating toxicogenomic profiles between generations of platforms.","authors":"Guojing Cong, Frank Chao, Daniel L Svoboda, Jeremy N Erickson, Michele R Balik-Meisner, Deepak Mav, Dhiral P Phadke, Elizabeth H Scholl, Ruchir R Shah, Scott S Auerbach","doi":"10.1186/s12859-026-06383-6","DOIUrl":"https://doi.org/10.1186/s12859-026-06383-6","url":null,"abstract":"<p><strong>Background: </strong>Transcriptomic profiling technologies have advanced the analysis of biological and toxicological responses. However, substantial differences in probe design, dynamic range, gene coverage, and preprocessing pipelines across platforms introduce artifacts that limit cross-study integration and hinder the reuse of historical datasets. We aim to develop computational methods for accurate cross-platform translation to maximize the value of legacy resources.</p><p><strong>Results: </strong>We present TransPlatformer a deep learning framework for translating gene expression profiles across heterogeneous toxicogenomics platforms. TransPlatformer employs a novel attention-based architecture to map high-dimensional fold-change vectors from legacy microarray technologies to current platforms. Models are trained and evaluated using DrugMatrix, spanning three technological generations. We investigate mixed-tissue, single-tissue, and cross-tissue training paradigms and benchmark performance against multilayer perceptron and matrix-completion baselines. In mixed-tissue training, TransPlatformer achieves a greater than 50% reduction in mean absolute error (0.043 vs. 0.09) and nearly doubles Pearson correlation (≈ 0.71 vs. 0.37) relative to baseline methods. Importantly, TransPlatformer preserves rare but biologically meaningful over- and under-expressed signals, with mean absolute error below 0.22. Single-tissue models yield further improvements for well-represented organs, such as a 10% reduction in liver mean absolute error, while underscoring the need for data augmentation strategies in low-sample tissues.ra CONCLUSIONS: TransPlatformer provides an effective and scalable computational solution for cross-platform transcriptomic translation. By enabling biologically faithful harmonization of gene expression data, the proposed approach facilitates the reuse of legacy toxicogenomics datasets, enhances downstream biomarker discovery, and supports more reproducible predictive modeling in toxicology.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146092027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPAC: a scalable and integrated enterprise platform for single-cell spatial analysis. SPAC:用于单细胞空间分析的可扩展集成企业平台。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-29 DOI: 10.1186/s12859-025-06339-2
Fang Liu, Rui He, Thomas Sheeley, David A Scheiblin, Stephen J Lockett, Lisa A Ridnour, David A Wink, Mark Jensen, Janelle Cortner, George Zaki

Background: Advancements in spatially resolved single-cell technologies are transforming our understanding of tissue architecture and disease microenvironments. However, analyzing the resulting high-dimensional, gigabyte-scale datasets remains challenging due to fragmented workflows, intensive computational requirements, and a lack of accessible, user-friendly tools for non-technical researchers.

Results: We introduce SPAC (analysis of SPAtial single-Cell datasets), a scalable, web-based platform for efficient and reproducible single-cell spatial analysis. SPAC employs a four-tier architecture that includes a modular Python-based analysis engine, seamless integration with high-performance computing (HPC) and GPU acceleration, an interactive browser interface for no-code workflow configuration, and a real-time visualization layer powered by Shiny for Python dashboards. This design empowers distinct user roles: data scientists can extend and customize analysis modules, while bench scientists can execute complete workflows and interactively explore results without coding. Built-in reproducibility features and collaborative workflow support ensure that analyses are transparent and easily shared across research teams. Using a 2.6-million-cell multiplex imaging dataset from a 4T1 breast tumor model as a benchmark, SPAC reduced unsupervised clustering time from ~3 hours on a CPU to under 10 minutes with GPU acceleration, achieving more than a 20-fold speedup. It also enabled fine-grained spatial profiling of distinct tumor microenvironment compartments, demonstrating the platform's scalability and performance.

Conclusions: SPAC addresses major barriers in single-cell spatial analysis by uniting an intuitive, user-friendly interface with scalable, high-performance computation in a robust and reproducible framework. By streamlining complex analyses and bridging the gap between experimental and computational researchers, SPAC fosters collaborative workflows and accelerates the transformation of large-scale spatial datasets into actionable biological insights.

背景:空间分辨单细胞技术的进步正在改变我们对组织结构和疾病微环境的理解。然而,由于分散的工作流程、密集的计算需求以及缺乏对非技术研究人员可访问的、用户友好的工具,分析由此产生的高维、千兆字节规模的数据集仍然具有挑战性。结果:我们介绍了空间单细胞数据集分析(SPAC),这是一个可扩展的、基于网络的平台,用于高效和可重复的单细胞空间分析。SPAC采用了一个四层架构,其中包括一个基于Python的模块化分析引擎,与高性能计算(HPC)和GPU加速的无缝集成,一个用于无代码工作流配置的交互式浏览器界面,以及一个由Shiny为Python仪表板提供支持的实时可视化层。这种设计赋予了不同的用户角色:数据科学家可以扩展和定制分析模块,而实验科学家可以执行完整的工作流,无需编码即可交互式地探索结果。内置的可再现性功能和协作工作流支持确保分析是透明的,并且可以轻松地在研究团队之间共享。使用来自4T1乳腺肿瘤模型的260万细胞多路成像数据集作为基准,SPAC将无监督聚类时间从CPU上的约3小时减少到GPU加速下的10分钟以下,实现了超过20倍的加速。它还实现了不同肿瘤微环境区室的细粒度空间分析,展示了平台的可扩展性和性能。结论:SPAC通过将直观、用户友好的界面与可扩展、高性能的计算结合在一个健壮且可重复的框架中,解决了单细胞空间分析的主要障碍。通过简化复杂的分析和弥合实验和计算研究人员之间的差距,SPAC促进了协作工作流程,并加速了大规模空间数据集向可操作的生物学见解的转变。
{"title":"SPAC: a scalable and integrated enterprise platform for single-cell spatial analysis.","authors":"Fang Liu, Rui He, Thomas Sheeley, David A Scheiblin, Stephen J Lockett, Lisa A Ridnour, David A Wink, Mark Jensen, Janelle Cortner, George Zaki","doi":"10.1186/s12859-025-06339-2","DOIUrl":"10.1186/s12859-025-06339-2","url":null,"abstract":"<p><strong>Background: </strong>Advancements in spatially resolved single-cell technologies are transforming our understanding of tissue architecture and disease microenvironments. However, analyzing the resulting high-dimensional, gigabyte-scale datasets remains challenging due to fragmented workflows, intensive computational requirements, and a lack of accessible, user-friendly tools for non-technical researchers.</p><p><strong>Results: </strong>We introduce SPAC (analysis of SPAtial single-Cell datasets), a scalable, web-based platform for efficient and reproducible single-cell spatial analysis. SPAC employs a four-tier architecture that includes a modular Python-based analysis engine, seamless integration with high-performance computing (HPC) and GPU acceleration, an interactive browser interface for no-code workflow configuration, and a real-time visualization layer powered by Shiny for Python dashboards. This design empowers distinct user roles: data scientists can extend and customize analysis modules, while bench scientists can execute complete workflows and interactively explore results without coding. Built-in reproducibility features and collaborative workflow support ensure that analyses are transparent and easily shared across research teams. Using a 2.6-million-cell multiplex imaging dataset from a 4T1 breast tumor model as a benchmark, SPAC reduced unsupervised clustering time from ~3 hours on a CPU to under 10 minutes with GPU acceleration, achieving more than a 20-fold speedup. It also enabled fine-grained spatial profiling of distinct tumor microenvironment compartments, demonstrating the platform's scalability and performance.</p><p><strong>Conclusions: </strong>SPAC addresses major barriers in single-cell spatial analysis by uniting an intuitive, user-friendly interface with scalable, high-performance computation in a robust and reproducible framework. By streamlining complex analyses and bridging the gap between experimental and computational researchers, SPAC fosters collaborative workflows and accelerates the transformation of large-scale spatial datasets into actionable biological insights.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"27 1","pages":"25"},"PeriodicalIF":3.3,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146084063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIMI: Molecular Isotope Mass Identifier for stable isotope-labeled Fourier transform ultra-high mass resolution data analysis. 用于稳定同位素标记傅立叶变换超高质量分辨率数据分析的分子同位素质量标识符。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-28 DOI: 10.1186/s12859-025-06348-1
Nabil Rahiman, Michael A Ochsenkühn, Shady A Amin, Kristin C Gunsalus
{"title":"MIMI: Molecular Isotope Mass Identifier for stable isotope-labeled Fourier transform ultra-high mass resolution data analysis.","authors":"Nabil Rahiman, Michael A Ochsenkühn, Shady A Amin, Kristin C Gunsalus","doi":"10.1186/s12859-025-06348-1","DOIUrl":"10.1186/s12859-025-06348-1","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"27 1","pages":"41"},"PeriodicalIF":3.3,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12879430/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integration of bulk RNA-seq pipeline metrics for assessing low-quality samples. 整合用于评估低质量样品的大量RNA-seq管道指标。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-27 DOI: 10.1186/s12859-025-06298-8
Samuel Hamilton, Gaurav Gadhvi, Tyler Therron, Deborah R Winter
{"title":"Integration of bulk RNA-seq pipeline metrics for assessing low-quality samples.","authors":"Samuel Hamilton, Gaurav Gadhvi, Tyler Therron, Deborah R Winter","doi":"10.1186/s12859-025-06298-8","DOIUrl":"10.1186/s12859-025-06298-8","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146059834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDEN: multiscale expected density of nucleotide encoding for enhanced DNA sequence classification with hybrid deep learning. EDEN:多尺度预期密度的核苷酸编码增强DNA序列分类与混合深度学习。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-24 DOI: 10.1186/s12859-026-06367-6
Saman Zabihi, Sattar Hashemi, Eghbal Mansoori

Background: DNA sequences are fundamental carriers of genetic information, and their accurate classification is essential for understanding gene regulation, disease mechanisms, and translational genomics. Existing encoding methods often fail to capture both local and long-range dependencies simultaneously.

Results: We introduce EDEN (Expected Density of Nucleotide Encoding), a unified multiscale encoding framework based on kernel density estimation. EDEN captures position-specific and context-dependent nucleotide patterns and integrates them into a hybrid deep learning architecture. Across sixteen benchmark datasets covering promoter detection, core promoter detection, and transcription factor binding prediction, EDEN achieves the best average performance while using orders of magnitude fewer parameters compared with state-of-the-art models. All source code, pretrained models, and datasets are publicly available at: https://github.com/zabihis/EDEN .

Conclusions: EDEN provides an efficient, biologically informed, and interpretable multiscale representation for genomic sequence classification. Its favorable parameter-performance ratio and robust consistency across tasks underscore its practicality for large-scale genomic applications.

背景:DNA序列是遗传信息的基本载体,其准确分类对于理解基因调控、疾病机制和翻译基因组学至关重要。现有的编码方法常常不能同时捕获本地和远程依赖关系。结果:我们引入了基于核密度估计的统一多尺度编码框架EDEN (Expected Density of Nucleotide Encoding)。EDEN捕获特定位置和上下文相关的核苷酸模式,并将它们集成到混合深度学习架构中。在涵盖启动子检测、核心启动子检测和转录因子结合预测的16个基准数据集中,EDEN实现了最佳的平均性能,同时使用的参数比最先进的模型少了几个数量级。所有源代码、预训练模型和数据集都可在以下网站公开获取:https://github.com/zabihis/EDEN.Conclusions: EDEN为基因组序列分类提供了高效、生物学信息丰富、可解释的多尺度表示。其良好的参数性能比和跨任务的鲁棒一致性强调了其在大规模基因组应用中的实用性。
{"title":"EDEN: multiscale expected density of nucleotide encoding for enhanced DNA sequence classification with hybrid deep learning.","authors":"Saman Zabihi, Sattar Hashemi, Eghbal Mansoori","doi":"10.1186/s12859-026-06367-6","DOIUrl":"10.1186/s12859-026-06367-6","url":null,"abstract":"<p><strong>Background: </strong>DNA sequences are fundamental carriers of genetic information, and their accurate classification is essential for understanding gene regulation, disease mechanisms, and translational genomics. Existing encoding methods often fail to capture both local and long-range dependencies simultaneously.</p><p><strong>Results: </strong>We introduce EDEN (Expected Density of Nucleotide Encoding), a unified multiscale encoding framework based on kernel density estimation. EDEN captures position-specific and context-dependent nucleotide patterns and integrates them into a hybrid deep learning architecture. Across sixteen benchmark datasets covering promoter detection, core promoter detection, and transcription factor binding prediction, EDEN achieves the best average performance while using orders of magnitude fewer parameters compared with state-of-the-art models. All source code, pretrained models, and datasets are publicly available at: https://github.com/zabihis/EDEN .</p><p><strong>Conclusions: </strong>EDEN provides an efficient, biologically informed, and interpretable multiscale representation for genomic sequence classification. Its favorable parameter-performance ratio and robust consistency across tasks underscore its practicality for large-scale genomic applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"40"},"PeriodicalIF":3.3,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12879454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146043789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
mspms: an R package and GUI for multiplex substrate profiling by mass spectrometry. mspms:一个R封装和GUI,用于质谱分析的多重衬底分析。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-24 DOI: 10.1186/s12859-026-06373-8
Charlie Bayne, Brianna Hurysz, David J Gonzalez, Anthony O'Donoghue

Background: Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method for determining the substrate specificity of proteolytic enzymes, which is essential for developing protease inhibitors, diagnostics, and protease-activated therapeutics. However, the complex datasets generated by MSP-MS pose significant analytical challenges and have limited accessibility for non-specialist users.

Results: We developed mspms, a Bioconductor R package with an accompanying graphical interface, to streamline the analysis of MSP-MS data. Mspms standardizes workflows for data preparation, processing, statistical analysis, and visualization. The tool is designed for accessibility, serving advanced users through the R package and broader audiences through a web-based interface. We validated mspms using data from four well-characterized cathepsins (A-D), demonstrating that it reliably captures expected substrate specificities.

Conclusions: mspms is the first publicly available, comprehensive platform for MSP-MS data analysis downstream of peptide identification and quantification. It integrates preprocessing, normalization, statistical testing, and visualization into a single, transparent, and user-friendly framework, making it a valuable resource for the protease research community. The package is distributed via Bioconductor, and a graphical interface is available online for interactive use.

背景:多重底物质谱分析(MSP-MS)是一种测定蛋白水解酶底物特异性的强大方法,对于开发蛋白酶抑制剂、诊断和蛋白酶激活疗法至关重要。然而,由MSP-MS生成的复杂数据集构成了重大的分析挑战,并且非专业用户的可访问性有限。结果:我们开发了mspms,一个附带图形界面的Bioconductor R包,以简化MSP-MS数据的分析。Mspms标准化了数据准备、处理、统计分析和可视化的工作流程。该工具专为易用性而设计,通过R包为高级用户提供服务,并通过基于web的界面为更广泛的受众提供服务。我们使用四种特征良好的组织蛋白酶(A-D)的数据验证了mspms,证明它可靠地捕获了预期的底物特异性。结论:mspms是第一个公开的、全面的多肽鉴定和定量下游MSP-MS数据分析平台。它将预处理,规范化,统计测试和可视化集成到一个单一的,透明的,用户友好的框架中,使其成为蛋白酶研究界的宝贵资源。该软件包通过Bioconductor进行分发,并且可以在线使用图形界面进行交互。
{"title":"mspms: an R package and GUI for multiplex substrate profiling by mass spectrometry.","authors":"Charlie Bayne, Brianna Hurysz, David J Gonzalez, Anthony O'Donoghue","doi":"10.1186/s12859-026-06373-8","DOIUrl":"https://doi.org/10.1186/s12859-026-06373-8","url":null,"abstract":"<p><strong>Background: </strong>Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method for determining the substrate specificity of proteolytic enzymes, which is essential for developing protease inhibitors, diagnostics, and protease-activated therapeutics. However, the complex datasets generated by MSP-MS pose significant analytical challenges and have limited accessibility for non-specialist users.</p><p><strong>Results: </strong>We developed mspms, a Bioconductor R package with an accompanying graphical interface, to streamline the analysis of MSP-MS data. Mspms standardizes workflows for data preparation, processing, statistical analysis, and visualization. The tool is designed for accessibility, serving advanced users through the R package and broader audiences through a web-based interface. We validated mspms using data from four well-characterized cathepsins (A-D), demonstrating that it reliably captures expected substrate specificities.</p><p><strong>Conclusions: </strong>mspms is the first publicly available, comprehensive platform for MSP-MS data analysis downstream of peptide identification and quantification. It integrates preprocessing, normalization, statistical testing, and visualization into a single, transparent, and user-friendly framework, making it a valuable resource for the protease research community. The package is distributed via Bioconductor, and a graphical interface is available online for interactive use.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146043733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
zDUR: reference-free FASTQ compressor with high compression ratio and speed. zDUR:无参考的FASTQ压缩机,具有高压缩比和速度。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-23 DOI: 10.1186/s12859-025-06364-1
Artem Ershov, Renpeng Ding, Qian Fu, Ivan Kozlov, Ekaterina Fadeeva, Evgeniy Mozheiko, Ming Ni, Yong Hou, Yan Zhou

Background: High-throughput sequencing technologies generate massive amounts of FASTQ data comprising nucleotide sequences, quality scores, and read identifiers, necessitating efficient compression to alleviate storage and transmission burdens. Compared to general-purpose compressors, specialized FASTQ compressors achieve higher compression performance by exploiting the inherent redundancy in FASTQ files. However, existing FASTQ-specialized compressors often suffer from limited data applicability and tend to over-optimize either compression ratio or compression speed at the expense of the other.

Results: We present zDUR, a reference-free FASTQ compressor designed for efficient and scalable handling of next-generation sequencing data across diverse platforms and sequencing data types. Benchmarking against six reference-free compressors on 15 representative datasets spanning four sequencing data types demonstrates that zDUR achieves a favorable overall balance between compression ratio and speed, with broad applicability across data types. In particular, on single-cell RNA-seq and spatial transcriptomics datasets, zDUR achieves over a tenfold increase in runtime performance while maintaining higher compression ratios than SPRING, one of the state-of-the-art reference-free FASTQ compressors.

Conclusions: zDUR offers a scalable and efficient solution for reference-free FASTQ compression, balancing performance, speed, and usability across diverse datasets.

背景:高通量测序技术产生大量FASTQ数据,包括核苷酸序列、质量评分和读取标识符,需要有效压缩以减轻存储和传输负担。与通用压缩器相比,专用FASTQ压缩器通过利用FASTQ文件中固有的冗余来实现更高的压缩性能。然而,现有的专用于fastq的压缩器通常存在数据适用性有限的问题,并且倾向于过度优化压缩比或压缩速度,而牺牲另一个。结果:我们提出了zDUR,一种无参考的FASTQ压缩器,旨在高效和可扩展地处理跨不同平台和测序数据类型的下一代测序数据。在跨越4种排序数据类型的15个代表性数据集上对6个无参考压缩器进行基准测试表明,zDUR在压缩比和速度之间实现了良好的总体平衡,具有跨数据类型的广泛适用性。特别是,在单细胞RNA-seq和空间转录组学数据集上,zDUR在运行时性能提高了十倍以上,同时保持了比SPRING更高的压缩比,SPRING是最先进的无参考FASTQ压缩器之一。结论:zDUR提供了一个可扩展和高效的解决方案,用于无参考的FASTQ压缩,平衡性能,速度和不同数据集的可用性。
{"title":"zDUR: reference-free FASTQ compressor with high compression ratio and speed.","authors":"Artem Ershov, Renpeng Ding, Qian Fu, Ivan Kozlov, Ekaterina Fadeeva, Evgeniy Mozheiko, Ming Ni, Yong Hou, Yan Zhou","doi":"10.1186/s12859-025-06364-1","DOIUrl":"https://doi.org/10.1186/s12859-025-06364-1","url":null,"abstract":"<p><strong>Background: </strong>High-throughput sequencing technologies generate massive amounts of FASTQ data comprising nucleotide sequences, quality scores, and read identifiers, necessitating efficient compression to alleviate storage and transmission burdens. Compared to general-purpose compressors, specialized FASTQ compressors achieve higher compression performance by exploiting the inherent redundancy in FASTQ files. However, existing FASTQ-specialized compressors often suffer from limited data applicability and tend to over-optimize either compression ratio or compression speed at the expense of the other.</p><p><strong>Results: </strong>We present zDUR, a reference-free FASTQ compressor designed for efficient and scalable handling of next-generation sequencing data across diverse platforms and sequencing data types. Benchmarking against six reference-free compressors on 15 representative datasets spanning four sequencing data types demonstrates that zDUR achieves a favorable overall balance between compression ratio and speed, with broad applicability across data types. In particular, on single-cell RNA-seq and spatial transcriptomics datasets, zDUR achieves over a tenfold increase in runtime performance while maintaining higher compression ratios than SPRING, one of the state-of-the-art reference-free FASTQ compressors.</p><p><strong>Conclusions: </strong>zDUR offers a scalable and efficient solution for reference-free FASTQ compression, balancing performance, speed, and usability across diverse datasets.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of differentially expressed genes in RNA-seq data via semi-rigid orthogonal sparse KL-NMTF. 利用半刚性正交稀疏KL-NMTF技术鉴定RNA-seq数据中的差异表达基因。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-23 DOI: 10.1186/s12859-026-06370-x
Grazia Gargano, Flavia Esposito, Nicoletta Del Buono, Sabino Ciavarella, Maria Carmela Vegliante
{"title":"Identification of differentially expressed genes in RNA-seq data via semi-rigid orthogonal sparse KL-NMTF.","authors":"Grazia Gargano, Flavia Esposito, Nicoletta Del Buono, Sabino Ciavarella, Maria Carmela Vegliante","doi":"10.1186/s12859-026-06370-x","DOIUrl":"10.1186/s12859-026-06370-x","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"39"},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNValidatron: accurate and efficient validation of PennCNV calls using computer vision. CNValidatron:使用计算机视觉准确有效地验证PennCNV调用。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-23 DOI: 10.1186/s12859-026-06375-6
Simone Montalbano, G Bragi Walters, Gudbjorn F Jonsson, Jesper R Gådin, Thomas Werge, Daniel F Gudbjartsson, Hreinn Stefansson, Andrés Ingason

Background: Large, rare copy number variants (CNVs) are a main source of genetic variation in the genome and are important in both evolution and disease risk. CNVs can be detected using different data sources, including genome sequencing, genotyping arrays and quantitative PCR experiments, but in most large cohorts, genotyping arrays remain the most prevalent source. Current methods to call CNVs from genotyping array data suffer from high false positive rates and while multiple approaches, including QC filtering, visual inspection of intensity tracks, and wet-lab validation are commonly applied to counter this problem, such methods are often non-specific (QC filtering) or inefficient (visual and wet-lab validations) at a genome-wide scale.

Results: We have assembled the largest collection of human-verified CNV calls using visual validation, totalling almost 60,000 calls from 22,500 samples from three cohorts genotyped on several different arrays. Across all cohorts our visual validation found the majority of CNV calls to be false positive (53.7%) or unclear (9.7%). The false positive fraction varied substantially across datasets and genomic regions, and we show that existing filtering methods based on QC metrics are inefficient in removing false calls. Given the supremacy of visual validation over existing filtering methods in controlling the false positive fraction, we used a subset of our visual validation dataset to train a convolutional neural network to automate the validation of CNVs through machine vision. We tested the efficacy of the model using the remainder of the dataset and found the performance exceeded 90% in most measures, approximating that of a human analyst. Orthogonal validation with genome sequencing data found our visual validation to be highly accurate, with only 1.7% of calls supported by the sequencing dataset deemed as false by the human analyst, and a further 7.5% deemed as unclear.

Conclusions: Visual inspection is the only effective validation approach for CNV calls. Our model is capable of automating this task at scale with very high accuracy, as shown by testing both within-sample and out-of-sample. The software is available as an R package at https://github.com/SinomeM/CNValidatron_fl.

背景:大而罕见的拷贝数变异(CNVs)是基因组遗传变异的主要来源,在进化和疾病风险中都很重要。CNVs可以通过不同的数据来源进行检测,包括基因组测序、基因分型阵列和定量PCR实验,但在大多数大型队列中,基因分型阵列仍然是最普遍的来源。目前从基因分型阵列数据中调用cnv的方法存在高假阳性率,而包括QC过滤、强度轨迹目视检查和湿实验室验证在内的多种方法通常用于解决这一问题,但这些方法在全基因组范围内通常是非特异性的(QC过滤)或低效的(视觉和湿实验室验证)。结果:我们使用视觉验证方法收集了最大的人类验证CNV呼叫集合,来自三个队列的22500个样本的近60,000个呼叫在几个不同的阵列上进行基因分型。在所有队列中,我们的视觉验证发现大多数CNV呼叫为假阳性(53.7%)或不清楚(9.7%)。假阳性分数在数据集和基因组区域之间差异很大,我们表明基于QC指标的现有过滤方法在去除假呼叫方面效率低下。鉴于视觉验证在控制假阳性分数方面优于现有过滤方法,我们使用视觉验证数据集的一个子集来训练卷积神经网络,通过机器视觉自动验证CNVs。我们使用数据集的剩余部分测试了模型的有效性,发现在大多数度量中,性能超过90%,接近人类分析师的水平。基因组测序数据的正交验证发现我们的视觉验证非常准确,测序数据支持的呼叫中只有1.7%被人类分析师认为是错误的,另外7.5%被认为是不清楚的。结论:目视检查是唯一有效的CNV呼叫验证方法。我们的模型能够以非常高的精度在规模上自动化这项任务,如样本内和样本外测试所示。该软件可以在https://github.com/SinomeM/CNValidatron_fl上以R包的形式获得。
{"title":"CNValidatron: accurate and efficient validation of PennCNV calls using computer vision.","authors":"Simone Montalbano, G Bragi Walters, Gudbjorn F Jonsson, Jesper R Gådin, Thomas Werge, Daniel F Gudbjartsson, Hreinn Stefansson, Andrés Ingason","doi":"10.1186/s12859-026-06375-6","DOIUrl":"10.1186/s12859-026-06375-6","url":null,"abstract":"<p><strong>Background: </strong>Large, rare copy number variants (CNVs) are a main source of genetic variation in the genome and are important in both evolution and disease risk. CNVs can be detected using different data sources, including genome sequencing, genotyping arrays and quantitative PCR experiments, but in most large cohorts, genotyping arrays remain the most prevalent source. Current methods to call CNVs from genotyping array data suffer from high false positive rates and while multiple approaches, including QC filtering, visual inspection of intensity tracks, and wet-lab validation are commonly applied to counter this problem, such methods are often non-specific (QC filtering) or inefficient (visual and wet-lab validations) at a genome-wide scale.</p><p><strong>Results: </strong>We have assembled the largest collection of human-verified CNV calls using visual validation, totalling almost 60,000 calls from 22,500 samples from three cohorts genotyped on several different arrays. Across all cohorts our visual validation found the majority of CNV calls to be false positive (53.7%) or unclear (9.7%). The false positive fraction varied substantially across datasets and genomic regions, and we show that existing filtering methods based on QC metrics are inefficient in removing false calls. Given the supremacy of visual validation over existing filtering methods in controlling the false positive fraction, we used a subset of our visual validation dataset to train a convolutional neural network to automate the validation of CNVs through machine vision. We tested the efficacy of the model using the remainder of the dataset and found the performance exceeded 90% in most measures, approximating that of a human analyst. Orthogonal validation with genome sequencing data found our visual validation to be highly accurate, with only 1.7% of calls supported by the sequencing dataset deemed as false by the human analyst, and a further 7.5% deemed as unclear.</p><p><strong>Conclusions: </strong>Visual inspection is the only effective validation approach for CNV calls. Our model is capable of automating this task at scale with very high accuracy, as shown by testing both within-sample and out-of-sample. The software is available as an R package at https://github.com/SinomeM/CNValidatron_fl.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A negative binomial latent factor model for paired microbiome sequencing data. 配对微生物组测序数据的负二项潜在因子模型。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2026-01-22 DOI: 10.1186/s12859-025-06362-3
Hyotae Kim, Nazema Y Siddiqui, Lisa Karstens, Li Ma

Background: Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations.

Results: Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately.

Conclusions: We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.

背景:微生物组测序数据通常从几个身体部位收集,并表现出依赖性。我们的目标是开发一个模型,通过捕获潜在的跨站点依赖关系来支持对来自不同站点的数据进行联合分析。所提出的模型包含(i)跨站点共享的潜在因素,以解释共同的主题效应,并作为站点之间相关性的来源;(ii)潜在因素的混合物,以允许跨站点关联中主题之间的异质性。结果:我们的模拟研究表明,当在建模中忽略这种依赖性时,两个位点之间更强的关联会导致回归分析中更大的效率损失。在一个案例研究中,我们的模型涉及从女性泌尿生殖系统微生物组与衰老的研究中收集的样本,结果发现阴道和尿液微生物组的协变量关联,而在分别应用于这两个部位的类似回归模型下,这些协变量关联在统计学上并不显著。结论:我们提出了一个从多个地点收集的微生物组测序数据的潜在因素模型。它捕获假定的潜在跨站点关联,而不会在没有这种关联的情况下损害估计准确性或推断效率。此外,我们提出的模型通过基于另一个站点的观察结果来预测微生物丰度,从而提高了预测性能。我们还提供了一个扩展框架,允许对主题(样本)和特定于集群的配对关联级别进行聚类。在此扩展框架下,可以根据关联强度对聚类进行分类。
{"title":"A negative binomial latent factor model for paired microbiome sequencing data.","authors":"Hyotae Kim, Nazema Y Siddiqui, Lisa Karstens, Li Ma","doi":"10.1186/s12859-025-06362-3","DOIUrl":"10.1186/s12859-025-06362-3","url":null,"abstract":"<p><strong>Background: </strong>Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations.</p><p><strong>Results: </strong>Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately.</p><p><strong>Conclusions: </strong>We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146028136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1