首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
VESNA: an open-source tool for automated 3D vessel segmentation and network analysis. VESNA:用于自动3D血管分割和网络分析的开源工具。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-21 DOI: 10.1186/s12859-025-06270-6
Magdalena Schüttler, Leyla Doğan, Jana Kirchner, Süleyman Ergün, Philipp Wörsdörfer, Sabine C Fischer

Background: Vasculature is an essential part of all tissues and organs and is involved in a wide range of different diseases. However, available software for blood vessel image analysis is often limited: Some only process two-dimensional data, others lack batch processing, putting a time burden on the user, while still others require tightly defined culturing methods and experimental conditions. This highlights the need for software that has the ability to batch process three-dimensional image data and requires few and simple experimental preparation steps.

Results: We present VESNA, a Fiji (ImageJ) macro for automated segmentation and skeletonization of three-dimensional fluorescence images, enabling quantitative vascular network analysis. It requires only basic experimental preparation, making it highly adaptable to a wide range of possible applications across experimental goals and different tissue culturing methods. The macro's potential is demonstrated on a range of different image data sets, from organoids with varying sizes, network complexities, and growth conditions, to expanding to other 3D tissue culturing methods, with an example of hydrogel-based cultures.

Conclusions: With its ability to process large amounts of 3D image data and its flexibility across experimental conditions, VESNA fulfills previously unmet needs in image processing of vascular structures and can be a valuable tool for a variety of experimental setups around three-dimensional vasculature, such as drug screening, research in tissue development and disease mechanisms.

背景:血管系统是所有组织和器官的重要组成部分,与各种疾病有关。然而,可用的血管图像分析软件往往是有限的:有些只处理二维数据,有些缺乏批量处理,给用户带来了时间负担,还有一些需要严格定义培养方法和实验条件。这突出了对软件的需求,具有批量处理三维图像数据的能力,需要很少和简单的实验准备步骤。结果:我们提出了VESNA,一个斐济(ImageJ)宏,用于三维荧光图像的自动分割和骨架化,使定量血管网络分析成为可能。它只需要基本的实验准备,使其高度适应于广泛的可能应用,跨越实验目标和不同的组织培养方法。宏观的潜力在一系列不同的图像数据集上得到了证明,从具有不同大小、网络复杂性和生长条件的类器官,到扩展到其他3D组织培养方法,以水凝胶为基础的培养为例。结论:VESNA具有处理大量三维图像数据的能力和跨实验条件的灵活性,满足了以前未满足的血管结构图像处理需求,可以成为围绕三维血管的各种实验设置的宝贵工具,如药物筛选,组织发育和疾病机制的研究。
{"title":"VESNA: an open-source tool for automated 3D vessel segmentation and network analysis.","authors":"Magdalena Schüttler, Leyla Doğan, Jana Kirchner, Süleyman Ergün, Philipp Wörsdörfer, Sabine C Fischer","doi":"10.1186/s12859-025-06270-6","DOIUrl":"10.1186/s12859-025-06270-6","url":null,"abstract":"<p><strong>Background: </strong>Vasculature is an essential part of all tissues and organs and is involved in a wide range of different diseases. However, available software for blood vessel image analysis is often limited: Some only process two-dimensional data, others lack batch processing, putting a time burden on the user, while still others require tightly defined culturing methods and experimental conditions. This highlights the need for software that has the ability to batch process three-dimensional image data and requires few and simple experimental preparation steps.</p><p><strong>Results: </strong>We present VESNA, a Fiji (ImageJ) macro for automated segmentation and skeletonization of three-dimensional fluorescence images, enabling quantitative vascular network analysis. It requires only basic experimental preparation, making it highly adaptable to a wide range of possible applications across experimental goals and different tissue culturing methods. The macro's potential is demonstrated on a range of different image data sets, from organoids with varying sizes, network complexities, and growth conditions, to expanding to other 3D tissue culturing methods, with an example of hydrogel-based cultures.</p><p><strong>Conclusions: </strong>With its ability to process large amounts of 3D image data and its flexibility across experimental conditions, VESNA fulfills previously unmet needs in image processing of vascular structures and can be a valuable tool for a variety of experimental setups around three-dimensional vasculature, such as drug screening, research in tissue development and disease mechanisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"254"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12539100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145343022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASET: an end-to-end pipeline for quantification and visualization of allele specific expression. ASET:对等位基因特异性表达进行量化和可视化的端到端管道。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-21 DOI: 10.1186/s12859-025-06282-2
Weisheng Wu, Kerby Shedden, Claudius Vincenz, Chris Gates, Beverly Strassmann

Allele-specific expression (ASE) analyses from RNA-Seq data provide quantitative insights into genomic imprinting and the genetic variants that affect transcription. Robust ASE analysis requires the integration of multiple computational steps, including read alignment, read counting, data visualization, and statistical testing-this complexity creates challenges for reproducibility, scalability, and ease of use. Here, we present ASE Toolkit (ASET), an end-to-end pipeline that streamlines SNP-level ASE data generation, visualization, and testing for parent-of-origin (PofO) effect. ASET includes a modular pipeline built with Nextflow for ASE quantification from short-read transcriptome sequencing reads, an R library for data visualization, and a Julia script for PofO testing. ASET performs comprehensive read quality control, SNP-tolerant alignment to reference genomes, read counting with allele and strand resolution, annotation with genes and exons, and estimation of contamination. In sum, ASET provides a complete and easy-to-use solution for molecular and biomedical scientists to identify and interpret patterns of ASE from RNA-Seq data.

来自RNA-Seq数据的等位基因特异性表达(ASE)分析为基因组印迹和影响转录的遗传变异提供了定量的见解。健壮的ASE分析需要集成多个计算步骤,包括读取对齐、读取计数、数据可视化和统计测试——这种复杂性为再现性、可伸缩性和易用性带来了挑战。在这里,我们展示了ASE工具包(ASET),一个端到端的管道,它简化了snp级ASE数据的生成、可视化和原始父级(PofO)效应的测试。ASET包括一个用Nextflow构建的模块化管道,用于从短读转录组测序读取ASE定量,一个R库用于数据可视化,以及一个用于PofO测试的Julia脚本。ASET执行全面的读段质量控制,对参考基因组进行耐snp比对,用等位基因和链分辨率进行读段计数,用基因和外显子进行注释,以及估计污染。总之,ASET为分子和生物医学科学家从RNA-Seq数据中识别和解释ASE模式提供了一个完整且易于使用的解决方案。
{"title":"ASET: an end-to-end pipeline for quantification and visualization of allele specific expression.","authors":"Weisheng Wu, Kerby Shedden, Claudius Vincenz, Chris Gates, Beverly Strassmann","doi":"10.1186/s12859-025-06282-2","DOIUrl":"10.1186/s12859-025-06282-2","url":null,"abstract":"<p><p>Allele-specific expression (ASE) analyses from RNA-Seq data provide quantitative insights into genomic imprinting and the genetic variants that affect transcription. Robust ASE analysis requires the integration of multiple computational steps, including read alignment, read counting, data visualization, and statistical testing-this complexity creates challenges for reproducibility, scalability, and ease of use. Here, we present ASE Toolkit (ASET), an end-to-end pipeline that streamlines SNP-level ASE data generation, visualization, and testing for parent-of-origin (PofO) effect. ASET includes a modular pipeline built with Nextflow for ASE quantification from short-read transcriptome sequencing reads, an R library for data visualization, and a Julia script for PofO testing. ASET performs comprehensive read quality control, SNP-tolerant alignment to reference genomes, read counting with allele and strand resolution, annotation with genes and exons, and estimation of contamination. In sum, ASET provides a complete and easy-to-use solution for molecular and biomedical scientists to identify and interpret patterns of ASE from RNA-Seq data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"257"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12539063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145343009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Big data dimensionality reduction-based supervised machine learning algorithms for NASH diagnosis. 基于大数据降维的监督式机器学习算法用于NASH诊断。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-21 DOI: 10.1186/s12859-025-06263-5
Onder Tutsoy, Huseyin Ali Ozturk, Hilmi Erdem Sumbul

Background: Identifying the Non-Alcoholic Steatohepatitis (NASH) that can cause liver failure-based morbidity remains a challenging research problem since there is no confirmed and effective approach for its early and accurate diagnosis yet. A large amount of medical data is collected to diagnose the NASH where the majority of them are redundant.

Methods: This paper initially focuses on selecting the most informative blood test data among the collected big data with the Pearson correlation statistical approach and modified Particle Swarm Optimization with Artificial Neural Networks (PSO-ANN) machine learning algorithm. Then, a gradient based Batch Least Squares (BLS) and a search-based Artificial Bee Colony (ABC) machine learning algorithms are implemented to optimize the NASH prediction models. Confirmed operational NASH diagnosis supervise the statistical and machine learning algorithms to develop accurate prediction models.

Results: Two machine learning algorithms were trained and also validated with the varying number of selected input features. The results yielded that the trained BLS machine learning model is able to diagnose benign and malignant cases with 100% and 98% accuracies, respectively. The trained ABC machine learning algorithm diagnoses the benign and malignant cases with 90.5% and 94.3% accuracies, respectively.

背景:识别可导致肝功能衰竭的非酒精性脂肪性肝炎(NASH)仍然是一个具有挑战性的研究问题,因为目前还没有明确有效的早期准确诊断方法。NASH诊断需要收集大量的医疗数据,其中大部分数据是冗余的。方法:本文首先利用Pearson相关统计方法和改进的粒子群算法结合人工神经网络(PSO-ANN)机器学习算法,从收集的大数据中筛选出信息量最大的血液检测数据。然后,实现了基于梯度的批处理最小二乘法(BLS)和基于搜索的人工蜂群(ABC)机器学习算法来优化NASH预测模型。确认可操作的NASH诊断监督统计和机器学习算法,以开发准确的预测模型。结果:训练了两种机器学习算法,并使用不同数量的选择输入特征进行了验证。结果表明,经过训练的BLS机器学习模型能够分别以100%和98%的准确率诊断良性和恶性病例。经过训练的ABC机器学习算法对良性和恶性病例的诊断准确率分别为90.5%和94.3%。
{"title":"Big data dimensionality reduction-based supervised machine learning algorithms for NASH diagnosis.","authors":"Onder Tutsoy, Huseyin Ali Ozturk, Hilmi Erdem Sumbul","doi":"10.1186/s12859-025-06263-5","DOIUrl":"10.1186/s12859-025-06263-5","url":null,"abstract":"<p><strong>Background: </strong>Identifying the Non-Alcoholic Steatohepatitis (NASH) that can cause liver failure-based morbidity remains a challenging research problem since there is no confirmed and effective approach for its early and accurate diagnosis yet. A large amount of medical data is collected to diagnose the NASH where the majority of them are redundant.</p><p><strong>Methods: </strong>This paper initially focuses on selecting the most informative blood test data among the collected big data with the Pearson correlation statistical approach and modified Particle Swarm Optimization with Artificial Neural Networks (PSO-ANN) machine learning algorithm. Then, a gradient based Batch Least Squares (BLS) and a search-based Artificial Bee Colony (ABC) machine learning algorithms are implemented to optimize the NASH prediction models. Confirmed operational NASH diagnosis supervise the statistical and machine learning algorithms to develop accurate prediction models.</p><p><strong>Results: </strong>Two machine learning algorithms were trained and also validated with the varying number of selected input features. The results yielded that the trained BLS machine learning model is able to diagnose benign and malignant cases with 100% and 98% accuracies, respectively. The trained ABC machine learning algorithm diagnoses the benign and malignant cases with 90.5% and 94.3% accuracies, respectively.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"256"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145342994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMILES alignment: a dynamic programming approach for the alignment of metabolites and other small organic molecules. SMILES校准:用于代谢物和其他小有机分子校准的动态规划方法。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-17 DOI: 10.1186/s12859-025-06278-y
Alexis L Tang, David A Liberles

Background: There is a need for computational approaches to compare small organic molecules based on chemical similarity or for evaluating biochemical transformations. No tool currently exists to generate global molecular alignments for small organic molecules. The study introduces a new approach to molecular alignment in the Simplified Molecular Input Line Entry System (SMILES) format. This method leverages programming and scoring alignments to minimize differences in electronegativity, here using a measure of atomic partial charges to address the challenge of understanding structural transformations in reaction pathways. This can be applied to study transitions from linear to cyclical pathways.

Results: The proposed method is based on the Needleman-Wunsch algorithm for sequence alignment, but it uses a modified scoring function for different input data. Validation against a benchmarked dataset from the Krebs cycle, based on the known chemical transformations in the pathway, confirmed the efficacy of the approach in aligning atoms that are known to be the same across the transformation. The algorithm also quantified each transformation of metabolites in the Pentose Phosphate Pathway and in Glycolysis. The method was used to study the difference in chemical similarity over transformations between linear and cyclical pathways. The study found a midpoint dissimilarity peak in cyclical pathways (particularly the Krebs Cycle) and a progressive decrease in molecular similarity in linear pathways, consistent with expectations.

Conclusions: The study introduces an algorithm that quantifies molecular transformations in metabolic pathways. The algorithm effectively highlights structural changes and was applied to a hypothesis about the transition from linear to cyclical structures. The software, which provides valuable insights into molecular transformations, is available at: https://github.com/24atang/SMILES-Alignment.git.

背景:需要基于化学相似性比较小有机分子或评估生化转化的计算方法。目前还没有工具可以生成小有机分子的全局分子比对。该研究介绍了一种简化分子输入线输入系统(SMILES)格式的分子定位新方法。该方法利用编程和评分对齐来最小化电负性差异,这里使用原子部分电荷的测量来解决理解反应途径中的结构转变的挑战。这可以应用于研究从线性路径到周期性路径的转变。结果:该方法基于Needleman-Wunsch算法进行序列比对,但对不同的输入数据使用了改进的评分函数。基于该途径中已知的化学转化,对来自克雷布斯循环的基准数据集进行验证,证实了该方法在对齐已知在整个转化过程中相同的原子方面的有效性。该算法还量化了戊糖磷酸途径和糖酵解过程中代谢物的每一次转化。该方法被用来研究在线性和循环途径之间的转换的化学相似性的差异。研究发现,在周期性途径(尤其是克雷布斯循环)中存在一个中点差异峰值,而在线性途径中,分子相似性逐渐降低,这与预期一致。结论:该研究引入了一种量化代谢途径中分子转化的算法。该算法有效地突出了结构变化,并应用于线性结构向周期性结构过渡的假设。该软件为分子转化提供了有价值的见解,可在https://github.com/24atang/SMILES-Alignment.git上获得。
{"title":"SMILES alignment: a dynamic programming approach for the alignment of metabolites and other small organic molecules.","authors":"Alexis L Tang, David A Liberles","doi":"10.1186/s12859-025-06278-y","DOIUrl":"10.1186/s12859-025-06278-y","url":null,"abstract":"<p><strong>Background: </strong>There is a need for computational approaches to compare small organic molecules based on chemical similarity or for evaluating biochemical transformations. No tool currently exists to generate global molecular alignments for small organic molecules. The study introduces a new approach to molecular alignment in the Simplified Molecular Input Line Entry System (SMILES) format. This method leverages programming and scoring alignments to minimize differences in electronegativity, here using a measure of atomic partial charges to address the challenge of understanding structural transformations in reaction pathways. This can be applied to study transitions from linear to cyclical pathways.</p><p><strong>Results: </strong>The proposed method is based on the Needleman-Wunsch algorithm for sequence alignment, but it uses a modified scoring function for different input data. Validation against a benchmarked dataset from the Krebs cycle, based on the known chemical transformations in the pathway, confirmed the efficacy of the approach in aligning atoms that are known to be the same across the transformation. The algorithm also quantified each transformation of metabolites in the Pentose Phosphate Pathway and in Glycolysis. The method was used to study the difference in chemical similarity over transformations between linear and cyclical pathways. The study found a midpoint dissimilarity peak in cyclical pathways (particularly the Krebs Cycle) and a progressive decrease in molecular similarity in linear pathways, consistent with expectations.</p><p><strong>Conclusions: </strong>The study introduces an algorithm that quantifies molecular transformations in metabolic pathways. The algorithm effectively highlights structural changes and was applied to a hypothesis about the transition from linear to cyclical structures. The software, which provides valuable insights into molecular transformations, is available at: https://github.com/24atang/SMILES-Alignment.git.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"251"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GRiNS: a python library for simulating gene regulatory network dynamics. 模拟基因调控网络动态的python库。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-17 DOI: 10.1186/s12859-025-06268-0
Pradyumna Harlapur, Harshavardhan Bv, Mohit Kumar Jolly

Background: The emergent dynamics of complex gene regulatory networks govern various cellular processes. However, understanding these dynamics is challenging due to the difficulty of parameterizing the computational models for these networks, especially as the network size increases. Here, we introduce a simulation library, Gene Regulatory Interaction Network Simulator (GRiNS), to address these challenges.

Results: GRiNS integrates popular parameter-agnostic simulation frameworks, RACIPE and Boolean Ising formalism, into a single Python library capable of leveraging GPU acceleration for efficient and scalable simulations. GRiNS extends the ordinary differential equations (ODE) based RACIPE framework with a more modular design, allowing users to choose parameters, initial conditions, and time-series outputs for greater customisability and accuracy in simulations. For large networks, where ODE-based simulation formalisms do not scale well, GRiNS implements Boolean Ising formalism, providing a simplified, coarse-grained alternative, significantly reducing the computational cost while capturing key dynamical behaviours of large regulatory networks.

Conclusion: GRiNS enables parameter-agnostic modeling of gene regulatory networks to study their dynamic and steady-state behaviors in a scalable and efficient manner. The documentation and installation instructions for GRiNS can be found at https://moltenecdysone09.github.io/GRiNS/ .

背景:复杂的基因调控网络的涌现动力学控制着各种细胞过程。然而,由于这些网络的计算模型难以参数化,特别是随着网络规模的增加,理解这些动态是具有挑战性的。在这里,我们介绍了一个模拟库,基因调控相互作用网络模拟器(GRiNS),以解决这些挑战。结果:GRiNS集成了流行的参数不可知仿真框架,RACIPE和布尔伊辛形式,到一个Python库能够利用GPU加速高效和可扩展的模拟。GRiNS扩展了基于常微分方程(ODE)的RACIPE框架,采用了更加模块化的设计,允许用户选择参数、初始条件和时间序列输出,以便在模拟中获得更高的可定制性和准确性。对于大型网络,其中基于ode的模拟形式不能很好地扩展,GRiNS实现了布尔伊辛形式,提供了一个简化的、粗粒度的替代方案,在捕获大型监管网络的关键动态行为的同时显著降低了计算成本。结论:GRiNS可以实现基因调控网络的参数不可知建模,以可扩展和有效的方式研究其动态和稳态行为。GRiNS的文档和安装说明可以在https://moltenecdysone09.github.io/GRiNS/上找到。
{"title":"GRiNS: a python library for simulating gene regulatory network dynamics.","authors":"Pradyumna Harlapur, Harshavardhan Bv, Mohit Kumar Jolly","doi":"10.1186/s12859-025-06268-0","DOIUrl":"10.1186/s12859-025-06268-0","url":null,"abstract":"<p><strong>Background: </strong>The emergent dynamics of complex gene regulatory networks govern various cellular processes. However, understanding these dynamics is challenging due to the difficulty of parameterizing the computational models for these networks, especially as the network size increases. Here, we introduce a simulation library, Gene Regulatory Interaction Network Simulator (GRiNS), to address these challenges.</p><p><strong>Results: </strong>GRiNS integrates popular parameter-agnostic simulation frameworks, RACIPE and Boolean Ising formalism, into a single Python library capable of leveraging GPU acceleration for efficient and scalable simulations. GRiNS extends the ordinary differential equations (ODE) based RACIPE framework with a more modular design, allowing users to choose parameters, initial conditions, and time-series outputs for greater customisability and accuracy in simulations. For large networks, where ODE-based simulation formalisms do not scale well, GRiNS implements Boolean Ising formalism, providing a simplified, coarse-grained alternative, significantly reducing the computational cost while capturing key dynamical behaviours of large regulatory networks.</p><p><strong>Conclusion: </strong>GRiNS enables parameter-agnostic modeling of gene regulatory networks to study their dynamic and steady-state behaviors in a scalable and efficient manner. The documentation and installation instructions for GRiNS can be found at https://moltenecdysone09.github.io/GRiNS/ .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"250"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Direct construction of sparse suffix arrays with Libsais. 用Libsais直接构造稀疏后缀数组。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-17 DOI: 10.1186/s12859-025-06277-z
Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt

Background: Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.

Results: We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.

Conclusions: We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.

背景:模式匹配是生物信息学,特别是基因组学、转录组学和蛋白质组学领域的基本挑战。高效的索引结构,如后缀数组,对于搜索大型数据集至关重要。稀疏后缀数组(SSA)只保留文本中每k个位置的后缀,其中k是稀疏因子。虽然与完整后缀数组相比,稀疏后缀数组提供了显著的内存节省,但它们通常仍然需要在采样步骤之前构建完整后缀数组,从而在构建阶段产生大量内存开销。结果:我们提出了一种使用简单但功能强大的文本编码直接构建稀疏后缀数组的替代方法。这种编码通过对字符进行分组来减少输入文本的长度,从而通过扩展广泛使用的Libsais库来实现直接的SSA构造。这种方法绕过了构建完整后缀数组的需要,在为各种核苷酸和氨基酸数据集构建稀疏因子为3或4的稀疏后缀数组时,将内存使用和构建时间减少了50%到75%。根据字母大小的不同,对于高达8的稀疏因子也可以获得类似的增益。对于更高的稀疏因子,可以通过使用所需稀疏因子的合适除数构建SSA,然后进行子采样步骤,从而获得相当的性能改进。该方法对于小字母(如核苷酸或氨基酸字母)的应用特别有效。该方法的开源实现可以在GitHub上获得,可以轻松采用大规模生物信息学应用。结论:本文提出了一种高效的大型数据集稀疏后缀数组构建方法。这种方法的核心是引入一个简单的文本转换,然后作为Libsais的输入。该方法将输入文本和结果后缀数组的长度减少了k倍,从而显著改善了执行时间和内存使用。
{"title":"Direct construction of sparse suffix arrays with Libsais.","authors":"Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt","doi":"10.1186/s12859-025-06277-z","DOIUrl":"10.1186/s12859-025-06277-z","url":null,"abstract":"<p><strong>Background: </strong>Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.</p><p><strong>Results: </strong>We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.</p><p><strong>Conclusions: </strong>We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"252"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535041/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JINet: easy and secure private data analysis for everyone. JINet:为每个人提供简单安全的私人数据分析。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-16 DOI: 10.1186/s12859-025-06244-8
Giada Lalli, James Collier, Yves Moreau, Daniele Raimondi

Background: The barriers to effective data analysis are sometimes insurmountable. Concerns ranging from privacy, security, and complexity can prevent researchers from using existing data analysis tools.

Results: JINet is a web browser-based platform intended to democratise access to advanced clinical and genomic data analysis software. It hosts numerous data analysis applications that are run in the safety of each User's web browser, without the data ever leaving their machine.

Conclusions: JINet promotes collaboration, standardisation and reproducibility by sharing scripts rather than data and creating a self-sustaining community around it in which Users and data analysis tools Developers interact thanks to JINet's interoperability primitives.

背景:有效数据分析的障碍有时是不可逾越的。对隐私、安全性和复杂性的担忧可能会阻止研究人员使用现有的数据分析工具。JINet是一个基于web浏览器的平台,旨在使高级临床和基因组数据分析软件的访问民主化。它承载了大量的数据分析应用程序,运行在每个用户的网络浏览器的安全,没有数据离开他们的机器。结论:JINet通过共享脚本而不是数据来促进协作、标准化和可再现性,并围绕它创建了一个自我维持的社区,在这个社区中,用户和数据分析工具开发人员通过JINet的互操作性原语进行交互。
{"title":"JINet: easy and secure private data analysis for everyone.","authors":"Giada Lalli, James Collier, Yves Moreau, Daniele Raimondi","doi":"10.1186/s12859-025-06244-8","DOIUrl":"10.1186/s12859-025-06244-8","url":null,"abstract":"<p><strong>Background: </strong>The barriers to effective data analysis are sometimes insurmountable. Concerns ranging from privacy, security, and complexity can prevent researchers from using existing data analysis tools.</p><p><strong>Results: </strong>JINet is a web browser-based platform intended to democratise access to advanced clinical and genomic data analysis software. It hosts numerous data analysis applications that are run in the safety of each User's web browser, without the data ever leaving their machine.</p><p><strong>Conclusions: </strong>JINet promotes collaboration, standardisation and reproducibility by sharing scripts rather than data and creating a self-sustaining community around it in which Users and data analysis tools Developers interact thanks to JINet's interoperability primitives.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"248"},"PeriodicalIF":3.3,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145306782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations. 全或部分观测值多模态数据集成的广义概率典型相关分析。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-16 DOI: 10.1186/s12859-025-06227-9
Tianjian Yang, Wei Vivian Li

The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. GPCCA demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA .

多模态数据的整合和分析在包括生物信息学在内的各个领域越来越重要。随着此类数据的数量和复杂性的增长,迫切需要一种计算模型,这种模型不仅可以集成各种模式,还可以利用它们的互补信息来提高聚类的准确性和洞察力,特别是在处理具有缺失数据的部分观测时。本文提出了一种用于多模态数据集成和联合降维的无监督方法——广义概率典型相关分析(GPCCA)。GPCCA解决了多模态数据分析中的关键挑战,方法是处理模型中的缺失值,实现两个以上模态的集成,并在考虑单个模态之间的相关性的同时识别信息特征。GPCCA展示了对各种缺失数据模式的鲁棒性,并提供了促进下游聚类和分析的低维嵌入。在一系列模拟设置中,GPCCA在捕获跨模态的基本模式方面优于现有方法。此外,我们还证明了它对来自TCGA癌症数据集和多视图图像数据集的多组学数据的适用性。GPCCA为多模态数据集成提供了一个有用的框架,有效地处理缺失数据并提供信息丰富的低维嵌入。它在癌症基因组学和多视图图像数据中的表现突出了它的鲁棒性和广泛应用的潜力。为了让更广泛的研究社区能够使用该方法,我们发布了一个R包GPCCA,可以在https://github.com/Kaversoniano/GPCCA上获得。
{"title":"Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations.","authors":"Tianjian Yang, Wei Vivian Li","doi":"10.1186/s12859-025-06227-9","DOIUrl":"10.1186/s12859-025-06227-9","url":null,"abstract":"<p><p>The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. GPCCA demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"249"},"PeriodicalIF":3.3,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12533326/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145306739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DCMF-PPI: a protein-protein interaction predictor based on dynamic condition and multi-feature fusion. DCMF-PPI:基于动态条件和多特征融合的蛋白质相互作用预测器。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-15 DOI: 10.1186/s12859-025-06272-4
Siqi Chen, Anhong Zheng, Weichi Yu, Chao Zhan

Background: The identification of protein-protein interaction (PPI) plays a crucial role in understanding the mechanisms of complex biological processes. Current research in predicting PPI has shown remarkable progress by integrating protein information with PPI topology structure. Nevertheless, these approaches frequently overlook the dynamic nature of protein and PPI structures during cellular processes, including conformational alterations and variations in binding affinities under diverse environmental circumstances. Additionally, the insufficient availability of comprehensive protein data hinders accurate protein representation. Consequently, these shortcomings restrict the model's generalizability and predictive precision.

Results: To address this, we introduce DCMF-PPI (Dynamic condition and multi-feature fusion framework for PPI), a novel hybrid framework that integrates dynamic modeling, multi-scale feature extraction, and probabilistic graph representation learning. DCMF-PPI comprises three core modules: (1) PortT5-GAT Module: The protein language model PortT5 is utilized to extract residue-level protein features, which are integrated with dynamic temporal dependencies. Graph attention networks are then employed to capture context-aware structural variations in protein interactions; (2) MPSWA Module: Employs parallel convolutional neural networks combined with wavelet transform to extract multi-scale features from diverse protein residue types, enhancing the representation of sequence and structural heterogeneity; (3) VGAE Module: Utilizes a Variational Graph Autoencoder to learn probabilistic latent representations, facilitating dynamic modeling of PPI graph structures and capturing uncertainty in interaction dynamics.

Conclusion: We conducted comprehensive experiments on benchmark datasets demonstrating that DCMF-PPI outperforms state-of-the-art methods in PPI prediction, achieving significant improvements in accuracy, precision, and recall. The framework's ability to fuse dynamic conditions and multi-level features highlights its effectiveness in modeling real-world biological complexities, positioning it as a robust tool for advancing PPI research and downstream applications in systems biology and drug discovery.

背景:蛋白质-蛋白质相互作用(PPI)的鉴定对理解复杂生物过程的机制起着至关重要的作用。目前,将蛋白质信息与PPI拓扑结构结合起来预测PPI的研究取得了显著进展。然而,这些方法往往忽略了细胞过程中蛋白质和PPI结构的动态性质,包括不同环境条件下的构象改变和结合亲和力的变化。此外,全面蛋白质数据的可用性不足阻碍了准确的蛋白质表示。因此,这些缺点限制了模型的通用性和预测精度。为了解决这个问题,我们引入了DCMF-PPI (PPI的动态条件和多特征融合框架),这是一个集成了动态建模、多尺度特征提取和概率图表示学习的新型混合框架。DCMF-PPI包括三个核心模块:(1)PortT5- gat模块:利用蛋白质语言模型PortT5提取残差级蛋白质特征,并结合动态时间依赖关系。然后使用图注意网络来捕获蛋白质相互作用中的上下文感知结构变化;(2) MPSWA模块:利用并行卷积神经网络结合小波变换,从不同的蛋白残基类型中提取多尺度特征,增强序列和结构异质性的表征;(3) VGAE模块:利用变分图自编码器学习概率潜在表示,便于PPI图结构的动态建模,捕捉交互动态中的不确定性。结论:我们在基准数据集上进行了全面的实验,证明DCMF-PPI在PPI预测方面优于最先进的方法,在准确性、精密度和召回率方面取得了显着提高。该框架融合动态条件和多层次特征的能力突出了其在模拟现实世界生物复杂性方面的有效性,将其定位为推进PPI研究以及系统生物学和药物发现中的下游应用的强大工具。
{"title":"DCMF-PPI: a protein-protein interaction predictor based on dynamic condition and multi-feature fusion.","authors":"Siqi Chen, Anhong Zheng, Weichi Yu, Chao Zhan","doi":"10.1186/s12859-025-06272-4","DOIUrl":"10.1186/s12859-025-06272-4","url":null,"abstract":"<p><strong>Background: </strong>The identification of protein-protein interaction (PPI) plays a crucial role in understanding the mechanisms of complex biological processes. Current research in predicting PPI has shown remarkable progress by integrating protein information with PPI topology structure. Nevertheless, these approaches frequently overlook the dynamic nature of protein and PPI structures during cellular processes, including conformational alterations and variations in binding affinities under diverse environmental circumstances. Additionally, the insufficient availability of comprehensive protein data hinders accurate protein representation. Consequently, these shortcomings restrict the model's generalizability and predictive precision.</p><p><strong>Results: </strong>To address this, we introduce DCMF-PPI (Dynamic condition and multi-feature fusion framework for PPI), a novel hybrid framework that integrates dynamic modeling, multi-scale feature extraction, and probabilistic graph representation learning. DCMF-PPI comprises three core modules: (1) PortT5-GAT Module: The protein language model PortT5 is utilized to extract residue-level protein features, which are integrated with dynamic temporal dependencies. Graph attention networks are then employed to capture context-aware structural variations in protein interactions; (2) MPSWA Module: Employs parallel convolutional neural networks combined with wavelet transform to extract multi-scale features from diverse protein residue types, enhancing the representation of sequence and structural heterogeneity; (3) VGAE Module: Utilizes a Variational Graph Autoencoder to learn probabilistic latent representations, facilitating dynamic modeling of PPI graph structures and capturing uncertainty in interaction dynamics.</p><p><strong>Conclusion: </strong>We conducted comprehensive experiments on benchmark datasets demonstrating that DCMF-PPI outperforms state-of-the-art methods in PPI prediction, achieving significant improvements in accuracy, precision, and recall. The framework's ability to fuse dynamic conditions and multi-level features highlights its effectiveness in modeling real-world biological complexities, positioning it as a robust tool for advancing PPI research and downstream applications in systems biology and drug discovery.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"247"},"PeriodicalIF":3.3,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12522320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145298374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAGPEK: fast and flexible approach to identify genotypes of Sanger sequencing data. SAGPEK:快速和灵活的方法来确定基因型的Sanger测序数据。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-10-14 DOI: 10.1186/s12859-025-06271-5
Jinpeng Wang, Shuo Sun, Yaran Zhang, Ning Huang, Chunhong Yang, Yaping Gao, Xiuge Wang, Zhihua Ju, Qiang Jiang, Yao Xiao, Xiaochao Wei, Wenhao Liu, Jinming Huang

Background: Although Sanger sequencing remains widely used in human genetic disease diagnosis and livestock breeding, software packages for analyzing such data have seen little innovation over time. Determining the genotypes of tens to hundreds of loci across hundreds or thousands of samples still typically relies on manual visual confirmation with traditional software, a process that is both time-consuming and prone to error.

Results: We present SAGPEK, a tool that automatically identifies genotypes at target loci from hundreds to thousands of ABI-format Sanger sequencing files and directly outputs the results. SAGPEK extracts the signal intensities for A, G, C, and T bases, performs base calling, and determines each site's homozygous or heterozygous status. It then generates a primary sequence composed of the bases with the highest signal intensities and records secondary bases for heterozygous sites. Using either built-in or user-provided anchor sequences, SAGPEK maps the coordinates of target loci, reports their genotypes, and, when applicable, annotates the corresponding amino acid changes.

Conclusions: SAGPEK provides an efficient, flexible, and user-friendly solution for analyzing ABI-format Sanger sequencing data, enabling simultaneous genotyping of tens of loci across hundreds of samples. Its innovation lies not in introducing new base-calling methods, but in integrating versatile functionalities-batch genotyping, customizable anchor sequences, amino acid alteration reporting, chromatogram visualization, and local execution-into a single open-source package. This makes SAGPEK well suited for applications such as human genetic disease screening, drug-resistance mutation detection, and functional mutation identification in livestock and other organisms.

背景:尽管Sanger测序仍然广泛应用于人类遗传疾病诊断和牲畜育种,但用于分析这些数据的软件包随着时间的推移几乎没有创新。在数百或数千个样本中确定数十到数百个基因座的基因型通常仍然依赖于传统软件的手动视觉确认,这一过程既耗时又容易出错。结果:我们提出了SAGPEK,这是一种工具,可以从数百到数千个abi格式的Sanger测序文件中自动识别目标位点的基因型,并直接输出结果。SAGPEK提取A、G、C和T碱基的信号强度,进行碱基调用,并确定每个位点的纯合或杂合状态。然后生成由信号强度最高的碱基组成的一级序列,并记录杂合位点的二级碱基。使用内置或用户提供的锚定序列,SAGPEK绘制目标位点的坐标,报告它们的基因型,并在适用时注释相应的氨基酸变化。结论:SAGPEK为分析abi格式的Sanger测序数据提供了一种高效、灵活和用户友好的解决方案,能够同时对数百个样本中的数十个基因座进行基因分型。它的创新不在于引入新的碱基调用方法,而在于将多种功能——批量基因分型、可定制的锚定序列、氨基酸变化报告、色谱可视化和本地执行——集成到一个单一的开源包中。这使得SAGPEK非常适合于人类遗传疾病筛选、耐药性突变检测以及牲畜和其他生物的功能突变鉴定等应用。
{"title":"SAGPEK: fast and flexible approach to identify genotypes of Sanger sequencing data.","authors":"Jinpeng Wang, Shuo Sun, Yaran Zhang, Ning Huang, Chunhong Yang, Yaping Gao, Xiuge Wang, Zhihua Ju, Qiang Jiang, Yao Xiao, Xiaochao Wei, Wenhao Liu, Jinming Huang","doi":"10.1186/s12859-025-06271-5","DOIUrl":"10.1186/s12859-025-06271-5","url":null,"abstract":"<p><strong>Background: </strong>Although Sanger sequencing remains widely used in human genetic disease diagnosis and livestock breeding, software packages for analyzing such data have seen little innovation over time. Determining the genotypes of tens to hundreds of loci across hundreds or thousands of samples still typically relies on manual visual confirmation with traditional software, a process that is both time-consuming and prone to error.</p><p><strong>Results: </strong>We present SAGPEK, a tool that automatically identifies genotypes at target loci from hundreds to thousands of ABI-format Sanger sequencing files and directly outputs the results. SAGPEK extracts the signal intensities for A, G, C, and T bases, performs base calling, and determines each site's homozygous or heterozygous status. It then generates a primary sequence composed of the bases with the highest signal intensities and records secondary bases for heterozygous sites. Using either built-in or user-provided anchor sequences, SAGPEK maps the coordinates of target loci, reports their genotypes, and, when applicable, annotates the corresponding amino acid changes.</p><p><strong>Conclusions: </strong>SAGPEK provides an efficient, flexible, and user-friendly solution for analyzing ABI-format Sanger sequencing data, enabling simultaneous genotyping of tens of loci across hundreds of samples. Its innovation lies not in introducing new base-calling methods, but in integrating versatile functionalities-batch genotyping, customizable anchor sequences, amino acid alteration reporting, chromatogram visualization, and local execution-into a single open-source package. This makes SAGPEK well suited for applications such as human genetic disease screening, drug-resistance mutation detection, and functional mutation identification in livestock and other organisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"246"},"PeriodicalIF":3.3,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12523141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145290903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1