Pub Date : 2025-10-21DOI: 10.1186/s12859-025-06270-6
Magdalena Schüttler, Leyla Doğan, Jana Kirchner, Süleyman Ergün, Philipp Wörsdörfer, Sabine C Fischer
Background: Vasculature is an essential part of all tissues and organs and is involved in a wide range of different diseases. However, available software for blood vessel image analysis is often limited: Some only process two-dimensional data, others lack batch processing, putting a time burden on the user, while still others require tightly defined culturing methods and experimental conditions. This highlights the need for software that has the ability to batch process three-dimensional image data and requires few and simple experimental preparation steps.
Results: We present VESNA, a Fiji (ImageJ) macro for automated segmentation and skeletonization of three-dimensional fluorescence images, enabling quantitative vascular network analysis. It requires only basic experimental preparation, making it highly adaptable to a wide range of possible applications across experimental goals and different tissue culturing methods. The macro's potential is demonstrated on a range of different image data sets, from organoids with varying sizes, network complexities, and growth conditions, to expanding to other 3D tissue culturing methods, with an example of hydrogel-based cultures.
Conclusions: With its ability to process large amounts of 3D image data and its flexibility across experimental conditions, VESNA fulfills previously unmet needs in image processing of vascular structures and can be a valuable tool for a variety of experimental setups around three-dimensional vasculature, such as drug screening, research in tissue development and disease mechanisms.
{"title":"VESNA: an open-source tool for automated 3D vessel segmentation and network analysis.","authors":"Magdalena Schüttler, Leyla Doğan, Jana Kirchner, Süleyman Ergün, Philipp Wörsdörfer, Sabine C Fischer","doi":"10.1186/s12859-025-06270-6","DOIUrl":"10.1186/s12859-025-06270-6","url":null,"abstract":"<p><strong>Background: </strong>Vasculature is an essential part of all tissues and organs and is involved in a wide range of different diseases. However, available software for blood vessel image analysis is often limited: Some only process two-dimensional data, others lack batch processing, putting a time burden on the user, while still others require tightly defined culturing methods and experimental conditions. This highlights the need for software that has the ability to batch process three-dimensional image data and requires few and simple experimental preparation steps.</p><p><strong>Results: </strong>We present VESNA, a Fiji (ImageJ) macro for automated segmentation and skeletonization of three-dimensional fluorescence images, enabling quantitative vascular network analysis. It requires only basic experimental preparation, making it highly adaptable to a wide range of possible applications across experimental goals and different tissue culturing methods. The macro's potential is demonstrated on a range of different image data sets, from organoids with varying sizes, network complexities, and growth conditions, to expanding to other 3D tissue culturing methods, with an example of hydrogel-based cultures.</p><p><strong>Conclusions: </strong>With its ability to process large amounts of 3D image data and its flexibility across experimental conditions, VESNA fulfills previously unmet needs in image processing of vascular structures and can be a valuable tool for a variety of experimental setups around three-dimensional vasculature, such as drug screening, research in tissue development and disease mechanisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"254"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12539100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145343022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-21DOI: 10.1186/s12859-025-06282-2
Weisheng Wu, Kerby Shedden, Claudius Vincenz, Chris Gates, Beverly Strassmann
Allele-specific expression (ASE) analyses from RNA-Seq data provide quantitative insights into genomic imprinting and the genetic variants that affect transcription. Robust ASE analysis requires the integration of multiple computational steps, including read alignment, read counting, data visualization, and statistical testing-this complexity creates challenges for reproducibility, scalability, and ease of use. Here, we present ASE Toolkit (ASET), an end-to-end pipeline that streamlines SNP-level ASE data generation, visualization, and testing for parent-of-origin (PofO) effect. ASET includes a modular pipeline built with Nextflow for ASE quantification from short-read transcriptome sequencing reads, an R library for data visualization, and a Julia script for PofO testing. ASET performs comprehensive read quality control, SNP-tolerant alignment to reference genomes, read counting with allele and strand resolution, annotation with genes and exons, and estimation of contamination. In sum, ASET provides a complete and easy-to-use solution for molecular and biomedical scientists to identify and interpret patterns of ASE from RNA-Seq data.
{"title":"ASET: an end-to-end pipeline for quantification and visualization of allele specific expression.","authors":"Weisheng Wu, Kerby Shedden, Claudius Vincenz, Chris Gates, Beverly Strassmann","doi":"10.1186/s12859-025-06282-2","DOIUrl":"10.1186/s12859-025-06282-2","url":null,"abstract":"<p><p>Allele-specific expression (ASE) analyses from RNA-Seq data provide quantitative insights into genomic imprinting and the genetic variants that affect transcription. Robust ASE analysis requires the integration of multiple computational steps, including read alignment, read counting, data visualization, and statistical testing-this complexity creates challenges for reproducibility, scalability, and ease of use. Here, we present ASE Toolkit (ASET), an end-to-end pipeline that streamlines SNP-level ASE data generation, visualization, and testing for parent-of-origin (PofO) effect. ASET includes a modular pipeline built with Nextflow for ASE quantification from short-read transcriptome sequencing reads, an R library for data visualization, and a Julia script for PofO testing. ASET performs comprehensive read quality control, SNP-tolerant alignment to reference genomes, read counting with allele and strand resolution, annotation with genes and exons, and estimation of contamination. In sum, ASET provides a complete and easy-to-use solution for molecular and biomedical scientists to identify and interpret patterns of ASE from RNA-Seq data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"257"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12539063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145343009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-21DOI: 10.1186/s12859-025-06263-5
Onder Tutsoy, Huseyin Ali Ozturk, Hilmi Erdem Sumbul
Background: Identifying the Non-Alcoholic Steatohepatitis (NASH) that can cause liver failure-based morbidity remains a challenging research problem since there is no confirmed and effective approach for its early and accurate diagnosis yet. A large amount of medical data is collected to diagnose the NASH where the majority of them are redundant.
Methods: This paper initially focuses on selecting the most informative blood test data among the collected big data with the Pearson correlation statistical approach and modified Particle Swarm Optimization with Artificial Neural Networks (PSO-ANN) machine learning algorithm. Then, a gradient based Batch Least Squares (BLS) and a search-based Artificial Bee Colony (ABC) machine learning algorithms are implemented to optimize the NASH prediction models. Confirmed operational NASH diagnosis supervise the statistical and machine learning algorithms to develop accurate prediction models.
Results: Two machine learning algorithms were trained and also validated with the varying number of selected input features. The results yielded that the trained BLS machine learning model is able to diagnose benign and malignant cases with 100% and 98% accuracies, respectively. The trained ABC machine learning algorithm diagnoses the benign and malignant cases with 90.5% and 94.3% accuracies, respectively.
{"title":"Big data dimensionality reduction-based supervised machine learning algorithms for NASH diagnosis.","authors":"Onder Tutsoy, Huseyin Ali Ozturk, Hilmi Erdem Sumbul","doi":"10.1186/s12859-025-06263-5","DOIUrl":"10.1186/s12859-025-06263-5","url":null,"abstract":"<p><strong>Background: </strong>Identifying the Non-Alcoholic Steatohepatitis (NASH) that can cause liver failure-based morbidity remains a challenging research problem since there is no confirmed and effective approach for its early and accurate diagnosis yet. A large amount of medical data is collected to diagnose the NASH where the majority of them are redundant.</p><p><strong>Methods: </strong>This paper initially focuses on selecting the most informative blood test data among the collected big data with the Pearson correlation statistical approach and modified Particle Swarm Optimization with Artificial Neural Networks (PSO-ANN) machine learning algorithm. Then, a gradient based Batch Least Squares (BLS) and a search-based Artificial Bee Colony (ABC) machine learning algorithms are implemented to optimize the NASH prediction models. Confirmed operational NASH diagnosis supervise the statistical and machine learning algorithms to develop accurate prediction models.</p><p><strong>Results: </strong>Two machine learning algorithms were trained and also validated with the varying number of selected input features. The results yielded that the trained BLS machine learning model is able to diagnose benign and malignant cases with 100% and 98% accuracies, respectively. The trained ABC machine learning algorithm diagnoses the benign and malignant cases with 90.5% and 94.3% accuracies, respectively.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"256"},"PeriodicalIF":3.3,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145342994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1186/s12859-025-06278-y
Alexis L Tang, David A Liberles
Background: There is a need for computational approaches to compare small organic molecules based on chemical similarity or for evaluating biochemical transformations. No tool currently exists to generate global molecular alignments for small organic molecules. The study introduces a new approach to molecular alignment in the Simplified Molecular Input Line Entry System (SMILES) format. This method leverages programming and scoring alignments to minimize differences in electronegativity, here using a measure of atomic partial charges to address the challenge of understanding structural transformations in reaction pathways. This can be applied to study transitions from linear to cyclical pathways.
Results: The proposed method is based on the Needleman-Wunsch algorithm for sequence alignment, but it uses a modified scoring function for different input data. Validation against a benchmarked dataset from the Krebs cycle, based on the known chemical transformations in the pathway, confirmed the efficacy of the approach in aligning atoms that are known to be the same across the transformation. The algorithm also quantified each transformation of metabolites in the Pentose Phosphate Pathway and in Glycolysis. The method was used to study the difference in chemical similarity over transformations between linear and cyclical pathways. The study found a midpoint dissimilarity peak in cyclical pathways (particularly the Krebs Cycle) and a progressive decrease in molecular similarity in linear pathways, consistent with expectations.
Conclusions: The study introduces an algorithm that quantifies molecular transformations in metabolic pathways. The algorithm effectively highlights structural changes and was applied to a hypothesis about the transition from linear to cyclical structures. The software, which provides valuable insights into molecular transformations, is available at: https://github.com/24atang/SMILES-Alignment.git.
{"title":"SMILES alignment: a dynamic programming approach for the alignment of metabolites and other small organic molecules.","authors":"Alexis L Tang, David A Liberles","doi":"10.1186/s12859-025-06278-y","DOIUrl":"10.1186/s12859-025-06278-y","url":null,"abstract":"<p><strong>Background: </strong>There is a need for computational approaches to compare small organic molecules based on chemical similarity or for evaluating biochemical transformations. No tool currently exists to generate global molecular alignments for small organic molecules. The study introduces a new approach to molecular alignment in the Simplified Molecular Input Line Entry System (SMILES) format. This method leverages programming and scoring alignments to minimize differences in electronegativity, here using a measure of atomic partial charges to address the challenge of understanding structural transformations in reaction pathways. This can be applied to study transitions from linear to cyclical pathways.</p><p><strong>Results: </strong>The proposed method is based on the Needleman-Wunsch algorithm for sequence alignment, but it uses a modified scoring function for different input data. Validation against a benchmarked dataset from the Krebs cycle, based on the known chemical transformations in the pathway, confirmed the efficacy of the approach in aligning atoms that are known to be the same across the transformation. The algorithm also quantified each transformation of metabolites in the Pentose Phosphate Pathway and in Glycolysis. The method was used to study the difference in chemical similarity over transformations between linear and cyclical pathways. The study found a midpoint dissimilarity peak in cyclical pathways (particularly the Krebs Cycle) and a progressive decrease in molecular similarity in linear pathways, consistent with expectations.</p><p><strong>Conclusions: </strong>The study introduces an algorithm that quantifies molecular transformations in metabolic pathways. The algorithm effectively highlights structural changes and was applied to a hypothesis about the transition from linear to cyclical structures. The software, which provides valuable insights into molecular transformations, is available at: https://github.com/24atang/SMILES-Alignment.git.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"251"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The emergent dynamics of complex gene regulatory networks govern various cellular processes. However, understanding these dynamics is challenging due to the difficulty of parameterizing the computational models for these networks, especially as the network size increases. Here, we introduce a simulation library, Gene Regulatory Interaction Network Simulator (GRiNS), to address these challenges.
Results: GRiNS integrates popular parameter-agnostic simulation frameworks, RACIPE and Boolean Ising formalism, into a single Python library capable of leveraging GPU acceleration for efficient and scalable simulations. GRiNS extends the ordinary differential equations (ODE) based RACIPE framework with a more modular design, allowing users to choose parameters, initial conditions, and time-series outputs for greater customisability and accuracy in simulations. For large networks, where ODE-based simulation formalisms do not scale well, GRiNS implements Boolean Ising formalism, providing a simplified, coarse-grained alternative, significantly reducing the computational cost while capturing key dynamical behaviours of large regulatory networks.
Conclusion: GRiNS enables parameter-agnostic modeling of gene regulatory networks to study their dynamic and steady-state behaviors in a scalable and efficient manner. The documentation and installation instructions for GRiNS can be found at https://moltenecdysone09.github.io/GRiNS/ .
{"title":"GRiNS: a python library for simulating gene regulatory network dynamics.","authors":"Pradyumna Harlapur, Harshavardhan Bv, Mohit Kumar Jolly","doi":"10.1186/s12859-025-06268-0","DOIUrl":"10.1186/s12859-025-06268-0","url":null,"abstract":"<p><strong>Background: </strong>The emergent dynamics of complex gene regulatory networks govern various cellular processes. However, understanding these dynamics is challenging due to the difficulty of parameterizing the computational models for these networks, especially as the network size increases. Here, we introduce a simulation library, Gene Regulatory Interaction Network Simulator (GRiNS), to address these challenges.</p><p><strong>Results: </strong>GRiNS integrates popular parameter-agnostic simulation frameworks, RACIPE and Boolean Ising formalism, into a single Python library capable of leveraging GPU acceleration for efficient and scalable simulations. GRiNS extends the ordinary differential equations (ODE) based RACIPE framework with a more modular design, allowing users to choose parameters, initial conditions, and time-series outputs for greater customisability and accuracy in simulations. For large networks, where ODE-based simulation formalisms do not scale well, GRiNS implements Boolean Ising formalism, providing a simplified, coarse-grained alternative, significantly reducing the computational cost while capturing key dynamical behaviours of large regulatory networks.</p><p><strong>Conclusion: </strong>GRiNS enables parameter-agnostic modeling of gene regulatory networks to study their dynamic and steady-state behaviors in a scalable and efficient manner. The documentation and installation instructions for GRiNS can be found at https://moltenecdysone09.github.io/GRiNS/ .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"250"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1186/s12859-025-06277-z
Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt
Background: Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.
Results: We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.
Conclusions: We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.
{"title":"Direct construction of sparse suffix arrays with Libsais.","authors":"Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt","doi":"10.1186/s12859-025-06277-z","DOIUrl":"10.1186/s12859-025-06277-z","url":null,"abstract":"<p><strong>Background: </strong>Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.</p><p><strong>Results: </strong>We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.</p><p><strong>Conclusions: </strong>We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"252"},"PeriodicalIF":3.3,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535041/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145312395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1186/s12859-025-06244-8
Giada Lalli, James Collier, Yves Moreau, Daniele Raimondi
Background: The barriers to effective data analysis are sometimes insurmountable. Concerns ranging from privacy, security, and complexity can prevent researchers from using existing data analysis tools.
Results: JINet is a web browser-based platform intended to democratise access to advanced clinical and genomic data analysis software. It hosts numerous data analysis applications that are run in the safety of each User's web browser, without the data ever leaving their machine.
Conclusions: JINet promotes collaboration, standardisation and reproducibility by sharing scripts rather than data and creating a self-sustaining community around it in which Users and data analysis tools Developers interact thanks to JINet's interoperability primitives.
{"title":"JINet: easy and secure private data analysis for everyone.","authors":"Giada Lalli, James Collier, Yves Moreau, Daniele Raimondi","doi":"10.1186/s12859-025-06244-8","DOIUrl":"10.1186/s12859-025-06244-8","url":null,"abstract":"<p><strong>Background: </strong>The barriers to effective data analysis are sometimes insurmountable. Concerns ranging from privacy, security, and complexity can prevent researchers from using existing data analysis tools.</p><p><strong>Results: </strong>JINet is a web browser-based platform intended to democratise access to advanced clinical and genomic data analysis software. It hosts numerous data analysis applications that are run in the safety of each User's web browser, without the data ever leaving their machine.</p><p><strong>Conclusions: </strong>JINet promotes collaboration, standardisation and reproducibility by sharing scripts rather than data and creating a self-sustaining community around it in which Users and data analysis tools Developers interact thanks to JINet's interoperability primitives.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"248"},"PeriodicalIF":3.3,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145306782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1186/s12859-025-06227-9
Tianjian Yang, Wei Vivian Li
The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. GPCCA demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA .
{"title":"Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations.","authors":"Tianjian Yang, Wei Vivian Li","doi":"10.1186/s12859-025-06227-9","DOIUrl":"10.1186/s12859-025-06227-9","url":null,"abstract":"<p><p>The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. GPCCA demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"249"},"PeriodicalIF":3.3,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12533326/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145306739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1186/s12859-025-06272-4
Siqi Chen, Anhong Zheng, Weichi Yu, Chao Zhan
Background: The identification of protein-protein interaction (PPI) plays a crucial role in understanding the mechanisms of complex biological processes. Current research in predicting PPI has shown remarkable progress by integrating protein information with PPI topology structure. Nevertheless, these approaches frequently overlook the dynamic nature of protein and PPI structures during cellular processes, including conformational alterations and variations in binding affinities under diverse environmental circumstances. Additionally, the insufficient availability of comprehensive protein data hinders accurate protein representation. Consequently, these shortcomings restrict the model's generalizability and predictive precision.
Results: To address this, we introduce DCMF-PPI (Dynamic condition and multi-feature fusion framework for PPI), a novel hybrid framework that integrates dynamic modeling, multi-scale feature extraction, and probabilistic graph representation learning. DCMF-PPI comprises three core modules: (1) PortT5-GAT Module: The protein language model PortT5 is utilized to extract residue-level protein features, which are integrated with dynamic temporal dependencies. Graph attention networks are then employed to capture context-aware structural variations in protein interactions; (2) MPSWA Module: Employs parallel convolutional neural networks combined with wavelet transform to extract multi-scale features from diverse protein residue types, enhancing the representation of sequence and structural heterogeneity; (3) VGAE Module: Utilizes a Variational Graph Autoencoder to learn probabilistic latent representations, facilitating dynamic modeling of PPI graph structures and capturing uncertainty in interaction dynamics.
Conclusion: We conducted comprehensive experiments on benchmark datasets demonstrating that DCMF-PPI outperforms state-of-the-art methods in PPI prediction, achieving significant improvements in accuracy, precision, and recall. The framework's ability to fuse dynamic conditions and multi-level features highlights its effectiveness in modeling real-world biological complexities, positioning it as a robust tool for advancing PPI research and downstream applications in systems biology and drug discovery.
{"title":"DCMF-PPI: a protein-protein interaction predictor based on dynamic condition and multi-feature fusion.","authors":"Siqi Chen, Anhong Zheng, Weichi Yu, Chao Zhan","doi":"10.1186/s12859-025-06272-4","DOIUrl":"10.1186/s12859-025-06272-4","url":null,"abstract":"<p><strong>Background: </strong>The identification of protein-protein interaction (PPI) plays a crucial role in understanding the mechanisms of complex biological processes. Current research in predicting PPI has shown remarkable progress by integrating protein information with PPI topology structure. Nevertheless, these approaches frequently overlook the dynamic nature of protein and PPI structures during cellular processes, including conformational alterations and variations in binding affinities under diverse environmental circumstances. Additionally, the insufficient availability of comprehensive protein data hinders accurate protein representation. Consequently, these shortcomings restrict the model's generalizability and predictive precision.</p><p><strong>Results: </strong>To address this, we introduce DCMF-PPI (Dynamic condition and multi-feature fusion framework for PPI), a novel hybrid framework that integrates dynamic modeling, multi-scale feature extraction, and probabilistic graph representation learning. DCMF-PPI comprises three core modules: (1) PortT5-GAT Module: The protein language model PortT5 is utilized to extract residue-level protein features, which are integrated with dynamic temporal dependencies. Graph attention networks are then employed to capture context-aware structural variations in protein interactions; (2) MPSWA Module: Employs parallel convolutional neural networks combined with wavelet transform to extract multi-scale features from diverse protein residue types, enhancing the representation of sequence and structural heterogeneity; (3) VGAE Module: Utilizes a Variational Graph Autoencoder to learn probabilistic latent representations, facilitating dynamic modeling of PPI graph structures and capturing uncertainty in interaction dynamics.</p><p><strong>Conclusion: </strong>We conducted comprehensive experiments on benchmark datasets demonstrating that DCMF-PPI outperforms state-of-the-art methods in PPI prediction, achieving significant improvements in accuracy, precision, and recall. The framework's ability to fuse dynamic conditions and multi-level features highlights its effectiveness in modeling real-world biological complexities, positioning it as a robust tool for advancing PPI research and downstream applications in systems biology and drug discovery.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"247"},"PeriodicalIF":3.3,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12522320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145298374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Although Sanger sequencing remains widely used in human genetic disease diagnosis and livestock breeding, software packages for analyzing such data have seen little innovation over time. Determining the genotypes of tens to hundreds of loci across hundreds or thousands of samples still typically relies on manual visual confirmation with traditional software, a process that is both time-consuming and prone to error.
Results: We present SAGPEK, a tool that automatically identifies genotypes at target loci from hundreds to thousands of ABI-format Sanger sequencing files and directly outputs the results. SAGPEK extracts the signal intensities for A, G, C, and T bases, performs base calling, and determines each site's homozygous or heterozygous status. It then generates a primary sequence composed of the bases with the highest signal intensities and records secondary bases for heterozygous sites. Using either built-in or user-provided anchor sequences, SAGPEK maps the coordinates of target loci, reports their genotypes, and, when applicable, annotates the corresponding amino acid changes.
Conclusions: SAGPEK provides an efficient, flexible, and user-friendly solution for analyzing ABI-format Sanger sequencing data, enabling simultaneous genotyping of tens of loci across hundreds of samples. Its innovation lies not in introducing new base-calling methods, but in integrating versatile functionalities-batch genotyping, customizable anchor sequences, amino acid alteration reporting, chromatogram visualization, and local execution-into a single open-source package. This makes SAGPEK well suited for applications such as human genetic disease screening, drug-resistance mutation detection, and functional mutation identification in livestock and other organisms.
{"title":"SAGPEK: fast and flexible approach to identify genotypes of Sanger sequencing data.","authors":"Jinpeng Wang, Shuo Sun, Yaran Zhang, Ning Huang, Chunhong Yang, Yaping Gao, Xiuge Wang, Zhihua Ju, Qiang Jiang, Yao Xiao, Xiaochao Wei, Wenhao Liu, Jinming Huang","doi":"10.1186/s12859-025-06271-5","DOIUrl":"10.1186/s12859-025-06271-5","url":null,"abstract":"<p><strong>Background: </strong>Although Sanger sequencing remains widely used in human genetic disease diagnosis and livestock breeding, software packages for analyzing such data have seen little innovation over time. Determining the genotypes of tens to hundreds of loci across hundreds or thousands of samples still typically relies on manual visual confirmation with traditional software, a process that is both time-consuming and prone to error.</p><p><strong>Results: </strong>We present SAGPEK, a tool that automatically identifies genotypes at target loci from hundreds to thousands of ABI-format Sanger sequencing files and directly outputs the results. SAGPEK extracts the signal intensities for A, G, C, and T bases, performs base calling, and determines each site's homozygous or heterozygous status. It then generates a primary sequence composed of the bases with the highest signal intensities and records secondary bases for heterozygous sites. Using either built-in or user-provided anchor sequences, SAGPEK maps the coordinates of target loci, reports their genotypes, and, when applicable, annotates the corresponding amino acid changes.</p><p><strong>Conclusions: </strong>SAGPEK provides an efficient, flexible, and user-friendly solution for analyzing ABI-format Sanger sequencing data, enabling simultaneous genotyping of tens of loci across hundreds of samples. Its innovation lies not in introducing new base-calling methods, but in integrating versatile functionalities-batch genotyping, customizable anchor sequences, amino acid alteration reporting, chromatogram visualization, and local execution-into a single open-source package. This makes SAGPEK well suited for applications such as human genetic disease screening, drug-resistance mutation detection, and functional mutation identification in livestock and other organisms.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"246"},"PeriodicalIF":3.3,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12523141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145290903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}