Pub Date : 2025-10-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf245
Muhammed Talo, Serdar Bozdag
Motivation: Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction.
Results: In this study, we propose a multimodal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model embeddings as node features in a protein family network. Employing graph attention networks on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using three benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun.
Availability and implementation: The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.
{"title":"ProtFun: a protein function prediction model using graph attention networks with a protein large language model.","authors":"Muhammed Talo, Serdar Bozdag","doi":"10.1093/bioadv/vbaf245","DOIUrl":"10.1093/bioadv/vbaf245","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding protein functions facilitates the identification of the underlying causes of many diseases and guides the research for discovering new therapeutic targets and medications. With the advancement of high throughput technologies, obtaining novel protein sequences has been a routine process. However, determining protein functions experimentally is cost- and labor-prohibitive. Therefore, it is crucial to develop computational methods for automatic protein function prediction.</p><p><strong>Results: </strong>In this study, we propose a multimodal deep learning architecture called ProtFun to predict protein functions. ProtFun integrates protein large language model embeddings as node features in a protein family network. Employing graph attention networks on this protein family network, ProtFun learns protein embeddings, which are integrated with protein signature representations from InterPro to train a protein function prediction model. We evaluated our architecture using three benchmark datasets. Our results showed that our proposed approach outperformed current state-of-the-art methods for most cases. An ablation study also highlighted the importance of different components of ProtFun.</p><p><strong>Availability and implementation: </strong>The data and source code of ProtFun is available at https://github.com/bozdaglab/ProtFun under Creative Commons Attribution Non Commercial 4.0 International Public License.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf245"},"PeriodicalIF":2.8,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571506/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf252
Bishwa Ghimire, Nicholas Booth, Tapio Lönnberg, Tero Aittokallio
Motivation: High-throughput genomic data analysis consists of the inexorably intertwined inputs and outputs of a vast array of bioinformatic analysis tools. To guarantee streamlined and reproducible analyses, the often complex data analysis pipelines need to be run using workflow management tools. Nextflow is one popular tool commonly used to automate such pipelines. Nextflow records key pipeline data, such as the submission time, start time, completion time, CPU usage, memory usage, and disk usage for each task run. These data are stored in log files, often scattered across a file system. Therefore, aggregating information about resource usage critical for the optimization of Nextflow pipelines and improving reproducibility, as well as parsing and managing such log data, can quickly become cumbersome.
Results: Here, we present a web-based tool, Nextpie, which provides both a database and a reporting tool for Nextflow pipelines. Nextpie stores comprehensive resource usage information in a relational database, thus facilitating and accelerating the performance of a variety of data analyses and interactive visualizations, providing an easily comprehensible overview of a pipeline's resource usage.
Availability and implementation: The Nextpie source code, user documentation, an SQLite database with test data, and a Nextflow example pipeline are available at GitHub (https://github.com/bishwaG/Nextpie).
{"title":"Nextpie: a web-based reporting tool and database for reproducible nextflow pipelines.","authors":"Bishwa Ghimire, Nicholas Booth, Tapio Lönnberg, Tero Aittokallio","doi":"10.1093/bioadv/vbaf252","DOIUrl":"10.1093/bioadv/vbaf252","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput genomic data analysis consists of the inexorably intertwined inputs and outputs of a vast array of bioinformatic analysis tools. To guarantee streamlined and reproducible analyses, the often complex data analysis pipelines need to be run using workflow management tools. Nextflow is one popular tool commonly used to automate such pipelines. Nextflow records key pipeline data, such as the submission time, start time, completion time, CPU usage, memory usage, and disk usage for each task run. These data are stored in log files, often scattered across a file system. Therefore, aggregating information about resource usage critical for the optimization of Nextflow pipelines and improving reproducibility, as well as parsing and managing such log data, can quickly become cumbersome.</p><p><strong>Results: </strong>Here, we present a web-based tool, Nextpie, which provides both a database and a reporting tool for Nextflow pipelines. Nextpie stores comprehensive resource usage information in a relational database, thus facilitating and accelerating the performance of a variety of data analyses and interactive visualizations, providing an easily comprehensible overview of a pipeline's resource usage.</p><p><strong>Availability and implementation: </strong>The Nextpie source code, user documentation, an SQLite database with test data, and a Nextflow example pipeline are available at GitHub (https://github.com/bishwaG/Nextpie).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf252"},"PeriodicalIF":2.8,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12574971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf250
Bradley T Martin, Zachery D Zbinden, Michael E Douglas, Marlis R Douglas, Tyler K Chafin
Motivation: Determining geographic origin of samples is a common objective in wildlife management, forensics, and conservation. Current methods often assume evolutionary models or require extensive reference datasets, which are costly and difficult to develop, that perform poorly with uneven or biased sampling. Supervised deep learning offers a promising alternative by learning complex patterns without prior model specifications. Combined with novel geo-genetic data augmentation and preprocessing techniques, it can reduce reference panel demands and improve performance across diverse sampling schemes, broadening accurate provenance determination to more study systems.
Results: We present GeoGenIE, an open-source software package powered by PyTorch for geographic provenance prediction from genomic data. GeoGenIE implements a multilayer perceptron architecture within an automated hyperparameter tuning framework, incorporating preprocessing, geo-genetic outlier detection, and data augmentation to improve accuracy in sparsely sampled regions. Benchmarking against a comparable approach with White-tailed deer (Odocoileus virginianus) double digest restriction-site associated DNA sequencing data, GeoGenIE achieved substantially improved geolocation accuracy with less spatial bias using a smaller SNP panel. Gains were most evident in undersampled regions, underscoring effectiveness under challenging conditions. Its parallelized execution also produced fast runtimes, promoting its application to large datasets.
Availability and implementation: Open-source at https://github.com/btmartin721/geogenie and https://pypi.org/project/GeoGenIE/.
{"title":"GeoGenIE: a deep learning approach to predict geographic provenance of biodiversity samples from genomic SNPs.","authors":"Bradley T Martin, Zachery D Zbinden, Michael E Douglas, Marlis R Douglas, Tyler K Chafin","doi":"10.1093/bioadv/vbaf250","DOIUrl":"10.1093/bioadv/vbaf250","url":null,"abstract":"<p><strong>Motivation: </strong>Determining geographic origin of samples is a common objective in wildlife management, forensics, and conservation. Current methods often assume evolutionary models or require extensive reference datasets, which are costly and difficult to develop, that perform poorly with uneven or biased sampling. Supervised deep learning offers a promising alternative by learning complex patterns without prior model specifications. Combined with novel geo-genetic data augmentation and preprocessing techniques, it can reduce reference panel demands and improve performance across diverse sampling schemes, broadening accurate provenance determination to more study systems.</p><p><strong>Results: </strong>We present GeoGenIE, an open-source software package powered by PyTorch for geographic provenance prediction from genomic data. GeoGenIE implements a multilayer perceptron architecture within an automated hyperparameter tuning framework, incorporating preprocessing, geo-genetic outlier detection, and data augmentation to improve accuracy in sparsely sampled regions. Benchmarking against a comparable approach with White-tailed deer (<i>Odocoileus virginianus</i>) double digest restriction-site associated DNA sequencing data, GeoGenIE achieved substantially improved geolocation accuracy with less spatial bias using a smaller SNP panel. Gains were most evident in undersampled regions, underscoring effectiveness under challenging conditions. Its parallelized execution also produced fast runtimes, promoting its application to large datasets.</p><p><strong>Availability and implementation: </strong>Open-source at https://github.com/btmartin721/geogenie and https://pypi.org/project/GeoGenIE/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf250"},"PeriodicalIF":2.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596584/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf247
Monica-Andreea Baciu-Drăgan, Niko Beerenwinkel
Summary: In recent years, developments in single-cell next-generation sequencing technology and computational methodology have made it possible to reconstruct, with increasing precision, the evolutionary history of tumors and their cell phylogenies, represented as mutation trees. Many mutation tree inference tools exist, but they do not support detailed visual tree inspection, nor tree comparisons or analysis at the cohort level, an important task in computational oncology. We developed oncotreeVIS, an interactive graphical user interface for visualizing mutation tree cohorts and tree posterior distributions obtained from mutation tree inference tools. OncotreeVIS can display mutation trees that encode single or joint genetic events, such as point mutations and copy number changes, and highlight matching subclones, conserved trajectories and drug-gene interactions at the cohort level. OncotreeVIS facilitates the visual inspection of mutation tree clusters and pairwise tree distances. It is available both as a JavaScript library that can be used locally or as a web application that can be accessed online. It includes seven default datasets of public mutation tree cohorts for visualization, while new mutation trees are provided in a predefined JSON format.
Availability and implementation: https://cbg-ethz.github.io/oncotreeVIS.
{"title":"OncotreeVIS-an interactive graphical user interface for visualizing mutation tree cohorts.","authors":"Monica-Andreea Baciu-Drăgan, Niko Beerenwinkel","doi":"10.1093/bioadv/vbaf247","DOIUrl":"10.1093/bioadv/vbaf247","url":null,"abstract":"<p><strong>Summary: </strong>In recent years, developments in single-cell next-generation sequencing technology and computational methodology have made it possible to reconstruct, with increasing precision, the evolutionary history of tumors and their cell phylogenies, represented as mutation trees. Many mutation tree inference tools exist, but they do not support detailed visual tree inspection, nor tree comparisons or analysis at the cohort level, an important task in computational oncology. We developed oncotreeVIS, an interactive graphical user interface for visualizing mutation tree cohorts and tree posterior distributions obtained from mutation tree inference tools. OncotreeVIS can display mutation trees that encode single or joint genetic events, such as point mutations and copy number changes, and highlight matching subclones, conserved trajectories and drug-gene interactions at the cohort level. OncotreeVIS facilitates the visual inspection of mutation tree clusters and pairwise tree distances. It is available both as a JavaScript library that can be used locally or as a web application that can be accessed online. It includes seven default datasets of public mutation tree cohorts for visualization, while new mutation trees are provided in a predefined JSON format.</p><p><strong>Availability and implementation: </strong>https://cbg-ethz.github.io/oncotreeVIS.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf247"},"PeriodicalIF":2.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf249
Jichen Zhang, Yutaka Saito
Motivation: Accurate detection of 5-methylcytosine (5mC) from PacBio single-molecule real-time (SMRT) sequencing remains challenging in prokaryotes due to weak kinetic signals and motif diversity.
Results: Here, we present KinMethyl, a generalizable deep learning framework that integrates sequence and kinetic signals to improve methylation detection across diverse bacterial genomes. Central to our approach is a regression model trained on whole-genome amplified samples to estimate the expected kinetics signals of unmethylated sequences. These predicted signals are incorporated into a downstream classifier to enhance the performance under low signal-to-noise conditions. KinMethyl outperforms existing tools such as kineticstools and ccsmeth across multiple bacterial species, methylation motifs, and modification types not only 5mC but also N6-methyladenine (6 mA) and N4-methylcytosine (4mC). In 5mC classification, KinMethyl improved the AUC by up to 0.20 compared to the existing method (0.6165 to 0.8190) with statistical significance (DeLong's test, P < 1e-10). The improvements were consistently observed in cross-species evaluations as well as different sequencing platforms including RSII and Sequel. This work highlights the utility of kinetic signal modeling and feature integration for robust and motif-independent methylation analysis in prokaryotic epigenomics.
Availability and implementation: The source code is available at https://github.com/ZhangBio/KinMethyl.
{"title":"KinMethyl: robust methylation detection in prokaryotic SMRT sequencing via kinetic signal modeling and deep feature integration.","authors":"Jichen Zhang, Yutaka Saito","doi":"10.1093/bioadv/vbaf249","DOIUrl":"10.1093/bioadv/vbaf249","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate detection of 5-methylcytosine (5mC) from PacBio single-molecule real-time (SMRT) sequencing remains challenging in prokaryotes due to weak kinetic signals and motif diversity.</p><p><strong>Results: </strong>Here, we present KinMethyl, a generalizable deep learning framework that integrates sequence and kinetic signals to improve methylation detection across diverse bacterial genomes. Central to our approach is a regression model trained on whole-genome amplified samples to estimate the expected kinetics signals of unmethylated sequences. These predicted signals are incorporated into a downstream classifier to enhance the performance under low signal-to-noise conditions. KinMethyl outperforms existing tools such as kineticstools and ccsmeth across multiple bacterial species, methylation motifs, and modification types not only 5mC but also N6-methyladenine (6 mA) and N4-methylcytosine (4mC). In 5mC classification, KinMethyl improved the AUC by up to 0.20 compared to the existing method (0.6165 to 0.8190) with statistical significance (DeLong's test, <i>P </i>< 1e-10). The improvements were consistently observed in cross-species evaluations as well as different sequencing platforms including RSII and Sequel. This work highlights the utility of kinetic signal modeling and feature integration for robust and motif-independent methylation analysis in prokaryotic epigenomics.</p><p><strong>Availability and implementation: </strong>The source code is available at https://github.com/ZhangBio/KinMethyl.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf249"},"PeriodicalIF":2.8,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12552093/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-08eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf223
Athina Gavriilidou, Alexandros Stamatakis, Anne Kupczok, Iliana Bista, Chris D Jiggins, Rosa Fernández, Eirini Skourtanioti, Grigoris Amoutzias, Daniela Delneri, Nikos Kyrpides, Christoforos Nikolaou, Alexandros A Pittis, Tereza Manousaki, Nikolaos Vakirlis
This perspective outlines emerging trends, key challenges, and future opportunities in evolutionary and comparative genomics. Our starting point are the topics presented at the 2024 EMBO Early Career Lecture Course "Evolutionary and Comparative Genomics", which highlighted recent conceptual and methodological advances in areas ranging from microbial pangenomes, protein evolution, hybrid speciation, novel gene origination and transposon dynamics. Here, we emphasize the role of computational and molecular approaches, providing a forward-looking view on where the field is headed and how it is being reshaped by new technologies and approaches.
{"title":"Advances and challenges in understanding evolution through genome comparison: meeting report of the European Molecular Biology Organization (EMBO) lecture course \"Evolutionary and Comparative Genomics\".","authors":"Athina Gavriilidou, Alexandros Stamatakis, Anne Kupczok, Iliana Bista, Chris D Jiggins, Rosa Fernández, Eirini Skourtanioti, Grigoris Amoutzias, Daniela Delneri, Nikos Kyrpides, Christoforos Nikolaou, Alexandros A Pittis, Tereza Manousaki, Nikolaos Vakirlis","doi":"10.1093/bioadv/vbaf223","DOIUrl":"10.1093/bioadv/vbaf223","url":null,"abstract":"<p><p>This perspective outlines emerging trends, key challenges, and future opportunities in evolutionary and comparative genomics. Our starting point are the topics presented at the 2024 EMBO Early Career Lecture Course \"Evolutionary and Comparative Genomics\", which highlighted recent conceptual and methodological advances in areas ranging from microbial pangenomes, protein evolution, hybrid speciation, novel gene origination and transposon dynamics. Here, we emphasize the role of computational and molecular approaches, providing a forward-looking view on where the field is headed and how it is being reshaped by new technologies and approaches.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf223"},"PeriodicalIF":2.8,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Peptides containing non-proteinogenic amino acids (PNPAs) are promising targets in drug development for their unique pharmacological properties. The lack of their mass spectra or retention data has been hindering PNPA research, where accurate assessment of their retention time in chromatography is crucial for identifying structures and characterizing functions. Conventional methods are often ineffective due to limited amount of data. This study aims to predict their retention order, not absolute time, from structures by using data from peptides and small molecules. This approach can advance natural product identification and drug research.
Results: Our model uses the Ranking Support Vector Machine, and successfully predicted the retention order of PNPA with an accuracy of over 0.9. Counting fingerprints and MIX fingerprint, which combines four types of fingerprints, were used as explanatory variables. To suppress the multi-collinearity, principal component analysis was applied to reduce spurious fingerprints. SHAP value analysis revealed that one component, derived from methyl groups, contributed most for the prediction. Overall, order prediction can effectively find candidate compounds from LC/MS data from non-conventional biological extracts.
Availability and implementation: https://github.com/ShoheiNakamukai/RO_prediction_of_PNPA/tree/main.
{"title":"Retention order prediction of peptides containing non-proteinogenic amino acids.","authors":"Shohei Nakamukai, Eisuke Hayakawa, Tetsuya Mori, Yuji Ise, Masami Yokota Hirai, Masanori Arita","doi":"10.1093/bioadv/vbaf246","DOIUrl":"10.1093/bioadv/vbaf246","url":null,"abstract":"<p><strong>Motivation: </strong>Peptides containing non-proteinogenic amino acids (PNPAs) are promising targets in drug development for their unique pharmacological properties. The lack of their mass spectra or retention data has been hindering PNPA research, where accurate assessment of their retention time in chromatography is crucial for identifying structures and characterizing functions. Conventional methods are often ineffective due to limited amount of data. This study aims to predict their retention order, not absolute time, from structures by using data from peptides and small molecules. This approach can advance natural product identification and drug research.</p><p><strong>Results: </strong>Our model uses the Ranking Support Vector Machine, and successfully predicted the retention order of PNPA with an accuracy of over 0.9. Counting fingerprints and MIX fingerprint, which combines four types of fingerprints, were used as explanatory variables. To suppress the multi-collinearity, principal component analysis was applied to reduce spurious fingerprints. SHAP value analysis revealed that one component, derived from methyl groups, contributed most for the prediction. Overall, order prediction can effectively find candidate compounds from LC/MS data from non-conventional biological extracts.</p><p><strong>Availability and implementation: </strong>https://github.com/ShoheiNakamukai/RO_prediction_of_PNPA/tree/main.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf246"},"PeriodicalIF":2.8,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12643230/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf238
Gene Myers, Richard Durbin, Chenxi Zhou
Motivation: FastGA finds alignments between two genome sequences more than an order of magnitude faster than previous methods that have comparable sensitivity. Its speed is due to (i) a fully cache-local architecture involving only MSD radix sorts and merges, (ii) an algorithm for finding adaptive seed hits in a linear merge of sorted k-mer tables, and (iii) a variant of the Myers adaptive wave algorithm to find alignments around a chain of seed hits. It further stores alignments in a fraction of the space of a conventional CIGAR string using a trace-point encoding and our ONEcode data system introduced here.
Results: For example, two 2 Gbp bat genomes are compared in 2.1 min with eight threads on an Apple laptop using 5.7 GB of memory and producing 1.05 million alignments covering 60% of each genome. Our ALN format file occupies 66 MB and in just 6 s can be converted to a standard 1.03 GB PAF file.
Availability and implementation: FastGA is freely available at GitHub: http://www.github.com/thegenemyers/FASTGA along with utilities for viewing inputs, intermediates, and outputs and transforming ALN files to PSL or PAF with or without CIGAR strings and common formats. There is also a utility to chain FastGA's alignments and display them in a dot-plot view in PostScript files.
{"title":"FastGA: fast genome alignment.","authors":"Gene Myers, Richard Durbin, Chenxi Zhou","doi":"10.1093/bioadv/vbaf238","DOIUrl":"10.1093/bioadv/vbaf238","url":null,"abstract":"<p><strong>Motivation: </strong>FastGA finds alignments between two genome sequences more than an order of magnitude faster than previous methods that have comparable sensitivity. Its speed is due to (i) a fully <i>cache-local</i> architecture involving only <i>MSD radix sorts</i> and merges, (ii) an algorithm for finding <i>adaptive seed</i> hits in a linear merge of sorted k-mer tables, and (iii) a variant of the Myers <i>adaptive wave</i> algorithm to find alignments around a chain of seed hits. It further stores alignments in a fraction of the space of a conventional CIGAR string using a <i>trace-point</i> encoding and our ONEcode data system introduced here.</p><p><strong>Results: </strong>For example, two 2 Gbp bat genomes are compared in 2.1 min with eight threads on an Apple laptop using 5.7 GB of memory and producing 1.05 million alignments covering 60% of each genome. Our ALN format file occupies 66 MB and in just 6 s can be converted to a standard 1.03 GB PAF file.</p><p><strong>Availability and implementation: </strong>FastGA is freely available at GitHub: http://www.github.com/thegenemyers/FASTGA along with utilities for viewing inputs, intermediates, and outputs and transforming ALN files to PSL or PAF with or without CIGAR strings and common formats. There is also a utility to chain FastGA's alignments and display them in a dot-plot view in PostScript files.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf238"},"PeriodicalIF":2.8,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12624442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf244
Changhoe Hwang, Krishna Prasad Adhikari, Gyeongin Oh, Sunshin Kim
Motivation: A major limitation in the development of fetal chromosomal aneuploidy detection technologies lies in the scarcity of real positive data. To address this issue, we propose a novel methodology to generate virtually unlimited synthetic negative and positive datasets with >99.9% similarity to real data, enabling accurate detection of both autosomal chromosome aneuploidies (ACA) and sex chromosome aneuploidies (SCA). In terms of methods, blood samples from 15 999 pregnant women were analyzed, including 186 clinically confirmed positive cases. Using 701 high-confidence negatives as a reference, we designed algorithms for synthetic data generation. For negatives, multiple real FASTQ files were randomly merged, and fetal fraction (FF) was recalculated to reflect biological variability. For positives, chromosome-specific read counts were adjusted using numerical equations: ACAs were simulated by increasing the target chromosome reads, and SCAs were generated by adjusting sex chromosome read counts using regression models that account for FF and total read count, with the GC distribution preserved. Logistic regression (LR) models were then trained using features including FF, GC content, and chromosomal read counts. Performance was evaluated against conventional z-score methods and real positive cases.
Results: From high-confidence negative samples, ∼160 000 synthetic training datasets were generated for major ACA and ∼35 000 for each SCA. While z-score methods showed declines in sensitivity (T13) or positive predictive value (PPV) (T18, T21) under low prevalence, LR models consistently maintained 100% sensitivity and PPV for ACAs, achieved ≥99.6% sensitivity and PPV for SCAs on synthetic evaluation datasets, and demonstrated 100% accuracy on real positive samples.
Availability and implementation: All formulas and procedures required for synthetic data generation and model development are implemented in Python and are available at https://github.com/genomecare-rnd/SyntheticData-NIPT.
{"title":"Synthetic data-driven AI approach for fetal chromosomal aneuploidies detection.","authors":"Changhoe Hwang, Krishna Prasad Adhikari, Gyeongin Oh, Sunshin Kim","doi":"10.1093/bioadv/vbaf244","DOIUrl":"10.1093/bioadv/vbaf244","url":null,"abstract":"<p><strong>Motivation: </strong>A major limitation in the development of fetal chromosomal aneuploidy detection technologies lies in the scarcity of real positive data. To address this issue, we propose a novel methodology to generate virtually unlimited synthetic negative and positive datasets with >99.9% similarity to real data, enabling accurate detection of both autosomal chromosome aneuploidies (ACA) and sex chromosome aneuploidies (SCA). In terms of methods, blood samples from 15 999 pregnant women were analyzed, including 186 clinically confirmed positive cases. Using 701 high-confidence negatives as a reference, we designed algorithms for synthetic data generation. For negatives, multiple real FASTQ files were randomly merged, and fetal fraction (FF) was recalculated to reflect biological variability. For positives, chromosome-specific read counts were adjusted using numerical equations: ACAs were simulated by increasing the target chromosome reads, and SCAs were generated by adjusting sex chromosome read counts using regression models that account for FF and total read count, with the GC distribution preserved. Logistic regression (LR) models were then trained using features including FF, GC content, and chromosomal read counts. Performance was evaluated against conventional <i>z</i>-score methods and real positive cases.</p><p><strong>Results: </strong>From high-confidence negative samples, ∼160 000 synthetic training datasets were generated for major ACA and ∼35 000 for each SCA. While <i>z</i>-score methods showed declines in sensitivity (T13) or positive predictive value (PPV) (T18, T21) under low prevalence, LR models consistently maintained 100% sensitivity and PPV for ACAs, achieved ≥99.6% sensitivity and PPV for SCAs on synthetic evaluation datasets, and demonstrated 100% accuracy on real positive samples.</p><p><strong>Availability and implementation: </strong>All formulas and procedures required for synthetic data generation and model development are implemented in Python and are available at https://github.com/genomecare-rnd/SyntheticData-NIPT.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf244"},"PeriodicalIF":2.8,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12557104/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145395710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf224
Daniel J van Zyl, Marcel Dunaiski, Houriiyah Tegally, Cheryl Baxter, Tulio de Oliveira, Joicymara S Xavier
Motivation: The dengue virus poses a major global health threat, with nearly 390 million infections annually. A recently proposed hierarchical dengue nomenclature system enhances spatial resolution by defining major and minor lineages within genotypes, aiding efforts to track viral evolution. While current subtyping tools-Genome Detective, GLUE, and Nextclade-rely on computationally intensive sequence alignment and phylogenetic inference, machine learning presents a promising alternative for achieving accurate and rapid classification.
Results: We present Craft (Chaos Random Forest), a machine learning framework for dengue subtyping. We demonstrate that Craft is capable of faster classification speeds while matching or surpassing the accuracy of existing tools. Craft achieves 99.5% accuracy on a hold-out test set formed from a consensus of predictions from existing tools and processes over 140 000 sequences per minute. Notably, Craft maintains remarkably high accuracy even when classifying sequence segments as short as 700 nucleotides.
Availability and implementation: Source code is available at: https://github.com/INFORM-Africa/AI-viral-lineage-classification.
{"title":"Craft: a machine learning approach to dengue subtyping.","authors":"Daniel J van Zyl, Marcel Dunaiski, Houriiyah Tegally, Cheryl Baxter, Tulio de Oliveira, Joicymara S Xavier","doi":"10.1093/bioadv/vbaf224","DOIUrl":"10.1093/bioadv/vbaf224","url":null,"abstract":"<p><strong>Motivation: </strong>The dengue virus poses a major global health threat, with nearly 390 million infections annually. A recently proposed hierarchical dengue nomenclature system enhances spatial resolution by defining major and minor lineages within genotypes, aiding efforts to track viral evolution. While current subtyping tools-Genome Detective, GLUE, and Nextclade-rely on computationally intensive sequence alignment and phylogenetic inference, machine learning presents a promising alternative for achieving accurate and rapid classification.</p><p><strong>Results: </strong>We present Craft (<b>C</b>haos <b>Ra</b>ndom <b>F</b>ores<b>t</b>), a machine learning framework for dengue subtyping. We demonstrate that Craft is capable of faster classification speeds while matching or surpassing the accuracy of existing tools. Craft achieves 99.5% accuracy on a hold-out test set formed from a consensus of predictions from existing tools and processes over 140 000 sequences per minute. Notably, Craft maintains remarkably high accuracy even when classifying sequence segments as short as 700 nucleotides.</p><p><strong>Availability and implementation: </strong>Source code is available at: https://github.com/INFORM-Africa/AI-viral-lineage-classification.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf224"},"PeriodicalIF":2.8,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145309982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}