Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1576317
Grigorios Koulouras, Yingrong Xu
Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.
{"title":"Protein cleaver: an interactive web interface for <i>in silico</i> prediction and systematic annotation of protein digestion-derived peptides.","authors":"Grigorios Koulouras, Yingrong Xu","doi":"10.3389/fbinf.2025.1576317","DOIUrl":"10.3389/fbinf.2025.1576317","url":null,"abstract":"<p><p>Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1576317"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12445168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1528515
Tim Breitenbach, Thomas Dandekar
How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.
{"title":"Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.","authors":"Tim Breitenbach, Thomas Dandekar","doi":"10.3389/fbinf.2025.1528515","DOIUrl":"10.3389/fbinf.2025.1528515","url":null,"abstract":"<p><p>How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1528515"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12444090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1577324
Anas Al-Okaily, Abdelghani Tbakhi
Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.
{"title":"A novel linear indexing method for strings under all internal nodes in a suffix tree.","authors":"Anas Al-Okaily, Abdelghani Tbakhi","doi":"10.3389/fbinf.2025.1577324","DOIUrl":"10.3389/fbinf.2025.1577324","url":null,"abstract":"<p><p>Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1577324"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12443692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1685992
Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray
{"title":"Editorial: Networks and graphs in biological data: current methods, opportunities and challenges.","authors":"Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray","doi":"10.3389/fbinf.2025.1685992","DOIUrl":"10.3389/fbinf.2025.1685992","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1685992"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12437696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Germline mutation profiling of breast cancer patients using a non-BRCA sequencing panel.","authors":"Sonar Soni Panigoro, Rafika Indah Paramita, Fadilah Fadilah, Septelia Inawati Wanandi, Aisyah Fitriannisa Prawiningrum, Linda Erlina, Wahyu Dian Utari, Ajeng Megawati Fajrin","doi":"10.3389/fbinf.2025.1620025","DOIUrl":"10.3389/fbinf.2025.1620025","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1620025"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12436446/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1630078
Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi
Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC DA (COntact search pruning by C Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C ) distance matrices. COC DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation ("brute-force"), static C distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like k-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.
蛋白质原子间接触是由空间接近性和原子分辨率上的物理化学互补性定义的,是表征分子相互作用和键合的基础。计算接触的方法通常被分类为依赖于欧几里得距离的截止点,或利用Delaunay和Voronoi细分的截止点无关。虽然截止依赖方法因其简单、完整和可靠而得到认可,但传统的实现方法在计算上仍然昂贵,在当前生物信息学的大数据时代提出了重大的可扩展性挑战。在这里,我们介绍了COC α DA (COntact search pruning by C α Distance Analysis),这是一个基于python的命令行工具,用于改进使用α -碳(C α)距离矩阵进行大规模原子间蛋白质接触分析的搜索修剪。COC α DA检测链内和链间的接触,并将其分为7种不同的类型:氢键和二硫键;疏水效果;吸引、排斥和盐桥相互作用;还有芳香的堆叠。为了评估我们的工具,我们将其与文献中的三种传统方法进行了比较:全反全原子距离计算(“蛮力”)、静态C α距离切断(SC)和Biopython的NeighborSearch类(NS)。与其他方法相比,COC α DA表现出了优越的性能,实现的计算时间平均比来自NS的k-d树等高级数据结构快6倍,并且更容易实现和完全可定制。该工具以一种简单有效的方式促进了对蛋白质中原子间接触的探索性和大规模分析,也使结果能够与其他工具和管道集成。COC α DA工具可在https://github.com/LBS-UFMG/COCaDA免费获得。
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">COC <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> DA - a fast and scalable algorithm for interatomic contact detection in proteins using C <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> distance matrices.","authors":"Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi","doi":"10.3389/fbinf.2025.1630078","DOIUrl":"10.3389/fbinf.2025.1630078","url":null,"abstract":"<p><p>Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC <math><mrow><mi>α</mi></mrow> </math> DA (COntact search pruning by C <math><mrow><mi>α</mi></mrow> </math> Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C <math><mrow><mi>α</mi></mrow> </math> ) distance matrices. COC <math><mrow><mi>α</mi></mrow> </math> DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation (\"brute-force\"), static C <math><mrow><mi>α</mi></mrow> </math> distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC <math><mrow><mi>α</mi></mrow> </math> DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like <i>k</i>-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC <math><mrow><mi>α</mi></mrow> </math> DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1630078"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145076621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1610015
Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma
The increasing adoption of high-throughput "omics" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.
{"title":"Advancing bioinformatics capacity through Nextflow and nf-core: lessons from an early-to mid-career researchers-focused program at The Kids Research Institute Australia.","authors":"Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma","doi":"10.3389/fbinf.2025.1610015","DOIUrl":"10.3389/fbinf.2025.1610015","url":null,"abstract":"<p><p>The increasing adoption of high-throughput \"omics\" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1610015"},"PeriodicalIF":3.9,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-26eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1613985
Jingmin Zhang, Tianwei Meng, Weiqi Gao, Xinghua Li, Juan Xu
Background: Non-alcoholic fatty liver disease (NAFLD) is a prevalent condition with limited effective treatments, necessitating novel therapeutic strategies. Bioinformatics offers a promising approach to identify new targets by analyzing gene expression and drug interactions.
Objective: This study aims to identify novel therapeutic targets for NAFLD through bioinformatics, focusing on drug repositioning and traditional Chinese medicine (TCM) components.
Methods: Three NAFLD-related gene expression datasets (GSE260666, GSE126848, GSE135251) were analyzed to identify differentially expressed genes. Protein-protein interaction networks were constructed using STRING and visualized with Cytoscape. Pathway enrichment analysis was performed, and drug-gene interactions were explored using the DGIdb database. TCM components were screened via the HERB database, with molecular docking conducted to assess binding affinities.
Results: Key hub genes (CXCL2, CDKN1A, TNFRSF12A, HGFAC) were identified, with significant enrichment in cell proliferation and PI3K-Akt signaling pathways. Cyclosporine emerged as a potential repurposed drug, while TCM components (curcumin, resveratrol, berberine) showed strong binding affinities to NAFLD targets.
Conclusion: Cyclosporine and TCM compounds are promising candidates for NAFLD treatment, warranting further experimental validation to confirm their therapeutic potential.
{"title":"Identifying novel therapeutic targets for non-alcoholic fatty liver disease using bioinformatics approaches: from drug repositioning to traditional Chinese medicine.","authors":"Jingmin Zhang, Tianwei Meng, Weiqi Gao, Xinghua Li, Juan Xu","doi":"10.3389/fbinf.2025.1613985","DOIUrl":"10.3389/fbinf.2025.1613985","url":null,"abstract":"<p><strong>Background: </strong>Non-alcoholic fatty liver disease (NAFLD) is a prevalent condition with limited effective treatments, necessitating novel therapeutic strategies. Bioinformatics offers a promising approach to identify new targets by analyzing gene expression and drug interactions.</p><p><strong>Objective: </strong>This study aims to identify novel therapeutic targets for NAFLD through bioinformatics, focusing on drug repositioning and traditional Chinese medicine (TCM) components.</p><p><strong>Methods: </strong>Three NAFLD-related gene expression datasets (GSE260666, GSE126848, GSE135251) were analyzed to identify differentially expressed genes. Protein-protein interaction networks were constructed using STRING and visualized with Cytoscape. Pathway enrichment analysis was performed, and drug-gene interactions were explored using the DGIdb database. TCM components were screened via the HERB database, with molecular docking conducted to assess binding affinities.</p><p><strong>Results: </strong>Key hub genes (CXCL2, CDKN1A, TNFRSF12A, HGFAC) were identified, with significant enrichment in cell proliferation and PI3K-Akt signaling pathways. Cyclosporine emerged as a potential repurposed drug, while TCM components (curcumin, resveratrol, berberine) showed strong binding affinities to NAFLD targets.</p><p><strong>Conclusion: </strong>Cyclosporine and TCM compounds are promising candidates for NAFLD treatment, warranting further experimental validation to confirm their therapeutic potential.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1613985"},"PeriodicalIF":3.9,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417881/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1633623
Kleber Padovani, Rafael Cabral Borges, Roberto Xavier, André Carlos Carvalho, Anna Reali, Annie Chateau, Ronnie Alves
Genome assembly remains an unsolved problem, and de novo strategies (i.e., those run without a reference) are relevant but computationally complex tasks in genomics. Although de novo assemblers have been previously successfully applied in genomic projects, there is still no "best assembler", and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning has emerged as an alternative (or complementary) way to develop accurate, fast and autonomous assemblers. Reinforcement learning has proven promising for solving complex activities without supervision, such as games, and there is a pressing need to understand the limits of this approach to "real-life" problems, such as the DNA fragment assembly problem. In this study, we analyze the boundaries of applying machine learning via reinforcement learning (RL) for genome assembly. We expand upon the previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing (>300% improvement). We tested the new approaches on 23 environments. Our results suggest the unsatisfactory performance of the approaches, both in terms of assembly quality and execution time, providing strong evidence for the poor scalability of the studied reinforcement learning approaches to the genome assembly problem. Finally, we discuss the existing proposal, complemented by attempts at improvement that also proved insufficient. In doing so, we contribute to the scientific community by offering a clear mapping of the limitations and challenges that should be taken into account in future attempts to apply reinforcement learning to genome assembly.
{"title":"Using reinforcement learning in genome assembly: in-depth analysis of a Q-learning assembler.","authors":"Kleber Padovani, Rafael Cabral Borges, Roberto Xavier, André Carlos Carvalho, Anna Reali, Annie Chateau, Ronnie Alves","doi":"10.3389/fbinf.2025.1633623","DOIUrl":"10.3389/fbinf.2025.1633623","url":null,"abstract":"<p><p>Genome assembly remains an unsolved problem, and de novo strategies (i.e., those run without a reference) are relevant but computationally complex tasks in genomics. Although de novo assemblers have been previously successfully applied in genomic projects, there is still no \"best assembler\", and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning has emerged as an alternative (or complementary) way to develop accurate, fast and autonomous assemblers. Reinforcement learning has proven promising for solving complex activities without supervision, such as games, and there is a pressing need to understand the limits of this approach to \"real-life\" problems, such as the DNA fragment assembly problem. In this study, we analyze the boundaries of applying machine learning via reinforcement learning (RL) for genome assembly. We expand upon the previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing (>300% improvement). We tested the new approaches on 23 environments. Our results suggest the unsatisfactory performance of the approaches, both in terms of assembly quality and execution time, providing strong evidence for the poor scalability of the studied reinforcement learning approaches to the genome assembly problem. Finally, we discuss the existing proposal, complemented by attempts at improvement that also proved insufficient. In doing so, we contribute to the scientific community by offering a clear mapping of the limitations and challenges that should be taken into account in future attempts to apply reinforcement learning to genome assembly.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1633623"},"PeriodicalIF":3.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12405310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145001993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1567219
Sumaiya Binte Shahid, Maleeha Kaikaus, Md Hasanul Kabir, Mohammad Abu Yousuf, A K M Azad, A S Al-Moisheer, Naif Alotaibi, Salem A Alyami, Touhid Bhuiyan, Mohammad Ali Moni
Introduction: Alzheimer's disease (AD) is one of the most common neurodegenerative disabilities that often leads to memory loss, confusion, difficulty in language and trouble with motor coordination. Although several machine learning (ML) and deep learning (DL) algorithms have been utilized to identify Alzheimer's disease (AD) from MRI scans, precise classification of AD categories remains challenging as neighbouring categories share common features.
Methods: This study proposes transfer learning-based methods for extracting features from MRI scans for multi-class classification of different AD categories. Four transfer learning-based feature extractors, namely, ResNet152V2, VGG16, InceptionV3, and MobileNet have been employed on two publicly available datasets (i.e., ADNI and OASIS) and a Merged dataset combining ADNI and OASIS, each having four categories: Moderate Demented (MoD), Mild Demented (MD), Very Mild Demented (VMD), and Non Demented (ND).
Results: Results suggest the Modified ResNet152V2 as the optimal feature extractor among the four transfer learning methods. Next, by utilizing the modified ResNet152V2 as a feature extractor, a Convolutional Neural Network based model, namely, the 'IncepRes', is proposed by fusing the Inception and ResNet architectures for multiclass classification of AD categories. The results indicate that our proposed model achieved a standard accuracy of 96.96%, 98.35% and 97.13% for ADNI, OASIS, and Merged datasets, respectively, outperforming other competing DL structures.
Discussion: We hope that our proposed framework may automate the precise classifications of various AD categories, and thereby can offer the prompt management and treatment of cognitive and functional impairments associated with AD.
{"title":"Novel deep learning for multi-class classification of Alzheimer's in disability using MRI datasets.","authors":"Sumaiya Binte Shahid, Maleeha Kaikaus, Md Hasanul Kabir, Mohammad Abu Yousuf, A K M Azad, A S Al-Moisheer, Naif Alotaibi, Salem A Alyami, Touhid Bhuiyan, Mohammad Ali Moni","doi":"10.3389/fbinf.2025.1567219","DOIUrl":"10.3389/fbinf.2025.1567219","url":null,"abstract":"<p><strong>Introduction: </strong>Alzheimer's disease (AD) is one of the most common neurodegenerative disabilities that often leads to memory loss, confusion, difficulty in language and trouble with motor coordination. Although several machine learning (ML) and deep learning (DL) algorithms have been utilized to identify Alzheimer's disease (AD) from MRI scans, precise classification of AD categories remains challenging as neighbouring categories share common features.</p><p><strong>Methods: </strong>This study proposes transfer learning-based methods for extracting features from MRI scans for multi-class classification of different AD categories. Four transfer learning-based feature extractors, namely, ResNet152V2, VGG16, InceptionV3, and MobileNet have been employed on two publicly available datasets (i.e., ADNI and OASIS) and a Merged dataset combining ADNI and OASIS, each having four categories: Moderate Demented (MoD), Mild Demented (MD), Very Mild Demented (VMD), and Non Demented (ND).</p><p><strong>Results: </strong>Results suggest the Modified ResNet152V2 as the optimal feature extractor among the four transfer learning methods. Next, by utilizing the modified ResNet152V2 as a feature extractor, a Convolutional Neural Network based model, namely, the 'IncepRes', is proposed by fusing the Inception and ResNet architectures for multiclass classification of AD categories. The results indicate that our proposed model achieved a standard accuracy of 96.96%, 98.35% and 97.13% for ADNI, OASIS, and Merged datasets, respectively, outperforming other competing DL structures.</p><p><strong>Discussion: </strong>We hope that our proposed framework may automate the precise classifications of various AD categories, and thereby can offer the prompt management and treatment of cognitive and functional impairments associated with AD.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1567219"},"PeriodicalIF":3.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12405159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145002021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}