Background: Pacific Biosciences (PacBio) circular consensus sequencing (CCS), also known as high fidelity (HiFi) technology, has revolutionized modern genomics by producing long (10 + kb) and highly accurate reads. This is achieved by sequencing circularized DNA molecules multiple times and combining them into a consensus sequence. Currently, the accuracy and quality value estimation provided by HiFi technology are more than sufficient for applications such as genome assembly and germline variant calling. However, there are limitations in the accuracy of the estimated quality scores when it comes to somatic variant calling on single reads.
Results: To address the challenge of inaccurate quality scores for somatic variant calling, we introduce TopoQual, a novel tool designed to enhance the accuracy of base quality predictions. TopoQual leverages techniques including partial order alignments (POA), topologically parallel bases, and deep learning algorithms to polish consensus sequences. Our results demonstrate that TopoQual corrects approximately 31.9% of errors in PacBio consensus sequences. Additionally, it validates base qualities up to q59, which corresponds to one error in 0.9 million bases. These improvements will significantly enhance the reliability of somatic variant calling using HiFi data.
Conclusion: TopoQual represents a significant advancement in genomics by improving the accuracy of base quality predictions for PacBio HiFi sequencing data. By correcting a substantial proportion of errors and achieving high base quality validation, TopoQual enables confident and accurate somatic variant calling. This tool not only addresses a critical limitation of current HiFi technology but also opens new possibilities for precise genomic analysis in various research and clinical applications.
{"title":"TopoQual polishes circular consensus sequencing data and accurately predicts quality scores.","authors":"Minindu Weerakoon, Sangjin Lee, Emily Mitchell, Haynes Heaton","doi":"10.1186/s12859-024-06020-0","DOIUrl":"10.1186/s12859-024-06020-0","url":null,"abstract":"<p><strong>Background: </strong>Pacific Biosciences (PacBio) circular consensus sequencing (CCS), also known as high fidelity (HiFi) technology, has revolutionized modern genomics by producing long (10 + kb) and highly accurate reads. This is achieved by sequencing circularized DNA molecules multiple times and combining them into a consensus sequence. Currently, the accuracy and quality value estimation provided by HiFi technology are more than sufficient for applications such as genome assembly and germline variant calling. However, there are limitations in the accuracy of the estimated quality scores when it comes to somatic variant calling on single reads.</p><p><strong>Results: </strong>To address the challenge of inaccurate quality scores for somatic variant calling, we introduce TopoQual, a novel tool designed to enhance the accuracy of base quality predictions. TopoQual leverages techniques including partial order alignments (POA), topologically parallel bases, and deep learning algorithms to polish consensus sequences. Our results demonstrate that TopoQual corrects approximately 31.9% of errors in PacBio consensus sequences. Additionally, it validates base qualities up to q59, which corresponds to one error in 0.9 million bases. These improvements will significantly enhance the reliability of somatic variant calling using HiFi data.</p><p><strong>Conclusion: </strong>TopoQual represents a significant advancement in genomics by improving the accuracy of base quality predictions for PacBio HiFi sequencing data. By correcting a substantial proportion of errors and achieving high base quality validation, TopoQual enables confident and accurate somatic variant calling. This tool not only addresses a critical limitation of current HiFi technology but also opens new possibilities for precise genomic analysis in various research and clinical applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"17"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737182/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142999540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-16DOI: 10.1186/s12859-024-06029-5
Xiaomin Liu, Yi-Ju Li, Qiao Fan
Background: With the advance of next-generation sequencing, various gene-based rare variant association tests have been developed, particularly for binary and continuous phenotypes. In contrast, fewer methods are available for traits not following binomial or normal distributions. To address this, we previously proposed a set of burden- and kernel-based rare variant tests for count data following zero-inflated Poisson (ZIP) distributions, referred to as ZIP-b and ZIP-k tests. We sought to extend the methods to accommodate negative binomial distribution and implemented these tests in a new R package.
Results: We introduce ZIM4rv, an R package designed to analyze the association of rare variants with zero-inflated counts outcomes. Our package offers two novel models developed by our team: our previously proposed ZIP-b and ZIP-k tests, and the newly derived Negative Binomial Burden and Kernel Test (ZINB-b, ZINB-k). Additionally, we include an ad-hoc two-stage analysis, testing zero and non-zero as a binary outcome and non-zero as a continuous outcome, respectively. To showcase the utility of our platform, we applied this program to analyze neuritic plaque count data from the ROSMAP cohort.
Conclusion: The R package ZIM4rv presents an integrated workflow for conducting association tests on a set of rare variants with zero-inflated counts data.
{"title":"Zim4rv: an R package to modeling zero-inflated count phenotype on regional-based rare variants.","authors":"Xiaomin Liu, Yi-Ju Li, Qiao Fan","doi":"10.1186/s12859-024-06029-5","DOIUrl":"10.1186/s12859-024-06029-5","url":null,"abstract":"<p><strong>Background: </strong>With the advance of next-generation sequencing, various gene-based rare variant association tests have been developed, particularly for binary and continuous phenotypes. In contrast, fewer methods are available for traits not following binomial or normal distributions. To address this, we previously proposed a set of burden- and kernel-based rare variant tests for count data following zero-inflated Poisson (ZIP) distributions, referred to as ZIP-b and ZIP-k tests. We sought to extend the methods to accommodate negative binomial distribution and implemented these tests in a new R package.</p><p><strong>Results: </strong>We introduce ZIM4rv, an R package designed to analyze the association of rare variants with zero-inflated counts outcomes. Our package offers two novel models developed by our team: our previously proposed ZIP-b and ZIP-k tests, and the newly derived Negative Binomial Burden and Kernel Test (ZINB-b, ZINB-k). Additionally, we include an ad-hoc two-stage analysis, testing zero and non-zero as a binary outcome and non-zero as a continuous outcome, respectively. To showcase the utility of our platform, we applied this program to analyze neuritic plaque count data from the ROSMAP cohort.</p><p><strong>Conclusion: </strong>The R package ZIM4rv presents an integrated workflow for conducting association tests on a set of rare variants with zero-inflated counts data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"18"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11740424/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142999544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-15DOI: 10.1186/s12859-024-06028-6
Leixia Tian, Qi Wang, Zhiheng Zhou, Xiya Liu, Ming Zhang, Guiying Yan
In recent years, combined drug screening has played a very important role in modern drug discovery. Generally, synergistic drug combinations are crucial in treatment for many diseases. However, the toxic side effects of drug combinations are probably increased with the increase of drugs numbers, so the accurate prediction of toxic side effects of drug combinations is equally important. In this paper, we built a Metapath-based Aggregated Embedding Model on Single Drug-Side Effect Heterogeneous Information Network (MAEM-SSHIN), which extracts feature from a heterogeneous information network of single drug side effects, and a Graph Convolutional Network on Combinatorial drugs and Side effect Heterogeneous Information Network (GCN-CSHIN), which transforms the complex task of predicting multiple side effects between drug pairs into the more manageable prediction of relationships between combinatorial drugs and individual side effects. MAEM-SSHIN and GCN-CSHIN provided a united novel framework for predicting potential side effects in combinatorial drug therapies. This integration enhances prediction accuracy, efficiency, and scalability. Our experimental results demonstrate that this combined framework outperforms existing methodologies in predicting side effects, and marks a significant advancement in pharmaceutical research.
{"title":"Predicting drug combination side effects based on a metapath-based heterogeneous graph neural network.","authors":"Leixia Tian, Qi Wang, Zhiheng Zhou, Xiya Liu, Ming Zhang, Guiying Yan","doi":"10.1186/s12859-024-06028-6","DOIUrl":"10.1186/s12859-024-06028-6","url":null,"abstract":"<p><p>In recent years, combined drug screening has played a very important role in modern drug discovery. Generally, synergistic drug combinations are crucial in treatment for many diseases. However, the toxic side effects of drug combinations are probably increased with the increase of drugs numbers, so the accurate prediction of toxic side effects of drug combinations is equally important. In this paper, we built a Metapath-based Aggregated Embedding Model on Single Drug-Side Effect Heterogeneous Information Network (MAEM-SSHIN), which extracts feature from a heterogeneous information network of single drug side effects, and a Graph Convolutional Network on Combinatorial drugs and Side effect Heterogeneous Information Network (GCN-CSHIN), which transforms the complex task of predicting multiple side effects between drug pairs into the more manageable prediction of relationships between combinatorial drugs and individual side effects. MAEM-SSHIN and GCN-CSHIN provided a united novel framework for predicting potential side effects in combinatorial drug therapies. This integration enhances prediction accuracy, efficiency, and scalability. Our experimental results demonstrate that this combined framework outperforms existing methodologies in predicting side effects, and marks a significant advancement in pharmaceutical research.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"16"},"PeriodicalIF":2.9,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142999455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14DOI: 10.1186/s12859-025-06038-y
Lujun Luo, Tarikul I Milon, Elijah K Tandoh, Walter J Galdamez, Andrei Y Chistoserdov, Jianping Yu, Jan Kern, Yingchun Wang, Wu Xu
Background: All chemical forms of energy and oxygen on Earth are generated via photosynthesis where light energy is converted into redox energy by two photosystems (PS I and PS II). There is an increasing number of PS I 3D structures deposited in the Protein Data Bank (PDB). The Triangular Spatial Relationship (TSR)-based algorithm converts 3D structures into integers (TSR keys). A comprehensive study was conducted, by taking advantage of the PS I 3D structures and the TSR-based algorithm, to answer three questions: (i) Are electron cofactors including P700, A-1 and A0, which are chemically identical chlorophylls, structurally different? (ii) There are two electron transfer chains (A and B branches) in PS I. Are the cofactors on both branches structurally different? (iii) Are the amino acids in cofactor binding sites structurally different from those not in cofactor binding sites?
Results: The key contributions and important findings include: (i) a novel TSR-based method for representing 3D structures of pigments as well as for quantifying pigment structures was developed; (ii) the results revealed that the redox cofactor, P700, are structurally conserved and different from other redox factors. Similar situations were also observed for both A-1 and A0; (iii) the results demonstrated structural differences between A and B branches for the redox cofactors P700, A-1, A0 and A1 as well as their cofactor binding sites; (iv) the tryptophan residues close to A0 and A1 are structurally conserved; (v) The TSR-based method outperforms the Root Mean Square Deviation (RMSD) and the Ultrafast Shape Recognition (USR) methods.
Conclusions: The structural analyses of redox cofactors and their binding sites provide a foundation for understanding the unique chemical and physical properties of each redox cofactor in PS I, which are essential for modulating the rate and direction of energy and electron transfers.
背景:地球上所有化学形式的能量和氧气都是通过光合作用产生的,其中光能通过两个光系统(PS I和PS II)转化为氧化还原能。蛋白质数据库(PDB)中储存的PS I 3D结构越来越多。基于三角空间关系(TSR)的算法将三维结构转换为整数(TSR键)。利用PS I的三维结构和基于tsr的算法进行了全面的研究,回答了三个问题:(I)电子辅助因子包括P700, A-1和A0,它们是化学上相同的叶绿素,在结构上是否不同?(ii) PS i中有两个电子传递链(A支和B支),两个支上的辅因子在结构上是否不同?(iii)辅因子结合位点上的氨基酸与非辅因子结合位点上的氨基酸在结构上是否不同?结果:主要贡献和重要发现包括:(i)开发了一种新的基于tsr的颜料三维结构表征方法和定量颜料结构的方法;(ii)结果显示,氧化还原辅因子P700在结构上是保守的,与其他氧化还原因子不同。A-1和A0也观察到类似的情况;(iii)结果表明,氧化还原辅助因子P700、A-1、A0和A1及其辅助因子结合位点在A和B分支之间存在结构差异;(iv)靠近A0和A1的色氨酸残基在结构上是保守的;(v)基于tsr的方法优于均方根偏差(RMSD)和超快速形状识别(USR)方法。结论:氧化还原辅助因子及其结合位点的结构分析为了解PS I中每个氧化还原辅助因子独特的化学和物理性质提供了基础,这些性质对调节能量和电子转移的速率和方向至关重要。
{"title":"Development of a TSR-based method for understanding structural relationships of cofactors and local environments in photosystem I.","authors":"Lujun Luo, Tarikul I Milon, Elijah K Tandoh, Walter J Galdamez, Andrei Y Chistoserdov, Jianping Yu, Jan Kern, Yingchun Wang, Wu Xu","doi":"10.1186/s12859-025-06038-y","DOIUrl":"10.1186/s12859-025-06038-y","url":null,"abstract":"<p><strong>Background: </strong>All chemical forms of energy and oxygen on Earth are generated via photosynthesis where light energy is converted into redox energy by two photosystems (PS I and PS II). There is an increasing number of PS I 3D structures deposited in the Protein Data Bank (PDB). The Triangular Spatial Relationship (TSR)-based algorithm converts 3D structures into integers (TSR keys). A comprehensive study was conducted, by taking advantage of the PS I 3D structures and the TSR-based algorithm, to answer three questions: (i) Are electron cofactors including P700, A<sub>-1</sub> and A<sub>0</sub>, which are chemically identical chlorophylls, structurally different? (ii) There are two electron transfer chains (A and B branches) in PS I. Are the cofactors on both branches structurally different? (iii) Are the amino acids in cofactor binding sites structurally different from those not in cofactor binding sites?</p><p><strong>Results: </strong>The key contributions and important findings include: (i) a novel TSR-based method for representing 3D structures of pigments as well as for quantifying pigment structures was developed; (ii) the results revealed that the redox cofactor, P700, are structurally conserved and different from other redox factors. Similar situations were also observed for both A<sub>-1</sub> and A<sub>0</sub>; (iii) the results demonstrated structural differences between A and B branches for the redox cofactors P700, A<sub>-1</sub>, A<sub>0</sub> and A<sub>1</sub> as well as their cofactor binding sites; (iv) the tryptophan residues close to A<sub>0</sub> and A<sub>1</sub> are structurally conserved; (v) The TSR-based method outperforms the Root Mean Square Deviation (RMSD) and the Ultrafast Shape Recognition (USR) methods.</p><p><strong>Conclusions: </strong>The structural analyses of redox cofactors and their binding sites provide a foundation for understanding the unique chemical and physical properties of each redox cofactor in PS I, which are essential for modulating the rate and direction of energy and electron transfers.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"15"},"PeriodicalIF":2.9,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11731568/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142982562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug-target interactions (DTIs) are pivotal in drug discovery and development, and their accurate identification can significantly expedite the process. Numerous DTI prediction methods have emerged, yet many fail to fully harness the feature information of drugs and targets or address the issue of feature redundancy. We aim to refine DTI prediction accuracy by eliminating redundant features and capitalizing on the node topological structure to enhance feature extraction. To achieve this, we introduce a PCA-augmented multi-layer heterogeneous graph-based network that concentrates on key features throughout the encoding-decoding phase. Our approach initiates with the construction of a heterogeneous graph from various similarity metrics, which is then encoded via a graph neural network. We concatenate and integrate the resultant representation vectors to merge multi-level information. Subsequently, principal component analysis is applied to distill the most informative features, with the random forest algorithm employed for the final decoding of the integrated data. Our method outperforms six baseline models in terms of accuracy, as demonstrated by extensive experimentation. Comprehensive ablation studies, visualization of results, and in-depth case analyses further validate our framework's efficacy and interpretability, providing a novel tool for drug discovery that integrates multimodal features.
{"title":"DTI-MHAPR: optimized drug-target interaction prediction via PCA-enhanced features and heterogeneous graph attention networks.","authors":"Guang Yang, Yinbo Liu, Sijian Wen, Wenxi Chen, Xiaolei Zhu, Yongmei Wang","doi":"10.1186/s12859-024-06021-z","DOIUrl":"10.1186/s12859-024-06021-z","url":null,"abstract":"<p><p>Drug-target interactions (DTIs) are pivotal in drug discovery and development, and their accurate identification can significantly expedite the process. Numerous DTI prediction methods have emerged, yet many fail to fully harness the feature information of drugs and targets or address the issue of feature redundancy. We aim to refine DTI prediction accuracy by eliminating redundant features and capitalizing on the node topological structure to enhance feature extraction. To achieve this, we introduce a PCA-augmented multi-layer heterogeneous graph-based network that concentrates on key features throughout the encoding-decoding phase. Our approach initiates with the construction of a heterogeneous graph from various similarity metrics, which is then encoded via a graph neural network. We concatenate and integrate the resultant representation vectors to merge multi-level information. Subsequently, principal component analysis is applied to distill the most informative features, with the random forest algorithm employed for the final decoding of the integrated data. Our method outperforms six baseline models in terms of accuracy, as demonstrated by extensive experimentation. Comprehensive ablation studies, visualization of results, and in-depth case analyses further validate our framework's efficacy and interpretability, providing a novel tool for drug discovery that integrates multimodal features.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"11"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11726937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13DOI: 10.1186/s12859-025-06036-0
Gabriele Moro, Rossano Atzeni, Ali Al-Subhi, Maria Giovanna Marche
Background: The increasing availability of sequenced genomes has enabled comparative analyses of various organisms. Numerous tools and online platforms have been developed for this purpose, facilitating the identification of unique features within selected organisms. However, choosing the most appropriate tools can be unclear during the initial stages of analysis, often requiring multiple attempts to match the specific characteristics of the data. Here, we introduce CompàreGenome, a command-line tool specifically designed for genomic diversity estimation analyses. Suitable for both prokaryotes and eukaryotes, this tool is particularly valuable in the early stages of studies when little information is available about the genetic differences or similarities among compared organisms.
Results: In all the tests conducted, CompàreGenome successfully identified specific genetic features of the selected organisms, detected the most conserved genes, pinpointed highly divergent ones, and functionally annotated these genes. This provided insights into biological processes, molecular functions, and cellular components associated with each gene. The tool also distinguished organisms at the strain level and quantified genetic distances using three distinct analytical methods.
Conclusion: CompàreGenome empowers users to explore genomic differences among organisms, translating technical outputs from various tools into actionable insights for biologists. While primarily tested on small microbial genomes, the tool has potential applications for larger genomes. CompàreGenome is implemented in Bash, R, and Python and is freely available under an LGPL-2.1 license.
{"title":"CompàreGenome: a command-line tool for genomic diversity estimation in prokaryotes and eukaryotes.","authors":"Gabriele Moro, Rossano Atzeni, Ali Al-Subhi, Maria Giovanna Marche","doi":"10.1186/s12859-025-06036-0","DOIUrl":"10.1186/s12859-025-06036-0","url":null,"abstract":"<p><strong>Background: </strong>The increasing availability of sequenced genomes has enabled comparative analyses of various organisms. Numerous tools and online platforms have been developed for this purpose, facilitating the identification of unique features within selected organisms. However, choosing the most appropriate tools can be unclear during the initial stages of analysis, often requiring multiple attempts to match the specific characteristics of the data. Here, we introduce CompàreGenome, a command-line tool specifically designed for genomic diversity estimation analyses. Suitable for both prokaryotes and eukaryotes, this tool is particularly valuable in the early stages of studies when little information is available about the genetic differences or similarities among compared organisms.</p><p><strong>Results: </strong>In all the tests conducted, CompàreGenome successfully identified specific genetic features of the selected organisms, detected the most conserved genes, pinpointed highly divergent ones, and functionally annotated these genes. This provided insights into biological processes, molecular functions, and cellular components associated with each gene. The tool also distinguished organisms at the strain level and quantified genetic distances using three distinct analytical methods.</p><p><strong>Conclusion: </strong>CompàreGenome empowers users to explore genomic differences among organisms, translating technical outputs from various tools into actionable insights for biologists. While primarily tested on small microbial genomes, the tool has potential applications for larger genomes. CompàreGenome is implemented in Bash, R, and Python and is freely available under an LGPL-2.1 license.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"14"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11731138/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13DOI: 10.1186/s12859-024-06005-z
Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen
Background: Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.
Results: We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.
Evaluation: Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.
Conclusions: By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .
{"title":"Solu: a cloud platform for real-time genomic pathogen surveillance.","authors":"Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen","doi":"10.1186/s12859-024-06005-z","DOIUrl":"10.1186/s12859-024-06005-z","url":null,"abstract":"<p><strong>Background: </strong>Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.</p><p><strong>Results: </strong>We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.</p><p><strong>Evaluation: </strong>Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.</p><p><strong>Conclusions: </strong>By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"12"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11731562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.
Results: In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.
Conclusions: The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.
{"title":"MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.","authors":"Jianwei Li, Xukun Zhang, Bing Li, Ziyu Li, Zhenzhen Chen","doi":"10.1186/s12859-025-06040-4","DOIUrl":"10.1186/s12859-025-06040-4","url":null,"abstract":"<p><strong>Background: </strong>MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.</p><p><strong>Results: </strong>In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.</p><p><strong>Conclusions: </strong>The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"13"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-11DOI: 10.1186/s12859-024-06007-x
Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal
Background: Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.
Results: In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.
Conclusions: Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.
{"title":"Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology.","authors":"Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal","doi":"10.1186/s12859-024-06007-x","DOIUrl":"10.1186/s12859-024-06007-x","url":null,"abstract":"<p><strong>Background: </strong>Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.</p><p><strong>Results: </strong>In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.</p><p><strong>Conclusions: </strong>Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"9"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}