Pub Date : 2025-01-13DOI: 10.1186/s12859-024-06005-z
Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen
Background: Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.
Results: We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.
Evaluation: Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.
Conclusions: By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .
{"title":"Solu: a cloud platform for real-time genomic pathogen surveillance.","authors":"Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen","doi":"10.1186/s12859-024-06005-z","DOIUrl":"10.1186/s12859-024-06005-z","url":null,"abstract":"<p><strong>Background: </strong>Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.</p><p><strong>Results: </strong>We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.</p><p><strong>Evaluation: </strong>Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.</p><p><strong>Conclusions: </strong>By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"12"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11731562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.
Results: In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.
Conclusions: The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.
{"title":"MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.","authors":"Jianwei Li, Xukun Zhang, Bing Li, Ziyu Li, Zhenzhen Chen","doi":"10.1186/s12859-025-06040-4","DOIUrl":"10.1186/s12859-025-06040-4","url":null,"abstract":"<p><strong>Background: </strong>MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.</p><p><strong>Results: </strong>In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.</p><p><strong>Conclusions: </strong>The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"13"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-11DOI: 10.1186/s12859-024-06007-x
Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal
Background: Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.
Results: In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.
Conclusions: Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.
{"title":"Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology.","authors":"Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal","doi":"10.1186/s12859-024-06007-x","DOIUrl":"10.1186/s12859-024-06007-x","url":null,"abstract":"<p><strong>Background: </strong>Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.</p><p><strong>Results: </strong>In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.</p><p><strong>Conclusions: </strong>Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"9"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-11DOI: 10.1186/s12859-025-06033-3
Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang
Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.
{"title":"UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides.","authors":"Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang","doi":"10.1186/s12859-025-06033-3","DOIUrl":"10.1186/s12859-025-06033-3","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"10"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11725221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1186/s12859-024-06022-y
Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin
Background: Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.
Results: BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.
Conclusions: The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the "Findable, Accessible, Interoperable, and Reusable" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.
{"title":"BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data.","authors":"Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin","doi":"10.1186/s12859-024-06022-y","DOIUrl":"10.1186/s12859-024-06022-y","url":null,"abstract":"<p><strong>Background: </strong>Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.</p><p><strong>Results: </strong>BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.</p><p><strong>Conclusions: </strong>The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the \"Findable, Accessible, Interoperable, and Reusable\" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"8"},"PeriodicalIF":2.9,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11721463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1186/s12859-024-05949-6
Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión
Background: Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.
Results: In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).
Conclusions: The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.
{"title":"Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.","authors":"Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión","doi":"10.1186/s12859-024-05949-6","DOIUrl":"10.1186/s12859-024-05949-6","url":null,"abstract":"<p><strong>Background: </strong>Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.</p><p><strong>Results: </strong>In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).</p><p><strong>Conclusions: </strong>The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"7"},"PeriodicalIF":2.9,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1186/s12859-024-06032-w
Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang
The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.
{"title":"CDPMF-DDA: contrastive deep probabilistic matrix factorization for drug-disease association prediction.","authors":"Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang","doi":"10.1186/s12859-024-06032-w","DOIUrl":"https://doi.org/10.1186/s12859-024-06032-w","url":null,"abstract":"<p><p>The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"5"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142942980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1186/s12859-024-06027-7
James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do
In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.
{"title":"Causal models and prediction in cell line perturbation experiments.","authors":"James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do","doi":"10.1186/s12859-024-06027-7","DOIUrl":"https://doi.org/10.1186/s12859-024-06027-7","url":null,"abstract":"<p><p>In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"4"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707890/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1186/s12859-024-06023-x
Weijie Yang, Jingsi Ji, Gang Fang
Background: Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.
Results: We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.
Conclusions: We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.
{"title":"A metric and its derived protein network for evaluation of ortholog database inconsistency.","authors":"Weijie Yang, Jingsi Ji, Gang Fang","doi":"10.1186/s12859-024-06023-x","DOIUrl":"https://doi.org/10.1186/s12859-024-06023-x","url":null,"abstract":"<p><strong>Background: </strong>Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.</p><p><strong>Results: </strong>We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.</p><p><strong>Conclusions: </strong>We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"6"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.
Results: We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.
Conclusions: DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .
{"title":"DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.","authors":"Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa","doi":"10.1186/s12859-024-06030-y","DOIUrl":"https://doi.org/10.1186/s12859-024-06030-y","url":null,"abstract":"<p><strong>Background: </strong>Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.</p><p><strong>Results: </strong>We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.</p><p><strong>Conclusions: </strong>DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"3"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}