Pub Date : 2024-08-27DOI: 10.1186/s12859-024-05882-8
Chrisostomos Drogaris, Yanlin Zhang, Eric Zhang, Elena Nazarova, Roman Sarrazin-Gendron, Sélik Wilhelm-Landry, Yan Cyr, Jacek Majewski, Mathieu Blanchette, Jérôme Waldispühl
Over the past two decades, scientists have increasingly realized the importance of the three-dimensional (3D) genome organization in regulating cellular activity. Hi-C and related experiments yield 2D contact matrices that can be used to infer 3D models of chromosome structure. Visualizing and analyzing genomes in 3D space remains challenging. Here, we present ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets. A user study comparing ARGV to existing tools demonstrates its benefits.
{"title":"ARGV: 3D genome structure exploration using augmented reality.","authors":"Chrisostomos Drogaris, Yanlin Zhang, Eric Zhang, Elena Nazarova, Roman Sarrazin-Gendron, Sélik Wilhelm-Landry, Yan Cyr, Jacek Majewski, Mathieu Blanchette, Jérôme Waldispühl","doi":"10.1186/s12859-024-05882-8","DOIUrl":"10.1186/s12859-024-05882-8","url":null,"abstract":"<p><p>Over the past two decades, scientists have increasingly realized the importance of the three-dimensional (3D) genome organization in regulating cellular activity. Hi-C and related experiments yield 2D contact matrices that can be used to infer 3D models of chromosome structure. Visualizing and analyzing genomes in 3D space remains challenging. Here, we present ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets. A user study comparing ARGV to existing tools demonstrates its benefits.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11348660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1186/s12859-024-05913-4
Mondher Khdhiri, Ella Thomas, Chanel de Smet, Priyanka Chandar, Induja Chandrakumar, Jean M Davidson, Paul Anderson, Samuel D Chorlton
Background: Commonly used approaches for genomic investigation of bacterial outbreaks, including SNP and gene-by-gene approaches, are limited by the requirement for background genomes and curated allele schemes, respectively. As a result, they only work on a select subset of known organisms, and fail on novel or less studied pathogens. We introduce refMLST, a gene-by-gene approach using the reference genome of a bacterium to form a scalable, reproducible and robust method to perform outbreak investigation.
Results: When applied to multiple outbreak causing bacteria including 1263 Salmonella enterica, 331 Yersinia enterocolitica and 6526 Campylobacter jejuni genomes, refMLST enabled consistent clustering, improved resolution, and faster processing in comparison to commonly used tools like chewieSnake.
Conclusions: refMLST is a novel multilocus sequence typing approach that is applicable to any bacterial species with a public reference genome, does not require a curated scheme, and automatically accounts for genetic recombination.
Availability and implementation: refMLST is freely available for academic use at https://bugseq.com/academic .
{"title":"refMLST: reference-based multilocus sequence typing enables universal bacterial typing.","authors":"Mondher Khdhiri, Ella Thomas, Chanel de Smet, Priyanka Chandar, Induja Chandrakumar, Jean M Davidson, Paul Anderson, Samuel D Chorlton","doi":"10.1186/s12859-024-05913-4","DOIUrl":"10.1186/s12859-024-05913-4","url":null,"abstract":"<p><strong>Background: </strong>Commonly used approaches for genomic investigation of bacterial outbreaks, including SNP and gene-by-gene approaches, are limited by the requirement for background genomes and curated allele schemes, respectively. As a result, they only work on a select subset of known organisms, and fail on novel or less studied pathogens. We introduce refMLST, a gene-by-gene approach using the reference genome of a bacterium to form a scalable, reproducible and robust method to perform outbreak investigation.</p><p><strong>Results: </strong>When applied to multiple outbreak causing bacteria including 1263 Salmonella enterica, 331 Yersinia enterocolitica and 6526 Campylobacter jejuni genomes, refMLST enabled consistent clustering, improved resolution, and faster processing in comparison to commonly used tools like chewieSnake.</p><p><strong>Conclusions: </strong>refMLST is a novel multilocus sequence typing approach that is applicable to any bacterial species with a public reference genome, does not require a curated scheme, and automatically accounts for genetic recombination.</p><p><strong>Availability and implementation: </strong>refMLST is freely available for academic use at https://bugseq.com/academic .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Mining the vast pool of biomedical literature to extract accurate responses and relevant references is challenging due to the domain's interdisciplinary nature, specialized jargon, and continuous evolution. Early natural language processing (NLP) approaches often led to incorrect answers as they failed to comprehend the nuances of natural language. However, transformer models have significantly advanced the field by enabling the creation of large language models (LLMs), enhancing question-answering (QA) tasks. Despite these advances, current LLM-based solutions for specialized domains like biology and biomedicine still struggle to generate up-to-date responses while avoiding "hallucination" or generating plausible but factually incorrect responses.
Results: Our work focuses on enhancing prompts using a retrieval-augmented architecture to guide LLMs in generating meaningful responses for biomedical QA tasks. We evaluated two approaches: one relying on text embedding and vector similarity in a high-dimensional space, and our proposed method, which uses explicit signals in user queries to extract meaningful contexts. For robust evaluation, we tested these methods on 50 specific and challenging questions from diverse biomedical topics, comparing their performance against a baseline model, BM25. Retrieval performance of our method was significantly better than others, achieving a median Precision@10 of 0.95, which indicates the fraction of the top 10 retrieved chunks that are relevant. We used GPT-4, OpenAI's most advanced LLM to maximize the answer quality and manually accessed LLM-generated responses. Our method achieved a median answer quality score of 2.5, surpassing both the baseline model and the text embedding-based approach. We developed a QA bot, WeiseEule ( https://github.com/wasimaftab/WeiseEule-LocalHost ), which utilizes these methods for comparative analysis and also offers advanced features for review writing and identifying relevant articles for citation.
Conclusions: Our findings highlight the importance of prompt enhancement methods that utilize explicit signals in user queries over traditional text embedding-based approaches to improve LLM-generated responses for specialized queries in specialized domains such as biology and biomedicine. By providing users complete control over the information fed into the LLM, our approach addresses some of the major drawbacks of existing web-based chatbots and LLM-based QA systems, including hallucinations and the generation of irrelevant or outdated responses.
{"title":"Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy.","authors":"Wasim Aftab, Zivkos Apostolou, Karim Bouazoune, Tobias Straub","doi":"10.1186/s12859-024-05902-7","DOIUrl":"10.1186/s12859-024-05902-7","url":null,"abstract":"<p><strong>Background: </strong>Mining the vast pool of biomedical literature to extract accurate responses and relevant references is challenging due to the domain's interdisciplinary nature, specialized jargon, and continuous evolution. Early natural language processing (NLP) approaches often led to incorrect answers as they failed to comprehend the nuances of natural language. However, transformer models have significantly advanced the field by enabling the creation of large language models (LLMs), enhancing question-answering (QA) tasks. Despite these advances, current LLM-based solutions for specialized domains like biology and biomedicine still struggle to generate up-to-date responses while avoiding \"hallucination\" or generating plausible but factually incorrect responses.</p><p><strong>Results: </strong>Our work focuses on enhancing prompts using a retrieval-augmented architecture to guide LLMs in generating meaningful responses for biomedical QA tasks. We evaluated two approaches: one relying on text embedding and vector similarity in a high-dimensional space, and our proposed method, which uses explicit signals in user queries to extract meaningful contexts. For robust evaluation, we tested these methods on 50 specific and challenging questions from diverse biomedical topics, comparing their performance against a baseline model, BM25. Retrieval performance of our method was significantly better than others, achieving a median Precision@10 of 0.95, which indicates the fraction of the top 10 retrieved chunks that are relevant. We used GPT-4, OpenAI's most advanced LLM to maximize the answer quality and manually accessed LLM-generated responses. Our method achieved a median answer quality score of 2.5, surpassing both the baseline model and the text embedding-based approach. We developed a QA bot, WeiseEule ( https://github.com/wasimaftab/WeiseEule-LocalHost ), which utilizes these methods for comparative analysis and also offers advanced features for review writing and identifying relevant articles for citation.</p><p><strong>Conclusions: </strong>Our findings highlight the importance of prompt enhancement methods that utilize explicit signals in user queries over traditional text embedding-based approaches to improve LLM-generated responses for specialized queries in specialized domains such as biology and biomedicine. By providing users complete control over the information fed into the LLM, our approach addresses some of the major drawbacks of existing web-based chatbots and LLM-based QA systems, including hallucinations and the generation of irrelevant or outdated responses.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1186/s12859-024-05776-9
Ravikiran Donthu, Jose A P Marcelino, Rosanna Giordano, Yudong Tao, Everett Weber, Arian Avalos, Mark Band, Tatsiana Akraiko, Shu-Ching Chen, Maria P Reyes, Haiping Hao, Yarira Ortiz-Alvarado, Charles A Cuff, Eddie Pérez Claudio, Felipe Soto-Adames, Allan H Smith-Pardo, William G Meikle, Jay D Evans, Tugrul Giray, Faten B Abdelkader, Mike Allsopp, Daniel Ball, Susana B Morgado, Shalva Barjadze, Adriana Correa-Benitez, Amina Chakir, David R Báez, Nabor H M Chavez, Anne Dalmon, Adrian B Douglas, Carmen Fraccica, Hermógenes Fernández-Marín, Alberto Galindo-Cardona, Ernesto Guzman-Novoa, Robert Horsburgh, Meral Kence, Joseph Kilonzo, Mert Kükrer, Yves Le Conte, Gaetana Mazzeo, Fernando Mota, Elliud Muli, Devrim Oskay, José A Ruiz-Martínez, Eugenia Oliveri, Igor Pichkhaia, Abderrahmane Romane, Cesar Guillen Sanchez, Evans Sikombwa, Alberto Satta, Alejandra A Scannapieco, Brandi Stanford, Victoria Soroker, Rodrigo A Velarde, Monica Vercelli, Zachary Huang
Background: Honey bees are the principal commercial pollinators. Along with other arthropods, they are increasingly under threat from anthropogenic factors such as the incursion of invasive honey bee subspecies, pathogens and parasites. Better tools are needed to identify bee subspecies. Genomic data for economic and ecologically important organisms is increasing, but in its basic form its practical application to address ecological problems is limited.
Results: We introduce HBeeID a means to identify honey bees. The tool utilizes a knowledge-based network and diagnostic SNPs identified by discriminant analysis of principle components and hierarchical agglomerative clustering. Tests of HBeeID showed that it identifies African, Americas-Africanized, Asian, and European honey bees with a high degree of certainty even when samples lack the full 272 SNPs of HBeeID. Its prediction capacity decreases with highly admixed samples.
Conclusion: HBeeID is a high-resolution genomic, SNP based tool, that can be used to identify honey bees and screen species that are invasive. Its flexible design allows for future improvements via sample data additions from other localities.
{"title":"HBeeID: a molecular tool that identifies honey bee subspecies from different geographic populations.","authors":"Ravikiran Donthu, Jose A P Marcelino, Rosanna Giordano, Yudong Tao, Everett Weber, Arian Avalos, Mark Band, Tatsiana Akraiko, Shu-Ching Chen, Maria P Reyes, Haiping Hao, Yarira Ortiz-Alvarado, Charles A Cuff, Eddie Pérez Claudio, Felipe Soto-Adames, Allan H Smith-Pardo, William G Meikle, Jay D Evans, Tugrul Giray, Faten B Abdelkader, Mike Allsopp, Daniel Ball, Susana B Morgado, Shalva Barjadze, Adriana Correa-Benitez, Amina Chakir, David R Báez, Nabor H M Chavez, Anne Dalmon, Adrian B Douglas, Carmen Fraccica, Hermógenes Fernández-Marín, Alberto Galindo-Cardona, Ernesto Guzman-Novoa, Robert Horsburgh, Meral Kence, Joseph Kilonzo, Mert Kükrer, Yves Le Conte, Gaetana Mazzeo, Fernando Mota, Elliud Muli, Devrim Oskay, José A Ruiz-Martínez, Eugenia Oliveri, Igor Pichkhaia, Abderrahmane Romane, Cesar Guillen Sanchez, Evans Sikombwa, Alberto Satta, Alejandra A Scannapieco, Brandi Stanford, Victoria Soroker, Rodrigo A Velarde, Monica Vercelli, Zachary Huang","doi":"10.1186/s12859-024-05776-9","DOIUrl":"10.1186/s12859-024-05776-9","url":null,"abstract":"<p><strong>Background: </strong>Honey bees are the principal commercial pollinators. Along with other arthropods, they are increasingly under threat from anthropogenic factors such as the incursion of invasive honey bee subspecies, pathogens and parasites. Better tools are needed to identify bee subspecies. Genomic data for economic and ecologically important organisms is increasing, but in its basic form its practical application to address ecological problems is limited.</p><p><strong>Results: </strong>We introduce HBeeID a means to identify honey bees. The tool utilizes a knowledge-based network and diagnostic SNPs identified by discriminant analysis of principle components and hierarchical agglomerative clustering. Tests of HBeeID showed that it identifies African, Americas-Africanized, Asian, and European honey bees with a high degree of certainty even when samples lack the full 272 SNPs of HBeeID. Its prediction capacity decreases with highly admixed samples.</p><p><strong>Conclusion: </strong>HBeeID is a high-resolution genomic, SNP based tool, that can be used to identify honey bees and screen species that are invasive. Its flexible design allows for future improvements via sample data additions from other localities.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11348773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1186/s12859-024-05885-5
O J Charles, C Venturini, R A Goldstein, J Breuer
The prevention and treatment of many herpesvirus associated diseases is based on the utilization of antiviral therapies, however therapeutic success is limited by the development of drug resistance. Currently no single database cataloguing resistance mutations exists, which hampers the use of sequence data for patient management. We therefore developed HerpesDRG, a drug resistance mutation database that incorporates all the known resistance genes and current treatment options, built from a systematic review of available genotype to phenotype literature. The database is released along with an R package that provides a simple approach to resistance variant annotation and clinical implication analysis from common sanger and next generation sequencing data. This represents the first openly available and community maintainable database of drug resistance mutations for the human herpesviruses (HHV), developed for the community of researchers and clinicians tackling HHV drug resistance.
许多疱疹病毒相关疾病的预防和治疗都是以抗病毒疗法为基础的,然而治疗的成功却受到耐药性发展的限制。目前还没有一个单一的数据库对耐药性突变进行编目,这妨碍了利用序列数据对患者进行管理。因此,我们开发了 HerpesDRG,这是一个包含所有已知耐药基因和当前治疗方案的耐药突变数据库,它是通过对现有基因型到表型文献的系统性审查而建立的。该数据库与一个 R 软件包一起发布,该软件包提供了一种简单的方法,可从普通的 sanger 和新一代测序数据中进行耐药性变异注释和临床影响分析。该数据库是首个公开可用、可由社区维护的人类疱疹病毒(HHV)耐药性变异数据库,是为解决 HHV 耐药性问题的研究人员和临床医生开发的。
{"title":"HerpesDRG: a comprehensive resource for human herpesvirus antiviral drug resistance genotyping.","authors":"O J Charles, C Venturini, R A Goldstein, J Breuer","doi":"10.1186/s12859-024-05885-5","DOIUrl":"10.1186/s12859-024-05885-5","url":null,"abstract":"<p><p>The prevention and treatment of many herpesvirus associated diseases is based on the utilization of antiviral therapies, however therapeutic success is limited by the development of drug resistance. Currently no single database cataloguing resistance mutations exists, which hampers the use of sequence data for patient management. We therefore developed HerpesDRG, a drug resistance mutation database that incorporates all the known resistance genes and current treatment options, built from a systematic review of available genotype to phenotype literature. The database is released along with an R package that provides a simple approach to resistance variant annotation and clinical implication analysis from common sanger and next generation sequencing data. This represents the first openly available and community maintainable database of drug resistance mutations for the human herpesviruses (HHV), developed for the community of researchers and clinicians tackling HHV drug resistance.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11350968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-24DOI: 10.1186/s12859-024-05900-9
Weixuan Liu, Thao Vu, Iain R Konigsberg, Katherine A Pratte, Yonghua Zhuang, Katerina J Kechris
Sparse multiple canonical correlation network analysis (SmCCNet) is a machine learning technique for integrating omics data along with a variable of interest (e.g., phenotype of complex disease), and reconstructing multi-omics networks that are specific to this variable. We present the second-generation SmCCNet (SmCCNet 2.0) that adeptly integrates single or multiple omics data types along with a quantitative or binary phenotype of interest. In addition, this new package offers a streamlined setup process that can be configured manually or automatically, ensuring a flexible and user-friendly experience. AVAILABILITY : This package is available in both CRAN: https://cran.r-project.org/web/packages/SmCCNet/index.html and Github: https://github.com/KechrisLab/SmCCNet under the MIT license. The network visualization tool is available at https://smccnet.shinyapps.io/smccnetnetwork/ .
{"title":"Smccnet 2.0: a comprehensive tool for multi-omics network inference with shiny visualization.","authors":"Weixuan Liu, Thao Vu, Iain R Konigsberg, Katherine A Pratte, Yonghua Zhuang, Katerina J Kechris","doi":"10.1186/s12859-024-05900-9","DOIUrl":"10.1186/s12859-024-05900-9","url":null,"abstract":"<p><p>Sparse multiple canonical correlation network analysis (SmCCNet) is a machine learning technique for integrating omics data along with a variable of interest (e.g., phenotype of complex disease), and reconstructing multi-omics networks that are specific to this variable. We present the second-generation SmCCNet (SmCCNet 2.0) that adeptly integrates single or multiple omics data types along with a quantitative or binary phenotype of interest. In addition, this new package offers a streamlined setup process that can be configured manually or automatically, ensuring a flexible and user-friendly experience. AVAILABILITY : This package is available in both CRAN: https://cran.r-project.org/web/packages/SmCCNet/index.html and Github: https://github.com/KechrisLab/SmCCNet under the MIT license. The network visualization tool is available at https://smccnet.shinyapps.io/smccnetnetwork/ .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142046244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1186/s12859-024-05904-5
Beiyi Zhang, Dongjiang Niu, Lianwei Zhang, Qiang Zhang, Zhen Li
Background: The rise of network pharmacology has led to the widespread use of network-based computational methods in predicting drug target interaction (DTI). However, existing DTI prediction models typically rely on a limited amount of data to extract drug and target features, potentially affecting the comprehensiveness and robustness of features. In addition, although multiple networks are used for DTI prediction, the integration of heterogeneous information often involves simplistic aggregation and attention mechanisms, which may impose certain limitations.
Results: MSH-DTI, a deep learning model for predicting drug-target interactions, is proposed in this paper. The model uses self-supervised learning methods to obtain drug and target structure features. A Heterogeneous Interaction-enhanced Feature Fusion Module is designed for multi-graph construction, and the graph convolutional networks are used to extract node features. With the help of an attention mechanism, the model focuses on the important parts of different features for prediction. Experimental results show that the AUROC and AUPR of MSH-DTI are 0.9620 and 0.9605 respectively, outperforming other models on the DTINet dataset.
Conclusion: The proposed MSH-DTI is a helpful tool to discover drug-target interactions, which is also validated through case studies in predicting new DTIs.
{"title":"MSH-DTI: multi-graph convolution with self-supervised embedding and heterogeneous aggregation for drug-target interaction prediction.","authors":"Beiyi Zhang, Dongjiang Niu, Lianwei Zhang, Qiang Zhang, Zhen Li","doi":"10.1186/s12859-024-05904-5","DOIUrl":"10.1186/s12859-024-05904-5","url":null,"abstract":"<p><strong>Background: </strong>The rise of network pharmacology has led to the widespread use of network-based computational methods in predicting drug target interaction (DTI). However, existing DTI prediction models typically rely on a limited amount of data to extract drug and target features, potentially affecting the comprehensiveness and robustness of features. In addition, although multiple networks are used for DTI prediction, the integration of heterogeneous information often involves simplistic aggregation and attention mechanisms, which may impose certain limitations.</p><p><strong>Results: </strong>MSH-DTI, a deep learning model for predicting drug-target interactions, is proposed in this paper. The model uses self-supervised learning methods to obtain drug and target structure features. A Heterogeneous Interaction-enhanced Feature Fusion Module is designed for multi-graph construction, and the graph convolutional networks are used to extract node features. With the help of an attention mechanism, the model focuses on the important parts of different features for prediction. Experimental results show that the AUROC and AUPR of MSH-DTI are 0.9620 and 0.9605 respectively, outperforming other models on the DTINet dataset.</p><p><strong>Conclusion: </strong>The proposed MSH-DTI is a helpful tool to discover drug-target interactions, which is also validated through case studies in predicting new DTIs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11342675/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142046243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1186/s12859-024-05852-0
Rick Z Li, Claudia Z Han, Christopher K Glass
Background: Growing evidence suggests that distal regulatory elements are essential for cellular function and states. The sequences within these distal elements, especially motifs for transcription factor binding, provide critical information about the underlying regulatory programs. However, cooperativities between transcription factors that recognize these motifs are nonlinear and multiplexed, rendering traditional modeling methods insufficient to capture the underlying mechanisms. Recent development of attention mechanism, which exhibit superior performance in capturing dependencies across input sequences, makes them well-suited to uncover and decipher intricate dependencies between regulatory elements.
Result: We present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), a deep learning framework that focuses on interpretability. In this study, we demonstrated that TIANA could discover biologically relevant insights into co-occurring pairs of transcription factor motifs. Compared with existing tools, TIANA showed superior interpretability and robust performance in identifying putative transcription factor cooperativities from co-occurring motifs.
Conclusion: Our results suggest that TIANA can be an effective tool to decipher transcription factor cooperativities from distal sequence data. TIANA can be accessed through: https://github.com/rzzli/TIANA .
{"title":"TIANA: transcription factors cooperativity inference analysis with neural attention.","authors":"Rick Z Li, Claudia Z Han, Christopher K Glass","doi":"10.1186/s12859-024-05852-0","DOIUrl":"10.1186/s12859-024-05852-0","url":null,"abstract":"<p><strong>Background: </strong>Growing evidence suggests that distal regulatory elements are essential for cellular function and states. The sequences within these distal elements, especially motifs for transcription factor binding, provide critical information about the underlying regulatory programs. However, cooperativities between transcription factors that recognize these motifs are nonlinear and multiplexed, rendering traditional modeling methods insufficient to capture the underlying mechanisms. Recent development of attention mechanism, which exhibit superior performance in capturing dependencies across input sequences, makes them well-suited to uncover and decipher intricate dependencies between regulatory elements.</p><p><strong>Result: </strong>We present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), a deep learning framework that focuses on interpretability. In this study, we demonstrated that TIANA could discover biologically relevant insights into co-occurring pairs of transcription factor motifs. Compared with existing tools, TIANA showed superior interpretability and robust performance in identifying putative transcription factor cooperativities from co-occurring motifs.</p><p><strong>Conclusion: </strong>Our results suggest that TIANA can be an effective tool to decipher transcription factor cooperativities from distal sequence data. TIANA can be accessed through: https://github.com/rzzli/TIANA .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11342676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142035126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1186/s12859-024-05887-3
Lorenzo Arcioni, Manuel Arcieri, Jessica Di Martino, Franco Liberati, Paolo Bottoni, Tiziana Castrignanò
Background: The availability of transcriptomic data for species without a reference genome enables the construction of de novo transcriptome assemblies as alternative reference resources from RNA-Seq data. A transcriptome provides direct information about a species' protein-coding genes under specific experimental conditions. The de novo assembly process produces a unigenes file in FASTA format, subsequently targeted for the annotation. Homology-based annotation, a method to infer the function of sequences by estimating similarity with other sequences in a reference database, is a computationally demanding procedure.
Results: To mitigate the computational burden, we introduce HPC-T-Annotator, a tool for de novo transcriptome homology annotation on high performance computing (HPC) infrastructures, designed for straightforward configuration via a Web interface. Once the configuration data are given, the entire parallel computing software for annotation is automatically generated and can be launched on a supercomputer using a simple command line. The output data can then be easily viewed using post-processing utilities in the form of Python notebooks integrated in the proposed software.
Conclusions: HPC-T-Annotator expedites homology-based annotation in de novo transcriptome assemblies. Its efficient parallelization strategy on HPC infrastructures significantly reduces computational load and execution times, enabling large-scale transcriptome analysis and comparison projects, while its intuitive graphical interface extends accessibility to users without IT skills.
背景:对于没有参考基因组的物种,转录组数据的可用性使得从 RNA-Seq 数据中构建全新的转录组集合作为替代参考资源成为可能。转录组提供了特定实验条件下物种蛋白质编码基因的直接信息。从头组装过程会产生一个 FASTA 格式的单基因文件,随后进行目标注释。基于同源性的注释是一种通过估计序列与参考数据库中其他序列的相似性来推断序列功能的方法,是一种计算要求很高的程序:为了减轻计算负担,我们推出了HPC-T-Annotator,这是一种在高性能计算(HPC)基础设施上进行全新转录组同源注释的工具,通过网络界面进行直接配置。一旦给出配置数据,整个用于注释的并行计算软件就会自动生成,并可通过简单的命令行在超级计算机上启动。然后,可以使用集成在拟议软件中的 Python 笔记本形式的后处理实用程序轻松查看输出数据:结论:HPC-T-Annotator 加快了全新转录组组装中基于同源性的注释工作。它在高性能计算基础设施上的高效并行化策略大大降低了计算负荷和执行时间,使大规模转录组分析和比较项目成为可能,而其直观的图形界面则使不具备信息技术技能的用户也能使用。
{"title":"HPC-T-Annotator: an HPC tool for de novo transcriptome assembly annotation.","authors":"Lorenzo Arcioni, Manuel Arcieri, Jessica Di Martino, Franco Liberati, Paolo Bottoni, Tiziana Castrignanò","doi":"10.1186/s12859-024-05887-3","DOIUrl":"10.1186/s12859-024-05887-3","url":null,"abstract":"<p><strong>Background: </strong>The availability of transcriptomic data for species without a reference genome enables the construction of de novo transcriptome assemblies as alternative reference resources from RNA-Seq data. A transcriptome provides direct information about a species' protein-coding genes under specific experimental conditions. The de novo assembly process produces a unigenes file in FASTA format, subsequently targeted for the annotation. Homology-based annotation, a method to infer the function of sequences by estimating similarity with other sequences in a reference database, is a computationally demanding procedure.</p><p><strong>Results: </strong>To mitigate the computational burden, we introduce HPC-T-Annotator, a tool for de novo transcriptome homology annotation on high performance computing (HPC) infrastructures, designed for straightforward configuration via a Web interface. Once the configuration data are given, the entire parallel computing software for annotation is automatically generated and can be launched on a supercomputer using a simple command line. The output data can then be easily viewed using post-processing utilities in the form of Python notebooks integrated in the proposed software.</p><p><strong>Conclusions: </strong>HPC-T-Annotator expedites homology-based annotation in de novo transcriptome assemblies. Its efficient parallelization strategy on HPC infrastructures significantly reduces computational load and execution times, enabling large-scale transcriptome analysis and comparison projects, while its intuitive graphical interface extends accessibility to users without IT skills.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11340092/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1186/s12859-024-05903-6
Seonho Kim, Juntae Yoon
Background: There has been a considerable advancement in AI technologies like LLM and machine learning to support biomedical knowledge discovery.
Main body: We propose a novel biomedical neural search service called 'VAIV Bio-Discovery', which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles with information related to chemical compound/drugs, gene/proteins, diseases, and their interactions (chemical compounds/drugs-proteins/gene including drugs-targets, drug-drug, and drug-disease). To provide comprehensive knowledge, the system offers four search options: basic search, entity and interaction search, and natural language search. We employ T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block. It also assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation (RAG). The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25.
Conclusion: As a result, our system can better understand the context, semantics and relationships between terms within the document, enhancing search accuracy. This research contributes to the rapidly evolving biomedical field by introducing a new service to access and discover relevant knowledge.
{"title":"VAIV bio-discovery service using transformer model and retrieval augmented generation.","authors":"Seonho Kim, Juntae Yoon","doi":"10.1186/s12859-024-05903-6","DOIUrl":"10.1186/s12859-024-05903-6","url":null,"abstract":"<p><strong>Background: </strong>There has been a considerable advancement in AI technologies like LLM and machine learning to support biomedical knowledge discovery.</p><p><strong>Main body: </strong>We propose a novel biomedical neural search service called 'VAIV Bio-Discovery', which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles with information related to chemical compound/drugs, gene/proteins, diseases, and their interactions (chemical compounds/drugs-proteins/gene including drugs-targets, drug-drug, and drug-disease). To provide comprehensive knowledge, the system offers four search options: basic search, entity and interaction search, and natural language search. We employ T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block. It also assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation (RAG). The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25.</p><p><strong>Conclusion: </strong>As a result, our system can better understand the context, semantics and relationships between terms within the document, enhancing search accuracy. This research contributes to the rapidly evolving biomedical field by introducing a new service to access and discover relevant knowledge.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11340140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}