首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes. GenRCA:一种用户友好型稀有密码子分析工具,用于根据基因组中的编码序列全面评估密码子使用偏好。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05934-z
Kunjie Fan, Yuanyuan Li, Zhiwei Chen, Long Fan

Background: The study of codon usage bias is important for understanding gene expression, evolution and gene design, providing critical insights into the molecular processes that govern the function and regulation of genes. Codon Usage Bias (CUB) indices are valuable metrics for understanding codon usage patterns across different organisms without extensive experiments. Considering that there is no one-fits-all index for all species, a comprehensive platform supporting the calculation and analysis of multiple CUB indices for codon optimization is greatly needed.

Results: Here, we release GenRCA, an updated version of our previous Rare Codon Analysis Tool, as a free and user-friendly website for all-inclusive evaluation of codon usage preferences of coding sequences. In this study, we manually reviewed and implemented up to 31 codon preference indices, with 65 expression host organisms covered and batch processing of multiple gene sequences supported, aiming to improve the user experience and provide more comprehensive and efficient analysis.

Conclusions: Our website fills a gap in the availability of comprehensive tools for species-specific CUB calculations, enabling researchers to thoroughly assess the protein expression level based on a comprehensive list of 31 indices and further guide the codon optimization.

背景:研究密码子使用偏差对于了解基因表达、进化和基因设计非常重要,它为了解支配基因功能和调控的分子过程提供了重要依据。密码子使用偏倚(CUB)指数是了解不同生物体密码子使用模式的重要指标,无需进行大量实验。考虑到没有适用于所有物种的万能指数,因此亟需一个支持计算和分析多个 CUB 指数以优化密码子的综合平台:在此,我们发布了 GenRCA,它是我们之前的稀有密码子分析工具的升级版本,是一个免费且用户友好的网站,用于对编码序列的密码子使用偏好进行全面评估。在这项研究中,我们人工审核并实现了多达 31 个密码子偏好指数,涵盖了 65 种表达宿主生物,并支持多基因序列的批量处理,旨在改善用户体验,提供更全面、更高效的分析:我们的网站填补了物种特异性 CUB 计算综合工具的空白,使研究人员能够根据 31 种指数的综合列表全面评估蛋白质表达水平,并进一步指导密码子优化。
{"title":"GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes.","authors":"Kunjie Fan, Yuanyuan Li, Zhiwei Chen, Long Fan","doi":"10.1186/s12859-024-05934-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05934-z","url":null,"abstract":"<p><strong>Background: </strong>The study of codon usage bias is important for understanding gene expression, evolution and gene design, providing critical insights into the molecular processes that govern the function and regulation of genes. Codon Usage Bias (CUB) indices are valuable metrics for understanding codon usage patterns across different organisms without extensive experiments. Considering that there is no one-fits-all index for all species, a comprehensive platform supporting the calculation and analysis of multiple CUB indices for codon optimization is greatly needed.</p><p><strong>Results: </strong>Here, we release GenRCA, an updated version of our previous Rare Codon Analysis Tool, as a free and user-friendly website for all-inclusive evaluation of codon usage preferences of coding sequences. In this study, we manually reviewed and implemented up to 31 codon preference indices, with 65 expression host organisms covered and batch processing of multiple gene sequences supported, aiming to improve the user experience and provide more comprehensive and efficient analysis.</p><p><strong>Conclusions: </strong>Our website fills a gap in the availability of comprehensive tools for species-specific CUB calculations, enabling researchers to thoroughly assess the protein expression level based on a comprehensive list of 31 indices and further guide the codon optimization.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Utilization of a natural language processing-based approach to determine the composition of artifact residues. 利用基于自然语言处理的方法确定人工制品残留物的成分。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05888-2
Tung Tho Nguyen, Korey J Brownstein

Background: Determining the composition of artifact residues is a central problem in ancient residue metabolomics. This is done by comparing mass spectral features in common with an experimental artifact and an ancient artifact (standard method). While this method is simple and straightforward, we sought to increase the accuracy of predicting which plant species had been used in which artifacts.

Results: Here, we introduce an algorithm (new method) based on ideas from the field of natural language processing (NLP) to solve this problem. We tested our strategy on a set of modern clay pipes. To limit biases, we were not provided information on which plant species had been smoked in which clay pipes. The results indicate that our new method performed 12.5% better than the standard method in predicting the plant species smoked in each artifact.

Conclusions: Utilizing an NLP-based approach, we developed a robust algorithm for characterizing the composition of artifact residues. This work also discusses other general applications in which our algorithm could be used in the field of metabolomics, such as datasets where there are a limited number of replicates.

背景:确定人工残留物的组成是古残留物代谢组学的核心问题。其方法是比较实验人工残留物和古代人工残留物的共同质谱特征(标准方法)。虽然这种方法简单明了,但我们仍试图提高预测哪些植物物种曾用于哪些人工制品的准确性:在此,我们介绍了一种基于自然语言处理(NLP)理念的算法(新方法)来解决这一问题。我们在一组现代陶管上测试了我们的策略。为了限制偏差,我们没有提供关于哪些植物物种曾在哪些陶制烟斗中熏制过的信息。结果表明,我们的新方法在预测每个文物中熏制的植物种类方面比标准方法好 12.5%:结论:利用基于 NLP 的方法,我们开发了一种稳健的算法,用于确定器物残留物成分的特征。这项工作还讨论了我们的算法在代谢组学领域的其他一般应用,如重复次数有限的数据集。
{"title":"Utilization of a natural language processing-based approach to determine the composition of artifact residues.","authors":"Tung Tho Nguyen, Korey J Brownstein","doi":"10.1186/s12859-024-05888-2","DOIUrl":"https://doi.org/10.1186/s12859-024-05888-2","url":null,"abstract":"<p><strong>Background: </strong>Determining the composition of artifact residues is a central problem in ancient residue metabolomics. This is done by comparing mass spectral features in common with an experimental artifact and an ancient artifact (standard method). While this method is simple and straightforward, we sought to increase the accuracy of predicting which plant species had been used in which artifacts.</p><p><strong>Results: </strong>Here, we introduce an algorithm (new method) based on ideas from the field of natural language processing (NLP) to solve this problem. We tested our strategy on a set of modern clay pipes. To limit biases, we were not provided information on which plant species had been smoked in which clay pipes. The results indicate that our new method performed 12.5% better than the standard method in predicting the plant species smoked in each artifact.</p><p><strong>Conclusions: </strong>Utilizing an NLP-based approach, we developed a robust algorithm for characterizing the composition of artifact residues. This work also discusses other general applications in which our algorithm could be used in the field of metabolomics, such as datasets where there are a limited number of replicates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437931/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization. DiscovEpi:自动全蛋白质组 MHC-I 表位预测和可视化。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05931-2
C Mahncke, F Schmiedeke, S Simm, L Kaderali, B M Bröker, U Seifert, C Cammann

Background: Antigen presentation is a central step in initiating and shaping the adaptive immune response. To activate CD8+ T cells, pathogen-derived peptides are presented on the cell surface of antigen-presenting cells bound to major histocompatibility complex (MHC) class I molecules. CD8+ T cells that recognize these complexes with their T cell receptor are activated and ideally eliminate infected cells. Prediction of putative peptides binding to MHC class I (MHC-I) is crucial for understanding pathogen recognition in specific immune responses and for supporting drug and vaccine design. There are reliable databases for epitope prediction algorithms available however they primarily focus on the prediction of epitopes in single immunogenic proteins.

Results: We have developed the tool DiscovEpi to establish an interface between whole proteomes and epitope prediction. The tool allows the automated identification of all potential MHC-I-binding peptides within a proteome and calculates the epitope density and average binding score for every protein, a protein-centric approach. DiscovEpi provides a convenient interface between automated multiple sequence extraction by organism and cell compartment from the database UniProt for subsequent epitope prediction via NetMHCpan. Furthermore, it allows ranking of proteins by their predicted immunogenicity on the one hand and comparison of different proteomes on the other. By applying the tool, we predict a higher immunogenic potential of membrane-associated proteins of SARS-CoV-2 compared to those of influenza A based on the presented metrics epitope density and binding score. This could be confirmed visually by comparing the epitope maps of the influenza A strain and SARS-CoV-2.

Conclusion: Automated prediction of whole proteomes and the subsequent visualization of the location of putative epitopes on sequence-level facilitate the search for putative immunogenic proteins or protein regions and support the study of adaptive immune responses and vaccine design.

背景:抗原递呈是启动和形成适应性免疫反应的核心步骤。为了激活 CD8+ T 细胞,病原体衍生的多肽会呈现在与主要组织相容性复合体(MHC)I 类分子结合的抗原呈递细胞的细胞表面。用 T 细胞受体识别这些复合物的 CD8+ T 细胞会被激活,并在理想情况下消灭受感染的细胞。预测与 MHC I 类(MHC-I)结合的假定肽对于了解特定免疫反应中的病原体识别以及支持药物和疫苗设计至关重要。目前已有可靠的表位预测算法数据库,但它们主要侧重于预测单一免疫原蛋白中的表位:结果:我们开发了 DiscovEpi 工具,在整个蛋白质组和表位预测之间建立了一个接口。该工具可以自动识别蛋白质组中所有潜在的 MHC-I 结合肽,并计算每个蛋白质的表位密度和平均结合得分,这是一种以蛋白质为中心的方法。DiscovEpi 提供了一个方便的接口,可从数据库 UniProt 中按生物体和细胞区自动提取多个序列,然后通过 NetMHCpan 进行表位预测。此外,它还可以根据预测的免疫原性对蛋白质进行排序,并对不同的蛋白质组进行比较。通过应用该工具,我们根据表位密度和结合得分等指标预测,与甲型流感相比,SARS-CoV-2 的膜相关蛋白具有更高的免疫原性。通过比较甲型流感病毒株和 SARS-CoV-2 的表位图,可以直观地证实这一点:整个蛋白质组的自动预测以及随后在序列水平上推定表位位置的可视化有助于寻找推定的免疫原蛋白或蛋白区域,并支持适应性免疫反应和疫苗设计的研究。
{"title":"DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization.","authors":"C Mahncke, F Schmiedeke, S Simm, L Kaderali, B M Bröker, U Seifert, C Cammann","doi":"10.1186/s12859-024-05931-2","DOIUrl":"https://doi.org/10.1186/s12859-024-05931-2","url":null,"abstract":"<p><strong>Background: </strong>Antigen presentation is a central step in initiating and shaping the adaptive immune response. To activate CD8<sup>+</sup> T cells, pathogen-derived peptides are presented on the cell surface of antigen-presenting cells bound to major histocompatibility complex (MHC) class I molecules. CD8<sup>+</sup> T cells that recognize these complexes with their T cell receptor are activated and ideally eliminate infected cells. Prediction of putative peptides binding to MHC class I (MHC-I) is crucial for understanding pathogen recognition in specific immune responses and for supporting drug and vaccine design. There are reliable databases for epitope prediction algorithms available however they primarily focus on the prediction of epitopes in single immunogenic proteins.</p><p><strong>Results: </strong>We have developed the tool DiscovEpi to establish an interface between whole proteomes and epitope prediction. The tool allows the automated identification of all potential MHC-I-binding peptides within a proteome and calculates the epitope density and average binding score for every protein, a protein-centric approach. DiscovEpi provides a convenient interface between automated multiple sequence extraction by organism and cell compartment from the database UniProt for subsequent epitope prediction via NetMHCpan. Furthermore, it allows ranking of proteins by their predicted immunogenicity on the one hand and comparison of different proteomes on the other. By applying the tool, we predict a higher immunogenic potential of membrane-associated proteins of SARS-CoV-2 compared to those of influenza A based on the presented metrics epitope density and binding score. This could be confirmed visually by comparing the epitope maps of the influenza A strain and SARS-CoV-2.</p><p><strong>Conclusion: </strong>Automated prediction of whole proteomes and the subsequent visualization of the location of putative epitopes on sequence-level facilitate the search for putative immunogenic proteins or protein regions and support the study of adaptive immune responses and vaccine design.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438315/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota. SpeciateIT 和 vSpeciateDB:新颖、快速、准确的阴道微生物群 16S rRNA 基因分类。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05930-3
Johanna B Holm, Pawel Gajer, Jacques Ravel

Background: Clustering of sequences into operational taxonomic units (OTUs) and denoising methods are a mainstream stopgap to taxonomically classifying large numbers of 16S rRNA gene sequences. Environment-specific reference databases generally yield optimal taxonomic assignment.

Results: We developed SpeciateIT, a novel taxonomic classification tool which rapidly and accurately classifies individual amplicon sequences ( https://github.com/Ravel-Laboratory/speciateIT ). We also present vSpeciateDB, a custom reference database for the taxonomic classification of 16S rRNA gene amplicon sequences from vaginal microbiota. We show that SpeciateIT requires minimal computational resources relative to other algorithms and, when combined with vSpeciateDB, affords accurate species level classification in an environment-specific manner.

Conclusions: Herein, two resources with new and practical importance are described. The novel classification algorithm, SpeciateIT, is based on 7th order Markov chain models and allows for fast and accurate per-sequence taxonomic assignments (as little as 10 min for 107 sequences). vSpeciateDB, a meticulously tailored reference database, stands as a vital and pragmatic contribution. Its significance lies in the superiority of this environment-specific database to provide more species-resolution over its universal counterparts.

背景:将序列聚类为可操作的分类单元(OTU)和去噪方法是对大量 16S rRNA 基因序列进行分类的主流方法。特定环境的参考数据库通常能产生最佳的分类分配:我们开发了一种新型分类工具 SpeciateIT,它能快速、准确地对单个扩增子序列进行分类 ( https://github.com/Ravel-Laboratory/speciateIT )。我们还介绍了 vSpeciateDB,这是一个定制的参考数据库,用于对来自阴道微生物群的 16S rRNA 基因扩增子序列进行分类。我们的研究表明,与其他算法相比,SpeciateIT 所需的计算资源最少,而且与 vSpeciateDB 结合使用时,能以特定环境的方式提供准确的物种级分类:结论:本文介绍了两种具有重要实际意义的新资源。新颖的分类算法 SpeciateIT 基于七阶马尔科夫链模型,可快速准确地按序列进行分类分配(107 个序列只需 10 分钟)。vSpeciateDB 是一个精心定制的参考数据库,是一项重要而实用的贡献。它的意义在于,与通用数据库相比,这个特定环境数据库能够提供更高的物种分辨率。
{"title":"SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota.","authors":"Johanna B Holm, Pawel Gajer, Jacques Ravel","doi":"10.1186/s12859-024-05930-3","DOIUrl":"10.1186/s12859-024-05930-3","url":null,"abstract":"<p><strong>Background: </strong>Clustering of sequences into operational taxonomic units (OTUs) and denoising methods are a mainstream stopgap to taxonomically classifying large numbers of 16S rRNA gene sequences. Environment-specific reference databases generally yield optimal taxonomic assignment.</p><p><strong>Results: </strong>We developed SpeciateIT, a novel taxonomic classification tool which rapidly and accurately classifies individual amplicon sequences ( https://github.com/Ravel-Laboratory/speciateIT ). We also present vSpeciateDB, a custom reference database for the taxonomic classification of 16S rRNA gene amplicon sequences from vaginal microbiota. We show that SpeciateIT requires minimal computational resources relative to other algorithms and, when combined with vSpeciateDB, affords accurate species level classification in an environment-specific manner.</p><p><strong>Conclusions: </strong>Herein, two resources with new and practical importance are described. The novel classification algorithm, SpeciateIT, is based on 7th order Markov chain models and allows for fast and accurate per-sequence taxonomic assignments (as little as 10 min for 10<sup>7</sup> sequences). vSpeciateDB, a meticulously tailored reference database, stands as a vital and pragmatic contribution. Its significance lies in the superiority of this environment-specific database to provide more species-resolution over its universal counterparts.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437924/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation. 利用几何注意力、分辨率间转移学习和基于同源性的增强技术,加速蛋白质结合位点预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-20 DOI: 10.1186/s12859-024-05923-2
Daeseok Lee, Wonjun Hwang, Jeunghyun Byun, Bonggun Shin

Background: Locating small molecule binding sites in target proteins, in the resolution of either pocket or residue, is critical in many drug-discovery scenarios. Since it is not always easy to find such binding sites using conventional methods, different deep learning methods to predict binding sites out of protein structures have been developed in recent years. The existing deep learning based methods have several limitations, including (1) the inefficiency of the CNN-only architecture, (2) loss of information due to excessive post-processing, and (3) the under-utilization of available data sources.

Methods: We present a new model architecture and training method that resolves the aforementioned problems. First, by layering geometric self-attention units on top of residue-level 3D CNN outputs, our model overcomes the problems of CNN-only architectures. Second, by configuring the fundamental units of computation as residues and pockets instead of voxels, our method reduced the information loss from post-processing. Lastly, by employing inter-resolution transfer learning and homology-based augmentation, our method maximizes the utilization of available data sources to a significant extent.

Results: The proposed method significantly outperformed all state-of-the-art baselines regarding both resolutions-pocket and residue. An ablation study demonstrated the indispensability of our proposed architecture, as well as transfer learning and homology-based augmentation, for achieving optimal performance. We further scrutinized our model's performance through a case study involving human serum albumin, which demonstrated our model's superior capability in identifying multiple binding sites of the protein, outperforming the existing methods.

Conclusions: We believe that our contribution to the literature is twofold. Firstly, we introduce a novel computational method for binding site prediction with practical applications, substantiated by its strong performance across diverse benchmarks and case studies. Secondly, the innovative aspects in our method- specifically, the design of the model architecture, inter-resolution transfer learning, and homology-based augmentation-would serve as useful components for future work.

背景:以口袋或残基的分辨率定位目标蛋白质中的小分子结合位点在许多药物发现方案中至关重要。由于使用传统方法并不总是很容易找到这些结合位点,近年来人们开发了不同的深度学习方法来预测蛋白质结构中的结合位点。现有的基于深度学习的方法有几个局限性,包括:(1)纯 CNN 架构效率低下;(2)过度的后处理导致信息丢失;(3)对可用数据源利用不足:我们提出了一种新的模型架构和训练方法来解决上述问题。首先,我们的模型通过在残差级三维 CNN 输出之上分层几何自注意力单元,克服了纯 CNN 架构的问题。其次,通过将基本计算单元配置为残基和口袋而不是体素,我们的方法减少了后处理带来的信息损失。最后,通过采用分辨率间转移学习和基于同源性的增强,我们的方法在很大程度上最大限度地利用了可用数据源:结果:所提出的方法在分辨率--口袋和残留方面都明显优于所有最先进的基线方法。一项消融研究表明,我们提出的架构以及迁移学习和同源扩增对于实现最佳性能是不可或缺的。我们通过一项涉及人血清白蛋白的案例研究进一步检验了我们模型的性能,结果表明我们的模型在识别蛋白质的多个结合位点方面能力出众,优于现有方法:我们认为我们对文献的贡献是双重的。首先,我们介绍了一种用于结合位点预测的新型计算方法,该方法在各种基准和案例研究中的出色表现证明了它的实际应用价值。其次,我们方法中的创新点--特别是模型架构设计、分辨率间转移学习和基于同源性的增强--将成为未来工作的有用组成部分。
{"title":"Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation.","authors":"Daeseok Lee, Wonjun Hwang, Jeunghyun Byun, Bonggun Shin","doi":"10.1186/s12859-024-05923-2","DOIUrl":"https://doi.org/10.1186/s12859-024-05923-2","url":null,"abstract":"<p><strong>Background: </strong>Locating small molecule binding sites in target proteins, in the resolution of either pocket or residue, is critical in many drug-discovery scenarios. Since it is not always easy to find such binding sites using conventional methods, different deep learning methods to predict binding sites out of protein structures have been developed in recent years. The existing deep learning based methods have several limitations, including (1) the inefficiency of the CNN-only architecture, (2) loss of information due to excessive post-processing, and (3) the under-utilization of available data sources.</p><p><strong>Methods: </strong>We present a new model architecture and training method that resolves the aforementioned problems. First, by layering geometric self-attention units on top of residue-level 3D CNN outputs, our model overcomes the problems of CNN-only architectures. Second, by configuring the fundamental units of computation as residues and pockets instead of voxels, our method reduced the information loss from post-processing. Lastly, by employing inter-resolution transfer learning and homology-based augmentation, our method maximizes the utilization of available data sources to a significant extent.</p><p><strong>Results: </strong>The proposed method significantly outperformed all state-of-the-art baselines regarding both resolutions-pocket and residue. An ablation study demonstrated the indispensability of our proposed architecture, as well as transfer learning and homology-based augmentation, for achieving optimal performance. We further scrutinized our model's performance through a case study involving human serum albumin, which demonstrated our model's superior capability in identifying multiple binding sites of the protein, outperforming the existing methods.</p><p><strong>Conclusions: </strong>We believe that our contribution to the literature is twofold. Firstly, we introduce a novel computational method for binding site prediction with practical applications, substantiated by its strong performance across diverse benchmarks and case studies. Secondly, the innovative aspects in our method- specifically, the design of the model architecture, inter-resolution transfer learning, and homology-based augmentation-would serve as useful components for future work.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416008/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142280013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging gene correlations in single cell transcriptomic data 利用单细胞转录组数据中的基因相关性
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-18 DOI: 10.1186/s12859-024-05926-z
Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander
Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.
为了克服单细胞 RNA 测序(scRNAseq)中的技术噪音,人们开发了许多方法。随着研究人员深入挖掘数据,寻找罕见细胞类型、细胞状态的微妙之处以及基因调控网络的细节,他们越来越需要精确度可控、临时参数和阈值较少的算法。阻碍这一目标实现的事实是,scRNAseq 的适当空分布不能简单地从生物变异基本真相未知(即通常情况下)的数据中提取。我们采用分析方法来解决这个问题,假设 scRNAseq 数据只反映细胞异质性(我们试图描述的特征)、转录噪声(随机分布在细胞中的时间波动)和采样误差(即泊松噪声)。我们分析 scRNAseq 数据时没有进行归一化处理--这一步会使分布偏斜,尤其是稀疏数据--而是计算与关键统计量相关的 p 值。我们开发了一种改进的方法,用于选择细胞聚类的特征和识别基因与基因之间的正负相关性。通过模拟数据,我们证明了这种我们称之为 BigSur(来自非规范化读数的基本信息学和基因统计)的方法甚至能捕捉到 scRNAseq 数据中微弱但重要的相关结构。将 BigSur 应用于克隆人类黑色素瘤细胞系的数据时,我们发现了成千上万的相关性,当这些相关性在没有监督的情况下聚类成基因群落时,它们与已知的细胞成分和生物过程相一致,并突出了潜在的新型细胞生物学关系。使用基于统计学的方法来识别基因-基因相关性,可以获得对功能相关基因调控网络的新见解。
{"title":"Leveraging gene correlations in single cell transcriptomic data","authors":"Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander","doi":"10.1186/s12859-024-05926-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05926-z","url":null,"abstract":"Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Taxanorm: a novel taxa-specific normalization approach for microbiome data Taxanorm:微生物组数据的新型特定分类归一化方法
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-16 DOI: 10.1186/s12859-024-05918-z
Ziyue Wang, Dillon Lloyd, Shanshan Zhao, Alison Motsinger-Reif
In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa. We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it balances power and false discovery control well. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias. TaxaNorm both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The ‘TaxaNorm’ R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm .
在高通量测序研究中,不同样本的测序深度(量化读数总数)各不相同。不同的测序深度会掩盖真正的生物信号,无法对不同样本进行直接比较。为了消除因测序深度不同而产生的变异,通常会在下游分析前对分类群计数进行归一化处理。然而,大多数现有的归一化方法都是使用特定于样本而非特定于分类群的大小因子来对计数进行缩放,这可能会导致某些分类群的校正过度或不足。我们开发了 TaxaNorm,这是一种基于零膨胀负二叉模型的新型归一化方法。该方法假定测序深度对平均值和离散度的影响在不同类群之间各不相同。加入零膨胀部分可以更好地捕捉微生物组数据的本质。我们还针对测序深度效应的变化提出了两个相应的诊断检测来进行验证。我们发现,在下游分析的大多数模拟场景中,TaxaNorm 的性能与现有方法相当,在某些情况下还能达到更高的功率。具体来说,它很好地平衡了功率和误发现控制。将该方法应用于真实数据集时,TaxaNorm 在纠正技术偏差方面的性能有所提高。TaxaNorm 通过在微生物组数据中引入适当的回归框架,消除了样本和分类群的特定偏差,有助于数据解释和可视化。TaxaNorm "R 软件包可通过 CRAN 存储库 https://CRAN.R-project.org/package=TaxaNorm 免费获取,源代码可从 https://github.com/wangziyue57/TaxaNorm 下载。
{"title":"Taxanorm: a novel taxa-specific normalization approach for microbiome data","authors":"Ziyue Wang, Dillon Lloyd, Shanshan Zhao, Alison Motsinger-Reif","doi":"10.1186/s12859-024-05918-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05918-z","url":null,"abstract":"In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa. We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it balances power and false discovery control well. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias. TaxaNorm both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The ‘TaxaNorm’ R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining impactful discoveries from the biomedical literature 从生物医学文献中挖掘有影响力的发现
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-16 DOI: 10.1186/s12859-024-05881-9
Erwan Moreau, Orla Hardiman, Mark Heverin, Declan O’Sullivan
Literature-based discovery (LBD) aims to help researchers to identify relations between concepts which are worthy of further investigation by text-mining the biomedical literature. While the LBD literature is rich and the field is considered mature, standard practice in the evaluation of LBD methods is methodologically poor and has not progressed on par with the domain. The lack of properly designed and decent-sized benchmark dataset hinders the progress of the field and its development into applications usable by biomedical experts. This work presents a method for mining past discoveries from the biomedical literature. It leverages the impact made by a discovery, using descriptive statistics to detect surges in the prevalence of a relation across time. The validity of the method is tested against a baseline representing the state-of-the-art “time-sliced” method. This method allows the collection of a large amount of time-stamped discoveries. These can be used for LBD evaluation, alleviating the long-standing issue of inadequate evaluation. It might also pave the way for more fine-grained LBD methods, which could exploit the diversity of these past discoveries to train supervised models. Finally the dataset (or some future version of it inspired by our method) could be used as a methodological tool for systematic reviews. We provide an online exploration tool in this perspective, available at https://brainmend.adaptcentre.ie/ .
基于文献的发现(LBD)旨在帮助研究人员通过对生物医学文献进行文本挖掘,找出值得进一步研究的概念之间的关系。虽然基于文献的发现(LBD)文献十分丰富,该领域也被认为是成熟的,但对基于文献的发现(LBD)方法进行评估的标准实践在方法论上并不完善,与该领域的进展不相称。缺乏设计合理、规模适当的基准数据集阻碍了该领域的发展,也阻碍了将其发展为生物医学专家可用的应用。这项工作提出了一种从生物医学文献中挖掘过去发现的方法。该方法利用发现所产生的影响,使用描述性统计来检测某一关系在不同时期的流行程度。该方法的有效性根据代表最先进的 "时间切片 "方法的基线进行了测试。这种方法可以收集大量有时间戳的发现。这些发现可用于枸杞多糖评估,从而缓解长期以来评估不足的问题。它还可以为更精细的 LBD 方法铺平道路,利用这些过去发现的多样性来训练监督模型。最后,该数据集(或受我们方法启发的未来版本)可用作系统性综述的方法论工具。我们从这个角度提供了一个在线探索工具,网址是 https://brainmend.adaptcentre.ie/ 。
{"title":"Mining impactful discoveries from the biomedical literature","authors":"Erwan Moreau, Orla Hardiman, Mark Heverin, Declan O’Sullivan","doi":"10.1186/s12859-024-05881-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05881-9","url":null,"abstract":"Literature-based discovery (LBD) aims to help researchers to identify relations between concepts which are worthy of further investigation by text-mining the biomedical literature. While the LBD literature is rich and the field is considered mature, standard practice in the evaluation of LBD methods is methodologically poor and has not progressed on par with the domain. The lack of properly designed and decent-sized benchmark dataset hinders the progress of the field and its development into applications usable by biomedical experts. This work presents a method for mining past discoveries from the biomedical literature. It leverages the impact made by a discovery, using descriptive statistics to detect surges in the prevalence of a relation across time. The validity of the method is tested against a baseline representing the state-of-the-art “time-sliced” method. This method allows the collection of a large amount of time-stamped discoveries. These can be used for LBD evaluation, alleviating the long-standing issue of inadequate evaluation. It might also pave the way for more fine-grained LBD methods, which could exploit the diversity of these past discoveries to train supervised models. Finally the dataset (or some future version of it inspired by our method) could be used as a methodological tool for systematic reviews. We provide an online exploration tool in this perspective, available at https://brainmend.adaptcentre.ie/ .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scBubbletree: computational approach for visualization of single cell RNA-seq data scBubbletree:单细胞 RNA-seq 数据可视化计算方法
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-13 DOI: 10.1186/s12859-024-05927-y
Simo Kitanovski, Yingying Cao, Dimitris Ttoouli, Farnoush Farahpour, Jun Wang, Daniel Hoffmann
Visualization approaches transform high-dimensional data from single cell RNA sequencing (scRNA-seq) experiments into two-dimensional plots that are used for analysis of cell relationships, and as a means of reporting biological insights. Yet, many standard approaches generate visuals that suffer from overplotting, lack of quantitative information, and distort global and local properties of biological patterns relative to the original high-dimensional space. We present scBubbletree, a new, scalable method for visualization of scRNA-seq data. The method identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms (bubble trees), corresponding to quantitative summaries of cluster properties and relationships. scBubbletree stacks bubble trees with further cluster-associated information in a visually easily accessible way, thus facilitating quantitative assessment and biological interpretation of scRNA-seq data. We demonstrate this with large scRNA-seq data sets, including one with over 1.2 million cells. To facilitate coherent quantification and visualization of scRNA-seq data we developed the R-package scBubbletree, which is freely available as part of the Bioconductor repository at: https://bioconductor.org/packages/scBubbletree/
可视化方法将单细胞 RNA 测序(scRNA-seq)实验的高维数据转化为二维图,用于分析细胞关系,并作为报告生物学见解的一种手段。然而,许多标准方法生成的视觉效果存在过度绘制、缺乏定量信息以及相对于原始高维空间的生物模式的全局和局部属性失真等问题。我们介绍的 scBubbletree 是一种新的、可扩展的 scRNA-seq 数据可视化方法。scBubbletree 将气泡树与进一步的群集相关信息堆叠在一起,以一种易于访问的视觉方式,从而促进了 scRNA-seq 数据的定量评估和生物学解释。我们用大型 scRNA-seq 数据集(包括一个拥有 120 多万个细胞的数据集)演示了这一点。为了促进 scRNA-seq 数据的连贯量化和可视化,我们开发了 R 软件包 scBubbletree,作为 Bioconductor 存储库的一部分免费提供,网址是:https://bioconductor.org/packages/scBubbletree/。
{"title":"scBubbletree: computational approach for visualization of single cell RNA-seq data","authors":"Simo Kitanovski, Yingying Cao, Dimitris Ttoouli, Farnoush Farahpour, Jun Wang, Daniel Hoffmann","doi":"10.1186/s12859-024-05927-y","DOIUrl":"https://doi.org/10.1186/s12859-024-05927-y","url":null,"abstract":"Visualization approaches transform high-dimensional data from single cell RNA sequencing (scRNA-seq) experiments into two-dimensional plots that are used for analysis of cell relationships, and as a means of reporting biological insights. Yet, many standard approaches generate visuals that suffer from overplotting, lack of quantitative information, and distort global and local properties of biological patterns relative to the original high-dimensional space. We present scBubbletree, a new, scalable method for visualization of scRNA-seq data. The method identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms (bubble trees), corresponding to quantitative summaries of cluster properties and relationships. scBubbletree stacks bubble trees with further cluster-associated information in a visually easily accessible way, thus facilitating quantitative assessment and biological interpretation of scRNA-seq data. We demonstrate this with large scRNA-seq data sets, including one with over 1.2 million cells. To facilitate coherent quantification and visualization of scRNA-seq data we developed the R-package scBubbletree, which is freely available as part of the Bioconductor repository at: https://bioconductor.org/packages/scBubbletree/ ","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distinguishing word identity and sequence context in DNA language models 在 DNA 语言模型中区分单词特征和序列上下文
IF 3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-13 DOI: 10.1186/s12859-024-05869-5
Melissa Sanabria, Jonas Hirsch, Anna R. Poetsch
Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model′s learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.
基于变换器的大型语言模型(LLM)非常适合生物序列数据,因为它与自然语言类似。由于可以通过标记化生成 "词 "的概念,因此可以学习复杂的关系。通过屏蔽标记预测来训练模型,它们既能学习标记序列标识,也能学习更大的序列上下文。我们开发了对模型学习进行检验的方法,这既关系到模型的可解释性,也关系到评估其在特定任务中的潜力。我们使用了 DNABERT,这是一个在人类基因组上训练的 DNA 语言模型,以重叠的 k-mers 作为标记。为了深入了解该模型的学习情况,我们询问了该模型是如何进行预测的,提取了标记嵌入,并定义了一个微调基准任务,以预测下一个不同大小、无重叠的标记。这项任务无需询问具体的基因组生物学信息即可评估基础模型,它不依赖于标记化策略、词汇量大小、字典或训练参数的数量。最后,标记身份信息不会泄漏到预测任务中,这使得它对评估序列上下文的学习特别有用。我们发现,具有重叠 k-mers 的模型在学习更大的序列上下文时非常吃力。相反,学习到的嵌入在很大程度上代表了标记序列。不过,在基因组生物学启发的微调任务中,我们还是取得了不错的成绩。具有重叠标记的模型可用于较大序列上下文相关性较低的任务,但标记序列直接代表了所需的学习特征。这就强调了对生物 LLM 中的知识表征进行研究的必要性。
{"title":"Distinguishing word identity and sequence context in DNA language models","authors":"Melissa Sanabria, Jonas Hirsch, Anna R. Poetsch","doi":"10.1186/s12859-024-05869-5","DOIUrl":"https://doi.org/10.1186/s12859-024-05869-5","url":null,"abstract":"Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of \"words\" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model′s learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1