首页 > 最新文献

arXiv - QuanBio - Genomics最新文献

英文 中文
QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis QuST-LLM:整合大型语言模型进行综合空间转录组学分析
Pub Date : 2024-06-20 DOI: arxiv-2406.14307
Chao Hui Huang
In this paper, we introduce QuST-LLM, an innovative extension of QuPath thatutilizes the capabilities of large language models (LLMs) to analyze andinterpret spatial transcriptomics (ST) data. This tool effectively simplifiesthe intricate and high-dimensional nature of ST data by offering acomprehensive workflow that includes data loading, region selection, geneexpression analysis, and functional annotation. QuST-LLM employs LLMs totransform complex ST data into understandable and detailed biologicalnarratives based on gene ontology annotations, thereby significantly improvingthe interpretability of ST data. Consequently, users can interact with theirown ST data using natural language. Hence, QuST-LLM provides researchers with apotent functionality to unravel the spatial and functional complexities oftissues, fostering novel insights and advancements in biomedical research.
在本文中,我们介绍了 QuST-LLM,它是 QuPath 的创新扩展,利用大型语言模型(LLM)的功能来分析和解释空间转录组学(ST)数据。该工具提供了一个全面的工作流程,包括数据加载、区域选择、基因表达分析和功能注释,从而有效简化了空间转录组学数据的复杂性和高维性。QuST-LLM 利用 LLM 将复杂的 ST 数据转化为基于基因图谱注释的可理解的详细生物学叙述,从而大大提高了 ST 数据的可解释性。因此,用户可以使用自然语言与自己的 ST 数据进行交互。因此,QuST-LLM 为研究人员提供了揭示问题的空间和功能复杂性的强大功能,促进了生物医学研究的新见解和新进展。
{"title":"QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis","authors":"Chao Hui Huang","doi":"arxiv-2406.14307","DOIUrl":"https://doi.org/arxiv-2406.14307","url":null,"abstract":"In this paper, we introduce QuST-LLM, an innovative extension of QuPath that\u0000utilizes the capabilities of large language models (LLMs) to analyze and\u0000interpret spatial transcriptomics (ST) data. This tool effectively simplifies\u0000the intricate and high-dimensional nature of ST data by offering a\u0000comprehensive workflow that includes data loading, region selection, gene\u0000expression analysis, and functional annotation. QuST-LLM employs LLMs to\u0000transform complex ST data into understandable and detailed biological\u0000narratives based on gene ontology annotations, thereby significantly improving\u0000the interpretability of ST data. Consequently, users can interact with their\u0000own ST data using natural language. Hence, QuST-LLM provides researchers with a\u0000potent functionality to unravel the spatial and functional complexities of\u0000tissues, fostering novel insights and advancements in biomedical research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mapping-free NLP-based technique for sequence search in Nanopore long-reads 基于无映射 NLP 技术的 Nanopore 长读数序列搜索技术
Pub Date : 2024-06-20 DOI: arxiv-2406.14187
Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska
In unforeseen situations, such as nuclear power plant's or civilian radiationaccidents, there is a need for effective and computationally inexpensivemethods to determine the expression level of a selected gene panel, allowingfor rough dose estimates in thousands of donors. The new generation in-situmapper, fast and of low energy consumption, working at the level of singlenanopore output, is in demand. We aim to create a sequence identification toolthat utilizes Natural Language Processing (NLP) techniques and ensures a highlevel of negative predictive value (NPV) compared to the classical approach.The training dataset consisted of RNASeq data from 6 samples. Having testedmultiple NLP models, the best configuration analyses the entire sequence anduses a word length of 3 base pairs with one-word neighbor on each side. For theconsidered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% andNPV 99.25%, compared to minimap2's performance in a cross-validation scenario.Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to98.15%. Obtained NLP model, validated on an external independent genomesequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduceddictionary. The salmon-estimated read counts differed from the classicalapproach on average by 3.48% for the complete dictionary and by 5.82% for thereduced one. We conclude that for long Oxford Nanopore reads, an NLP-basedapproach can successfully replace classical mapping in case of emergency. Thedeveloped NLP model can be easily retrained to identify selected transcriptsand/or work with various long-read sequencing techniques. Our results of thestudy clearly demonstrate the potential of applying techniques known fromclassical text processing to nucleotide sequences and represent a significantadvancement in this field of science.
在不可预见的情况下,如核电站或民用辐射事故,需要有效且计算成本低廉的方法来确定所选基因面板的表达水平,以便对成千上万供体的剂量进行粗略估计。新一代体外成像仪速度快、能耗低、可在单个核孔输出的水平上工作,是目前所需要的。我们的目标是创建一种序列识别工具,利用自然语言处理(NLP)技术,确保与传统方法相比具有较高的阴性预测值(NPV)。训练数据集由来自 6 个样本的 RNASeq 数据组成。在测试了多个 NLP 模型后,最佳配置分析了整个序列,并使用了 3 个碱基对的字长,每边相邻一个字。将字典从 1024 个减少到 145 个后,BACC 为 96.49%,NPV 为 98.15%。在外部独立基因组测序数据集上验证获得的 NLP 模型后,完整字典的 NPV 为 99.64%,缩减字典的 NPV 为 95.87%。对于完整字典,鲑鱼估计的读数与经典方法平均相差 3.48%,对于缩减字典则相差 5.82%。我们的结论是,对于牛津纳米孔的长读数,基于 NLP 的方法可以在紧急情况下成功取代经典映射。开发的 NLP 模型可以很容易地进行再训练,以识别选定的转录本和/或与各种长读数测序技术配合使用。我们的研究结果清楚地证明了将经典文本处理技术应用于核苷酸序列的潜力,是这一科学领域的重大进步。
{"title":"A mapping-free NLP-based technique for sequence search in Nanopore long-reads","authors":"Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska","doi":"arxiv-2406.14187","DOIUrl":"https://doi.org/arxiv-2406.14187","url":null,"abstract":"In unforeseen situations, such as nuclear power plant's or civilian radiation\u0000accidents, there is a need for effective and computationally inexpensive\u0000methods to determine the expression level of a selected gene panel, allowing\u0000for rough dose estimates in thousands of donors. The new generation in-situ\u0000mapper, fast and of low energy consumption, working at the level of single\u0000nanopore output, is in demand. We aim to create a sequence identification tool\u0000that utilizes Natural Language Processing (NLP) techniques and ensures a high\u0000level of negative predictive value (NPV) compared to the classical approach.\u0000The training dataset consisted of RNASeq data from 6 samples. Having tested\u0000multiple NLP models, the best configuration analyses the entire sequence and\u0000uses a word length of 3 base pairs with one-word neighbor on each side. For the\u0000considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and\u0000NPV 99.25%, compared to minimap2's performance in a cross-validation scenario.\u0000Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to\u000098.15%. Obtained NLP model, validated on an external independent genome\u0000sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced\u0000dictionary. The salmon-estimated read counts differed from the classical\u0000approach on average by 3.48% for the complete dictionary and by 5.82% for the\u0000reduced one. We conclude that for long Oxford Nanopore reads, an NLP-based\u0000approach can successfully replace classical mapping in case of emergency. The\u0000developed NLP model can be easily retrained to identify selected transcripts\u0000and/or work with various long-read sequencing techniques. Our results of the\u0000study clearly demonstrate the potential of applying techniques known from\u0000classical text processing to nucleotide sequences and represent a significant\u0000advancement in this field of science.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design RNA-FrameFlow:从头开始三维 RNA 主干设计的流程匹配
Pub Date : 2024-06-19 DOI: arxiv-2406.13839
Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbonedesign. We build upon SE(3) flow matching for protein backbone generation andestablish protocols for data preparation and evaluation to address uniquechallenges posed by RNA modeling. We formulate RNA structures as a set ofrigid-body frames and associated loss functions which account for larger, moreconformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins(4 atoms per residue). Toward tackling the lack of diversity in 3D RNAdatasets, we explore training with structural clustering and croppingaugmentations. Additionally, we define a suite of evaluation metrics to measurewhether the generated RNA structures are globally self-consistent (via inversefolding followed by forward folding) and locally recover RNA-specificstructural descriptors. The most performant version of RNA-FrameFlow generateslocally realistic RNA backbones of 40-150 nucleotides, over 40% of which passour validity criteria as measured by a self-consistency TM-score >= 0.45, atwhich two RNAs have the same global fold. Open-source code:https://github.com/rish-16/rna-backbone-design
我们介绍了 RNA-FrameFlow,这是第一个用于三维 RNA 主干设计的生成模型。我们以用于蛋白质骨架生成的 SE(3) 流匹配为基础,建立了数据准备和评估协议,以解决 RNA 建模带来的独特挑战。我们将 RNA 结构表述为一组刚体框架和相关损失函数,这些函数考虑到了 RNA 主干(每个核苷酸 13 个原子)相对于蛋白质(每个残基 4 个原子)更大、构型更灵活的特点。为了解决三维 RNA 数据集缺乏多样性的问题,我们探索了结构聚类和裁剪增强训练。此外,我们还定义了一套评估指标来衡量生成的 RNA 结构是否具有全局自洽性(通过反折后正折)和局部恢复 RNA 特有的结构描述符。性能最好的 RNA-FrameFlow 版本能生成 40-150 个核苷酸的局部真实 RNA 主干,其中超过 40% 的 RNA 主干通过了自洽性 TM 分数 >= 0.45 的有效性标准,即两个 RNA 具有相同的全局折叠。开放源代码:https://github.com/rish-16/rna-backbone-design
{"title":"RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design","authors":"Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò","doi":"arxiv-2406.13839","DOIUrl":"https://doi.org/arxiv-2406.13839","url":null,"abstract":"We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone\u0000design. We build upon SE(3) flow matching for protein backbone generation and\u0000establish protocols for data preparation and evaluation to address unique\u0000challenges posed by RNA modeling. We formulate RNA structures as a set of\u0000rigid-body frames and associated loss functions which account for larger, more\u0000conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins\u0000(4 atoms per residue). Toward tackling the lack of diversity in 3D RNA\u0000datasets, we explore training with structural clustering and cropping\u0000augmentations. Additionally, we define a suite of evaluation metrics to measure\u0000whether the generated RNA structures are globally self-consistent (via inverse\u0000folding followed by forward folding) and locally recover RNA-specific\u0000structural descriptors. The most performant version of RNA-FrameFlow generates\u0000locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass\u0000our validity criteria as measured by a self-consistency TM-score >= 0.45, at\u0000which two RNAs have the same global fold. Open-source code:\u0000https://github.com/rish-16/rna-backbone-design","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model PathoLM:通过基因组基础模型从 DNA 序列识别致病性
Pub Date : 2024-06-19 DOI: arxiv-2406.13133
Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
Pathogen identification is pivotal in diagnosing, treating, and preventingdiseases, crucial for controlling infections and safeguarding public health.Traditional alignment-based methods, though widely used, are computationallyintense and reliant on extensive reference databases, often failing to detectnovel pathogens due to their low sensitivity and specificity. Similarly,conventional machine learning techniques, while promising, require largeannotated datasets and extensive feature engineering and are prone tooverfitting. Addressing these challenges, we introduce PathoLM, a cutting-edgepathogen language model optimized for the identification of pathogenicity inbacterial and viral sequences. Leveraging the strengths of pre-trained DNAmodels such as the Nucleotide Transformer, PathoLM requires minimal data forfine-tuning, thereby enhancing pathogen detection capabilities. It effectivelycaptures a broader genomic context, significantly improving the identificationof novel and divergent pathogens. We developed a comprehensive data setcomprising approximately 30 species of viruses and bacteria, including ESKAPEEpathogens, seven notably virulent bacterial strains resistant to antibiotics.Additionally, we curated a species classification dataset centered specificallyon the ESKAPEE group. In comparative assessments, PathoLM dramaticallyoutperforms existing models like DciPatho, demonstrating robust zero-shot andfew-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE speciesclassification, where it showed superior performance compared to other advanceddeep learning methods, despite the complexities of the task.
病原体鉴定是诊断、治疗和预防疾病的关键,对控制感染和保障公共卫生至关重要。传统的基于配准的方法虽然应用广泛,但计算量大,依赖于大量的参考数据库,由于灵敏度和特异性低,往往无法检测到新的病原体。同样,传统的机器学习技术虽然前景广阔,但需要大量的标注数据集和广泛的特征工程,容易造成拟合过度。为了应对这些挑战,我们推出了 PathoLM,这是一种针对细菌和病毒序列致病性识别而优化的前沿病原体语言模型。PathoLM 充分利用了核苷酸转换器等预训练 DNA 模型的优势,只需最少的数据进行微调,从而提高了病原体检测能力。它能有效捕捉更广泛的基因组背景,大大提高了对新型和不同病原体的识别能力。我们开发了一个包含约 30 种病毒和细菌的综合数据集,其中包括 ESKAPEE 病原体,即七种对抗生素具有抗药性的显著毒性细菌菌株。在比较评估中,PathoLM 显著优于 DciPatho 等现有模型,展示了强大的零点和零点能力。此外,我们还将 PathoLM-Sp 扩展到了 ESKAPEE 物种分类中,尽管该任务非常复杂,但与其他先进的深度学习方法相比,PathoLM-Sp 表现出了卓越的性能。
{"title":"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":"https://doi.org/arxiv-2406.13133","url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\u0000diseases, crucial for controlling infections and safeguarding public health.\u0000Traditional alignment-based methods, though widely used, are computationally\u0000intense and reliant on extensive reference databases, often failing to detect\u0000novel pathogens due to their low sensitivity and specificity. Similarly,\u0000conventional machine learning techniques, while promising, require large\u0000annotated datasets and extensive feature engineering and are prone to\u0000overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\u0000pathogen language model optimized for the identification of pathogenicity in\u0000bacterial and viral sequences. Leveraging the strengths of pre-trained DNA\u0000models such as the Nucleotide Transformer, PathoLM requires minimal data for\u0000fine-tuning, thereby enhancing pathogen detection capabilities. It effectively\u0000captures a broader genomic context, significantly improving the identification\u0000of novel and divergent pathogens. We developed a comprehensive data set\u0000comprising approximately 30 species of viruses and bacteria, including ESKAPEE\u0000pathogens, seven notably virulent bacterial strains resistant to antibiotics.\u0000Additionally, we curated a species classification dataset centered specifically\u0000on the ESKAPEE group. In comparative assessments, PathoLM dramatically\u0000outperforms existing models like DciPatho, demonstrating robust zero-shot and\u0000few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\u0000classification, where it showed superior performance compared to other advanced\u0000deep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements skandiver:用于识别细胞间移动遗传因子的基于分异的分析工具
Pub Date : 2024-06-17 DOI: arxiv-2406.12064
Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu
Mobile genetic elements (MGEs) are as ubiquitous in nature as they are variedin type, ranging from viral insertions to transposons to incorporated plasmids.Horizontal transfer of MGEs across bacterial species may also pose asignificant threat to global health due to their capability to harbourantibiotic resistance genes. However, despite cheap and rapid whole genomesequencing, the varied nature of MGEs makes it difficult to fully characterizethem, and existing methods for detecting MGEs often don't agree on what shouldcount. In this manuscript, we first define and argue in favor of adivergence-based characterization of mobile-genetic elements. Using thatparadigm, we present skandiver, a tool designed to efficiently detect MGEs fromwhole genome assemblies without the need for gene annotation or markers.skandiver determines mobile elements via genome fragmentation, averagenucleotide identity (ANI), and divergence time. By building on the scalableskani software for ANI computation, skandiver can query hundreds of completeassemblies against $>$65,000 representative genomes in a few minutes and 19 GBmemory, providing scalable and efficient method for elucidating mobile elementprofiles in incomplete, uncharacterized genomic sequences. For isolated andintegrated large plasmids (>10kbp), skandiver's recall was 48% and 47%,MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%,respectively. For isolated large plasmids, skandiver's recall (48%) is lowerthan state-of-the-art reference-based methods geNomad (86%) andMobileElementFinder (59%). However, skandiver achieves higher recall onintegrated plasmids and, unlike other methods, without comparing against acurated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver
移动遗传因子(MGEs)在自然界无处不在,其类型也多种多样,从病毒插入到转座子,再到整合质粒,不一而足。由于移动遗传因子能够携带抗生素耐药基因,因此它们在细菌物种间的水平转移也可能对全球健康构成重大威胁。然而,尽管全基因组测序既便宜又快速,但由于 MGEs 的性质各不相同,因此很难全面描述它们的特征,而且现有的 MGEs 检测方法往往对哪些应该被计算在内意见不一。在本手稿中,我们首先定义并支持基于发散性的移动遗传因子特征描述。利用这一范式,我们提出了 skandiver,这是一种无需基因注释或标记就能从全基因组组装中高效检测 MGE 的工具。skandiver 通过基因组片段、平均核苷酸同一性(ANI)和分歧时间来确定移动元素。skandiver通过基因组片段确定移动元素的平均核苷酸同一性(ANI)和分歧时间。通过利用可扩展的kani软件进行ANI计算,skandiver可以在几分钟内利用19 GB内存查询数百个完整的基因组组装和价值>65,000美元的代表性基因组,为阐明不完整、未定性基因组序列中的移动元素档案提供了可扩展的高效方法。对于分离的和整合的大质粒(>10kbp),skandiver的召回率分别为48%和47%,MobileElementFinder的召回率分别为59%和17%,geNomad的召回率分别为86%和32%。对于分离出的大质粒,skandiver的召回率(48%)低于最先进的基于参考的方法geNomad(86%)和MobileElementFinder(59%)。然而,skandiver 在整合质粒上的召回率更高,而且与其他方法不同的是,它不需要与已整合的数据库进行比较,这使得 skandiver 适合于发现新的 MGEs。可用性: https://github.com/YoukaiFromAccounting/skandiver
{"title":"skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements","authors":"Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu","doi":"arxiv-2406.12064","DOIUrl":"https://doi.org/arxiv-2406.12064","url":null,"abstract":"Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied\u0000in type, ranging from viral insertions to transposons to incorporated plasmids.\u0000Horizontal transfer of MGEs across bacterial species may also pose a\u0000significant threat to global health due to their capability to harbour\u0000antibiotic resistance genes. However, despite cheap and rapid whole genome\u0000sequencing, the varied nature of MGEs makes it difficult to fully characterize\u0000them, and existing methods for detecting MGEs often don't agree on what should\u0000count. In this manuscript, we first define and argue in favor of a\u0000divergence-based characterization of mobile-genetic elements. Using that\u0000paradigm, we present skandiver, a tool designed to efficiently detect MGEs from\u0000whole genome assemblies without the need for gene annotation or markers.\u0000skandiver determines mobile elements via genome fragmentation, average\u0000nucleotide identity (ANI), and divergence time. By building on the scalable\u0000skani software for ANI computation, skandiver can query hundreds of complete\u0000assemblies against $>$65,000 representative genomes in a few minutes and 19 GB\u0000memory, providing scalable and efficient method for elucidating mobile element\u0000profiles in incomplete, uncharacterized genomic sequences. For isolated and\u0000integrated large plasmids (>10kbp), skandiver's recall was 48% and 47%,\u0000MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%,\u0000respectively. For isolated large plasmids, skandiver's recall (48%) is lower\u0000than state-of-the-art reference-based methods geNomad (86%) and\u0000MobileElementFinder (59%). However, skandiver achieves higher recall on\u0000integrated plasmids and, unlike other methods, without comparing against a\u0000curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection pVACview:高效新抗原优先排序和选择的交互式可视化工具
Pub Date : 2024-06-11 DOI: arxiv-2406.06985
Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith
Neoantigen targeting therapies including personalized vaccines have shownpromise in the treatment of cancers. Accurate identification/prioritization ofneoantigens is highly relevant to designing clinical trials, predictingtreatment response, and understanding mechanisms of resistance. With the adventof massively parallel sequencing technologies, it is now possible to predictneoantigens based on patient-specific variant information. However, numerousfactors must be considered when prioritizing neoantigens for use inpersonalized therapies. Complexities such as alternative transcriptannotations, various binding, presentation and immunogenicity predictionalgorithms, and variable peptide lengths/registers all potentially impact theneoantigen selection process. While computational tools generate numerousalgorithmic predictions for neoantigen characterization, results from thesepipelines are difficult to navigate and require extensive knowledge of theunderlying tools for accurate interpretation. Due to the intricate nature andnumber of salient neoantigen features, presenting all relevant information tofacilitate candidate selection for downstream applications is a difficultchallenge that current tools fail to address. We have created pVACview, thefirst interactive tool designed to aid in the prioritization and selection ofneoantigen candidates for personalized neoantigen therapies. pVACview has auser-friendly and intuitive interface where users can upload, explore, selectand export their neoantigen candidates. The tool allows users to visualizecandidates using variant, transcript and peptide information. pVACview willallow researchers to analyze and prioritize neoantigen candidates with greaterefficiency and accuracy in basic and translational settings. The application isavailable as part of the pVACtools pipeline at pvactools.org and as an onlineserver at pvacview.org.
包括个性化疫苗在内的新抗原靶向疗法在治疗癌症方面大有可为。准确识别/优先选择新抗原与设计临床试验、预测治疗反应和了解抗药性机制密切相关。随着大规模并行测序技术的发展,根据患者特异性变异信息预测内抗原已成为可能。然而,在确定用于个体化疗法的新抗原的优先级时,必须考虑许多因素。替代转录本注释、各种结合、表达和免疫原性预测算法以及可变的肽长度/序列等复杂因素都可能影响新抗原的选择过程。虽然计算工具能生成大量用于新抗原特征描述的算法预测结果,但这些管道产生的结果难以驾驭,需要对基础工具有广泛的了解才能准确解读。由于新抗原特征错综复杂且数量众多,如何呈现所有相关信息以方便下游应用的候选筛选是一个艰巨的挑战,而目前的工具无法解决这一问题。我们创建了 pVACview,这是第一款交互式工具,旨在帮助确定个性化新抗原疗法的新抗原候选物的优先级并进行筛选。该工具允许用户使用变体、转录本和肽信息对候选基因进行可视化处理。pVACview 将允许研究人员在基础和转化环境中高效、准确地分析和优先处理新抗原候选基因。该应用程序可作为 pVACtools pipeline 的一部分在 pvactools.org 上使用,也可作为在线服务器在 pvacview.org 上使用。
{"title":"pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection","authors":"Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith","doi":"arxiv-2406.06985","DOIUrl":"https://doi.org/arxiv-2406.06985","url":null,"abstract":"Neoantigen targeting therapies including personalized vaccines have shown\u0000promise in the treatment of cancers. Accurate identification/prioritization of\u0000neoantigens is highly relevant to designing clinical trials, predicting\u0000treatment response, and understanding mechanisms of resistance. With the advent\u0000of massively parallel sequencing technologies, it is now possible to predict\u0000neoantigens based on patient-specific variant information. However, numerous\u0000factors must be considered when prioritizing neoantigens for use in\u0000personalized therapies. Complexities such as alternative transcript\u0000annotations, various binding, presentation and immunogenicity prediction\u0000algorithms, and variable peptide lengths/registers all potentially impact the\u0000neoantigen selection process. While computational tools generate numerous\u0000algorithmic predictions for neoantigen characterization, results from these\u0000pipelines are difficult to navigate and require extensive knowledge of the\u0000underlying tools for accurate interpretation. Due to the intricate nature and\u0000number of salient neoantigen features, presenting all relevant information to\u0000facilitate candidate selection for downstream applications is a difficult\u0000challenge that current tools fail to address. We have created pVACview, the\u0000first interactive tool designed to aid in the prioritization and selection of\u0000neoantigen candidates for personalized neoantigen therapies. pVACview has a\u0000user-friendly and intuitive interface where users can upload, explore, select\u0000and export their neoantigen candidates. The tool allows users to visualize\u0000candidates using variant, transcript and peptide information. pVACview will\u0000allow researchers to analyze and prioritize neoantigen candidates with greater\u0000efficiency and accuracy in basic and translational settings. The application is\u0000available as part of the pVACtools pipeline at pvactools.org and as an online\u0000server at pvacview.org.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization 单细胞基因组学中的强化基因选择:预过滤协同作用和强化优化
Pub Date : 2024-06-11 DOI: arxiv-2406.07418
Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, Meng Xiao
Recent advancements in single-cell genomics necessitate precision in genepanel selection to interpret complex biological data effectively. Those methodsaim to streamline the analysis of scRNA-seq data by focusing on the mostinformative genes that contribute significantly to the specific analysis task.Traditional selection methods, which often rely on expert domain knowledge,embedded machine learning models, or heuristic-based iterative optimization,are prone to biases and inefficiencies that may obscure critical genomicsignals. Recognizing the limitations of traditional methods, we aim totranscend these constraints with a refined strategy. In this study, weintroduce an iterative gene panel selection strategy that is applicable toclustering tasks in single-cell genomics. Our method uniquely integratesresults from other gene selection algorithms, providing valuable preliminaryboundaries or prior knowledge as initial guides in the search space to enhancethe efficiency of our framework. Furthermore, we incorporate the stochasticnature of the exploration process in reinforcement learning (RL) and itscapability for continuous optimization through reward-based feedback. Thiscombination mitigates the biases inherent in the initial boundaries andharnesses RL's adaptability to refine and target gene panel selectiondynamically. To illustrate the effectiveness of our method, we conducteddetailed comparative experiments, case studies, and visualization analysis.
单细胞基因组学的最新进展要求对基因组进行精确选择,以有效解读复杂的生物数据。传统的选择方法通常依赖于专家领域知识、嵌入式机器学习模型或基于启发式的迭代优化,这些方法容易产生偏差和低效,可能会掩盖关键的基因组学信号。认识到传统方法的局限性,我们希望通过一种改进的策略来超越这些限制。在这项研究中,我们介绍了一种适用于单细胞基因组学中聚类任务的迭代基因面板选择策略。我们的方法独特地整合了其他基因选择算法的结果,提供了有价值的初步边界或先验知识作为搜索空间的初始指南,从而提高了我们框架的效率。此外,我们还结合了强化学习(RL)中探索过程的随机性,以及通过基于奖励的反馈进行持续优化的能力。这种结合减轻了初始边界中固有的偏差,并利用 RL 的适应性动态地完善和锁定基因面板选择。为了说明我们方法的有效性,我们进行了详细的对比实验、案例研究和可视化分析。
{"title":"Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization","authors":"Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, Meng Xiao","doi":"arxiv-2406.07418","DOIUrl":"https://doi.org/arxiv-2406.07418","url":null,"abstract":"Recent advancements in single-cell genomics necessitate precision in gene\u0000panel selection to interpret complex biological data effectively. Those methods\u0000aim to streamline the analysis of scRNA-seq data by focusing on the most\u0000informative genes that contribute significantly to the specific analysis task.\u0000Traditional selection methods, which often rely on expert domain knowledge,\u0000embedded machine learning models, or heuristic-based iterative optimization,\u0000are prone to biases and inefficiencies that may obscure critical genomic\u0000signals. Recognizing the limitations of traditional methods, we aim to\u0000transcend these constraints with a refined strategy. In this study, we\u0000introduce an iterative gene panel selection strategy that is applicable to\u0000clustering tasks in single-cell genomics. Our method uniquely integrates\u0000results from other gene selection algorithms, providing valuable preliminary\u0000boundaries or prior knowledge as initial guides in the search space to enhance\u0000the efficiency of our framework. Furthermore, we incorporate the stochastic\u0000nature of the exploration process in reinforcement learning (RL) and its\u0000capability for continuous optimization through reward-based feedback. This\u0000combination mitigates the biases inherent in the initial boundaries and\u0000harnesses RL's adaptability to refine and target gene panel selection\u0000dynamically. To illustrate the effectiveness of our method, we conducted\u0000detailed comparative experiments, case studies, and visualization analysis.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level 利用单细胞全息数据的数据挖掘方法评估纯组织环境对基因表达水平的影响
Pub Date : 2024-06-11 DOI: arxiv-2406.06969
Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi
While single-cell RNA-seq enables the investigation of the celltype effect onthe transcriptome, the pure tissue environmental effect has not been wellinvestigated. The bias in the combination of tissue and celltype in the bodymade it difficult to evaluate the effect of pure tissue environment by omicsdata mining. It is important to prevent statistical confounding among discretevariables such as celltype, tissue, and other categorical variables whenevaluating the effects of these variables. We propose a novel method toenumerate suitable analysis units of variables for estimating the effects oftissue environment by extending the maximal biclique enumeration problem forbipartite graphs to $k$-partite hypergraphs. We applied the proposed method toa large mouse single-cell transcriptome dataset of Tabala Muris Senis toevaluate pure tissue environmental effects on gene expression. Data Miningusing the proposed method revealed pure tissue environment effects on geneexpression and its age-related change among adipose sub-tissues. The methodproposed in this study helps evaluations of the effects of discrete variablesin exploratory data mining of large-scale genomics datasets.
虽然单细胞RNA-seq可以研究细胞类型对转录组的影响,但是纯组织环境的影响还没有得到很好的研究。由于体内组织和细胞类型的组合存在偏差,因此很难通过omics数据挖掘来评估纯组织环境的影响。在评估细胞类型、组织和其他分类变量等离散变量的影响时,必须防止这些变量之间的统计混淆。我们提出了一种新方法,通过将局部图的最大双斜枚举问题扩展到 $k$ 局部超图,来枚举合适的变量分析单元,以估计组织环境的影响。我们将提出的方法应用于 Tabala Muris Senis 的大型小鼠单细胞转录组数据集,以评估纯组织环境对基因表达的影响。利用该方法进行的数据挖掘揭示了纯组织环境对基因表达的影响及其在脂肪亚组织中与年龄相关的变化。本研究提出的方法有助于在大规模基因组学数据集的探索性数据挖掘中评估离散变量的影响。
{"title":"Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level","authors":"Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi","doi":"arxiv-2406.06969","DOIUrl":"https://doi.org/arxiv-2406.06969","url":null,"abstract":"While single-cell RNA-seq enables the investigation of the celltype effect on\u0000the transcriptome, the pure tissue environmental effect has not been well\u0000investigated. The bias in the combination of tissue and celltype in the body\u0000made it difficult to evaluate the effect of pure tissue environment by omics\u0000data mining. It is important to prevent statistical confounding among discrete\u0000variables such as celltype, tissue, and other categorical variables when\u0000evaluating the effects of these variables. We propose a novel method to\u0000enumerate suitable analysis units of variables for estimating the effects of\u0000tissue environment by extending the maximal biclique enumeration problem for\u0000bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to\u0000a large mouse single-cell transcriptome dataset of Tabala Muris Senis to\u0000evaluate pure tissue environmental effects on gene expression. Data Mining\u0000using the proposed method revealed pure tissue environment effects on gene\u0000expression and its age-related change among adipose sub-tissues. The method\u0000proposed in this study helps evaluations of the effects of discrete variables\u0000in exploratory data mining of large-scale genomics datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics STimage-1K4M:用于空间转录组学的组织病理学图像-基因表达数据集
Pub Date : 2024-06-10 DOI: arxiv-2406.06393
Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li
Recent advances in multi-modal algorithms have driven and been driven by theincreasing availability of large image-text datasets, leading to significantstrides in various fields, including computational pathology. However, in mostexisting medical image-text datasets, the text typically provides high-levelsummaries that may not sufficiently describe sub-tile regions within a largepathology image. For example, an image might cover an extensive tissue areacontaining cancerous and healthy regions, but the accompanying text might onlyspecify that this image is a cancer slide, lacking the nuanced details neededfor in-depth analysis. In this study, we introduce STimage-1K4M, a noveldataset designed to bridge this gap by providing genomic features for sub-tileimages. STimage-1K4M contains 1,149 images derived from spatial transcriptomicsdata, which captures gene expression information at the level of individualspatial spots within a pathology image. Specifically, each image in the datasetis broken down into smaller sub-image tiles, with each tile paired with15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tileimages and gene expressions, STimage-1K4M offers unprecedented granularity,paving the way for a wide range of advanced research in multi-modal dataanalysis an innovative applications in computational pathology, and beyond.
大型图像-文本数据集的可用性不断提高,推动了多模态算法的最新进展,使包括计算病理学在内的各个领域都取得了长足进步。然而,在大多数现有的医学图像-文本数据集中,文本通常提供高水平的摘要,而这些摘要可能无法充分描述大型病理图像中的细分区域。例如,一幅图像可能覆盖了一个包含癌变和健康区域的大范围组织区域,但随附的文本可能只说明这幅图像是癌症切片,缺乏深入分析所需的细微细节。在本研究中,我们介绍了 STimage-1K4M,这是一个新数据集,旨在通过提供子平分图像的基因组特征来弥补这一差距。STimage-1K4M 包含 1,149 张源自空间转录组学数据的图像,该数据捕捉病理图像中单个空间点水平的基因表达信息。具体来说,数据集中的每张图像都被分解成更小的子图像瓦片,每个瓦片配对 15,000-30,000 个维度的基因表达。STimage-1K4M 拥有 4,293,195 对子瓦片图像和基因表达,提供了前所未有的粒度,为多模态数据分析的广泛高级研究和计算病理学等领域的创新应用铺平了道路。
{"title":"STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics","authors":"Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li","doi":"arxiv-2406.06393","DOIUrl":"https://doi.org/arxiv-2406.06393","url":null,"abstract":"Recent advances in multi-modal algorithms have driven and been driven by the\u0000increasing availability of large image-text datasets, leading to significant\u0000strides in various fields, including computational pathology. However, in most\u0000existing medical image-text datasets, the text typically provides high-level\u0000summaries that may not sufficiently describe sub-tile regions within a large\u0000pathology image. For example, an image might cover an extensive tissue area\u0000containing cancerous and healthy regions, but the accompanying text might only\u0000specify that this image is a cancer slide, lacking the nuanced details needed\u0000for in-depth analysis. In this study, we introduce STimage-1K4M, a novel\u0000dataset designed to bridge this gap by providing genomic features for sub-tile\u0000images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics\u0000data, which captures gene expression information at the level of individual\u0000spatial spots within a pathology image. Specifically, each image in the dataset\u0000is broken down into smaller sub-image tiles, with each tile paired with\u000015,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile\u0000images and gene expressions, STimage-1K4M offers unprecedented granularity,\u0000paving the way for a wide range of advanced research in multi-modal data\u0000analysis an innovative applications in computational pathology, and beyond.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models GenBench:用于系统评估基因组基础模型的基准套件
Pub Date : 2024-06-01 DOI: arxiv-2406.01627
Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li
The Genomic Foundation Model (GFM) paradigm is expected to facilitate theextraction of generalizable representations from massive genomic data, therebyenabling their application across a spectrum of downstream applications.Despite advancements, a lack of evaluation framework makes it difficult toensure equitable assessment due to experimental settings, model intricacy,benchmark datasets, and reproducibility challenges. In the absence ofstandardization, comparative analyses risk becoming biased and unreliable. Tosurmount this impasse, we introduce GenBench, a comprehensive benchmarkingsuite specifically tailored for evaluating the efficacy of Genomic FoundationModels. GenBench offers a modular and expandable framework that encapsulates avariety of state-of-the-art methodologies. Through systematic evaluations ofdatasets spanning diverse biological domains with a particular emphasis on bothshort-range and long-range genomic tasks, firstly including the three mostimportant DNA tasks covering Coding Region, Non-Coding Region, GenomeStructure, etc. Moreover, We provide a nuanced analysis of the interplaybetween model architecture and dataset characteristics on task-specificperformance. Our findings reveal an interesting observation: independent of thenumber of parameters, the discernible difference in preference between theattention-based and convolution-based models on short- and long-range tasks mayprovide insights into the future design of GFM.
基因组基础模型(GFM)范式有望促进从海量基因组数据中提取可通用的表征,从而使其能够应用于各种下游应用。尽管取得了进展,但由于实验设置、模型复杂性、基准数据集和可重复性方面的挑战,评估框架的缺乏使公平评估难以得到保证。在缺乏标准化的情况下,比较分析有可能变得有失偏颇和不可靠。为了打破这一僵局,我们推出了 GenBench,这是一个专门用于评估基因组基础模型功效的综合基准套件。GenBench 提供了一个模块化、可扩展的框架,囊括了各种最先进的方法。通过对横跨不同生物领域的数据集进行系统评估,特别强调短程和远程基因组任务,首先包括三个最重要的 DNA 任务,涵盖编码区、非编码区、基因组结构等。此外,我们还对模型架构和数据集特征之间的相互作用进行了细致的分析。我们的发现揭示了一个有趣的现象:与参数数量无关,基于注意力的模型和基于卷积的模型在短程和远程任务上存在明显的偏好差异,这可能会为未来的 GFM 设计提供启示。
{"title":"GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models","authors":"Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li","doi":"arxiv-2406.01627","DOIUrl":"https://doi.org/arxiv-2406.01627","url":null,"abstract":"The Genomic Foundation Model (GFM) paradigm is expected to facilitate the\u0000extraction of generalizable representations from massive genomic data, thereby\u0000enabling their application across a spectrum of downstream applications.\u0000Despite advancements, a lack of evaluation framework makes it difficult to\u0000ensure equitable assessment due to experimental settings, model intricacy,\u0000benchmark datasets, and reproducibility challenges. In the absence of\u0000standardization, comparative analyses risk becoming biased and unreliable. To\u0000surmount this impasse, we introduce GenBench, a comprehensive benchmarking\u0000suite specifically tailored for evaluating the efficacy of Genomic Foundation\u0000Models. GenBench offers a modular and expandable framework that encapsulates a\u0000variety of state-of-the-art methodologies. Through systematic evaluations of\u0000datasets spanning diverse biological domains with a particular emphasis on both\u0000short-range and long-range genomic tasks, firstly including the three most\u0000important DNA tasks covering Coding Region, Non-Coding Region, Genome\u0000Structure, etc. Moreover, We provide a nuanced analysis of the interplay\u0000between model architecture and dataset characteristics on task-specific\u0000performance. Our findings reveal an interesting observation: independent of the\u0000number of parameters, the discernible difference in preference between the\u0000attention-based and convolution-based models on short- and long-range tasks may\u0000provide insights into the future design of GFM.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - QuanBio - Genomics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1