A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

arXiv - QuanBio - Genomics Pub Date : 2024-07-21 DOI:arxiv-2407.15888

Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun

{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":null,"url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\nbiology. Many deep learning models have been proposed to embed DNA sequences\nand predict their enzymatic function, leveraging information in public\ndatabases linking DNA sequences to an enzymatic function label. However, much\nof the scientific community's knowledge of biological function is not\nrepresented in these categorical labels, and is instead captured in\nunstructured text descriptions of mechanisms, reactions, and enzyme behavior.\nThese descriptions are often captured alongside DNA sequences in biological\ndatabases, albeit in an unstructured manner. Deep learning of models predicting\nenzymatic function are likely to benefit from incorporating this multi-modal\ndata encoding scientific knowledge of biological function. There is, however,\nno dataset designed for machine learning algorithms to leverage this\nmulti-modal information. Here we propose a novel dataset and benchmark suite\nthat enables the exploration and development of large multi-modal neural\nnetwork models on gene DNA sequences and natural language descriptions of gene\nfunction. We present baseline performance on benchmarks for both unsupervised\nand supervised tasks that demonstrate the difficulty of this modeling\nobjective, while demonstrating the potential benefit of incorporating\nmulti-modal data types in function prediction compared to DNA sequences alone.\nOur dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结合 DNA 序列和自然语言的酶功能多模式预测基准数据集

从 DNA 序列预测基因功能是生物学的一项基本挑战。许多深度学习模型被提出来嵌入 DNA 序列并预测其酶功能，利用公共数据库中的信息将 DNA 序列与酶功能标签联系起来。然而，科学界关于生物功能的大部分知识并没有体现在这些分类标签中，而是体现在关于机制、反应和酶行为的非结构化文本描述中。这些描述通常与生物数据库中的 DNA 序列一起被捕获，尽管是以非结构化的方式捕获的。预测酶功能的深度学习模型很可能受益于这些编码生物功能科学知识的多模式数据。然而，目前还没有专为机器学习算法设计的数据集来利用这些多模态信息。在这里，我们提出了一个新颖的数据集和基准套装，它可以在基因 DNA 序列和基因功能自然语言描述上探索和开发大型多模态神经网络模型。我们展示了无监督和有监督任务的基准性能，证明了这一建模目标的难度，同时也证明了与单独的 DNA 序列相比，在功能预测中纳入多模态数据类型的潜在好处。我们的数据集在：https://hoarfrost-lab.github.io/BioTalk/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量