Data-Efficient Molecular Generation with Hierarchical Textual Inversion

arXiv - QuanBio - Molecular Networks Pub Date : 2024-05-05 DOI:arxiv-2405.02845

Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin

{"title":"Data-Efficient Molecular Generation with Hierarchical Textual Inversion","authors":"Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin","doi":"arxiv-2405.02845","DOIUrl":null,"url":null,"abstract":"Developing an effective molecular generation framework even with a limited\nnumber of molecules is often important for its practical deployment, e.g., drug\ndiscovery, since acquiring task-related molecular data requires expensive and\ntime-consuming experimental costs. To tackle this issue, we introduce\nHierarchical textual Inversion for Molecular generation (HI-Mol), a novel\ndata-efficient molecular generation method. HI-Mol is inspired by the\nimportance of hierarchical information, e.g., both coarse- and fine-grained\nfeatures, in understanding the molecule distribution. We propose to use\nmulti-level embeddings to reflect such hierarchical features based on the\nadoption of the recent textual inversion technique in the visual domain, which\nachieves data-efficient image generation. Compared to the conventional textual\ninversion method in the image domain using a single-level token embedding, our\nmulti-level token embeddings allow the model to effectively learn the\nunderlying low-shot molecule distribution. We then generate molecules based on\nthe interpolation of the multi-level token embeddings. Extensive experiments\ndemonstrate the superiority of HI-Mol with notable data-efficiency. For\ninstance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x\nless training data. We also show the effectiveness of molecules generated by\nHI-Mol in low-shot molecular property prediction.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. We propose to use multi-level embeddings to reflect such hierarchical features based on the adoption of the recent textual inversion technique in the visual domain, which achieves data-efficient image generation. Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution. We then generate molecules based on the interpolation of the multi-level token embeddings. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x less training data. We also show the effectiveness of molecules generated by HI-Mol in low-shot molecular property prediction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用层次化文本反演进行高效数据分子生成

由于获取与任务相关的分子数据需要昂贵且耗时的实验成本，因此即使分子数量有限，开发一个有效的分子生成框架对于其实际应用（例如药物发现）也非常重要。为了解决这个问题，我们引入了分子生成的分层文本反演（HI-Mol），这是一种新颖的数据高效分子生成方法。HI-Mol 的灵感来自层次信息（例如粗粒度和细粒度特征）在理解分子分布方面的重要性。我们建议在视觉领域采用最新的文本反演技术的基础上，使用多层次嵌入来反映这种层次特征，从而实现数据高效的图像生成。与在图像领域使用单级标记嵌入的传统文本反演方法相比，我们的多级标记嵌入可以让模型有效地学习底层低照分子分布。然后，我们根据多级标记嵌入的插值生成分子。大量实验证明，HI-Mol 具有显著的数据效率优势。例如，在 QM9 上，HI-Mol 在训练数据减少 50 倍的情况下，性能超过了之前最先进的方法。我们还展示了 HI-Mol 生成的分子在低射分子性质预测中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - QuanBio - Molecular Networks

自引率

0.00%

发文量