Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin
{"title":"Data-Efficient Molecular Generation with Hierarchical Textual Inversion","authors":"Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin","doi":"arxiv-2405.02845","DOIUrl":null,"url":null,"abstract":"Developing an effective molecular generation framework even with a limited\nnumber of molecules is often important for its practical deployment, e.g., drug\ndiscovery, since acquiring task-related molecular data requires expensive and\ntime-consuming experimental costs. To tackle this issue, we introduce\nHierarchical textual Inversion for Molecular generation (HI-Mol), a novel\ndata-efficient molecular generation method. HI-Mol is inspired by the\nimportance of hierarchical information, e.g., both coarse- and fine-grained\nfeatures, in understanding the molecule distribution. We propose to use\nmulti-level embeddings to reflect such hierarchical features based on the\nadoption of the recent textual inversion technique in the visual domain, which\nachieves data-efficient image generation. Compared to the conventional textual\ninversion method in the image domain using a single-level token embedding, our\nmulti-level token embeddings allow the model to effectively learn the\nunderlying low-shot molecule distribution. We then generate molecules based on\nthe interpolation of the multi-level token embeddings. Extensive experiments\ndemonstrate the superiority of HI-Mol with notable data-efficiency. For\ninstance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x\nless training data. We also show the effectiveness of molecules generated by\nHI-Mol in low-shot molecular property prediction.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. We propose to use multi-level embeddings to reflect such hierarchical features based on the adoption of the recent textual inversion technique in the visual domain, which achieves data-efficient image generation. Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution. We then generate molecules based on the interpolation of the multi-level token embeddings. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x less training data. We also show the effectiveness of molecules generated by HI-Mol in low-shot molecular property prediction.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用层次化文本反演进行高效数据分子生成
由于获取与任务相关的分子数据需要昂贵且耗时的实验成本,因此即使分子数量有限,开发一个有效的分子生成框架对于其实际应用(例如药物发现)也非常重要。为了解决这个问题,我们引入了分子生成的分层文本反演(HI-Mol),这是一种新颖的数据高效分子生成方法。HI-Mol 的灵感来自层次信息(例如粗粒度和细粒度特征)在理解分子分布方面的重要性。我们建议在视觉领域采用最新的文本反演技术的基础上,使用多层次嵌入来反映这种层次特征,从而实现数据高效的图像生成。与在图像领域使用单级标记嵌入的传统文本反演方法相比,我们的多级标记嵌入可以让模型有效地学习底层低照分子分布。然后,我们根据多级标记嵌入的插值生成分子。大量实验证明,HI-Mol 具有显著的数据效率优势。例如,在 QM9 上,HI-Mol 在训练数据减少 50 倍的情况下,性能超过了之前最先进的方法。我们还展示了 HI-Mol 生成的分子在低射分子性质预测中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-variable control to mitigate loads in CRISPRa networks Some bounds on positive equilibria in mass action networks Non-explosivity of endotactic stochastic reaction systems Limits on the computational expressivity of non-equilibrium biophysical processes When lowering temperature, the in vivo circadian clock in cyanobacteria follows and surpasses the in vitro protein clock trough the Hopf bifurcation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1