Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin
{"title":"Data-Efficient Molecular Generation with Hierarchical Textual Inversion","authors":"Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin","doi":"arxiv-2405.02845","DOIUrl":null,"url":null,"abstract":"Developing an effective molecular generation framework even with a limited\nnumber of molecules is often important for its practical deployment, e.g., drug\ndiscovery, since acquiring task-related molecular data requires expensive and\ntime-consuming experimental costs. To tackle this issue, we introduce\nHierarchical textual Inversion for Molecular generation (HI-Mol), a novel\ndata-efficient molecular generation method. HI-Mol is inspired by the\nimportance of hierarchical information, e.g., both coarse- and fine-grained\nfeatures, in understanding the molecule distribution. We propose to use\nmulti-level embeddings to reflect such hierarchical features based on the\nadoption of the recent textual inversion technique in the visual domain, which\nachieves data-efficient image generation. Compared to the conventional textual\ninversion method in the image domain using a single-level token embedding, our\nmulti-level token embeddings allow the model to effectively learn the\nunderlying low-shot molecule distribution. We then generate molecules based on\nthe interpolation of the multi-level token embeddings. Extensive experiments\ndemonstrate the superiority of HI-Mol with notable data-efficiency. For\ninstance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x\nless training data. We also show the effectiveness of molecules generated by\nHI-Mol in low-shot molecular property prediction.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Developing an effective molecular generation framework even with a limited
number of molecules is often important for its practical deployment, e.g., drug
discovery, since acquiring task-related molecular data requires expensive and
time-consuming experimental costs. To tackle this issue, we introduce
Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel
data-efficient molecular generation method. HI-Mol is inspired by the
importance of hierarchical information, e.g., both coarse- and fine-grained
features, in understanding the molecule distribution. We propose to use
multi-level embeddings to reflect such hierarchical features based on the
adoption of the recent textual inversion technique in the visual domain, which
achieves data-efficient image generation. Compared to the conventional textual
inversion method in the image domain using a single-level token embedding, our
multi-level token embeddings allow the model to effectively learn the
underlying low-shot molecule distribution. We then generate molecules based on
the interpolation of the multi-level token embeddings. Extensive experiments
demonstrate the superiority of HI-Mol with notable data-efficiency. For
instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x
less training data. We also show the effectiveness of molecules generated by
HI-Mol in low-shot molecular property prediction.