利用低维分子嵌入进行快速化学相似性搜索。

Kathryn E Kirchoff, James Wellnitz, Joshua E Hochuli, Travis Maxfield, Konstantin I Popov, Shawn Gomez, Alexander Tropsha
{"title":"利用低维分子嵌入进行快速化学相似性搜索。","authors":"Kathryn E Kirchoff, James Wellnitz, Joshua E Hochuli, Travis Maxfield, Konstantin I Popov, Shawn Gomez, Alexander Tropsha","doi":"10.1007/978-3-031-56060-6_3","DOIUrl":null,"url":null,"abstract":"<p><p>Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a <i>k</i>-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.</p>","PeriodicalId":519896,"journal":{"name":"Advances in information retrieval : ... European Conference on IR Research, ECIR ... proceedings. European Conference on IR Research","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10998712/pdf/","citationCount":"0","resultStr":"{\"title\":\"Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search.\",\"authors\":\"Kathryn E Kirchoff, James Wellnitz, Joshua E Hochuli, Travis Maxfield, Konstantin I Popov, Shawn Gomez, Alexander Tropsha\",\"doi\":\"10.1007/978-3-031-56060-6_3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a <i>k</i>-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.</p>\",\"PeriodicalId\":519896,\"journal\":{\"name\":\"Advances in information retrieval : ... European Conference on IR Research, ECIR ... proceedings. European Conference on IR Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10998712/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in information retrieval : ... European Conference on IR Research, ECIR ... proceedings. European Conference on IR Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/978-3-031-56060-6_3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/3/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in information retrieval : ... European Conference on IR Research, ECIR ... proceedings. European Conference on IR Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-031-56060-6_3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/16 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基于近邻的相似性搜索是化学领域的一项常见任务,在药物发现领域有显著的应用案例。然而,这项任务中最常用的一些方法仍然采用蛮力方法。在实践中,这种方法的计算成本很高,而且耗时过长,部分原因是现代化学数据库的规模庞大。以往针对这一任务的计算进展通常依赖于硬件的改进或特定数据集的技巧,缺乏通用性。利用低复杂度搜索算法的方法仍相对缺乏探索。然而,这些算法中的许多都是近似解决方案,并且/或者在典型的高维化学嵌入方面举步维艰。在此,我们评估了低维化学嵌入和 k-d 树数据结构的组合能否实现快速近邻查询,同时保持标准化学相似性搜索基准的性能。我们研究了标准化学嵌入的不同降维方法,以及为完成这项任务而学习的结构感知嵌入--SmallSA。利用这一框架,在单个 CPU 内核上对超过 10 亿种化学物质的搜索只需不到一秒的时间,比 "蛮力 "方法快了五个数量级。我们还证明,SmallSA 在化学相似性基准测试中取得了极具竞争力的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search.

Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1