针对特定任务目标主动学习分子数据

Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke
{"title":"针对特定任务目标主动学习分子数据","authors":"Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke","doi":"arxiv-2408.11191","DOIUrl":null,"url":null,"abstract":"Active learning (AL) has shown promise for being a particularly\ndata-efficient machine learning approach. Yet, its performance depends on the\napplication and it is not clear when AL practitioners can expect computational\nsavings. Here, we carry out a systematic AL performance assessment for three\ndiverse molecular datasets and two common scientific tasks: compiling compact,\ninformative datasets and targeted molecular searches. We implemented AL with\nGaussian processes (GP) and used the many-body tensor as molecular\nrepresentation. For the first task, we tested different data acquisition\nstrategies, batch sizes and GP noise settings. AL was insensitive to the\nacquisition batch size and we observed the best AL performance for the\nacquisition strategy that combines uncertainty reduction with clustering to\npromote diversity. However, for optimal GP noise settings, AL did not\noutperform randomized selection of data points. Conversely, for targeted\nsearches, AL outperformed random sampling and achieved data savings up to 64%.\nOur analysis provides insight into this task-specific performance difference in\nterms of target distributions and data collection strategies. We established\nthat the performance of AL depends on the relative distribution of the target\nmolecules in comparison to the total dataset distribution, with the largest\ncomputational savings achieved when their overlap is minimal.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Active Learning of Molecular Data for Task-Specific Objectives\",\"authors\":\"Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke\",\"doi\":\"arxiv-2408.11191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Active learning (AL) has shown promise for being a particularly\\ndata-efficient machine learning approach. Yet, its performance depends on the\\napplication and it is not clear when AL practitioners can expect computational\\nsavings. Here, we carry out a systematic AL performance assessment for three\\ndiverse molecular datasets and two common scientific tasks: compiling compact,\\ninformative datasets and targeted molecular searches. We implemented AL with\\nGaussian processes (GP) and used the many-body tensor as molecular\\nrepresentation. For the first task, we tested different data acquisition\\nstrategies, batch sizes and GP noise settings. AL was insensitive to the\\nacquisition batch size and we observed the best AL performance for the\\nacquisition strategy that combines uncertainty reduction with clustering to\\npromote diversity. However, for optimal GP noise settings, AL did not\\noutperform randomized selection of data points. Conversely, for targeted\\nsearches, AL outperformed random sampling and achieved data savings up to 64%.\\nOur analysis provides insight into this task-specific performance difference in\\nterms of target distributions and data collection strategies. We established\\nthat the performance of AL depends on the relative distribution of the target\\nmolecules in comparison to the total dataset distribution, with the largest\\ncomputational savings achieved when their overlap is minimal.\",\"PeriodicalId\":501065,\"journal\":{\"name\":\"arXiv - PHYS - Data Analysis, Statistics and Probability\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Data Analysis, Statistics and Probability\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.11191\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

主动学习(AL)有望成为一种数据效率特别高的机器学习方法。然而,主动学习的性能取决于应用,目前还不清楚主动学习实践者何时可以期望节省计算量。在这里,我们针对三种不同的分子数据集和两种常见的科学任务进行了系统的 AL 性能评估:编译紧凑、信息丰富的数据集和有针对性的分子搜索。我们用高斯过程(GP)实现了 AL,并使用多体张量作为分子描述。对于第一个任务,我们测试了不同的数据采集策略、批量大小和 GP 噪声设置。AL对采集批量大小不敏感,我们观察到,将减少不确定性与促进多样性的聚类相结合的采集策略具有最佳的AL性能。然而,对于最佳的 GP 噪声设置,AL 的表现并不优于随机选择数据点。相反,对于有针对性的搜索,AL 的性能优于随机抽样,并节省了高达 64% 的数据。我们的分析深入揭示了目标分布和数据采集策略在特定任务中的性能差异。我们的分析深入揭示了目标分布和数据收集策略方面的这种特定任务性能差异。我们发现,AL 的性能取决于目标分子相对于整个数据集分布的相对分布,当两者的重叠最小时,计算量节省最大。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Active Learning of Molecular Data for Task-Specific Objectives
Active learning (AL) has shown promise for being a particularly data-efficient machine learning approach. Yet, its performance depends on the application and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes and GP noise settings. AL was insensitive to the acquisition batch size and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PASS: An Asynchronous Probabilistic Processor for Next Generation Intelligence Astrometric Binary Classification Via Artificial Neural Networks XENONnT Analysis: Signal Reconstruction, Calibration and Event Selection Converting sWeights to Probabilities with Density Ratios Challenges and perspectives in recurrence analyses of event time series
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1