Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke
{"title":"Active Learning of Molecular Data for Task-Specific Objectives","authors":"Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke","doi":"arxiv-2408.11191","DOIUrl":null,"url":null,"abstract":"Active learning (AL) has shown promise for being a particularly\ndata-efficient machine learning approach. Yet, its performance depends on the\napplication and it is not clear when AL practitioners can expect computational\nsavings. Here, we carry out a systematic AL performance assessment for three\ndiverse molecular datasets and two common scientific tasks: compiling compact,\ninformative datasets and targeted molecular searches. We implemented AL with\nGaussian processes (GP) and used the many-body tensor as molecular\nrepresentation. For the first task, we tested different data acquisition\nstrategies, batch sizes and GP noise settings. AL was insensitive to the\nacquisition batch size and we observed the best AL performance for the\nacquisition strategy that combines uncertainty reduction with clustering to\npromote diversity. However, for optimal GP noise settings, AL did not\noutperform randomized selection of data points. Conversely, for targeted\nsearches, AL outperformed random sampling and achieved data savings up to 64%.\nOur analysis provides insight into this task-specific performance difference in\nterms of target distributions and data collection strategies. We established\nthat the performance of AL depends on the relative distribution of the target\nmolecules in comparison to the total dataset distribution, with the largest\ncomputational savings achieved when their overlap is minimal.","PeriodicalId":501065,"journal":{"name":"arXiv - PHYS - Data Analysis, Statistics and Probability","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Data Analysis, Statistics and Probability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Active learning (AL) has shown promise for being a particularly
data-efficient machine learning approach. Yet, its performance depends on the
application and it is not clear when AL practitioners can expect computational
savings. Here, we carry out a systematic AL performance assessment for three
diverse molecular datasets and two common scientific tasks: compiling compact,
informative datasets and targeted molecular searches. We implemented AL with
Gaussian processes (GP) and used the many-body tensor as molecular
representation. For the first task, we tested different data acquisition
strategies, batch sizes and GP noise settings. AL was insensitive to the
acquisition batch size and we observed the best AL performance for the
acquisition strategy that combines uncertainty reduction with clustering to
promote diversity. However, for optimal GP noise settings, AL did not
outperform randomized selection of data points. Conversely, for targeted
searches, AL outperformed random sampling and achieved data savings up to 64%.
Our analysis provides insight into this task-specific performance difference in
terms of target distributions and data collection strategies. We established
that the performance of AL depends on the relative distribution of the target
molecules in comparison to the total dataset distribution, with the largest
computational savings achieved when their overlap is minimal.
主动学习(AL)有望成为一种数据效率特别高的机器学习方法。然而,主动学习的性能取决于应用,目前还不清楚主动学习实践者何时可以期望节省计算量。在这里,我们针对三种不同的分子数据集和两种常见的科学任务进行了系统的 AL 性能评估:编译紧凑、信息丰富的数据集和有针对性的分子搜索。我们用高斯过程(GP)实现了 AL,并使用多体张量作为分子描述。对于第一个任务,我们测试了不同的数据采集策略、批量大小和 GP 噪声设置。AL对采集批量大小不敏感,我们观察到,将减少不确定性与促进多样性的聚类相结合的采集策略具有最佳的AL性能。然而,对于最佳的 GP 噪声设置,AL 的表现并不优于随机选择数据点。相反,对于有针对性的搜索,AL 的性能优于随机抽样,并节省了高达 64% 的数据。我们的分析深入揭示了目标分布和数据采集策略在特定任务中的性能差异。我们的分析深入揭示了目标分布和数据收集策略方面的这种特定任务性能差异。我们发现,AL 的性能取决于目标分子相对于整个数据集分布的相对分布,当两者的重叠最小时,计算量节省最大。