用最小多级机器学习(M3L)减少训练数据需求

IF 6.3 2区 物理与天体物理 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Learning Science and Technology Pub Date : 2024-06-06 DOI:10.1088/2632-2153/ad4ae5
Stefan Heinen, Danish Khan, Guido Falk von Rudorff, Konstantin Karandashev, Daniel Jose Arismendi Arrieta, Alastair J A Price, Surajit Nandi, Arghya Bhowmik, Kersti Hermansson, O Anatole von Lilienfeld
{"title":"用最小多级机器学习(M3L)减少训练数据需求","authors":"Stefan Heinen, Danish Khan, Guido Falk von Rudorff, Konstantin Karandashev, Daniel Jose Arismendi Arrieta, Alastair J A Price, Surajit Nandi, Arghya Bhowmik, Kersti Hermansson, O Anatole von Lilienfeld","doi":"10.1088/2632-2153/ad4ae5","DOIUrl":null,"url":null,"abstract":"For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{LCCSD(T)}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:mi>LCCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn1.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, Electrolyte Genome Project, QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{CCSD(T)}_\\mathrm{AE}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>AE</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn2.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, and QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{CCSD(T)}_\\mathrm{EA}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>EA</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn3.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and <italic toggle=\"yes\">τ</italic>-HCTH, B3LYP<inline-formula>\n<tex-math><?CDATA $\\ast$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:mo>∗</mml:mo></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn4.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>(VWN5), and TPSSH.","PeriodicalId":33757,"journal":{"name":"Machine Learning Science and Technology","volume":"19 1","pages":""},"PeriodicalIF":6.3000,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reducing training data needs with minimal multilevel machine learning (M3L)\",\"authors\":\"Stefan Heinen, Danish Khan, Guido Falk von Rudorff, Konstantin Karandashev, Daniel Jose Arismendi Arrieta, Alastair J A Price, Surajit Nandi, Arghya Bhowmik, Kersti Hermansson, O Anatole von Lilienfeld\",\"doi\":\"10.1088/2632-2153/ad4ae5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9<inline-formula>\\n<tex-math><?CDATA $^\\\\mathrm{LCCSD(T)}$?></tex-math>\\n<mml:math overflow=\\\"scroll\\\"><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:mi>LCCSD</mml:mi><mml:mo stretchy=\\\"false\\\">(</mml:mo><mml:mi mathvariant=\\\"normal\\\">T</mml:mi><mml:mo stretchy=\\\"false\\\">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math>\\n<inline-graphic xlink:href=\\\"mlstad4ae5ieqn1.gif\\\" xlink:type=\\\"simple\\\"></inline-graphic>\\n</inline-formula>, Electrolyte Genome Project, QM9<inline-formula>\\n<tex-math><?CDATA $^\\\\mathrm{CCSD(T)}_\\\\mathrm{AE}$?></tex-math>\\n<mml:math overflow=\\\"scroll\\\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>AE</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\\\"false\\\">(</mml:mo><mml:mi mathvariant=\\\"normal\\\">T</mml:mi><mml:mo stretchy=\\\"false\\\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\\n<inline-graphic xlink:href=\\\"mlstad4ae5ieqn2.gif\\\" xlink:type=\\\"simple\\\"></inline-graphic>\\n</inline-formula>, and QM9<inline-formula>\\n<tex-math><?CDATA $^\\\\mathrm{CCSD(T)}_\\\\mathrm{EA}$?></tex-math>\\n<mml:math overflow=\\\"scroll\\\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>EA</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\\\"false\\\">(</mml:mo><mml:mi mathvariant=\\\"normal\\\">T</mml:mi><mml:mo stretchy=\\\"false\\\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\\n<inline-graphic xlink:href=\\\"mlstad4ae5ieqn3.gif\\\" xlink:type=\\\"simple\\\"></inline-graphic>\\n</inline-formula>, respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and <italic toggle=\\\"yes\\\">τ</italic>-HCTH, B3LYP<inline-formula>\\n<tex-math><?CDATA $\\\\ast$?></tex-math>\\n<mml:math overflow=\\\"scroll\\\"><mml:mrow><mml:mo>∗</mml:mo></mml:mrow></mml:math>\\n<inline-graphic xlink:href=\\\"mlstad4ae5ieqn4.gif\\\" xlink:type=\\\"simple\\\"></inline-graphic>\\n</inline-formula>(VWN5), and TPSSH.\",\"PeriodicalId\":33757,\"journal\":{\"name\":\"Machine Learning Science and Technology\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2024-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine Learning Science and Technology\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.1088/2632-2153/ad4ae5\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning Science and Technology","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1088/2632-2153/ad4ae5","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

对于许多科学领域的机器学习应用来说,即使避免实验并依靠计算和模拟,数据获取而非训练也是瓶颈所在。相应地,为了降低成本和碳足迹,训练数据的效率是关键。我们引入了最小多层次机器学习(M3L),它利用多层次参考数据的损失函数来优化训练数据集的大小,以最大限度地减少预测误差与总体训练数据获取成本(以计算墙时间衡量)的组合。在不同的理论水平上,包括 HF、MP2、DLPNO-CCSD(T)、DFHFCABS、PNOMP2F12 和 PNOCCSD(T)F12,并使用基集 TZ、cc-pVTZ 和 AVTZ-F12 对数千种有机分子的雾化能和电子亲和力进行了计算,获得了数值证据。在数据集 QM7b、QM9LCCSD(T)、电解质基因组计划、QM9AECCSD(T)和 QM9EACCSD(T)中,与启发式次优多层次机器学习(M2L)相比,我们在不同化合物子空间中达到化学准确性的 M3L 基准表明计算成本大幅降低了 1.01、1.1、3.8、13.8 和 25.8 倍。此外,我们还使用 M2L 研究了 76 个密度函数在多层次学习中的性能,并根据雅各布梯形图的层次结构建立了以下层次:LDA、GGA、mGGA 和混合函数。在 M2L 和所考虑的分子中,mGGA 与 GGA 相比没有明显优势。在所考虑的函数中,结合 LDA,使用 M3L 在 QM9 上原子化能平均表现最好的三个 GGA 和混合级分别对应 PW91、KT2、B97D 和 τ-HTH、B3LYP∗(VWN5) 和 TPSSH。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Reducing training data needs with minimal multilevel machine learning (M3L)
For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9 LCCSD(T) , Electrolyte Genome Project, QM9 AECCSD(T) , and QM9 EACCSD(T) , respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and τ-HCTH, B3LYP (VWN5), and TPSSH.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Machine Learning Science and Technology
Machine Learning Science and Technology Computer Science-Artificial Intelligence
CiteScore
9.10
自引率
4.40%
发文量
86
审稿时长
5 weeks
期刊介绍: Machine Learning Science and Technology is a multidisciplinary open access journal that bridges the application of machine learning across the sciences with advances in machine learning methods and theory as motivated by physical insights. Specifically, articles must fall into one of the following categories: advance the state of machine learning-driven applications in the sciences or make conceptual, methodological or theoretical advances in machine learning with applications to, inspiration from, or motivated by scientific problems.
期刊最新文献
Quality assurance for online adaptive radiotherapy: a secondary dose verification model with geometry-encoded U-Net. Optimizing ZX-diagrams with deep reinforcement learning DiffLense: a conditional diffusion model for super-resolution of gravitational lensing data Equivariant tensor network potentials Masked particle modeling on sets: towards self-supervised high energy physics foundation models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1