QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules.

IF 7.2 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Scientific Data Pub Date : 2025-02-03 DOI:10.1038/s41597-024-04247-3

Vivin Vinod, Peter Zaspel

{"title":"QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules.","authors":"Vivin Vinod, Peter Zaspel","doi":"10.1038/s41597-024-04247-3","DOIUrl":null,"url":null,"abstract":"<p><p>Progress in both Machine Learning (ML) and Quantum Chemistry (QC) methods have resulted in high accuracy ML models for QC properties. Datasets such as MD17 and WS22 have been used to benchmark these models at a given level of QC method, or fidelity, which refers to the accuracy of the chosen QC method. Multifidelity ML (MFML) methods, where models are trained on data from more than one fidelity, have shown to be effective over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. One hurdle for effective research here is the lack of a diverse multifidelity dataset for benchmarking. We provide the Quantum chemistry MultiFidelity (QeMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. QeMFi offers to the community a variety of QC properties such as vertical excitation properties and molecular dipole moments. Further QeMFi offers QC computation times allowing for a time benefit benchmark of multifidelity models for ML-QC.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"202"},"PeriodicalIF":7.2000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11791055/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Data","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41597-024-04247-3","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Progress in both Machine Learning (ML) and Quantum Chemistry (QC) methods have resulted in high accuracy ML models for QC properties. Datasets such as MD17 and WS22 have been used to benchmark these models at a given level of QC method, or fidelity, which refers to the accuracy of the chosen QC method. Multifidelity ML (MFML) methods, where models are trained on data from more than one fidelity, have shown to be effective over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. One hurdle for effective research here is the lack of a diverse multifidelity dataset for benchmarking. We provide the Quantum chemistry MultiFidelity (QeMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. QeMFi offers to the community a variety of QC properties such as vertical excitation properties and molecular dipole moments. Further QeMFi offers QC computation times allowing for a time benefit benchmark of multifidelity models for ML-QC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

QeMFi：不同分子量子化学性质的多保真度数据集。

机器学习（ML）和量子化学（QC）方法的进步导致了QC特性的高精度ML模型。MD17和WS22等数据集已被用于在给定的QC方法或保真度水平上对这些模型进行基准测试，保真度指的是所选QC方法的准确性。多保真度机器学习（MFML）方法，其中模型在来自多个保真度的数据上进行训练，已经证明比单一保真度方法更有效。从能带隙到激发能，在这个方向上有很多的研究进展。有效研究的一个障碍是缺乏多样化的多保真度数据集进行基准测试。我们提供了量子化学多保真度（QeMFi）数据集，该数据集由使用TD-DFT形式计算的五个保真度组成。保真度的基础设置选择不同：STO-3G、3-21G、6-31G、def2-SVP和def2-TZVP。QeMFi为社区提供了各种QC特性，如垂直激发特性和分子偶极矩。进一步的QeMFi提供QC计算时间，允许ML-QC的多保真模型的时间效益基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scientific Data Social Sciences-Education

CiteScore

11.20

自引率

4.10%

发文量

689

审稿时长

16 weeks

期刊介绍： Scientific Data is an open-access journal focused on data, publishing descriptions of research datasets and articles on data sharing across natural sciences, medicine, engineering, and social sciences. Its goal is to enhance the sharing and reuse of scientific data, encourage broader data sharing, and acknowledge those who share their data. The journal primarily publishes Data Descriptors, which offer detailed descriptions of research datasets, including data collection methods and technical analyses validating data quality. These descriptors aim to facilitate data reuse rather than testing hypotheses or presenting new interpretations, methods, or in-depth analyses.