Regression with Large Language Models for Materials and Molecular Property Prediction

arXiv - PHYS - Materials Science Pub Date : 2024-09-09 DOI:arxiv-2409.06080

Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan

{"title":"Regression with Large Language Models for Materials and Molecular Property Prediction","authors":"Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan","doi":"arxiv-2409.06080","DOIUrl":null,"url":null,"abstract":"We demonstrate the ability of large language models (LLMs) to perform\nmaterial and molecular property regression tasks, a significant deviation from\nthe conventional LLM use case. We benchmark the Large Language Model Meta AI\n(LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials\nproperties. Only composition-based input strings are used as the model input\nand we fine tune on only the generative loss. We broadly find that LLaMA 3,\nwhen fine-tuned using the SMILES representation of molecules, provides useful\nregression results which can rival standard materials property prediction\nmodels like random forest or fully connected neural networks on the QM9\ndataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the\nstate-of-the-art models that were trained using far more granular\nrepresentation of molecules (e.g., atom types and their coordinates) for the\nsame task. Interestingly, LLaMA 3 provides improved predictions compared to\nGPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting\nthat LLM-like generative models can potentially transcend their traditional\napplications to tackle complex physical phenomena, thus paving the way for\nfuture research and applications in chemistry, materials science and other\nscientific domains.","PeriodicalId":501234,"journal":{"name":"arXiv - PHYS - Materials Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Materials Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用大型语言模型进行回归，用于材料和分子特性预测

我们展示了大型语言模型（LLM）执行材料和分子特性回归任务的能力，这与传统的 LLM 用例大相径庭。我们在 QM9 数据集中的若干分子特性和 24 种材料特性上对大型语言模型元人工智能（LLaMA）3 进行了基准测试。只有基于成分的输入字符串被用作模型输入，我们仅对生成损失进行了微调。我们大致发现，当使用分子的 SMILES 表示法进行微调时，LLaMA 3 提供了有用的回归结果，可以与 QM9 数据集上的标准材料属性预测模型（如随机森林或全连接神经网络）相媲美。不足为奇的是，LLaMA 3 的误差比使用更细粒度的分子表示（如原子类型及其坐标）进行相同任务训练的最先进模型高出 5-10 倍。有趣的是，与 GPT-3.5 和 GPT-4o 相比，LLaMA 3 提供了更好的预测。这项工作凸显了 LLM 的多功能性，表明类似 LLM 的生成模型有可能超越其传统应用，解决复杂的物理现象，从而为化学、材料科学和其他科学领域的未来研究和应用铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - PHYS - Materials Science

自引率

0.00%

发文量