Regression with Large Language Models for Materials and Molecular Property Prediction

Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan
{"title":"Regression with Large Language Models for Materials and Molecular Property Prediction","authors":"Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan","doi":"arxiv-2409.06080","DOIUrl":null,"url":null,"abstract":"We demonstrate the ability of large language models (LLMs) to perform\nmaterial and molecular property regression tasks, a significant deviation from\nthe conventional LLM use case. We benchmark the Large Language Model Meta AI\n(LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials\nproperties. Only composition-based input strings are used as the model input\nand we fine tune on only the generative loss. We broadly find that LLaMA 3,\nwhen fine-tuned using the SMILES representation of molecules, provides useful\nregression results which can rival standard materials property prediction\nmodels like random forest or fully connected neural networks on the QM9\ndataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the\nstate-of-the-art models that were trained using far more granular\nrepresentation of molecules (e.g., atom types and their coordinates) for the\nsame task. Interestingly, LLaMA 3 provides improved predictions compared to\nGPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting\nthat LLM-like generative models can potentially transcend their traditional\napplications to tackle complex physical phenomena, thus paving the way for\nfuture research and applications in chemistry, materials science and other\nscientific domains.","PeriodicalId":501234,"journal":{"name":"arXiv - PHYS - Materials Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Materials Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用大型语言模型进行回归,用于材料和分子特性预测
我们展示了大型语言模型(LLM)执行材料和分子特性回归任务的能力,这与传统的 LLM 用例大相径庭。我们在 QM9 数据集中的若干分子特性和 24 种材料特性上对大型语言模型元人工智能(LLaMA)3 进行了基准测试。只有基于成分的输入字符串被用作模型输入,我们仅对生成损失进行了微调。我们大致发现,当使用分子的 SMILES 表示法进行微调时,LLaMA 3 提供了有用的回归结果,可以与 QM9 数据集上的标准材料属性预测模型(如随机森林或全连接神经网络)相媲美。不足为奇的是,LLaMA 3 的误差比使用更细粒度的分子表示(如原子类型及其坐标)进行相同任务训练的最先进模型高出 5-10 倍。有趣的是,与 GPT-3.5 和 GPT-4o 相比,LLaMA 3 提供了更好的预测。这项工作凸显了 LLM 的多功能性,表明类似 LLM 的生成模型有可能超越其传统应用,解决复杂的物理现象,从而为化学、材料科学和其他科学领域的未来研究和应用铺平道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Anionic disorder and its impact on the surface electronic structure of oxynitride photoactive semiconductors Accelerating the Training and Improving the Reliability of Machine-Learned Interatomic Potentials for Strongly Anharmonic Materials through Active Learning Hybridization gap approaching the two-dimensional limit of topological insulator Bi$_x$Sb$_{1-x}$ Sampling Latent Material-Property Information From LLM-Derived Embedding Representations Smart Data-Driven GRU Predictor for SnO$_2$ Thin films Characteristics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1