Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

IF 4.3 3区 材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC ACS Applied Electronic Materials Pub Date : 2024-08-19 DOI:10.1039/d4fd00104d
Jiayun Pang, Ivan Vulić
{"title":"Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction","authors":"Jiayun Pang, Ivan Vulić","doi":"10.1039/d4fd00104d","DOIUrl":null,"url":null,"abstract":"Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d4fd00104d","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
针对有机反应预测的指令调整和字节级语言模型的专业化与分析
基于变压器的编码器-解码器模型在化学反应预测任务中取得了令人瞩目的成果。然而,这些模型通常依赖于使用数千万个未标记的分子进行预训练,这不仅耗时,而且需要 GPU 密集型处理。在这项工作中,我们要回答的核心问题之一是FlanT5 和 ByT5(仅在语言数据上进行预训练的编码解码器模型)能否通过特定任务的微调有效地专门用于有机反应预测?我们对这一过程中的几个关键问题进行了系统的实证研究,包括标记化、(面向 SMILES 的)预训练的影响、微调样本效率以及推理时的解码算法。我们的主要研究结果表明,虽然 FlanT5 和 ByT5 只对语言任务进行了预训练,但它们为反应预测的微调打下了坚实的基础,从而在此过程中实现了 "化学领域兼容"。这表明,在大量未标记的分子数据集上进行 GPU 密集且昂贵的预训练,对于发挥化学语言模型的威力可能是有用的,但并非必不可少。尽管不同模型之间存在一些差异,但我们的所有模型都达到了相当的 Top-1 和 Top-5 准确率。值得注意的是,标记化和词汇修剪会略微影响最终性能,但可以加快训练和推理速度;最有效的贪婪解码策略非常有竞争力,而更复杂的解码算法只能取得微弱的收益。总之,我们从多个维度对 FlanT5 和 ByT5 进行了评估,并对它们对有机反应预测的影响进行了基准测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.20
自引率
4.30%
发文量
567
期刊最新文献
Vitamin B12: prevention of human beings from lethal diseases and its food application. Current status and obstacles of narrowing yield gaps of four major crops. Cold shock treatment alleviates pitting in sweet cherry fruit by enhancing antioxidant enzymes activity and regulating membrane lipid metabolism. Removal of proteins and lipids affects structure, in vitro digestion and physicochemical properties of rice flour modified by heat-moisture treatment. Investigating the impact of climate variables on the organic honey yield in Turkey using XGBoost machine learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1