Composition and structure analyzer/featurizer for explainable machine-learning models to predict solid state structures†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Digital discovery Pub Date : 2025-01-17 DOI:10.1039/D4DD00332B
Emil I. Jaffal, Sangjoon Lee, Danila Shiryaev, Alex Vtorov, Nikhil Kumar Barua, Holger Kleinke and Anton O. Oliynyk
{"title":"Composition and structure analyzer/featurizer for explainable machine-learning models to predict solid state structures†","authors":"Emil I. Jaffal, Sangjoon Lee, Danila Shiryaev, Alex Vtorov, Nikhil Kumar Barua, Holger Kleinke and Anton O. Oliynyk","doi":"10.1039/D4DD00332B","DOIUrl":null,"url":null,"abstract":"<p >Traditional and non-classical machine learning models for solid-state structure prediction have predominantly relied on compositional features (derived from properties of constituent elements) to predict the existence of a structure and its properties. However, the lack of structural information can be a source of suboptimal property mapping and increased predictive uncertainty. To address this challenge, we have introduced a strategy that generates and combines both compositional and structural features with minimal programming expertise required. Our approach utilizes open-source, interactive Python programs named Composition Analyzer Featurizer (CAF) and Structure Analyzer Featurizer (SAF). CAF generates numerical compositional features from a list of formulae provided in an Excel file, while SAF extracts numerical structural features from a .cif file by generating a supercell. 133 features from CAF and 94 features from SAF are used either individually or in combination to cluster nine structure types in equiatomic AB intermetallics. The performance is comparable to those with features from JARVIS, MAGPIE, mat2vec, and OLED datasets in PLS-DA, SVM, and XGBoost models. Our SAF + CAF features provide a cost-efficient and reliable solution, even with the PLS-DA method, where a significant fraction of the most contributing features is the same as those identified in the more computationally intensive XGBoost models.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 548-560"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00332b?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00332b","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Traditional and non-classical machine learning models for solid-state structure prediction have predominantly relied on compositional features (derived from properties of constituent elements) to predict the existence of a structure and its properties. However, the lack of structural information can be a source of suboptimal property mapping and increased predictive uncertainty. To address this challenge, we have introduced a strategy that generates and combines both compositional and structural features with minimal programming expertise required. Our approach utilizes open-source, interactive Python programs named Composition Analyzer Featurizer (CAF) and Structure Analyzer Featurizer (SAF). CAF generates numerical compositional features from a list of formulae provided in an Excel file, while SAF extracts numerical structural features from a .cif file by generating a supercell. 133 features from CAF and 94 features from SAF are used either individually or in combination to cluster nine structure types in equiatomic AB intermetallics. The performance is comparable to those with features from JARVIS, MAGPIE, mat2vec, and OLED datasets in PLS-DA, SVM, and XGBoost models. Our SAF + CAF features provide a cost-efficient and reliable solution, even with the PLS-DA method, where a significant fraction of the most contributing features is the same as those identified in the more computationally intensive XGBoost models.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于可解释的机器学习模型预测固态结构的成分和结构分析仪/特征器†
用于固态结构预测的传统和非经典机器学习模型主要依赖于组成特征(来自组成元素的属性)来预测结构的存在及其属性。然而,缺乏结构信息可能是次优属性映射和增加预测不确定性的来源。为了应对这一挑战,我们引入了一种策略,该策略生成并结合了组合和结构特性,所需的编程专业知识最少。我们的方法利用了开源的交互式Python程序,名为Composition Analyzer feature izer (CAF)和Structure Analyzer feature izer (SAF)。CAF从Excel文件中提供的公式列表中生成数值组成特征,而SAF通过生成超级单体从.cif文件中提取数值结构特征。利用CAF的133个特征和SAF的94个特征单独或联合对等原子AB金属间化合物中的9种结构类型进行聚类。性能可与在PLS-DA、SVM和XGBoost模型中使用JARVIS、MAGPIE、mat2vec和OLED数据集的特征相媲美。即使使用PLS-DA方法,我们的SAF + CAF功能也提供了一种经济高效且可靠的解决方案,其中大部分最重要的功能与在计算密集型的XGBoost模型中确定的功能相同。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
期刊最新文献
A universal machine learning model for the electronic density of states. Precision fragment addition: domain-specific DeepFrag2 models for smarter lead optimization. MC3D: the materials cloud computational database of experimentally known stoichiometric inorganics. Scientific knowledge graph and ontology generation using open large language models. A simple compound prioritization method for drug discovery considering multi-target binding.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1