Critical assessment of machine learning prediction of biomass pyrolysis

IF 7.5 1区工程技术 Q2 ENERGY & FUELS Fuel Pub Date : 2025-03-18 DOI:10.1016/j.fuel.2025.135000

Antonio Elia Pascarella , Antonio Coppola , Stefano Marrone , Roberto Chirone , Carlo Sansone , Piero Salatino

{"title":"Critical assessment of machine learning prediction of biomass pyrolysis","authors":"Antonio Elia Pascarella , Antonio Coppola , Stefano Marrone , Roberto Chirone , Carlo Sansone , Piero Salatino","doi":"10.1016/j.fuel.2025.135000","DOIUrl":null,"url":null,"abstract":"<div><div>Biomass pyrolysis is a complex process, quite challenging to model physically and Modern AI methods could improve its prediction and characterization. However, AI model construction requires high-quality datasets. Existing datasets in literature, usually only a few hundred records, are inadequate for robust AI applications.</div><div>A first goal of the study was to make best use of the currently available body of experimental data on fixed bed non-catalytic biomass pyrolysis by comprehensively compiling available data from nearly 160 sources into a new dataset of 1137 records. Each record was carefully standardized to overcome inconsistencies in terminology and lack of uniformity among different sources. This extended dataset (including biomass properties, pyrolysis operating conditions, and bioliquid yield), integrating previous ones, is intended to promote community-based data sharing. The compiled dataset was characterized by remarkable data sparsity, due to lack of completeness of the original data.</div><div>A second goal was benchmarking different regression and data imputation models to assess the predictive ability of ML applied to the collected dataset. The most accurate estimates were obtained by leveraging a subset of about 500 instances without missing values, resulting in a Mean Absolute Error (MAE) of 2.28. Application of ML to the entire dataset with imputed missing data yielded a less accurate estimate (MAE = 3.45), a feature that underlines the criticality of missing data imputation, and of the sparsity of the dataset.</div><div>A third and mostly relevant goal was the critical assessment of Explainable Artificial Intelligence (XAI) techniques that come into play when ML is aimed at evaluating the importance and directional trends of selected features. XAI tools, namely Partial Dependence Plots (PDP) and SHAP, have been applied to the dataset to assess their trustworthiness to support mechanistic inference of the importance and directional trends of key biomass properties and process operational parameters on pyrolysis yields. The result of this analysis is far from satisfactory. Significant discrepancies across studies, inconsistencies among different methods and somewhat erratic trends in PDP plots reflect the challenge in achieving consistent mechanistic insights from purely data-driven approaches, suggesting the adoption of physics-informed machine learning embodying physico-chemical relationships to improved Explainable AI.</div></div>","PeriodicalId":325,"journal":{"name":"Fuel","volume":"394 ","pages":"Article 135000"},"PeriodicalIF":7.5000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fuel","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0016236125007252","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENERGY & FUELS","Score":null,"Total":0}

引用次数: 0

Abstract

Biomass pyrolysis is a complex process, quite challenging to model physically and Modern AI methods could improve its prediction and characterization. However, AI model construction requires high-quality datasets. Existing datasets in literature, usually only a few hundred records, are inadequate for robust AI applications.

A first goal of the study was to make best use of the currently available body of experimental data on fixed bed non-catalytic biomass pyrolysis by comprehensively compiling available data from nearly 160 sources into a new dataset of 1137 records. Each record was carefully standardized to overcome inconsistencies in terminology and lack of uniformity among different sources. This extended dataset (including biomass properties, pyrolysis operating conditions, and bioliquid yield), integrating previous ones, is intended to promote community-based data sharing. The compiled dataset was characterized by remarkable data sparsity, due to lack of completeness of the original data.

A second goal was benchmarking different regression and data imputation models to assess the predictive ability of ML applied to the collected dataset. The most accurate estimates were obtained by leveraging a subset of about 500 instances without missing values, resulting in a Mean Absolute Error (MAE) of 2.28. Application of ML to the entire dataset with imputed missing data yielded a less accurate estimate (MAE = 3.45), a feature that underlines the criticality of missing data imputation, and of the sparsity of the dataset.

A third and mostly relevant goal was the critical assessment of Explainable Artificial Intelligence (XAI) techniques that come into play when ML is aimed at evaluating the importance and directional trends of selected features. XAI tools, namely Partial Dependence Plots (PDP) and SHAP, have been applied to the dataset to assess their trustworthiness to support mechanistic inference of the importance and directional trends of key biomass properties and process operational parameters on pyrolysis yields. The result of this analysis is far from satisfactory. Significant discrepancies across studies, inconsistencies among different methods and somewhat erratic trends in PDP plots reflect the challenge in achieving consistent mechanistic insights from purely data-driven approaches, suggesting the adoption of physics-informed machine learning embodying physico-chemical relationships to improved Explainable AI.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生物质热解的机器学习预测的关键评估

生物质热解是一个复杂的过程，物理建模非常具有挑战性，现代人工智能方法可以改善其预测和表征。然而，人工智能模型的构建需要高质量的数据集。文献中现有的数据集，通常只有几百条记录，不足以用于强大的人工智能应用。该研究的第一个目标是通过将来自近160个来源的现有数据综合汇编成一个包含1137条记录的新数据集，最大限度地利用目前可用的固定床非催化生物质热解实验数据体。每个记录都经过仔细的标准化，以克服术语上的不一致和不同来源之间缺乏一致性。这个扩展的数据集（包括生物质特性、热解操作条件和生物液体产量）整合了之前的数据集，旨在促进基于社区的数据共享。由于原始数据缺乏完整性，编译后的数据具有显著的数据稀疏性。第二个目标是对不同的回归和数据输入模型进行基准测试，以评估应用于收集的数据集的机器学习的预测能力。最准确的估计是通过利用大约500个没有缺失值的实例子集获得的，其平均绝对误差（MAE）为2.28。将ML应用于包含缺失数据的整个数据集产生了较不准确的估计（MAE = 3.45），这一特征强调了缺失数据输入的重要性，以及数据集的稀疏性。第三个也是最相关的目标是对可解释人工智能（XAI）技术的关键评估，当ML旨在评估选定特征的重要性和方向趋势时，XAI技术就会发挥作用。XAI工具，即部分依赖图（PDP）和SHAP，已应用于数据集，以评估其可信度，以支持关键生物质特性和工艺操作参数对热解产量的重要性和方向趋势的机制推断。这个分析的结果远不能令人满意。研究之间的显著差异、不同方法之间的不一致性以及PDP图中有些不稳定的趋势反映了从纯数据驱动的方法中获得一致的机制见解所面临的挑战，这表明采用物理信息的机器学习体现了物理-化学关系，以改进可解释的人工智能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Fuel 工程技术-工程：化工

CiteScore

12.80

自引率

20.30%

发文量

3506

审稿时长

64 days

期刊介绍： The exploration of energy sources remains a critical matter of study. For the past nine decades, fuel has consistently held the forefront in primary research efforts within the field of energy science. This area of investigation encompasses a wide range of subjects, with a particular emphasis on emerging concerns like environmental factors and pollution.