首页 > 最新文献

Journal of Chemometrics最新文献

英文 中文
Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach 将高光谱图像转换为化学图:一种新颖的端到端深度学习方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-16 DOI: 10.1002/cem.70041
Ole-Christian Galbo Engstrøm, Michela Albano-Gaglio, Erik Schou Dreier, Yamine Bouzembrak, Maria Font-i-Furnols, Puneet Mishra, Kim Steenstrup Pedersen

Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. The U-Net is compared with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0%–100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

目前从高光谱图像生成化学图的方法是基于偏最小二乘(PLS)回归等模型,生成逐像素的预测,不考虑空间背景,并且受到高度噪声的影响。本研究提出了一种端到端深度学习方法,使用修改版本的U-Net和自定义损失函数直接从高光谱图像中获取化学图谱,跳过传统逐像素分析所需的所有中间步骤。U-Net在具有相关平均脂肪参考值的五花肉样本的真实数据集上与传统PLS回归进行了比较。在平均脂肪预测任务上,U-Net得到的测试集均方根误差比PLS回归低9%至13%。同时,U-Net生成精细的化学图谱,其中99.91%的方差是空间相关的。相反,在pls生成的化学图谱中,只有2.53%的方差是空间相关的,这表明每个逐像素预测在很大程度上与相邻像素无关。此外,虽然pls生成的化学图谱所包含的预测远远超出了0%-100%的物理可能范围,但U-Net学会了保持在这个范围内。因此,本研究结果表明,U-Net在化学图谱生成方面优于PLS。
{"title":"Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach","authors":"Ole-Christian Galbo Engstrøm,&nbsp;Michela Albano-Gaglio,&nbsp;Erik Schou Dreier,&nbsp;Yamine Bouzembrak,&nbsp;Maria Font-i-Furnols,&nbsp;Puneet Mishra,&nbsp;Kim Steenstrup Pedersen","doi":"10.1002/cem.70041","DOIUrl":"10.1002/cem.70041","url":null,"abstract":"<p>Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. The U-Net is compared with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0%–100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 8","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144635096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spectral Wavelength Selection Method Based on Improved Particle Swarm Optimization Idea and Simulated Annealing Strategy 基于改进粒子群优化思想和模拟退火策略的光谱波长选择方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-15 DOI: 10.1002/cem.70050
Ying Dong, Weida Wang, Nanfeng Zhang, Jinming Liu

Wavelength selection (WS) is an effective means to address the presence of many uncorrelated and collinear variables in high-dimensional spectral data that seriously influence the modeling accuracy and efficiency. Aiming to address too many wavelength variables selected by particle swarm optimization algorithm (PSO) and its premature convergence, this paper proposes a novel spectral WS approach—iPSOSA—based on the improved PSO idea and simulated annealing algorithms (SA) strategy. iPSOSA applies the velocity and position update ideas of PSO to the guided shift evolution process of the binary bits with the value of “1” in the particle and integrates with the perturbation strategy of the SA Metropolis acceptance criterion. It effectively solves the premature convergence of PSO and overcomes the low efficiency of the SA evolution, which has high efficiency in WS. By evaluating the modeling performance of different intelligent WS methods using two public spectral datasets from soil and maize, it was found that the iPSOSA outperforms the full-spectrum and other three comparative algorithms. The best iPSOSA partial least squares regression models for soil organic matter and maize moisture contents have excellent regression performance, with the validation set's coefficient of determination higher than 0.98, relative root mean squared error lower than 1.50%, and residual predictive deviation higher than 8.00. iPSOSA presents better comprehensive performance in WS than traditional intelligent algorithms in terms of modeling performance, variable dimensionality, and searching efficiency, providing a new solution for obtaining high correlation wavelength variables in the spectral modeling process.

波长选择(Wavelength selection, WS)是解决高维光谱数据中存在的许多不相关和共线变量严重影响建模精度和效率的有效手段。针对粒子群优化算法(PSO)选择的波长变量过多以及其过早收敛的问题,提出了一种基于改进粒子群优化算法思想和模拟退火算法(SA)策略的新型光谱WS方法——ipsosa。iPSOSA将PSO的速度和位置更新思想应用到粒子中值为“1”的二进制位的引导位移演化过程中,并与SA Metropolis接受准则的摄动策略相结合。它有效地解决了粒子群算法过早收敛的问题,克服了粒子群算法进化效率低的问题,使得粒子群算法在WS中具有较高的效率。利用土壤和玉米两种公共光谱数据集,对不同智能WS方法的建模性能进行了评估,发现iPSOSA算法优于全光谱算法和其他三种比较算法。最佳的iPSOSA偏最小二乘回归模型对土壤有机质和玉米含水率具有良好的回归性能,验证集的决定系数大于0.98,相对均方根误差小于1.50%,残差预测偏差大于8.00。iPSOSA在WS建模性能、变维度、搜索效率等方面均优于传统智能算法的综合性能,为光谱建模过程中获取高相关波长变量提供了新的解决方案。
{"title":"Spectral Wavelength Selection Method Based on Improved Particle Swarm Optimization Idea and Simulated Annealing Strategy","authors":"Ying Dong,&nbsp;Weida Wang,&nbsp;Nanfeng Zhang,&nbsp;Jinming Liu","doi":"10.1002/cem.70050","DOIUrl":"10.1002/cem.70050","url":null,"abstract":"<div>\u0000 \u0000 <p>Wavelength selection (WS) is an effective means to address the presence of many uncorrelated and collinear variables in high-dimensional spectral data that seriously influence the modeling accuracy and efficiency. Aiming to address too many wavelength variables selected by particle swarm optimization algorithm (PSO) and its premature convergence, this paper proposes a novel spectral WS approach—iPSOSA—based on the improved PSO idea and simulated annealing algorithms (SA) strategy. iPSOSA applies the velocity and position update ideas of PSO to the guided shift evolution process of the binary bits with the value of “1” in the particle and integrates with the perturbation strategy of the SA Metropolis acceptance criterion. It effectively solves the premature convergence of PSO and overcomes the low efficiency of the SA evolution, which has high efficiency in WS. By evaluating the modeling performance of different intelligent WS methods using two public spectral datasets from soil and maize, it was found that the iPSOSA outperforms the full-spectrum and other three comparative algorithms. The best iPSOSA partial least squares regression models for soil organic matter and maize moisture contents have excellent regression performance, with the validation set's coefficient of determination higher than 0.98, relative root mean squared error lower than 1.50%, and residual predictive deviation higher than 8.00. iPSOSA presents better comprehensive performance in WS than traditional intelligent algorithms in terms of modeling performance, variable dimensionality, and searching efficiency, providing a new solution for obtaining high correlation wavelength variables in the spectral modeling process.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 8","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144635052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Method for Measuring Similarity or Distance of Molecular and Arbitrary Graphs Based on a Collection of Topological Indices 一种基于拓扑指数集合的分子图和任意图相似性或距离度量方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-15 DOI: 10.1002/cem.70047
Mert Sinan Oz

The comparison of graphs using various types of quantitative structural similarity or distance measures has an important place in many scientific disciplines. Two of these are cheminformatics and chemical graph theory, in which the structural similarity or distance measures between molecular graphs are analyzed by calculating the Jaccard/Tanimoto index based on molecular fingerprints. A novel method is proposed to measure the structural similarity or distance for molecular and arbitrary graphs. This method calculates the Jaccard/Tanimoto index based on a collection of topological indices embedded in the entries of a vector. We statistically compare the proposed method with the method for calculating the Jaccard/Tanimoto indices based on five different molecular fingerprints on alkane and cycloalkane isomers. Furthermore, to explore how the method works on non-molecular graphs, we statistically analyze it on the set of all connected graphs with seven vertices. The Jaccard/Tanimoto index values produced by the proposed method cover the value domain. In addition, it provides a discrete similarity distribution with the clustering, which makes the differences clear and provides convenience for comparison. Two outstanding features of the proposed method are its applicability to arbitrary graphs and the computational complexity of the algorithm used in the method is polynomial over the number of graphs and the number of vertices and edges of the graphs.

利用各种类型的定量结构相似性或距离度量对图进行比较在许多科学学科中占有重要地位。其中两个是化学信息学和化学图论,其中通过计算基于分子指纹的Jaccard/Tanimoto指数来分析分子图之间的结构相似性或距离度量。提出了一种测量分子图和任意图结构相似性或距离的新方法。该方法基于嵌入在向量条目中的拓扑索引集合计算Jaccard/Tanimoto索引。我们将该方法与基于烷烃和环烷烃异构体的五种不同分子指纹图谱计算Jaccard/Tanimoto指数的方法进行了统计比较。此外,为了探索该方法在非分子图上的工作原理,我们对具有七个顶点的所有连通图的集合进行了统计分析。该方法产生的Jaccard/Tanimoto指数值覆盖了值域。此外,通过聚类提供离散的相似度分布,使差异清晰,便于比较。该方法的两个突出特点是它适用于任意图,并且该方法中使用的算法的计算复杂度是图的数量和图的顶点和边的数量的多项式。
{"title":"A Method for Measuring Similarity or Distance of Molecular and Arbitrary Graphs Based on a Collection of Topological Indices","authors":"Mert Sinan Oz","doi":"10.1002/cem.70047","DOIUrl":"10.1002/cem.70047","url":null,"abstract":"<div>\u0000 \u0000 <p>The comparison of graphs using various types of quantitative structural similarity or distance measures has an important place in many scientific disciplines. Two of these are cheminformatics and chemical graph theory, in which the structural similarity or distance measures between molecular graphs are analyzed by calculating the Jaccard/Tanimoto index based on molecular fingerprints. A novel method is proposed to measure the structural similarity or distance for molecular and arbitrary graphs. This method calculates the Jaccard/Tanimoto index based on a collection of topological indices embedded in the entries of a vector. We statistically compare the proposed method with the method for calculating the Jaccard/Tanimoto indices based on five different molecular fingerprints on alkane and cycloalkane isomers. Furthermore, to explore how the method works on non-molecular graphs, we statistically analyze it on the set of all connected graphs with seven vertices. The Jaccard/Tanimoto index values produced by the proposed method cover the value domain. In addition, it provides a discrete similarity distribution with the clustering, which makes the differences clear and provides convenience for comparison. Two outstanding features of the proposed method are its applicability to arbitrary graphs and the computational complexity of the algorithm used in the method is polynomial over the number of graphs and the number of vertices and edges of the graphs.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144624520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MultANOVA Followed by Post Hoc Analyses for Designed High-Dimensional Data: A Comprehensive Framework That Outperforms ASCA, rMANOVA, and VASCA 设计高维数据的事后分析:优于ASCA、rMANOVA和VASCA的综合框架
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-14 DOI: 10.1002/cem.70039
Benjamin Mahieu, Véronique Cariou

Analytical platforms generate high-dimensional data, where the number of variables usually exceeds the number of observations. Such data are frequently derived from an experimental design, where samples have been collected to identify potential variation in the factors or interactions of interest. To circumvent issues related to large data sizes when evaluating factor and interaction effects, ANOVA simultaneous component analysis (ASCA), regularized multivariate analysis of variance (rMANOVA), and variable selection ASCA (VASCA) have been proposed previously. However, they require computationally intensive methods to test the effects of factors and interactions. In the present paper, multiple ANOVAs (MultANOVA) is proposed as a simple yet effective alternative to the above methods. MultANOVA has the advantage of being direct and fast, as it does not rely on intensive calculation methods, while incorporating a variable selection strategy. This method entails the execution of multiple ANOVAs, one per variable, with multiple test corrections. Subsequent post hoc analyses are also introduced. These encompass multiple least-squares difference tests (MultLSD) for the pairwise comparison of multivariate least-squares means and diagonal canonical discriminant analysis (DCDA) with approximate confidence ellipses to visualize significant effects. MultANOVA is compared to the aforementioned methods based on simulations, which demonstrate that it holds the nominal alpha risk as opposed to rMANOVA and VASCA, while being more powerful than ASCA and VASCA. Even though MultANOVA is proven less powerful than VASCA for variable selection, it has been demonstrated to hold the nominal risk, whereas VASCA does not. Finally, the MultANOVA framework is illustrated based on metagenomics, metabolomics, and spectroscopic data.

分析平台生成高维数据,其中变量的数量通常超过观测的数量。这些数据通常来自实验设计,其中收集样本以确定感兴趣的因素或相互作用的潜在变化。为了避免在评估因素和相互作用效应时与大数据量相关的问题,之前已经提出了ANOVA同步成分分析(ASCA),正则化多变量方差分析(rMANOVA)和变量选择ASCA (VASCA)。然而,它们需要计算密集的方法来测试因素和相互作用的影响。在本文中,多重方差分析(MultANOVA)被提出作为一种简单而有效的替代上述方法。MultANOVA具有直接和快速的优点,因为它不依赖于密集的计算方法,同时结合了变量选择策略。该方法需要执行多个anova,每个变量一个,具有多个测试更正。随后的事后分析也被介绍。这些包括多个最小二乘差异检验(MultLSD),用于对多变量最小二乘均值进行两两比较,并使用近似置信椭圆对角典型判别分析(DCDA)来可视化显着效果。MultANOVA与上述基于模拟的方法进行了比较,结果表明,与rMANOVA和VASCA相比,MultANOVA具有名义上的alpha风险,而比ASCA和VASCA更强大。尽管MultANOVA被证明在变量选择方面不如VASCA强大,但它已被证明具有名义风险,而VASCA则没有。最后,基于宏基因组学、代谢组学和光谱数据阐述了MultANOVA框架。
{"title":"MultANOVA Followed by Post Hoc Analyses for Designed High-Dimensional Data: A Comprehensive Framework That Outperforms ASCA, rMANOVA, and VASCA","authors":"Benjamin Mahieu,&nbsp;Véronique Cariou","doi":"10.1002/cem.70039","DOIUrl":"10.1002/cem.70039","url":null,"abstract":"<p>Analytical platforms generate high-dimensional data, where the number of variables usually exceeds the number of observations. Such data are frequently derived from an experimental design, where samples have been collected to identify potential variation in the factors or interactions of interest. To circumvent issues related to large data sizes when evaluating factor and interaction effects, ANOVA simultaneous component analysis (ASCA), regularized multivariate analysis of variance (rMANOVA), and variable selection ASCA (VASCA) have been proposed previously. However, they require computationally intensive methods to test the effects of factors and interactions. In the present paper, multiple ANOVAs (MultANOVA) is proposed as a simple yet effective alternative to the above methods. MultANOVA has the advantage of being direct and fast, as it does not rely on intensive calculation methods, while incorporating a variable selection strategy. This method entails the execution of multiple ANOVAs, one per variable, with multiple test corrections. Subsequent post hoc analyses are also introduced. These encompass multiple least-squares difference tests (MultLSD) for the pairwise comparison of multivariate least-squares means and diagonal canonical discriminant analysis (DCDA) with approximate confidence ellipses to visualize significant effects. MultANOVA is compared to the aforementioned methods based on simulations, which demonstrate that it holds the nominal alpha risk as opposed to rMANOVA and VASCA, while being more powerful than ASCA and VASCA. Even though MultANOVA is proven less powerful than VASCA for variable selection, it has been demonstrated to hold the nominal risk, whereas VASCA does not. Finally, the MultANOVA framework is illustrated based on metagenomics, metabolomics, and spectroscopic data.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144624299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Classification Limit of Detection: Estimating Sample-Level Classification Uncertainty in Spectroscopy Using Monte Carlo Error Propagation of Spectral Noise 检测的分类极限:利用光谱噪声的蒙特卡罗误差传播估计光谱中样本级分类不确定度
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-12 DOI: 10.1002/cem.70048
Helder V. Carneiro, Caelin P. Celani, Karl S. Booksh

This study presents a novel Monte Carlo–based methodology for estimating classification uncertainty in chemometric models by propagating spectral measurement noise. Unlike traditional approaches that treat classification as deterministic, this method simulates realistic noise structures, both independent and correlated, captured from multiple spectrum measurements to quantify sample-specific uncertainty. The technique is applicable to both linear and non-linear models, including partial least squares discriminant analysis (PLS-DA) and various support vector machine (SVM) kernels. The methodology was validated using three datasets: synthetic 2D simulations for controlled model geometry, X-ray fluorescence (XRF) spectra from colored glass rods, and laser-induced breakdown spectroscopy (LIBS) data from Dalbergia wood species. Results revealed that uncertainty increases with spectral similarity and perpendicular alignment between noise structures and decision boundaries. In real-world applications, classification metrics alone proved insufficient to assess model reliability. The inclusion of uncertainty intervals enabled identification of ambiguous predictions even in cases of perfect classification accuracy. This work advances chemometric analysis by linking measurement uncertainty to classification outcomes, offering a robust framework for decision-making in high-stakes analytical contexts.

本文提出了一种新的基于蒙特卡罗的方法,通过传播光谱测量噪声来估计化学计量模型中的分类不确定性。与将分类视为确定性的传统方法不同,该方法模拟了从多个频谱测量中捕获的独立和相关的现实噪声结构,以量化样品特定的不确定性。该技术适用于线性和非线性模型,包括偏最小二乘判别分析(PLS-DA)和各种支持向量机(SVM)核。该方法使用三个数据集进行验证:控制模型几何形状的合成二维模拟,彩色玻璃棒的x射线荧光(XRF)光谱,以及黄檀木材物种的激光诱导击穿光谱(LIBS)数据。结果表明,不确定性随着谱相似性和噪声结构与决策边界的垂直对齐而增加。在实际应用中,分类度量本身不足以评估模型的可靠性。不确定区间的包含使模糊预测的识别即使在完美的分类精度的情况下。这项工作通过将测量不确定性与分类结果联系起来,推进了化学计量学分析,为高风险分析环境中的决策提供了一个强大的框架。
{"title":"The Classification Limit of Detection: Estimating Sample-Level Classification Uncertainty in Spectroscopy Using Monte Carlo Error Propagation of Spectral Noise","authors":"Helder V. Carneiro,&nbsp;Caelin P. Celani,&nbsp;Karl S. Booksh","doi":"10.1002/cem.70048","DOIUrl":"10.1002/cem.70048","url":null,"abstract":"<div>\u0000 \u0000 <p>This study presents a novel Monte Carlo–based methodology for estimating classification uncertainty in chemometric models by propagating spectral measurement noise. Unlike traditional approaches that treat classification as deterministic, this method simulates realistic noise structures, both independent and correlated, captured from multiple spectrum measurements to quantify sample-specific uncertainty. The technique is applicable to both linear and non-linear models, including partial least squares discriminant analysis (PLS-DA) and various support vector machine (SVM) kernels. The methodology was validated using three datasets: synthetic 2D simulations for controlled model geometry, X-ray fluorescence (XRF) spectra from colored glass rods, and laser-induced breakdown spectroscopy (LIBS) data from <i>Dalbergia</i> wood species. Results revealed that uncertainty increases with spectral similarity and perpendicular alignment between noise structures and decision boundaries. In real-world applications, classification metrics alone proved insufficient to assess model reliability. The inclusion of uncertainty intervals enabled identification of ambiguous predictions even in cases of perfect classification accuracy. This work advances chemometric analysis by linking measurement uncertainty to classification outcomes, offering a robust framework for decision-making in high-stakes analytical contexts.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144606699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Dynamic Iterative Data Cleaning Strategy Based on Model Feedback to Enhance the Prediction Accuracy of Nanocellulose Emulsions 基于模型反馈的动态迭代数据清洗策略提高纳米纤维素乳剂的预测精度
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-12 DOI: 10.1002/cem.70046
Long Wang, Zi'ang Xia, Yao Zhang, Xiaoyu Liu, Chaojie Li, Xue Li, Jiahao Dai, Mingshun Bi, Jingxue Yang, Heng Zhang

The effectiveness of artificial neural networks, which were key technologies in artificial intelligence, greatly depends on the quality of the input data. Data cleaning, a crucial component of data preprocessing, played a vital role in enhancing the accuracy, robustness, and generalization capabilities of neural network models. In this study, a Feedback-Driven Iterative Cleaning (FDIC) framework, guided by model performance, was developed and applied to the study of droplet size prediction models for nanocellulose-stabilized Pickering emulsion systems. After randomly removing between 1% and 40% of the data, an artificial neural network model was established using CNC particle size (X1), CNC concentration (X2), and the oil–water volume ratio of CNC to oil-phase monomer (X3) as input variables, with emulsion droplet size (Y) as the quantitative index. The model's accuracy was evaluated after data removal using the coefficient of determination (R2), mean squared error (MSE), and mean absolute scaling error (MASE). The main finding was that targeted removal of a small portion of the data significantly improved the predictive power of the model. Specifically, removing 5% of the dataset results in optimal performance, with R2 improving from 0.5307 without cleaning to 0.7258, with an MSE of 183.4917, and MASE of 0.4060. This result suggested a significant and quantifiable improvement in the accuracy of the model through our iterative cleaning process. The study revealed a nonlinear relationship between the number of iterations and the model's generalization ability. This finding offered a novel methodological tool for data governance in the smart era and demonstrates significant value in dynamic environments.

人工神经网络是人工智能的关键技术,其有效性在很大程度上取决于输入数据的质量。数据清洗是数据预处理的重要组成部分,对提高神经网络模型的准确性、鲁棒性和泛化能力起着至关重要的作用。在本研究中,以模型性能为指导,开发了一个反馈驱动迭代清洗(FDIC)框架,并将其应用于纳米纤维素稳定皮克林乳液体系的液滴尺寸预测模型的研究。随机剔除1% ~ 40%的数据后,以CNC粒度(X1)、CNC浓度(X2)、CNC与油相单体油水体积比(X3)为输入变量,以乳化液液滴粒径(Y)为定量指标,建立人工神经网络模型。剔除数据后,使用决定系数(R2)、均方误差(MSE)和平均绝对缩放误差(MASE)评估模型的准确性。主要发现是,有针对性地删除一小部分数据显著提高了模型的预测能力。具体来说,删除5%的数据集可以获得最佳性能,R2从未清理的0.5307提高到0.7258,MSE为183.4917,MASE为0.4060。这一结果表明,通过我们的迭代清洗过程,模型的准确性有了显著的、可量化的提高。研究表明,迭代次数与模型泛化能力之间存在非线性关系。这一发现为智能时代的数据治理提供了一种新的方法论工具,并在动态环境中展示了重要的价值。
{"title":"A Dynamic Iterative Data Cleaning Strategy Based on Model Feedback to Enhance the Prediction Accuracy of Nanocellulose Emulsions","authors":"Long Wang,&nbsp;Zi'ang Xia,&nbsp;Yao Zhang,&nbsp;Xiaoyu Liu,&nbsp;Chaojie Li,&nbsp;Xue Li,&nbsp;Jiahao Dai,&nbsp;Mingshun Bi,&nbsp;Jingxue Yang,&nbsp;Heng Zhang","doi":"10.1002/cem.70046","DOIUrl":"10.1002/cem.70046","url":null,"abstract":"<div>\u0000 \u0000 <p>The effectiveness of artificial neural networks, which were key technologies in artificial intelligence, greatly depends on the quality of the input data. Data cleaning, a crucial component of data preprocessing, played a vital role in enhancing the accuracy, robustness, and generalization capabilities of neural network models. In this study, a Feedback-Driven Iterative Cleaning (FDIC) framework, guided by model performance, was developed and applied to the study of droplet size prediction models for nanocellulose-stabilized Pickering emulsion systems. After randomly removing between 1% and 40% of the data, an artificial neural network model was established using CNC particle size (X1), CNC concentration (X2), and the oil–water volume ratio of CNC to oil-phase monomer (X3) as input variables, with emulsion droplet size (Y) as the quantitative index. The model's accuracy was evaluated after data removal using the coefficient of determination (R<sup>2</sup>), mean squared error (MSE), and mean absolute scaling error (MASE). The main finding was that targeted removal of a small portion of the data significantly improved the predictive power of the model. Specifically, removing 5% of the dataset results in optimal performance, with <i>R</i><sup><i>2</i></sup> improving from 0.5307 without cleaning to 0.7258, with an MSE of 183.4917, and MASE of 0.4060. This result suggested a significant and quantifiable improvement in the accuracy of the model through our iterative cleaning process. The study revealed a nonlinear relationship between the number of iterations and the model's generalization ability. This finding offered a novel methodological tool for data governance in the smart era and demonstrates significant value in dynamic environments.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144606753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nondestructive Identification of Paper Based on Relative Formation Time Using Three-Dimensional Fluorescence Spectroscopy Combined With Supervised Learning 基于相对形成时间的三维荧光光谱与监督学习相结合的纸张无损识别
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-11 DOI: 10.1002/cem.70043
Xiaohong Chen, Yuhuan He, Lan Cui, Hongda Li, Xiaojing Wu

In order to achieve nondestructive analysis and identification of the relative formation time of paper evidence and to solve the difficulties in document authenticity identification in the field of forensic science, this study selected three-dimensional fluorescence spectroscopy data of paper evidence of the same brand and model collected in the same storage environment within the last decade (2012–2023). After preprocessing steps like eliminating scattering, smoothing noise and principal component analysis (PCA), machine learning algorithms such as K-nearest neighbor (KNN) and linear discriminant analysis (LDA) were employed to classify and predict specific feature bands. The accuracy of KNN and LDA was 94.5% and 98.9%, respectively. Furthermore, relative formation time prediction was conducted for paper samples by LDA in the sample library, achieving an accuracy rate of 98.0%. Finally, the established model was successfully applied to analyze an actual case involving suspected “forged official documents.” It accurately determined the relative formation time of the forged paper, and the analysis results were consistent with the suspect's confession.

为了实现对纸质证据相对形成时间的无损分析与鉴定,解决法医学领域文书真实性鉴定的难题,本研究选取了近十年(2012-2023年)在同一存储环境下采集的同品牌、同型号纸质证据的三维荧光光谱数据。在消除散射、平滑噪声和主成分分析(PCA)等预处理步骤之后,采用k近邻(KNN)和线性判别分析(LDA)等机器学习算法对特定特征波段进行分类和预测。KNN和LDA的准确率分别为94.5%和98.9%。利用LDA对样本库中的纸质样本进行相对形成时间预测,准确率达到98.0%。最后,将所建立的模型成功地应用于一起涉嫌“伪造公文”的实际案例分析。准确确定了伪造纸的相对形成时间,分析结果与犯罪嫌疑人的供词一致。
{"title":"Nondestructive Identification of Paper Based on Relative Formation Time Using Three-Dimensional Fluorescence Spectroscopy Combined With Supervised Learning","authors":"Xiaohong Chen,&nbsp;Yuhuan He,&nbsp;Lan Cui,&nbsp;Hongda Li,&nbsp;Xiaojing Wu","doi":"10.1002/cem.70043","DOIUrl":"10.1002/cem.70043","url":null,"abstract":"<div>\u0000 \u0000 <p>In order to achieve nondestructive analysis and identification of the relative formation time of paper evidence and to solve the difficulties in document authenticity identification in the field of forensic science, this study selected three-dimensional fluorescence spectroscopy data of paper evidence of the same brand and model collected in the same storage environment within the last decade (2012–2023). After preprocessing steps like eliminating scattering, smoothing noise and principal component analysis (PCA), machine learning algorithms such as <i>K</i>-nearest neighbor (KNN) and linear discriminant analysis (LDA) were employed to classify and predict specific feature bands. The accuracy of KNN and LDA was 94.5% and 98.9%, respectively. Furthermore, relative formation time prediction was conducted for paper samples by LDA in the sample library, achieving an accuracy rate of 98.0%. Finally, the established model was successfully applied to analyze an actual case involving suspected “forged official documents.” It accurately determined the relative formation time of the forged paper, and the analysis results were consistent with the suspect's confession.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XAI-2DCOS: Enhancing Interpretability in Spectral Deep Learning Models Through 2D Correlation Spectroscopy XAI-2DCOS:通过二维相关光谱增强光谱深度学习模型的可解释性
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-11 DOI: 10.1002/cem.70045
Jhonatan Contreras, Thomas Bocklitz

Deep learning (DL) has significantly advanced Raman spectra analysis, achieving high accuracy and efficiency. However, their complexity and opacity limit their application in areas where understanding and transparency are essential. To address this, we present XAI-2DCOS, an innovative eXplainable Artificial Intelligence (XAI) framework that employs 2D correlation spectroscopy (2DCOS). Traditionally, 2DCOS reveals the sequence of molecular changes under varying conditions. We repurpose it to enhance the interpretability of DL models by linking changes in spectral features to model outputs, identifying critical wavenumbers, and how their variations affect model accuracy. We applied XAI-2DCOS to a DL model trained on a dataset of oil Raman spectra, demonstrating its ability to identify critical spectral features that align with domain knowledge. To improve robustness, we integrated a conditional generative adversarial network (CGAN) for data augmentation. CGAN generates synthetic data, ensuring the presence of spectra across the entire probability range. A normalized relevance score quantifies the contribution for each wavenumber to the model's prediction. A predictive probability map delineates decision boundaries within the PCA space. Synchronous 2DCOS maps are used to guide spectral adjustments that improve prediction confidence for specific class predictions. These adjustments can affect multiple output classes with differential scaling of output activations, suggesting that crossing a threshold can shift the model decision. Our results show that XAI-2DCOS improves the interpretability and reliability of DL models applied to Raman spectra. Furthermore, CGAN data augmentation extends the applicability of XAI-2DCOS to smaller datasets.

深度学习(DL)在拉曼光谱分析方面具有显著的进步,实现了高精度和高效率。然而,它们的复杂性和不透明性限制了它们在理解和透明至关重要的领域的应用。为了解决这个问题,我们提出了XAI-2DCOS,这是一种创新的可解释人工智能(XAI)框架,采用2D相关光谱(2DCOS)。传统上,2DCOS揭示了不同条件下分子变化的序列。我们将其重新用于增强DL模型的可解释性,方法是将光谱特征的变化与模型输出联系起来,识别关键波数,以及它们的变化如何影响模型精度。我们将XAI-2DCOS应用于在石油拉曼光谱数据集上训练的深度学习模型,证明了其识别与领域知识一致的关键光谱特征的能力。为了提高鲁棒性,我们集成了一个条件生成对抗网络(CGAN)来进行数据增强。CGAN生成合成数据,确保在整个概率范围内存在光谱。规范化的相关性评分量化了每个波数对模型预测的贡献。预测概率图描绘了PCA空间内的决策边界。同步2DCOS地图用于指导光谱调整,以提高特定类别预测的预测信心。这些调整可以影响具有不同输出激活比例的多个输出类,这表明跨越阈值可以改变模型决策。结果表明,XAI-2DCOS提高了拉曼光谱DL模型的可解释性和可靠性。此外,CGAN数据增强将XAI-2DCOS的适用性扩展到更小的数据集。
{"title":"XAI-2DCOS: Enhancing Interpretability in Spectral Deep Learning Models Through 2D Correlation Spectroscopy","authors":"Jhonatan Contreras,&nbsp;Thomas Bocklitz","doi":"10.1002/cem.70045","DOIUrl":"10.1002/cem.70045","url":null,"abstract":"<p>Deep learning (DL) has significantly advanced Raman spectra analysis, achieving high accuracy and efficiency. However, their complexity and opacity limit their application in areas where understanding and transparency are essential. To address this, we present XAI-2DCOS, an innovative eXplainable Artificial Intelligence (XAI) framework that employs 2D correlation spectroscopy (2DCOS). Traditionally, 2DCOS reveals the sequence of molecular changes under varying conditions. We repurpose it to enhance the interpretability of DL models by linking changes in spectral features to model outputs, identifying critical wavenumbers, and how their variations affect model accuracy. We applied XAI-2DCOS to a DL model trained on a dataset of oil Raman spectra, demonstrating its ability to identify critical spectral features that align with domain knowledge. To improve robustness, we integrated a conditional generative adversarial network (CGAN) for data augmentation. CGAN generates synthetic data, ensuring the presence of spectra across the entire probability range. A normalized relevance score quantifies the contribution for each wavenumber to the model's prediction. A predictive probability map delineates decision boundaries within the PCA space. Synchronous 2DCOS maps are used to guide spectral adjustments that improve prediction confidence for specific class predictions. These adjustments can affect multiple output classes with differential scaling of output activations, suggesting that crossing a threshold can shift the model decision. Our results show that XAI-2DCOS improves the interpretability and reliability of DL models applied to Raman spectra. Furthermore, CGAN data augmentation extends the applicability of XAI-2DCOS to smaller datasets.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Honoring Prof. Age K. Smilde 社论:纪念Age K. Smilde教授
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-10 DOI: 10.1002/cem.70052
Rasmus Bro
<p>It is both a privilege and an emotional moment for me to write this editorial for the special issue of the <i>Journal of Chemometrics</i> honoring Prof. Age K. Smilde, who recently retired. For me, and for countless others in our field, Prof. Smilde (also more informally know as Age) has been more than a scholar; he has been a mentor, a collaborator, and an inspiration whose contributions have left a huge mark on the world of chemometrics.</p><p>Looking back, it feels almost surreal to think of my early days in academia, 30 years ago, when I was navigating the complex world of multi-way tensor analysis. At the time, Age seemed to me to be the quintessential ‘all-knowing’ professor. His mastery of the field, combined with a willingness to mentor and nurture young scientists, made a profound difference in my career. I remember a conference where he explained the complexity of tensor rank. I quickly grasped the problem and slightly arrogantly said: I will fix it. I tried. I was very fast and 100% wrong. I never managed to make even the slightest progress!</p><p>He played a pivotal role in helping me craft some of my earliest papers, including one of the first approaches to tensor regression. Our discussions on the properties of multi-way arrays and their applications remain etched in my memory—not just as lessons in science, but as moments of shared curiosity.</p><p>Age's career is nothing short of extraordinary. From his foundational work at the University of Groningen to his tenure at the University of Amsterdam, where he led the group later known as Biosystems Data Analysis, Age has consistently been at the forefront of methodological advancements in not just chemometrics. His work on multi-way analysis, data integration, and systems biology has truly shaped the respective fields. It is no surprise that he has been honored with numerous awards, such as the prestigious Herman Wold Gold Medal and the Kowalski Award, reflecting his pioneering contributions and global recognition.</p><p>What sets Age apart, is his ability to foster collaboration and build bridges within the scientific community. He introduced me to some of the most significant researchers not only in chemometrics but also in psychometrics, widening my horizons and opening doors that would otherwise have remained closed. His efforts to create platforms for collaboration, such as co-founding TRICAP and contributing to international chemometric meetings, have enriched our discipline.</p><p>Reflecting on the arc of our careers, I cannot help but smile at the realization that the ‘old’ professor who once seemed so far ahead of me is, in fact, only a few years my senior. Time has a way of leveling us, and today I count Age as not only a colleague but also a dear friend and peer. His wisdom, humility, and warmth continue to inspire, and his legacy will undoubtedly endure through the countless students, collaborators, and researchers he has influenced.</p><p>This special issue is a testam
为《化学计量学杂志》特刊撰写这篇社论,以纪念最近退休的Age K. Smilde教授,对我来说,这既是一种荣幸,也是一种激动的时刻。对我和我们这个领域的无数其他人来说,斯米尔德教授(也被非正式地称为Age)不仅仅是一位学者;他是我的导师、合作者和灵感来源,他的贡献在化学计量学领域留下了巨大的印记。回首往事,回想起30年前我在学术界的早期时光,感觉几乎是超现实的,当时我正在探索多路张量分析的复杂世界。当时,在我看来,Age是一位典型的“无所不知”教授。他对这个领域的精通,加上他愿意指导和培养年轻科学家,对我的职业生涯产生了深远的影响。我记得在一次会议上,他解释了张量秩的复杂性。我很快就明白了问题所在,略带傲慢地说:我会解决的。我试过了。我猜得很快,而且完全错了。我从来没有取得哪怕是一点点的进步!他在帮助我撰写我最早的一些论文中发挥了关键作用,包括最早的张量回归方法之一。我们关于多路阵列的特性及其应用的讨论仍然铭刻在我的记忆中——不仅作为科学课程,而且作为共同好奇的时刻。Age的事业是非凡的。从他在格罗宁根大学的基础工作到他在阿姆斯特丹大学的任期,在那里他领导了后来被称为生物系统数据分析的小组,Age一直站在方法论进步的最前沿,而不仅仅是化学计量学。他在多路分析、数据集成和系统生物学方面的工作真正塑造了各自的领域。毫无疑问,他获得了许多奖项,如久负盛名的赫尔曼世界金奖和科瓦尔斯基奖,这反映了他的开创性贡献和全球认可。让Age与众不同的是他在科学界促进合作和建立桥梁的能力。他向我介绍了一些最重要的研究人员,不仅在化学计量学方面,而且在心理计量学方面,拓宽了我的视野,打开了原本紧闭的大门。他努力创建合作平台,如共同创立TRICAP和参与国际化学计量学会议,丰富了我们的学科。回顾我们的职业生涯,我不禁笑了,因为我意识到,这位曾经看起来遥遥领先于我的“老”教授,实际上只比我年长几岁。时间会让我们变得更平,今天,我不仅把年龄视为同事,还视其为亲爱的朋友和同伴。他的智慧、谦逊和热情继续激励着我们,他的遗产无疑将通过他影响的无数学生、合作者和研究人员而延续下去。本期特刊证明了斯米尔德教授对我们这个领域的影响。它汇集了研究人员的贡献,这些研究人员的工作受到他的思想、指导和合作的影响。这是对像Age这样的科学家最恰当的致敬。我谨代表所有有幸与斯米尔德教授共事的人,感谢你,Age,感谢你孜孜不倦的贡献、你的指导和你的友谊。我们不仅庆祝你非凡的职业生涯,也庆祝背后的人——一个真正的化学计量学巨人。
{"title":"Editorial: Honoring Prof. Age K. Smilde","authors":"Rasmus Bro","doi":"10.1002/cem.70052","DOIUrl":"10.1002/cem.70052","url":null,"abstract":"&lt;p&gt;It is both a privilege and an emotional moment for me to write this editorial for the special issue of the &lt;i&gt;Journal of Chemometrics&lt;/i&gt; honoring Prof. Age K. Smilde, who recently retired. For me, and for countless others in our field, Prof. Smilde (also more informally know as Age) has been more than a scholar; he has been a mentor, a collaborator, and an inspiration whose contributions have left a huge mark on the world of chemometrics.&lt;/p&gt;&lt;p&gt;Looking back, it feels almost surreal to think of my early days in academia, 30 years ago, when I was navigating the complex world of multi-way tensor analysis. At the time, Age seemed to me to be the quintessential ‘all-knowing’ professor. His mastery of the field, combined with a willingness to mentor and nurture young scientists, made a profound difference in my career. I remember a conference where he explained the complexity of tensor rank. I quickly grasped the problem and slightly arrogantly said: I will fix it. I tried. I was very fast and 100% wrong. I never managed to make even the slightest progress!&lt;/p&gt;&lt;p&gt;He played a pivotal role in helping me craft some of my earliest papers, including one of the first approaches to tensor regression. Our discussions on the properties of multi-way arrays and their applications remain etched in my memory—not just as lessons in science, but as moments of shared curiosity.&lt;/p&gt;&lt;p&gt;Age's career is nothing short of extraordinary. From his foundational work at the University of Groningen to his tenure at the University of Amsterdam, where he led the group later known as Biosystems Data Analysis, Age has consistently been at the forefront of methodological advancements in not just chemometrics. His work on multi-way analysis, data integration, and systems biology has truly shaped the respective fields. It is no surprise that he has been honored with numerous awards, such as the prestigious Herman Wold Gold Medal and the Kowalski Award, reflecting his pioneering contributions and global recognition.&lt;/p&gt;&lt;p&gt;What sets Age apart, is his ability to foster collaboration and build bridges within the scientific community. He introduced me to some of the most significant researchers not only in chemometrics but also in psychometrics, widening my horizons and opening doors that would otherwise have remained closed. His efforts to create platforms for collaboration, such as co-founding TRICAP and contributing to international chemometric meetings, have enriched our discipline.&lt;/p&gt;&lt;p&gt;Reflecting on the arc of our careers, I cannot help but smile at the realization that the ‘old’ professor who once seemed so far ahead of me is, in fact, only a few years my senior. Time has a way of leveling us, and today I count Age as not only a colleague but also a dear friend and peer. His wisdom, humility, and warmth continue to inspire, and his legacy will undoubtedly endure through the countless students, collaborators, and researchers he has influenced.&lt;/p&gt;&lt;p&gt;This special issue is a testam","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate and Rational Collision Cross Section Prediction Using Voxel-Projected Area and Deep Learning 基于体素投影面积和深度学习的准确、合理的碰撞截面预测
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-07-08 DOI: 10.1002/cem.70040
Jiongyu Wang, Yuxuan Liao, Ting Xie, Ruixi Chen, Jiahui Lai, Zhimin Zhang, Hongmei Lu

Ion mobility spectrometry–mass spectrometry (IMS-MS) enables rapid acquisition of collision cross section (CCS), a critical physicochemical property for analyte characterization. Despite CCS being theoretically defined as the rotationally averaged projected area of 3D atomic spheres, existing models have underutilized this geometric insight. Here, we present a projected area–based CCS prediction method (PACCS). It integrates voxel-projected area approximation, graph neural network (GNN)–extracted features, and m/z to achieve accurate and rational CCS prediction. A voxel-based algorithm efficiently calculates molecular projected areas by leveraging Fibonacci grids sampling and discretizing 3D conformers into voxel grids. PACCS demonstrates exceptional performance, achieving a median relative error (MedRE) of 1.03% and a coefficient of determination (R2) of 0.994 on the test set. External test set against AllCCS2, GraphCCS, SigmaCCS, CCSbase, and DeepCCS highlights the superiority of PACCS, with 80.1% of predictions exhibiting < 3% error. Notably, PACCS exhibits broad applicability across diverse molecular types, including environmental contaminants (R2 = 0.954–0.979) and structurally complex phycotoxins (R2 = 0.961), highlighting the superiority of PACCS in robustness and versatility. Computational efficiency is enhanced via parallelization, enabling large-scale CCS database generation (e.g., 5.9 million entries for ChEMBL within 10 h). Ablation studies confirm the pivotal role of voxel-projected areas (Pearson correlation coefficients > 0.988), while stability analyses reveal minimal sensitivity to conformational variability (standard deviation of R2 is 0.00003). PACCS provides an open-source, scalable solution for expanding CCS databases, advancing compound identification in metabolomics and environmental analysis.

离子迁移谱-质谱(IMS-MS)可以快速获取碰撞截面(CCS),这是分析物表征的关键物理化学性质。尽管CCS在理论上被定义为三维原子球体的旋转平均投影面积,但现有的模型并没有充分利用这种几何洞察力。本文提出了一种基于投影区域的CCS预测方法(PACCS)。结合体素投影面积逼近、图神经网络(GNN)提取特征和m/z,实现准确合理的CCS预测。基于体素的算法通过利用斐波那契网格采样和离散三维构象到体素网格有效地计算分子投影区域。PACCS表现出优异的性能,在测试集上的中位相对误差(MedRE)为1.03%,决定系数(R2)为0.994。针对AllCCS2、GraphCCS、SigmaCCS、CCSbase和DeepCCS的外部测试集突出了PACCS的优势,80.1%的预测显示出<; 3%的误差。值得注意的是,PACCS在不同的分子类型中表现出广泛的适用性,包括环境污染物(R2 = 0.954-0.979)和结构复杂的藻毒素(R2 = 0.961),这突出了PACCS在稳健性和通用性方面的优势。通过并行化提高了计算效率,实现了大规模的CCS数据库生成(例如,在10小时内为ChEMBL生成590万个条目)。消融研究证实了体素投影区域的关键作用(Pearson相关系数>; 0.988),而稳定性分析显示对构象变异性的敏感性最小(R2的标准差为0.00003)。PACCS提供了一个开源的、可扩展的解决方案,用于扩展CCS数据库,推进代谢组学和环境分析中的化合物鉴定。
{"title":"Accurate and Rational Collision Cross Section Prediction Using Voxel-Projected Area and Deep Learning","authors":"Jiongyu Wang,&nbsp;Yuxuan Liao,&nbsp;Ting Xie,&nbsp;Ruixi Chen,&nbsp;Jiahui Lai,&nbsp;Zhimin Zhang,&nbsp;Hongmei Lu","doi":"10.1002/cem.70040","DOIUrl":"10.1002/cem.70040","url":null,"abstract":"<div>\u0000 \u0000 <p>Ion mobility spectrometry–mass spectrometry (IMS-MS) enables rapid acquisition of collision cross section (CCS), a critical physicochemical property for analyte characterization. Despite CCS being theoretically defined as the rotationally averaged projected area of 3D atomic spheres, existing models have underutilized this geometric insight. Here, we present a projected area–based CCS prediction method (PACCS). It integrates voxel-projected area approximation, graph neural network (GNN)–extracted features, and <i>m/z</i> to achieve accurate and rational CCS prediction. A voxel-based algorithm efficiently calculates molecular projected areas by leveraging Fibonacci grids sampling and discretizing 3D conformers into voxel grids. PACCS demonstrates exceptional performance, achieving a median relative error (MedRE) of 1.03% and a coefficient of determination (<i>R</i><sup>2</sup>) of 0.994 on the test set. External test set against AllCCS2, GraphCCS, SigmaCCS, CCSbase, and DeepCCS highlights the superiority of PACCS, with 80.1% of predictions exhibiting &lt; 3% error. Notably, PACCS exhibits broad applicability across diverse molecular types, including environmental contaminants (<i>R</i><sup>2</sup> = 0.954–0.979) and structurally complex phycotoxins (<i>R</i><sup>2</sup> = 0.961), highlighting the superiority of PACCS in robustness and versatility. Computational efficiency is enhanced via parallelization, enabling large-scale CCS database generation (e.g., 5.9 million entries for ChEMBL within 10 h). Ablation studies confirm the pivotal role of voxel-projected areas (Pearson correlation coefficients &gt; 0.988), while stability analyses reveal minimal sensitivity to conformational variability (standard deviation of <i>R</i><sup>2</sup> is 0.00003). PACCS provides an open-source, scalable solution for expanding CCS databases, advancing compound identification in metabolomics and environmental analysis.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144574152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Chemometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1