Journal of Chemometrics最新文献_第7页

Editorial: Honoring Prof. Age K. Smilde 社论：纪念Age K. Smilde教授

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-07-10 DOI: 10.1002/cem.70052

Rasmus Bro

It is both a privilege and an emotional moment for me to write this editorial for the special issue of the Journal of Chemometrics honoring Prof. Age K. Smilde, who recently retired. For me, and for countless others in our field, Prof. Smilde (also more informally know as Age) has been more than a scholar; he has been a mentor, a collaborator, and an inspiration whose contributions have left a huge mark on the world of chemometrics.Looking back, it feels almost surreal to think of my early days in academia, 30 years ago, when I was navigating the complex world of multi-way tensor analysis. At the time, Age seemed to me to be the quintessential ‘all-knowing’ professor. His mastery of the field, combined with a willingness to mentor and nurture young scientists, made a profound difference in my career. I remember a conference where he explained the complexity of tensor rank. I quickly grasped the problem and slightly arrogantly said: I will fix it. I tried. I was very fast and 100% wrong. I never managed to make even the slightest progress!He played a pivotal role in helping me craft some of my earliest papers, including one of the first approaches to tensor regression. Our discussions on the properties of multi-way arrays and their applications remain etched in my memory—not just as lessons in science, but as moments of shared curiosity.Age's career is nothing short of extraordinary. From his foundational work at the University of Groningen to his tenure at the University of Amsterdam, where he led the group later known as Biosystems Data Analysis, Age has consistently been at the forefront of methodological advancements in not just chemometrics. His work on multi-way analysis, data integration, and systems biology has truly shaped the respective fields. It is no surprise that he has been honored with numerous awards, such as the prestigious Herman Wold Gold Medal and the Kowalski Award, reflecting his pioneering contributions and global recognition.What sets Age apart, is his ability to foster collaboration and build bridges within the scientific community. He introduced me to some of the most significant researchers not only in chemometrics but also in psychometrics, widening my horizons and opening doors that would otherwise have remained closed. His efforts to create platforms for collaboration, such as co-founding TRICAP and contributing to international chemometric meetings, have enriched our discipline.Reflecting on the arc of our careers, I cannot help but smile at the realization that the ‘old’ professor who once seemed so far ahead of me is, in fact, only a few years my senior. Time has a way of leveling us, and today I count Age as not only a colleague but also a dear friend and peer. His wisdom, humility, and warmth continue to inspire, and his legacy will undoubtedly endure through the countless students, collaborators, and researchers he has influenced.This special issue is a testam

为《化学计量学杂志》特刊撰写这篇社论，以纪念最近退休的Age K. Smilde教授，对我来说，这既是一种荣幸，也是一种激动的时刻。对我和我们这个领域的无数其他人来说，斯米尔德教授（也被非正式地称为Age）不仅仅是一位学者；他是我的导师、合作者和灵感来源，他的贡献在化学计量学领域留下了巨大的印记。回首往事，回想起30年前我在学术界的早期时光，感觉几乎是超现实的，当时我正在探索多路张量分析的复杂世界。当时，在我看来，Age是一位典型的“无所不知”教授。他对这个领域的精通，加上他愿意指导和培养年轻科学家，对我的职业生涯产生了深远的影响。我记得在一次会议上，他解释了张量秩的复杂性。我很快就明白了问题所在，略带傲慢地说：我会解决的。我试过了。我猜得很快，而且完全错了。我从来没有取得哪怕是一点点的进步！他在帮助我撰写我最早的一些论文中发挥了关键作用，包括最早的张量回归方法之一。我们关于多路阵列的特性及其应用的讨论仍然铭刻在我的记忆中——不仅作为科学课程，而且作为共同好奇的时刻。Age的事业是非凡的。从他在格罗宁根大学的基础工作到他在阿姆斯特丹大学的任期，在那里他领导了后来被称为生物系统数据分析的小组，Age一直站在方法论进步的最前沿，而不仅仅是化学计量学。他在多路分析、数据集成和系统生物学方面的工作真正塑造了各自的领域。毫无疑问，他获得了许多奖项，如久负盛名的赫尔曼世界金奖和科瓦尔斯基奖，这反映了他的开创性贡献和全球认可。让Age与众不同的是他在科学界促进合作和建立桥梁的能力。他向我介绍了一些最重要的研究人员，不仅在化学计量学方面，而且在心理计量学方面，拓宽了我的视野，打开了原本紧闭的大门。他努力创建合作平台，如共同创立TRICAP和参与国际化学计量学会议，丰富了我们的学科。回顾我们的职业生涯，我不禁笑了，因为我意识到，这位曾经看起来遥遥领先于我的“老”教授，实际上只比我年长几岁。时间会让我们变得更平，今天，我不仅把年龄视为同事，还视其为亲爱的朋友和同伴。他的智慧、谦逊和热情继续激励着我们，他的遗产无疑将通过他影响的无数学生、合作者和研究人员而延续下去。本期特刊证明了斯米尔德教授对我们这个领域的影响。它汇集了研究人员的贡献，这些研究人员的工作受到他的思想、指导和合作的影响。这是对像Age这样的科学家最恰当的致敬。我谨代表所有有幸与斯米尔德教授共事的人，感谢你，Age，感谢你孜孜不倦的贡献、你的指导和你的友谊。我们不仅庆祝你非凡的职业生涯，也庆祝背后的人——一个真正的化学计量学巨人。

{"title":"Editorial: Honoring Prof. Age K. Smilde","authors":"Rasmus Bro","doi":"10.1002/cem.70052","DOIUrl":"10.1002/cem.70052","url":null,"abstract":"It is both a privilege and an emotional moment for me to write this editorial for the special issue of the Journal of Chemometrics honoring Prof. Age K. Smilde, who recently retired. For me, and for countless others in our field, Prof. Smilde (also more informally know as Age) has been more than a scholar; he has been a mentor, a collaborator, and an inspiration whose contributions have left a huge mark on the world of chemometrics.Looking back, it feels almost surreal to think of my early days in academia, 30 years ago, when I was navigating the complex world of multi-way tensor analysis. At the time, Age seemed to me to be the quintessential ‘all-knowing’ professor. His mastery of the field, combined with a willingness to mentor and nurture young scientists, made a profound difference in my career. I remember a conference where he explained the complexity of tensor rank. I quickly grasped the problem and slightly arrogantly said: I will fix it. I tried. I was very fast and 100% wrong. I never managed to make even the slightest progress!He played a pivotal role in helping me craft some of my earliest papers, including one of the first approaches to tensor regression. Our discussions on the properties of multi-way arrays and their applications remain etched in my memory—not just as lessons in science, but as moments of shared curiosity.Age's career is nothing short of extraordinary. From his foundational work at the University of Groningen to his tenure at the University of Amsterdam, where he led the group later known as Biosystems Data Analysis, Age has consistently been at the forefront of methodological advancements in not just chemometrics. His work on multi-way analysis, data integration, and systems biology has truly shaped the respective fields. It is no surprise that he has been honored with numerous awards, such as the prestigious Herman Wold Gold Medal and the Kowalski Award, reflecting his pioneering contributions and global recognition.What sets Age apart, is his ability to foster collaboration and build bridges within the scientific community. He introduced me to some of the most significant researchers not only in chemometrics but also in psychometrics, widening my horizons and opening doors that would otherwise have remained closed. His efforts to create platforms for collaboration, such as co-founding TRICAP and contributing to international chemometric meetings, have enriched our discipline.Reflecting on the arc of our careers, I cannot help but smile at the realization that the ‘old’ professor who once seemed so far ahead of me is, in fact, only a few years my senior. Time has a way of leveling us, and today I count Age as not only a colleague but also a dear friend and peer. His wisdom, humility, and warmth continue to inspire, and his legacy will undoubtedly endure through the countless students, collaborators, and researchers he has influenced.This special issue is a testam","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate and Rational Collision Cross Section Prediction Using Voxel-Projected Area and Deep Learning 基于体素投影面积和深度学习的准确、合理的碰撞截面预测

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-07-08 DOI: 10.1002/cem.70040

Jiongyu Wang, Yuxuan Liao, Ting Xie, Ruixi Chen, Jiahui Lai, Zhimin Zhang, Hongmei Lu

Ion mobility spectrometry–mass spectrometry (IMS-MS) enables rapid acquisition of collision cross section (CCS), a critical physicochemical property for analyte characterization. Despite CCS being theoretically defined as the rotationally averaged projected area of 3D atomic spheres, existing models have underutilized this geometric insight. Here, we present a projected area–based CCS prediction method (PACCS). It integrates voxel-projected area approximation, graph neural network (GNN)–extracted features, and m/z to achieve accurate and rational CCS prediction. A voxel-based algorithm efficiently calculates molecular projected areas by leveraging Fibonacci grids sampling and discretizing 3D conformers into voxel grids. PACCS demonstrates exceptional performance, achieving a median relative error (MedRE) of 1.03% and a coefficient of determination (R²) of 0.994 on the test set. External test set against AllCCS2, GraphCCS, SigmaCCS, CCSbase, and DeepCCS highlights the superiority of PACCS, with 80.1% of predictions exhibiting < 3% error. Notably, PACCS exhibits broad applicability across diverse molecular types, including environmental contaminants (R² = 0.954–0.979) and structurally complex phycotoxins (R² = 0.961), highlighting the superiority of PACCS in robustness and versatility. Computational efficiency is enhanced via parallelization, enabling large-scale CCS database generation (e.g., 5.9 million entries for ChEMBL within 10 h). Ablation studies confirm the pivotal role of voxel-projected areas (Pearson correlation coefficients > 0.988), while stability analyses reveal minimal sensitivity to conformational variability (standard deviation of R² is 0.00003). PACCS provides an open-source, scalable solution for expanding CCS databases, advancing compound identification in metabolomics and environmental analysis.

离子迁移谱-质谱（IMS-MS）可以快速获取碰撞截面（CCS），这是分析物表征的关键物理化学性质。尽管CCS在理论上被定义为三维原子球体的旋转平均投影面积，但现有的模型并没有充分利用这种几何洞察力。本文提出了一种基于投影区域的CCS预测方法（PACCS）。结合体素投影面积逼近、图神经网络（GNN）提取特征和m/z，实现准确合理的CCS预测。基于体素的算法通过利用斐波那契网格采样和离散三维构象到体素网格有效地计算分子投影区域。PACCS表现出优异的性能，在测试集上的中位相对误差（MedRE）为1.03%，决定系数（R2）为0.994。针对AllCCS2、GraphCCS、SigmaCCS、CCSbase和DeepCCS的外部测试集突出了PACCS的优势，80.1%的预测显示出<； 3%的误差。值得注意的是，PACCS在不同的分子类型中表现出广泛的适用性，包括环境污染物（R2 = 0.954-0.979）和结构复杂的藻毒素（R2 = 0.961），这突出了PACCS在稳健性和通用性方面的优势。通过并行化提高了计算效率，实现了大规模的CCS数据库生成（例如，在10小时内为ChEMBL生成590万个条目）。消融研究证实了体素投影区域的关键作用（Pearson相关系数>； 0.988），而稳定性分析显示对构象变异性的敏感性最小（R2的标准差为0.00003）。PACCS提供了一个开源的、可扩展的解决方案，用于扩展CCS数据库，推进代谢组学和环境分析中的化合物鉴定。

{"title":"Accurate and Rational Collision Cross Section Prediction Using Voxel-Projected Area and Deep Learning","authors":"Jiongyu Wang, Yuxuan Liao, Ting Xie, Ruixi Chen, Jiahui Lai, Zhimin Zhang, Hongmei Lu","doi":"10.1002/cem.70040","DOIUrl":"10.1002/cem.70040","url":null,"abstract":"<div>\u0000 \u0000 Ion mobility spectrometry–mass spectrometry (IMS-MS) enables rapid acquisition of collision cross section (CCS), a critical physicochemical property for analyte characterization. Despite CCS being theoretically defined as the rotationally averaged projected area of 3D atomic spheres, existing models have underutilized this geometric insight. Here, we present a projected area–based CCS prediction method (PACCS). It integrates voxel-projected area approximation, graph neural network (GNN)–extracted features, and m/z to achieve accurate and rational CCS prediction. A voxel-based algorithm efficiently calculates molecular projected areas by leveraging Fibonacci grids sampling and discretizing 3D conformers into voxel grids. PACCS demonstrates exceptional performance, achieving a median relative error (MedRE) of 1.03% and a coefficient of determination (R2) of 0.994 on the test set. External test set against AllCCS2, GraphCCS, SigmaCCS, CCSbase, and DeepCCS highlights the superiority of PACCS, with 80.1% of predictions exhibiting < 3% error. Notably, PACCS exhibits broad applicability across diverse molecular types, including environmental contaminants (R2 = 0.954–0.979) and structurally complex phycotoxins (R2 = 0.961), highlighting the superiority of PACCS in robustness and versatility. Computational efficiency is enhanced via parallelization, enabling large-scale CCS database generation (e.g., 5.9 million entries for ChEMBL within 10 h). Ablation studies confirm the pivotal role of voxel-projected areas (Pearson correlation coefficients > 0.988), while stability analyses reveal minimal sensitivity to conformational variability (standard deviation of R2 is 0.00003). PACCS provides an open-source, scalable solution for expanding CCS databases, advancing compound identification in metabolomics and environmental analysis.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144574152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Frequency-Domain Alignment of Heterogeneous, Multidimensional Separations Data Through Complex Orthogonal Procrustes Analysis 基于复正交Procrustes分析的异构、多维分离数据频域对齐

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-07-07 DOI: 10.1002/cem.70042

Michael Sorochan Armstrong

Multidimensional separations data have the capacity to reveal detailed information about complex biological samples. However, data analysis has been an ongoing challenge in the area because the peaks that represent chemical factors may drift over the course of several analytical runs along the first- and second-dimension retention times. This makes higher level analyses of the data difficult, because a 1–1 comparison of samples is seldom possible without sophisticated preprocessing routines. This work offers a very simple solution to the alignment problem through an orthogonal Procrustes analysis of the frequency-domain representation of the data, which for each coefficient relative drift and amplitude are represented as a complex number. Its performance on synthetically generated data presenting nonlinear retention distortions is evaluated, in addition to its applicability to quantitative problems using experimental calibration, and untargeted metabolomics data. This analysis is extremely simple and can be recreated using just a few lines of code, relying only on fast algorithms for matrix multiplication and Fourier transforms.

多维分离数据有能力揭示复杂生物样品的详细信息。然而，数据分析一直是该领域的一个挑战，因为代表化学因素的峰值可能会在沿一维和二维保留时间的几次分析运行过程中漂移。这使得对数据进行更高层次的分析变得困难，因为如果没有复杂的预处理程序，很少可能对样本进行1-1比较。这项工作通过对数据的频域表示进行正交Procrustes分析，为校准问题提供了一个非常简单的解决方案，其中每个系数的相对漂移和幅度都表示为复数。除了对使用实验校准的定量问题和非靶向代谢组学数据的适用性外，还评估了其在呈现非线性保留扭曲的综合生成数据上的性能。这个分析非常简单，只需几行代码就可以重新创建，只依赖于矩阵乘法和傅里叶变换的快速算法。

{"title":"Frequency-Domain Alignment of Heterogeneous, Multidimensional Separations Data Through Complex Orthogonal Procrustes Analysis","authors":"Michael Sorochan Armstrong","doi":"10.1002/cem.70042","DOIUrl":"10.1002/cem.70042","url":null,"abstract":"Multidimensional separations data have the capacity to reveal detailed information about complex biological samples. However, data analysis has been an ongoing challenge in the area because the peaks that represent chemical factors may drift over the course of several analytical runs along the first- and second-dimension retention times. This makes higher level analyses of the data difficult, because a 1–1 comparison of samples is seldom possible without sophisticated preprocessing routines. This work offers a very simple solution to the alignment problem through an orthogonal Procrustes analysis of the frequency-domain representation of the data, which for each coefficient relative drift and amplitude are represented as a complex number. Its performance on synthetically generated data presenting nonlinear retention distortions is evaluated, in addition to its applicability to quantitative problems using experimental calibration, and untargeted metabolomics data. This analysis is extremely simple and can be recreated using just a few lines of code, relying only on fast algorithms for matrix multiplication and Fourier transforms.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 7","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144573560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Optimizing Radial Basis Function Support Vector Classifier (SO-RBFSVC) 自优化径向基函数支持向量分类器SO-RBFSVC

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-26 DOI: 10.1002/cem.70038

Qudus Ayodeji Thanni, Peter de Boves Harrington

Support vector classifiers (SVCs) typically use radial basis function (RBF) kernels to map data into higher dimensional spaces that may improve the linear separation of otherwise nonseparable classes. We present a novel self-optimizing radial basis function support vector classifier (SO-RBFSVC) that integrates response surface methodology (RSM), two-dimensional cubic spline interpolation, and bootstrapped Latin partitions (BLPs) for automated hyperparameter tuning. The SO-RBFSVC simultaneously optimizes the RBF kernel width (σ) and cost parameter (C) using an interpolated response surface obtained from generalized prediction accuracies. The SO-RBFSVC was compared to other self-optimizing classifiers (super SVC [sSVC] and super partial least squares discriminant analysis [sPLS-DA]). Four datasets were evaluated: (i) hemp and marijuana discrimination using proton nuclear magnetic resonance spectra, (ii) barley growth location using near-infrared spectra, (iii) glass-type identification based on elemental composition, and (iv) wine cultivar classification from physicochemical properties. External validation results showed that SO-RBFSVC performed comparably to the other models, achieving error rates of 0.4 ± 0.5% for hemp/marijuana, 7 ± 1% for glass, and 6 ± 1% for wine, while outperforming the linear models with 10 ± 1% error for the barley NIR data. For the first time, generalized sensitivity analysis (GSA) was applied to quantify model linearity. GSA revealed high nonlinearity in the barley dataset, justifying a nonlinear model. The SO-RBFSVC provides robust, automated classifier tuning for low- and high-dimensional datasets, offering ease of use.

支持向量分类器（SVCs）通常使用径向基函数（RBF）核将数据映射到高维空间，这可能会改善不可分类的线性分离。我们提出了一种新的自优化径向基函数支持向量分类器（SO-RBFSVC），它集成了响应面方法（RSM）、二维三次样条插值和用于自动超参数调谐的自引导拉丁分区（blp）。SO-RBFSVC利用广义预测精度得到的插值响应面同时优化RBF核宽度（σ）和代价参数(C)。将SO-RBFSVC与其他自优化分类器（超级SVC [sSVC]和超级偏最小二乘判别分析[sPLS-DA]）进行比较。对4个数据集进行了评估：(i)利用质子核磁共振光谱对大麻和大麻进行鉴别，（ii）利用近红外光谱对大麦生长位置进行鉴别，（iii）利用元素组成对玻璃类型进行鉴别，以及（iv）利用理化性质对葡萄酒品种进行分类。外部验证结果表明，SO-RBFSVC与其他模型相比，大麻/大麻的误差率为0.4±0.5%，玻璃的误差率为7±1%，葡萄酒的误差率为6±1%，而大麦近红外数据的误差率为10±1%，优于线性模型。首次将广义灵敏度分析（GSA）用于模型线性度的量化。GSA揭示了大麦数据集中的高度非线性，证明了非线性模型的合理性。SO-RBFSVC为低维和高维数据集提供鲁棒的自动分类器调优，易于使用。

{"title":"Self-Optimizing Radial Basis Function Support Vector Classifier (SO-RBFSVC)","authors":"Qudus Ayodeji Thanni, Peter de Boves Harrington","doi":"10.1002/cem.70038","DOIUrl":"10.1002/cem.70038","url":null,"abstract":"Support vector classifiers (SVCs) typically use radial basis function (RBF) kernels to map data into higher dimensional spaces that may improve the linear separation of otherwise nonseparable classes. We present a novel self-optimizing radial basis function support vector classifier (SO-RBFSVC) that integrates response surface methodology (RSM), two-dimensional cubic spline interpolation, and bootstrapped Latin partitions (BLPs) for automated hyperparameter tuning. The SO-RBFSVC simultaneously optimizes the RBF kernel width (σ) and cost parameter (C) using an interpolated response surface obtained from generalized prediction accuracies. The SO-RBFSVC was compared to other self-optimizing classifiers (super SVC [sSVC] and super partial least squares discriminant analysis [sPLS-DA]). Four datasets were evaluated: (i) hemp and marijuana discrimination using proton nuclear magnetic resonance spectra, (ii) barley growth location using near-infrared spectra, (iii) glass-type identification based on elemental composition, and (iv) wine cultivar classification from physicochemical properties. External validation results showed that SO-RBFSVC performed comparably to the other models, achieving error rates of 0.4 ± 0.5% for hemp/marijuana, 7 ± 1% for glass, and 6 ± 1% for wine, while outperforming the linear models with 10 ± 1% error for the barley NIR data. For the first time, generalized sensitivity analysis (GSA) was applied to quantify model linearity. GSA revealed high nonlinearity in the barley dataset, justifying a nonlinear model. The SO-RBFSVC provides robust, automated classifier tuning for low- and high-dimensional datasets, offering ease of use.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 6","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144140398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Are Chemometric Models Validated? A Systematic Review of Linear Regression Models for NIRS Data in Food Analysis 如何验证化学计量学模型？食品近红外光谱分析数据线性回归模型的系统综述

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-19 DOI: 10.1002/cem.70036

Jokin Ezenarro, Daniel Schorn-García

Chemometric models play a critical role in the spectroscopic analysis of food, particularly with near-infrared spectroscopy (NIRS), enabling the accurate prediction and monitoring of physicochemical properties. Although chemometric methods have proven to be useful tools in NIRS analysis, their reliability depends on rigorous validation to ensure the rigour of their predictions and their applicability. This systematic review examines validation strategies applied to regression models in NIRS-based food analysis, emphasising the use of cross-validation, external validation and figures of merit (FoM) as key evaluation tools. This comprehensive literature search identified trends in validation methodologies, highlighting frequent reliance on partial least squares (PLS) regression and common flaws in validation methodologies and their reporting. While external validation is considered the best approach, many studies lack it and employ cross-validation methods solely, which may lead to overoptimistic model performance estimates. Furthermore, inconsistencies in the selection and definition of FoM hinder direct comparison across studies. This review underscores the need for increased methodological transparency and rigour in the validation of chemometric models to enhance their reliability.

化学计量学模型在食品的光谱分析中起着至关重要的作用，特别是近红外光谱（NIRS），可以准确预测和监测食品的理化性质。虽然化学计量学方法已被证明是近红外光谱分析的有用工具，但其可靠性取决于严格的验证，以确保其预测的严谨性和适用性。本系统综述研究了在基于nir的食品分析中应用于回归模型的验证策略，强调交叉验证、外部验证和价值图（FoM）作为关键评估工具的使用。这项全面的文献检索确定了验证方法的趋势，突出了对偏最小二乘（PLS）回归的频繁依赖以及验证方法及其报告中的常见缺陷。虽然外部验证被认为是最好的方法，但许多研究缺乏外部验证，只采用交叉验证方法，这可能导致模型性能估计过于乐观。此外，FoM的选择和定义的不一致性阻碍了研究之间的直接比较。这篇综述强调了在化学计量模型验证中增加方法透明度和严谨性以提高其可靠性的必要性。

{"title":"How Are Chemometric Models Validated? A Systematic Review of Linear Regression Models for NIRS Data in Food Analysis","authors":"Jokin Ezenarro, Daniel Schorn-García","doi":"10.1002/cem.70036","DOIUrl":"10.1002/cem.70036","url":null,"abstract":"Chemometric models play a critical role in the spectroscopic analysis of food, particularly with near-infrared spectroscopy (NIRS), enabling the accurate prediction and monitoring of physicochemical properties. Although chemometric methods have proven to be useful tools in NIRS analysis, their reliability depends on rigorous validation to ensure the rigour of their predictions and their applicability. This systematic review examines validation strategies applied to regression models in NIRS-based food analysis, emphasising the use of cross-validation, external validation and figures of merit (FoM) as key evaluation tools. This comprehensive literature search identified trends in validation methodologies, highlighting frequent reliance on partial least squares (PLS) regression and common flaws in validation methodologies and their reporting. While external validation is considered the best approach, many studies lack it and employ cross-validation methods solely, which may lead to overoptimistic model performance estimates. Furthermore, inconsistencies in the selection and definition of FoM hinder direct comparison across studies. This review underscores the need for increased methodological transparency and rigour in the validation of chemometric models to enhance their reliability.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 6","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144085154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

De Novo Design of HIV-1 Integrase-LEDGF/p75 Inhibitors Through Deep Reinforcement Learning and Virtual Screening 基于深度强化学习和虚拟筛选的HIV-1整合酶- ledgf /p75抑制剂从头设计

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-12 DOI: 10.1002/cem.70037

Hai-Bo Sun, Hai-Long Wu, Tong Wang, An-Qi Chen, Ru-Qin Yu

Human immunodeficiency virus (HIV) has far-reaching impacts on global public health. Acquired immunodeficiency syndrome (AIDS) has caused millions of deaths globally, with thousands still getting infected. Therefore, developing HIV-1 integrase inhibitors is crucial for controlling AIDS by slowing virus replication and transmission. This study is grounded in the framework of deep reinforcement learning, aiming to de novo design inhibitors of HIV-1 integrase-Lens Epithelial-Derived Growth Factor/p75 interaction and subsequently employing molecular docking to screen potential therapeutic compounds. Initially, a molecular generation model was established based on the long short-term memory algorithm and refined through transfer learning to obtain a preliminary generative model. Subsequently, the deep reinforcement learning strategy was employed, using inhibition activity as a reward value, enabling the model more likely to generate molecules with desirable properties. The results indicate that the reinforced generation model not only generates novel and effective SMILES structures with medicinal potential but also demonstrates strong binding affinity between the generated molecules and the target protein, as indicated by molecular docking experiments. Ultimately, through virtual screening, we identified six lead compounds having the potential to become inhibitors of interaction between Lens Epithelial-Derived Growth Factor/p75 and HIV-1 integrase, providing an effective and practical strategy for de novo drug design of HIV-1 integrase inhibitors.

人类免疫缺陷病毒（HIV）对全球公共卫生产生深远影响。获得性免疫缺陷综合症（艾滋病）已在全球造成数百万人死亡，仍有数千人受到感染。因此，开发HIV-1整合酶抑制剂对于通过减缓病毒复制和传播来控制艾滋病至关重要。本研究基于深度强化学习的框架，旨在重新设计HIV-1整合酶-晶状体上皮衍生生长因子/p75相互作用的抑制剂，并随后采用分子对接来筛选潜在的治疗化合物。首先，基于长短期记忆算法建立分子生成模型，并通过迁移学习进行细化，得到初步的生成模型。随后，采用深度强化学习策略，使用抑制活性作为奖励值，使模型更有可能生成具有理想特性的分子。结果表明，通过分子对接实验，增强生成模型不仅生成了具有药用潜力的新颖有效的smile结构，而且生成的分子与靶蛋白之间具有较强的结合亲和力。最终，通过虚拟筛选，我们确定了六种先导化合物，它们有可能成为晶状体上皮衍生生长因子/p75与HIV-1整合酶之间相互作用的抑制剂，为HIV-1整合酶抑制剂的新药物设计提供了有效和实用的策略。

{"title":"De Novo Design of HIV-1 Integrase-LEDGF/p75 Inhibitors Through Deep Reinforcement Learning and Virtual Screening","authors":"Hai-Bo Sun, Hai-Long Wu, Tong Wang, An-Qi Chen, Ru-Qin Yu","doi":"10.1002/cem.70037","DOIUrl":"10.1002/cem.70037","url":null,"abstract":"<div>\u0000 \u0000 Human immunodeficiency virus (HIV) has far-reaching impacts on global public health. Acquired immunodeficiency syndrome (AIDS) has caused millions of deaths globally, with thousands still getting infected. Therefore, developing HIV-1 integrase inhibitors is crucial for controlling AIDS by slowing virus replication and transmission. This study is grounded in the framework of deep reinforcement learning, aiming to de novo design inhibitors of HIV-1 integrase-Lens Epithelial-Derived Growth Factor/p75 interaction and subsequently employing molecular docking to screen potential therapeutic compounds. Initially, a molecular generation model was established based on the long short-term memory algorithm and refined through transfer learning to obtain a preliminary generative model. Subsequently, the deep reinforcement learning strategy was employed, using inhibition activity as a reward value, enabling the model more likely to generate molecules with desirable properties. The results indicate that the reinforced generation model not only generates novel and effective SMILES structures with medicinal potential but also demonstrates strong binding affinity between the generated molecules and the target protein, as indicated by molecular docking experiments. Ultimately, through virtual screening, we identified six lead compounds having the potential to become inhibitors of interaction between Lens Epithelial-Derived Growth Factor/p75 and HIV-1 integrase, providing an effective and practical strategy for de novo drug design of HIV-1 integrase inhibitors.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143939411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Two-Parameter Estimation Technique for Handling Multicollinearity in Inverse Gaussian Regression Model 一种新的处理高斯反回归模型多重共线性的双参数估计技术

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-08 DOI: 10.1002/cem.70032

Ishrat Riaz, Aamir Sanaullah, Mustafa M. Hasaballah, Oluwafemi Samson Balogun, Mahmoud E. Bakr

This study focuses on the prevalent issue of multicollinearity in the inverse Gaussian regression model (IGRM), which arises when predictor variables have a high degree of correlation. The typical maximum likelihood estimator (MLE) proves to be highly unstable when dealing with linearly linked regressors. Eventually, the accuracy of the model may suffer because of inflated variances and inaccurate coefficient estimates. To improve parameter estimation accuracy and combat multicollinearity, this paper suggests an alternative biased estimator for the IGRM that integrates a two-parameter framework. This novel two-parameter estimator is a general estimator that takes the maximum likelihood, ridge, and Stein estimators as special cases. The theoretical characteristics of the estimator, including its bias and mean squared error (MSE), are develop and then go through a thorough theoretical comparison with the previous estimators in terms of the mean square error matrix (MMSE) criterion. Moreover, the optimal values of the biasing parameters for the advised estimator are also obtained. An extensive simulated study and real-world dataset are examined to assess the practical relevance of the proposed estimator. The empirical results show that, in comparison to conventional estimators, including MLE, ridge, and Stein estimators, the suggested estimator considerably lowers the MSE and improves the parameter estimation accuracy. These results illustrate the novel approach's potential for dealing with multicollinearity in IGRM. The continuous development of reliable estimating methods for generalized linear models (GLMs) is aided by these findings.

本文研究了逆高斯回归模型（IGRM）中普遍存在的多重共线性问题，当预测变量具有高度相关时，就会出现多重共线性问题。典型的极大似然估计（MLE）在处理线性关联回归量时被证明是高度不稳定的。最终，由于膨胀的方差和不准确的系数估计，模型的准确性可能会受到影响。为了提高参数估计精度和对抗多重共线性，本文提出了一种集成双参数框架的IGRM有偏估计器。这种新的双参数估计是一种以极大似然估计、ridge估计和Stein估计为特殊情况的一般估计。首先阐述了该估计器的理论特性，包括偏置和均方误差（MSE），然后根据均方误差矩阵（MMSE）准则与之前的估计器进行了彻底的理论比较。此外，还得到了建议估计器的最优偏置参数值。广泛的模拟研究和现实世界的数据集进行了检查，以评估所提出的估计器的实际相关性。实验结果表明，与传统的MLE、ridge和Stein估计器相比，该估计器显著降低了MSE，提高了参数估计精度。这些结果说明了这种新方法在处理IGRM中的多重共线性方面的潜力。这些发现有助于不断发展可靠的广义线性模型（GLMs）估计方法。

{"title":"A Novel Two-Parameter Estimation Technique for Handling Multicollinearity in Inverse Gaussian Regression Model","authors":"Ishrat Riaz, Aamir Sanaullah, Mustafa M. Hasaballah, Oluwafemi Samson Balogun, Mahmoud E. Bakr","doi":"10.1002/cem.70032","DOIUrl":"10.1002/cem.70032","url":null,"abstract":"<div>\u0000 \u0000 This study focuses on the prevalent issue of multicollinearity in the inverse Gaussian regression model (IGRM), which arises when predictor variables have a high degree of correlation. The typical maximum likelihood estimator (MLE) proves to be highly unstable when dealing with linearly linked regressors. Eventually, the accuracy of the model may suffer because of inflated variances and inaccurate coefficient estimates. To improve parameter estimation accuracy and combat multicollinearity, this paper suggests an alternative biased estimator for the IGRM that integrates a two-parameter framework. This novel two-parameter estimator is a general estimator that takes the maximum likelihood, ridge, and Stein estimators as special cases. The theoretical characteristics of the estimator, including its bias and mean squared error (MSE), are develop and then go through a thorough theoretical comparison with the previous estimators in terms of the mean square error matrix (MMSE) criterion. Moreover, the optimal values of the biasing parameters for the advised estimator are also obtained. An extensive simulated study and real-world dataset are examined to assess the practical relevance of the proposed estimator. The empirical results show that, in comparison to conventional estimators, including MLE, ridge, and Stein estimators, the suggested estimator considerably lowers the MSE and improves the parameter estimation accuracy. These results illustrate the novel approach's potential for dealing with multicollinearity in IGRM. The continuous development of reliable estimating methods for generalized linear models (GLMs) is aided by these findings.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143925881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feasibility Study on Identifying Seed Variety of Soybean With Hyperspectral Imaging and Deep Learning 利用高光谱成像和深度学习技术鉴定大豆种子品种的可行性研究

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-01 DOI: 10.1002/cem.70035

Lei Pang, Zhen Wang, Siyan Mi, Hui Li

Seed variety purity is an important indicator of seed quality, and mixing soybean seeds at different maturity stages can affect crop growth and food quality. This study investigated the feasibility of recognizing five soybean varieties at different maturity stages using hyperspectral imaging. Hyperspectral data from 3600 soybean seeds were collected in the range of 395.5–1003.7 nm. First, the potential to qualitatively distinguish the five soybean varieties was assessed using visual cluster analyses based on principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). Next, the performance of four classification models—random forest (RF), extreme learning machine (ELM), partial least squares discriminant analysis (PLS-DA), and one-dimensional convolutional neural network (1DCNN)—was compared. Multiplicative scatter correction (MSC) preprocessing significantly improved the recognition effect of all four models, with the 1DCNN model demonstrating the highest accuracy and most stable recognition performance. The effects of feature bands extracted using competitive adaptive reweighted sampling (CARS), variable importance in projection (VIP), and local linear embedding (LLE) on the four models were also compared. The accuracy of all four feature band sets, when combined with the MSC+1DCNN model, exceeded 96% in identifying soybean varieties. Therefore, these results indicate that the 1DCNN discriminant analysis model is suitable for spectral data analysis in soybean seed variety classification and can significantly enhance classification accuracy.

种子品种纯度是衡量种子品质的重要指标，不同成熟期大豆种子混用会影响作物生长和食品品质。本研究探讨了利用高光谱成像技术识别5个不同成熟期大豆品种的可行性。在395.5 ~ 1003.7 nm范围内采集了3600颗大豆种子的高光谱数据。首先，利用基于主成分分析（PCA）、t分布随机邻居嵌入（t-SNE）和均匀流形逼近与投影（UMAP）的视觉聚类分析，对5个大豆品种进行定性区分。接下来，比较了随机森林（RF）、极限学习机（ELM）、偏最小二乘判别分析（PLS-DA）和一维卷积神经网络（1DCNN）四种分类模型的性能。乘法散射校正（multiplative scatter correction， MSC）预处理显著提高了四种模型的识别效果，其中1DCNN模型的识别精度最高，识别性能最稳定。比较了竞争自适应重加权采样（CARS）、投影变量重要度（VIP）和局部线性嵌入（LLE）提取的特征波段对四种模型的影响。当与MSC+1DCNN模型结合使用时，所有4个特征波段集的识别准确率均超过96%。因此，这些结果表明，1DCNN判别分析模型适用于大豆种子品种分类中的光谱数据分析，可以显著提高分类精度。

{"title":"Feasibility Study on Identifying Seed Variety of Soybean With Hyperspectral Imaging and Deep Learning","authors":"Lei Pang, Zhen Wang, Siyan Mi, Hui Li","doi":"10.1002/cem.70035","DOIUrl":"10.1002/cem.70035","url":null,"abstract":"<div>\u0000 \u0000 Seed variety purity is an important indicator of seed quality, and mixing soybean seeds at different maturity stages can affect crop growth and food quality. This study investigated the feasibility of recognizing five soybean varieties at different maturity stages using hyperspectral imaging. Hyperspectral data from 3600 soybean seeds were collected in the range of 395.5–1003.7 nm. First, the potential to qualitatively distinguish the five soybean varieties was assessed using visual cluster analyses based on principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). Next, the performance of four classification models—random forest (RF), extreme learning machine (ELM), partial least squares discriminant analysis (PLS-DA), and one-dimensional convolutional neural network (1DCNN)—was compared. Multiplicative scatter correction (MSC) preprocessing significantly improved the recognition effect of all four models, with the 1DCNN model demonstrating the highest accuracy and most stable recognition performance. The effects of feature bands extracted using competitive adaptive reweighted sampling (CARS), variable importance in projection (VIP), and local linear embedding (LLE) on the four models were also compared. The accuracy of all four feature band sets, when combined with the MSC+1DCNN model, exceeded 96% in identifying soybean varieties. Therefore, these results indicate that the 1DCNN discriminant analysis model is suitable for spectral data analysis in soybean seed variety classification and can significantly enhance classification accuracy.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiview Ensemble Learning Framework for Real-Time UV Spectroscopic Detection of Nitrate in Water With Chemometric Modelling 基于化学计量模型的水中硝酸盐紫外光谱实时检测的多视图集成学习框架

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-01 DOI: 10.1002/cem.70033

Sagar Rana, Sudeshna Bagchi

The accuracy of detection of nitrate in water for quality monitoring is a significant yet challenging task. To address this, the present work proposes an ensemble machine learning–based chemometric framework for the optical detection of nitrate in water. It incorporates an absorbance-based reagent-less detection of nitrate in water to support the robustness of the model. The absorption spectra were recorded using a portable set-up in the presence and absence of interfering ions. Different interfering ions, namely, nitrite (NO₂⁻), calcium (Ca²⁺), magnesium (Mg²⁺), carbonate (CO₃²⁻), bromide (Br⁻), chloride (Cl⁻) and phosphate (PO₄³⁻), in all possible combinations (binary, ternary, quaternary, quinary, senary and septenary mixtures) are added to target analyte to validate the real-time application of the proposed algorithm. Under the multiview framework, two models, MVNPM-I and MVNPM-II, i.e., multiview nitrate prediction models, are proposed. MVNPM-I is based on an ensemble of regressors' results, and MVNPM-II uses multiple views of the dataset followed by an ensemble of their results. The performance of the models is assessed using a hold-out validation scheme with 10 repetitions and measured using R² score and mean squared error (MSE). The best results of R² score 0.9978 with a standard deviation 0.0014 and MSE of 1.1799 with a standard deviation of 0.8639 are obtained using the MVNPM-II model. Further, the performance measures of the proposed models show that they can handle the presence of interfering ions. The algorithm was also tested using real-world samples with an R² score and MSE of 0.9998 and 0.696, respectively. The promising results strengthen the applicability of the proposed method in real-world scenarios.

水质监测中硝酸盐的准确检测是一项重要而又具有挑战性的任务。为了解决这个问题，本研究提出了一个基于集成机器学习的化学计量学框架，用于水中硝酸盐的光学检测。它结合了基于吸收剂的水中硝酸盐少试剂检测，以支持模型的鲁棒性。在存在和不存在干扰离子的情况下，用便携式装置记录了吸收光谱。不同的干扰离子，即亚硝酸盐（NO2−）、钙（Ca2+）、镁（Mg2+）、碳酸盐（CO32−）、溴化物（Br−）、氯化物（Cl−）和磷酸盐（PO43−），以所有可能的组合（二元、三元、四元、五元、四元和七元混合物）添加到目标分析物中，以验证所提出算法的实时应用。在多视角框架下，提出了MVNPM-I和MVNPM-II两个多视角硝酸盐预测模型。MVNPM-I基于回归者结果的集合，而MVNPM-II使用数据集的多个视图，然后是它们结果的集合。使用10次重复的保留验证方案评估模型的性能，并使用R2评分和均方误差（MSE）进行测量。采用MVNPM-II模型得到的最佳结果为R2评分0.9978，标准差0.0014；MSE为1.1799，标准差0.8639。此外，所提出的模型的性能测量表明，它们可以处理干扰离子的存在。该算法还使用实际样本进行了测试，R2得分和MSE分别为0.9998和0.696。这些有希望的结果增强了所提出方法在现实场景中的适用性。

{"title":"Multiview Ensemble Learning Framework for Real-Time UV Spectroscopic Detection of Nitrate in Water With Chemometric Modelling","authors":"Sagar Rana, Sudeshna Bagchi","doi":"10.1002/cem.70033","DOIUrl":"10.1002/cem.70033","url":null,"abstract":"<div>\u0000 \u0000 The accuracy of detection of nitrate in water for quality monitoring is a significant yet challenging task. To address this, the present work proposes an ensemble machine learning–based chemometric framework for the optical detection of nitrate in water. It incorporates an absorbance-based reagent-less detection of nitrate in water to support the robustness of the model. The absorption spectra were recorded using a portable set-up in the presence and absence of interfering ions. Different interfering ions, namely, nitrite (NO2−), calcium (Ca2+), magnesium (Mg2+), carbonate (CO32−), bromide (Br−), chloride (Cl−) and phosphate (PO43−), in all possible combinations (binary, ternary, quaternary, quinary, senary and septenary mixtures) are added to target analyte to validate the real-time application of the proposed algorithm. Under the multiview framework, two models, MVNPM-I and MVNPM-II, i.e., multiview nitrate prediction models, are proposed. MVNPM-I is based on an ensemble of regressors' results, and MVNPM-II uses multiple views of the dataset followed by an ensemble of their results. The performance of the models is assessed using a hold-out validation scheme with 10 repetitions and measured using R2 score and mean squared error (MSE). The best results of R2 score 0.9978 with a standard deviation 0.0014 and MSE of 1.1799 with a standard deviation of 0.8639 are obtained using the MVNPM-II model. Further, the performance measures of the proposed models show that they can handle the presence of interfering ions. The algorithm was also tested using real-world samples with an R2 score and MSE of 0.9998 and 0.696, respectively. The promising results strengthen the applicability of the proposed method in real-world scenarios.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantitative Structure–Activity Relationship Modeling Based on Improving Kernel Ridge Regression 基于改进核岭回归的构效关系定量建模

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics

Pub Date : 2025-05-01 DOI: 10.1002/cem.70027

Shaimaa Waleed Mahmood, Ghalya Tawfeeq Basheer, Zakariya Yahya Algamal

The quantitative structure–activity relationship (QSAR) as an effective and promising model to better understands the relationship between chemical activity and chemical compounds is usually used in modeling chemical datasets. Kernel ridge regression (KRR) has attracted the interest of scholars recently because of its non-iterative methodology for problem solving. KRR is a highly regarded and practical machine learning approach that has successfully tackled classification and regression issues. So is a regression method that uses a nonlinear kernel function to define an inner product in a higher-dimensional transformed space. This allows for generalization performance based on regularization least squares solution. However, the performance of KRR is affected by the choices of the values of the hyper-parameters that define the type of kernel. This has a major processing cost, uses memory, and is also accompanied by poor accuracy performance when studying the prior methods of determining these hyper-parameter values. Thus, the main highlighted enhancement in this paper is the enhancement of the coati optimization algorithm by applying elite opposite-based learning to increase the density of population around the search space to optima for the proper selection of the best hyperparameters. Thus, it is necessary to verify and compare its work with the proposed improvement of KRR in increasing its performance, seven public chemical datasets were used. Based on several assessment criteria, the results show that the proposed improvement is superior to all the baseline methods regarding the classification performance.

定量构效关系（quantitative structure-activity relationship， QSAR）是一种有效的、有前景的模型，可以更好地理解化学活性与化合物之间的关系，通常用于化学数据集的建模。核脊回归以其求解问题的非迭代方法近年来引起了学者们的广泛关注。KRR是一种备受推崇的实用机器学习方法，已经成功地解决了分类和回归问题。用非线性核函数在高维变换空间中定义内积的回归方法也是如此。这允许基于正则化最小二乘解的泛化性能。然而，KRR的性能受到定义内核类型的超参数值的选择的影响。这种方法的处理成本高，占用内存，并且在研究先前确定这些超参数值的方法时，还伴随着较差的精度性能。因此，本文主要强调的增强是对coati优化算法的增强，通过应用基于精英的对偶学习来增加搜索空间周围的人口密度，以优化最佳超参数的正确选择。因此，有必要将其工作与提出的KRR改进方法进行验证和比较，以提高其性能，使用了7个公共化学数据集。基于多个评价标准，结果表明所提出的改进方法在分类性能方面优于所有基线方法。

{"title":"Quantitative Structure–Activity Relationship Modeling Based on Improving Kernel Ridge Regression","authors":"Shaimaa Waleed Mahmood, Ghalya Tawfeeq Basheer, Zakariya Yahya Algamal","doi":"10.1002/cem.70027","DOIUrl":"10.1002/cem.70027","url":null,"abstract":"<div>\u0000 \u0000 The quantitative structure–activity relationship (QSAR) as an effective and promising model to better understands the relationship between chemical activity and chemical compounds is usually used in modeling chemical datasets. Kernel ridge regression (KRR) has attracted the interest of scholars recently because of its non-iterative methodology for problem solving. KRR is a highly regarded and practical machine learning approach that has successfully tackled classification and regression issues. So is a regression method that uses a nonlinear kernel function to define an inner product in a higher-dimensional transformed space. This allows for generalization performance based on regularization least squares solution. However, the performance of KRR is affected by the choices of the values of the hyper-parameters that define the type of kernel. This has a major processing cost, uses memory, and is also accompanied by poor accuracy performance when studying the prior methods of determining these hyper-parameter values. Thus, the main highlighted enhancement in this paper is the enhancement of the coati optimization algorithm by applying elite opposite-based learning to increase the density of population around the search space to optima for the proper selection of the best hyperparameters. Thus, it is necessary to verify and compare its work with the proposed improvement of KRR in increasing its performance, seven public chemical datasets were used. Based on several assessment criteria, the results show that the proposed improvement is superior to all the baseline methods regarding the classification performance.\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0