首页 > 最新文献

Journal of Chemometrics最新文献

英文 中文
De Novo Design of HIV-1 Integrase-LEDGF/p75 Inhibitors Through Deep Reinforcement Learning and Virtual Screening 基于深度强化学习和虚拟筛选的HIV-1整合酶- ledgf /p75抑制剂从头设计
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-05-12 DOI: 10.1002/cem.70037
Hai-Bo Sun, Hai-Long Wu, Tong Wang, An-Qi Chen, Ru-Qin Yu

Human immunodeficiency virus (HIV) has far-reaching impacts on global public health. Acquired immunodeficiency syndrome (AIDS) has caused millions of deaths globally, with thousands still getting infected. Therefore, developing HIV-1 integrase inhibitors is crucial for controlling AIDS by slowing virus replication and transmission. This study is grounded in the framework of deep reinforcement learning, aiming to de novo design inhibitors of HIV-1 integrase-Lens Epithelial-Derived Growth Factor/p75 interaction and subsequently employing molecular docking to screen potential therapeutic compounds. Initially, a molecular generation model was established based on the long short-term memory algorithm and refined through transfer learning to obtain a preliminary generative model. Subsequently, the deep reinforcement learning strategy was employed, using inhibition activity as a reward value, enabling the model more likely to generate molecules with desirable properties. The results indicate that the reinforced generation model not only generates novel and effective SMILES structures with medicinal potential but also demonstrates strong binding affinity between the generated molecules and the target protein, as indicated by molecular docking experiments. Ultimately, through virtual screening, we identified six lead compounds having the potential to become inhibitors of interaction between Lens Epithelial-Derived Growth Factor/p75 and HIV-1 integrase, providing an effective and practical strategy for de novo drug design of HIV-1 integrase inhibitors.

人类免疫缺陷病毒(HIV)对全球公共卫生产生深远影响。获得性免疫缺陷综合症(艾滋病)已在全球造成数百万人死亡,仍有数千人受到感染。因此,开发HIV-1整合酶抑制剂对于通过减缓病毒复制和传播来控制艾滋病至关重要。本研究基于深度强化学习的框架,旨在重新设计HIV-1整合酶-晶状体上皮衍生生长因子/p75相互作用的抑制剂,并随后采用分子对接来筛选潜在的治疗化合物。首先,基于长短期记忆算法建立分子生成模型,并通过迁移学习进行细化,得到初步的生成模型。随后,采用深度强化学习策略,使用抑制活性作为奖励值,使模型更有可能生成具有理想特性的分子。结果表明,通过分子对接实验,增强生成模型不仅生成了具有药用潜力的新颖有效的smile结构,而且生成的分子与靶蛋白之间具有较强的结合亲和力。最终,通过虚拟筛选,我们确定了六种先导化合物,它们有可能成为晶状体上皮衍生生长因子/p75与HIV-1整合酶之间相互作用的抑制剂,为HIV-1整合酶抑制剂的新药物设计提供了有效和实用的策略。
{"title":"De Novo Design of HIV-1 Integrase-LEDGF/p75 Inhibitors Through Deep Reinforcement Learning and Virtual Screening","authors":"Hai-Bo Sun,&nbsp;Hai-Long Wu,&nbsp;Tong Wang,&nbsp;An-Qi Chen,&nbsp;Ru-Qin Yu","doi":"10.1002/cem.70037","DOIUrl":"10.1002/cem.70037","url":null,"abstract":"<div>\u0000 \u0000 <p>Human immunodeficiency virus (HIV) has far-reaching impacts on global public health. Acquired immunodeficiency syndrome (AIDS) has caused millions of deaths globally, with thousands still getting infected. Therefore, developing HIV-1 integrase inhibitors is crucial for controlling AIDS by slowing virus replication and transmission. This study is grounded in the framework of deep reinforcement learning, aiming to de novo design inhibitors of HIV-1 integrase-Lens Epithelial-Derived Growth Factor/p75 interaction and subsequently employing molecular docking to screen potential therapeutic compounds. Initially, a molecular generation model was established based on the long short-term memory algorithm and refined through transfer learning to obtain a preliminary generative model. Subsequently, the deep reinforcement learning strategy was employed, using inhibition activity as a reward value, enabling the model more likely to generate molecules with desirable properties. The results indicate that the reinforced generation model not only generates novel and effective SMILES structures with medicinal potential but also demonstrates strong binding affinity between the generated molecules and the target protein, as indicated by molecular docking experiments. Ultimately, through virtual screening, we identified six lead compounds having the potential to become inhibitors of interaction between Lens Epithelial-Derived Growth Factor/p75 and HIV-1 integrase, providing an effective and practical strategy for de novo drug design of HIV-1 integrase inhibitors.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143939411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Two-Parameter Estimation Technique for Handling Multicollinearity in Inverse Gaussian Regression Model 一种新的处理高斯反回归模型多重共线性的双参数估计技术
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-05-08 DOI: 10.1002/cem.70032
Ishrat Riaz, Aamir Sanaullah, Mustafa M. Hasaballah, Oluwafemi Samson Balogun, Mahmoud E. Bakr

This study focuses on the prevalent issue of multicollinearity in the inverse Gaussian regression model (IGRM), which arises when predictor variables have a high degree of correlation. The typical maximum likelihood estimator (MLE) proves to be highly unstable when dealing with linearly linked regressors. Eventually, the accuracy of the model may suffer because of inflated variances and inaccurate coefficient estimates. To improve parameter estimation accuracy and combat multicollinearity, this paper suggests an alternative biased estimator for the IGRM that integrates a two-parameter framework. This novel two-parameter estimator is a general estimator that takes the maximum likelihood, ridge, and Stein estimators as special cases. The theoretical characteristics of the estimator, including its bias and mean squared error (MSE), are develop and then go through a thorough theoretical comparison with the previous estimators in terms of the mean square error matrix (MMSE) criterion. Moreover, the optimal values of the biasing parameters for the advised estimator are also obtained. An extensive simulated study and real-world dataset are examined to assess the practical relevance of the proposed estimator. The empirical results show that, in comparison to conventional estimators, including MLE, ridge, and Stein estimators, the suggested estimator considerably lowers the MSE and improves the parameter estimation accuracy. These results illustrate the novel approach's potential for dealing with multicollinearity in IGRM. The continuous development of reliable estimating methods for generalized linear models (GLMs) is aided by these findings.

本文研究了逆高斯回归模型(IGRM)中普遍存在的多重共线性问题,当预测变量具有高度相关时,就会出现多重共线性问题。典型的极大似然估计(MLE)在处理线性关联回归量时被证明是高度不稳定的。最终,由于膨胀的方差和不准确的系数估计,模型的准确性可能会受到影响。为了提高参数估计精度和对抗多重共线性,本文提出了一种集成双参数框架的IGRM有偏估计器。这种新的双参数估计是一种以极大似然估计、ridge估计和Stein估计为特殊情况的一般估计。首先阐述了该估计器的理论特性,包括偏置和均方误差(MSE),然后根据均方误差矩阵(MMSE)准则与之前的估计器进行了彻底的理论比较。此外,还得到了建议估计器的最优偏置参数值。广泛的模拟研究和现实世界的数据集进行了检查,以评估所提出的估计器的实际相关性。实验结果表明,与传统的MLE、ridge和Stein估计器相比,该估计器显著降低了MSE,提高了参数估计精度。这些结果说明了这种新方法在处理IGRM中的多重共线性方面的潜力。这些发现有助于不断发展可靠的广义线性模型(GLMs)估计方法。
{"title":"A Novel Two-Parameter Estimation Technique for Handling Multicollinearity in Inverse Gaussian Regression Model","authors":"Ishrat Riaz,&nbsp;Aamir Sanaullah,&nbsp;Mustafa M. Hasaballah,&nbsp;Oluwafemi Samson Balogun,&nbsp;Mahmoud E. Bakr","doi":"10.1002/cem.70032","DOIUrl":"10.1002/cem.70032","url":null,"abstract":"<div>\u0000 \u0000 <p>This study focuses on the prevalent issue of multicollinearity in the inverse Gaussian regression model (IGRM), which arises when predictor variables have a high degree of correlation. The typical maximum likelihood estimator (MLE) proves to be highly unstable when dealing with linearly linked regressors. Eventually, the accuracy of the model may suffer because of inflated variances and inaccurate coefficient estimates. To improve parameter estimation accuracy and combat multicollinearity, this paper suggests an alternative biased estimator for the IGRM that integrates a two-parameter framework. This novel two-parameter estimator is a general estimator that takes the maximum likelihood, ridge, and Stein estimators as special cases. The theoretical characteristics of the estimator, including its bias and mean squared error (MSE), are develop and then go through a thorough theoretical comparison with the previous estimators in terms of the mean square error matrix (MMSE) criterion. Moreover, the optimal values of the biasing parameters for the advised estimator are also obtained. An extensive simulated study and real-world dataset are examined to assess the practical relevance of the proposed estimator. The empirical results show that, in comparison to conventional estimators, including MLE, ridge, and Stein estimators, the suggested estimator considerably lowers the MSE and improves the parameter estimation accuracy. These results illustrate the novel approach's potential for dealing with multicollinearity in IGRM. The continuous development of reliable estimating methods for generalized linear models (GLMs) is aided by these findings.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143925881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feasibility Study on Identifying Seed Variety of Soybean With Hyperspectral Imaging and Deep Learning 利用高光谱成像和深度学习技术鉴定大豆种子品种的可行性研究
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-05-01 DOI: 10.1002/cem.70035
Lei Pang, Zhen Wang, Siyan Mi, Hui Li

Seed variety purity is an important indicator of seed quality, and mixing soybean seeds at different maturity stages can affect crop growth and food quality. This study investigated the feasibility of recognizing five soybean varieties at different maturity stages using hyperspectral imaging. Hyperspectral data from 3600 soybean seeds were collected in the range of 395.5–1003.7 nm. First, the potential to qualitatively distinguish the five soybean varieties was assessed using visual cluster analyses based on principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). Next, the performance of four classification models—random forest (RF), extreme learning machine (ELM), partial least squares discriminant analysis (PLS-DA), and one-dimensional convolutional neural network (1DCNN)—was compared. Multiplicative scatter correction (MSC) preprocessing significantly improved the recognition effect of all four models, with the 1DCNN model demonstrating the highest accuracy and most stable recognition performance. The effects of feature bands extracted using competitive adaptive reweighted sampling (CARS), variable importance in projection (VIP), and local linear embedding (LLE) on the four models were also compared. The accuracy of all four feature band sets, when combined with the MSC+1DCNN model, exceeded 96% in identifying soybean varieties. Therefore, these results indicate that the 1DCNN discriminant analysis model is suitable for spectral data analysis in soybean seed variety classification and can significantly enhance classification accuracy.

种子品种纯度是衡量种子品质的重要指标,不同成熟期大豆种子混用会影响作物生长和食品品质。本研究探讨了利用高光谱成像技术识别5个不同成熟期大豆品种的可行性。在395.5 ~ 1003.7 nm范围内采集了3600颗大豆种子的高光谱数据。首先,利用基于主成分分析(PCA)、t分布随机邻居嵌入(t-SNE)和均匀流形逼近与投影(UMAP)的视觉聚类分析,对5个大豆品种进行定性区分。接下来,比较了随机森林(RF)、极限学习机(ELM)、偏最小二乘判别分析(PLS-DA)和一维卷积神经网络(1DCNN)四种分类模型的性能。乘法散射校正(multiplative scatter correction, MSC)预处理显著提高了四种模型的识别效果,其中1DCNN模型的识别精度最高,识别性能最稳定。比较了竞争自适应重加权采样(CARS)、投影变量重要度(VIP)和局部线性嵌入(LLE)提取的特征波段对四种模型的影响。当与MSC+1DCNN模型结合使用时,所有4个特征波段集的识别准确率均超过96%。因此,这些结果表明,1DCNN判别分析模型适用于大豆种子品种分类中的光谱数据分析,可以显著提高分类精度。
{"title":"Feasibility Study on Identifying Seed Variety of Soybean With Hyperspectral Imaging and Deep Learning","authors":"Lei Pang,&nbsp;Zhen Wang,&nbsp;Siyan Mi,&nbsp;Hui Li","doi":"10.1002/cem.70035","DOIUrl":"10.1002/cem.70035","url":null,"abstract":"<div>\u0000 \u0000 <p>Seed variety purity is an important indicator of seed quality, and mixing soybean seeds at different maturity stages can affect crop growth and food quality. This study investigated the feasibility of recognizing five soybean varieties at different maturity stages using hyperspectral imaging. Hyperspectral data from 3600 soybean seeds were collected in the range of 395.5–1003.7 nm. First, the potential to qualitatively distinguish the five soybean varieties was assessed using visual cluster analyses based on principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). Next, the performance of four classification models—random forest (RF), extreme learning machine (ELM), partial least squares discriminant analysis (PLS-DA), and one-dimensional convolutional neural network (1DCNN)—was compared. Multiplicative scatter correction (MSC) preprocessing significantly improved the recognition effect of all four models, with the 1DCNN model demonstrating the highest accuracy and most stable recognition performance. The effects of feature bands extracted using competitive adaptive reweighted sampling (CARS), variable importance in projection (VIP), and local linear embedding (LLE) on the four models were also compared. The accuracy of all four feature band sets, when combined with the MSC+1DCNN model, exceeded 96% in identifying soybean varieties. Therefore, these results indicate that the 1DCNN discriminant analysis model is suitable for spectral data analysis in soybean seed variety classification and can significantly enhance classification accuracy.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiview Ensemble Learning Framework for Real-Time UV Spectroscopic Detection of Nitrate in Water With Chemometric Modelling 基于化学计量模型的水中硝酸盐紫外光谱实时检测的多视图集成学习框架
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-05-01 DOI: 10.1002/cem.70033
Sagar Rana, Sudeshna Bagchi

The accuracy of detection of nitrate in water for quality monitoring is a significant yet challenging task. To address this, the present work proposes an ensemble machine learning–based chemometric framework for the optical detection of nitrate in water. It incorporates an absorbance-based reagent-less detection of nitrate in water to support the robustness of the model. The absorption spectra were recorded using a portable set-up in the presence and absence of interfering ions. Different interfering ions, namely, nitrite (NO2), calcium (Ca2+), magnesium (Mg2+), carbonate (CO32−), bromide (Br), chloride (Cl) and phosphate (PO43−), in all possible combinations (binary, ternary, quaternary, quinary, senary and septenary mixtures) are added to target analyte to validate the real-time application of the proposed algorithm. Under the multiview framework, two models, MVNPM-I and MVNPM-II, i.e., multiview nitrate prediction models, are proposed. MVNPM-I is based on an ensemble of regressors' results, and MVNPM-II uses multiple views of the dataset followed by an ensemble of their results. The performance of the models is assessed using a hold-out validation scheme with 10 repetitions and measured using R2 score and mean squared error (MSE). The best results of R2 score 0.9978 with a standard deviation 0.0014 and MSE of 1.1799 with a standard deviation of 0.8639 are obtained using the MVNPM-II model. Further, the performance measures of the proposed models show that they can handle the presence of interfering ions. The algorithm was also tested using real-world samples with an R2 score and MSE of 0.9998 and 0.696, respectively. The promising results strengthen the applicability of the proposed method in real-world scenarios.

水质监测中硝酸盐的准确检测是一项重要而又具有挑战性的任务。为了解决这个问题,本研究提出了一个基于集成机器学习的化学计量学框架,用于水中硝酸盐的光学检测。它结合了基于吸收剂的水中硝酸盐少试剂检测,以支持模型的鲁棒性。在存在和不存在干扰离子的情况下,用便携式装置记录了吸收光谱。不同的干扰离子,即亚硝酸盐(NO2−)、钙(Ca2+)、镁(Mg2+)、碳酸盐(CO32−)、溴化物(Br−)、氯化物(Cl−)和磷酸盐(PO43−),以所有可能的组合(二元、三元、四元、五元、四元和七元混合物)添加到目标分析物中,以验证所提出算法的实时应用。在多视角框架下,提出了MVNPM-I和MVNPM-II两个多视角硝酸盐预测模型。MVNPM-I基于回归者结果的集合,而MVNPM-II使用数据集的多个视图,然后是它们结果的集合。使用10次重复的保留验证方案评估模型的性能,并使用R2评分和均方误差(MSE)进行测量。采用MVNPM-II模型得到的最佳结果为R2评分0.9978,标准差0.0014;MSE为1.1799,标准差0.8639。此外,所提出的模型的性能测量表明,它们可以处理干扰离子的存在。该算法还使用实际样本进行了测试,R2得分和MSE分别为0.9998和0.696。这些有希望的结果增强了所提出方法在现实场景中的适用性。
{"title":"Multiview Ensemble Learning Framework for Real-Time UV Spectroscopic Detection of Nitrate in Water With Chemometric Modelling","authors":"Sagar Rana,&nbsp;Sudeshna Bagchi","doi":"10.1002/cem.70033","DOIUrl":"10.1002/cem.70033","url":null,"abstract":"<div>\u0000 \u0000 <p>The accuracy of detection of nitrate in water for quality monitoring is a significant yet challenging task. To address this, the present work proposes an ensemble machine learning–based chemometric framework for the optical detection of nitrate in water. It incorporates an absorbance-based reagent-less detection of nitrate in water to support the robustness of the model. The absorption spectra were recorded using a portable set-up in the presence and absence of interfering ions. Different interfering ions, namely, nitrite (NO<sub>2</sub><sup>−</sup>), calcium (Ca<sup>2+</sup>), magnesium (Mg<sup>2+</sup>), carbonate (CO<sub>3</sub><sup>2−</sup>), bromide (Br<sup>−</sup>), chloride (Cl<sup>−</sup>) and phosphate (PO<sub>4</sub><sup>3−</sup>), in all possible combinations (binary, ternary, quaternary, quinary, senary and septenary mixtures) are added to target analyte to validate the real-time application of the proposed algorithm. Under the multiview framework, two models, MVNPM-I and MVNPM-II, i.e., multiview nitrate prediction models, are proposed. MVNPM-I is based on an ensemble of regressors' results, and MVNPM-II uses multiple views of the dataset followed by an ensemble of their results. The performance of the models is assessed using a hold-out validation scheme with 10 repetitions and measured using <i>R</i><sup>2</sup> score and mean squared error (MSE). The best results of <i>R</i><sup>2</sup> score 0.9978 with a standard deviation 0.0014 and MSE of 1.1799 with a standard deviation of 0.8639 are obtained using the MVNPM-II model. Further, the performance measures of the proposed models show that they can handle the presence of interfering ions. The algorithm was also tested using real-world samples with an <i>R</i><sup>2</sup> score and MSE of 0.9998 and 0.696, respectively. The promising results strengthen the applicability of the proposed method in real-world scenarios.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantitative Structure–Activity Relationship Modeling Based on Improving Kernel Ridge Regression 基于改进核岭回归的构效关系定量建模
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-05-01 DOI: 10.1002/cem.70027
Shaimaa Waleed Mahmood, Ghalya Tawfeeq Basheer, Zakariya Yahya Algamal

The quantitative structure–activity relationship (QSAR) as an effective and promising model to better understands the relationship between chemical activity and chemical compounds is usually used in modeling chemical datasets. Kernel ridge regression (KRR) has attracted the interest of scholars recently because of its non-iterative methodology for problem solving. KRR is a highly regarded and practical machine learning approach that has successfully tackled classification and regression issues. So is a regression method that uses a nonlinear kernel function to define an inner product in a higher-dimensional transformed space. This allows for generalization performance based on regularization least squares solution. However, the performance of KRR is affected by the choices of the values of the hyper-parameters that define the type of kernel. This has a major processing cost, uses memory, and is also accompanied by poor accuracy performance when studying the prior methods of determining these hyper-parameter values. Thus, the main highlighted enhancement in this paper is the enhancement of the coati optimization algorithm by applying elite opposite-based learning to increase the density of population around the search space to optima for the proper selection of the best hyperparameters. Thus, it is necessary to verify and compare its work with the proposed improvement of KRR in increasing its performance, seven public chemical datasets were used. Based on several assessment criteria, the results show that the proposed improvement is superior to all the baseline methods regarding the classification performance.

定量构效关系(quantitative structure-activity relationship, QSAR)是一种有效的、有前景的模型,可以更好地理解化学活性与化合物之间的关系,通常用于化学数据集的建模。核脊回归以其求解问题的非迭代方法近年来引起了学者们的广泛关注。KRR是一种备受推崇的实用机器学习方法,已经成功地解决了分类和回归问题。用非线性核函数在高维变换空间中定义内积的回归方法也是如此。这允许基于正则化最小二乘解的泛化性能。然而,KRR的性能受到定义内核类型的超参数值的选择的影响。这种方法的处理成本高,占用内存,并且在研究先前确定这些超参数值的方法时,还伴随着较差的精度性能。因此,本文主要强调的增强是对coati优化算法的增强,通过应用基于精英的对偶学习来增加搜索空间周围的人口密度,以优化最佳超参数的正确选择。因此,有必要将其工作与提出的KRR改进方法进行验证和比较,以提高其性能,使用了7个公共化学数据集。基于多个评价标准,结果表明所提出的改进方法在分类性能方面优于所有基线方法。
{"title":"Quantitative Structure–Activity Relationship Modeling Based on Improving Kernel Ridge Regression","authors":"Shaimaa Waleed Mahmood,&nbsp;Ghalya Tawfeeq Basheer,&nbsp;Zakariya Yahya Algamal","doi":"10.1002/cem.70027","DOIUrl":"10.1002/cem.70027","url":null,"abstract":"<div>\u0000 \u0000 <p>The quantitative structure–activity relationship (QSAR) as an effective and promising model to better understands the relationship between chemical activity and chemical compounds is usually used in modeling chemical datasets. Kernel ridge regression (KRR) has attracted the interest of scholars recently because of its non-iterative methodology for problem solving. KRR is a highly regarded and practical machine learning approach that has successfully tackled classification and regression issues. So is a regression method that uses a nonlinear kernel function to define an inner product in a higher-dimensional transformed space. This allows for generalization performance based on regularization least squares solution. However, the performance of KRR is affected by the choices of the values of the hyper-parameters that define the type of kernel. This has a major processing cost, uses memory, and is also accompanied by poor accuracy performance when studying the prior methods of determining these hyper-parameter values. Thus, the main highlighted enhancement in this paper is the enhancement of the coati optimization algorithm by applying elite opposite-based learning to increase the density of population around the search space to optima for the proper selection of the best hyperparameters. Thus, it is necessary to verify and compare its work with the proposed improvement of KRR in increasing its performance, seven public chemical datasets were used. Based on several assessment criteria, the results show that the proposed improvement is superior to all the baseline methods regarding the classification performance.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to “Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY” 修正“XTX和XTY的快速分区交叉验证与定心和缩放”
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-28 DOI: 10.1002/cem.70034

Galbo Engstrøm, O.-C. and Holm Jensen, M. (2025), Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY. Journal of Chemometrics, 39: e70008, https://doi.org/10.1002/cem.70008.

On line 27 in Algorithm 7 on page 10, the text to the right reads “Obtain XcsTYcsT” but should read “Obtain XcsTYcs”.

In Proposition 15 on page 11, the last equality contains a double hat over xsT. It should have been a single hat.

On pages 3 and 4, P$$ mathcal{P} $$ has been written multiple times when P[n]$$ mathcal{P}left[nright] $$ was intended. Likewise, V$$ mathcal{V} $$ has been written multiple times when V[p]$$ mathcal{V}left[pright] $$ was intended.

We apologize for the confusion.

Galbo Engstrøm, o . c。和Holm Jensen, M.(2025),基于快速分割的XTX和XTY的定心和缩放交叉验证。化学计量学学报,39:e70008, https://doi.org/10.1002/cem.70008。在第10页算法7的第27行,右侧的文本读为“获取XcsTYcsT”,但应该读为“获取XcsTYcs”。在第11页的命题15中,最后一个等式包含了xsT上的双帽。应该是一顶帽子。我们为造成的混乱道歉。
{"title":"Correction to “Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY”","authors":"","doi":"10.1002/cem.70034","DOIUrl":"10.1002/cem.70034","url":null,"abstract":"<p>\u0000 <span>Galbo Engstrøm, O.-C.</span> and <span>Holm Jensen, M.</span> (<span>2025</span>), <span>Fast Partition-Based Cross-Validation With Centering and Scaling for <b>X</b><sup><b>T</b></sup><b>X</b> and <b>X</b><sup><b>T</b></sup><b>Y</b></span>. <i>Journal of Chemometrics</i>, <span>39</span>: e70008, https://doi.org/10.1002/cem.70008.\u0000 </p><p>On line 27 in Algorithm 7 on page 10, the text to the right reads “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>csT</b></sup>” but should read “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>cs</b></sup>”.</p><p>In Proposition 15 on page 11, the last equality contains a double hat over <b>x</b><sub><b>s</b></sub><sup><b>T</b></sup>. It should have been a single hat.</p><p>On pages 3 and 4, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 </mrow>\u0000 <annotation>$$ mathcal{P} $$</annotation>\u0000 </semantics></math> has been written multiple times when <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 <mo>[</mo>\u0000 <mo>n</mo>\u0000 <mo>]</mo>\u0000 </mrow>\u0000 <annotation>$$ mathcal{P}left[nright] $$</annotation>\u0000 </semantics></math> was intended. Likewise, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>V</mi>\u0000 </mrow>\u0000 <annotation>$$ mathcal{V} $$</annotation>\u0000 </semantics></math> has been written multiple times when <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>V</mi>\u0000 <mo>[</mo>\u0000 <mo>p</mo>\u0000 <mo>]</mo>\u0000 </mrow>\u0000 <annotation>$$ mathcal{V}left[pright] $$</annotation>\u0000 </semantics></math> was intended.</p><p>We apologize for the confusion.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HiBBKA: A Hybrid Method With Resampling and Heuristic Feature Selection for Class-Imbalanced Data in Chemometrics 化学计量学中类不平衡数据的重采样和启发式特征选择混合方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-20 DOI: 10.1002/cem.70029
Ying Guo, Ying Kou, Lun-Zhao Yi, Guang-Hui Fu

In critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single-algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial-based under-sampling with SMOTE (RBU-SMOTE) and an improved binary black-winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU-SMOTE resampling method synergistically integrates radial-based under-sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class-imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU-SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real-world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU-SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state-of-the-art algorithms in overall classification metrics, particularly in the minority class recognition.

在包括药物化学、生物医学、代谢组学和计算毒理学在内的关键领域,数据集的类别不平衡和对少数类别的识别准确性差仍然是持续存在的挑战。虽然以前的研究使用重采样和特征选择技术来解决数据不平衡和提高分类性能,但大多数方法都集中在单一算法解决方案上,而不是混合方法。混合算法通过综合多种技术的优势,为处理不平衡数据提供更全面、更高效的解决方案,具有明显的优势。本研究提出了一种将径向欠采样与SMOTE算法(RBU-SMOTE)和改进的二进制黑翼风筝算法(iBBKA)相结合的特征选择混合算法HiBBKA。该框架通过两个关键阶段进行:首先,RBU-SMOTE重采样方法将基于径向的欠采样(RBU)与合成少数派过采样技术(SMOTE)协同集成,有效地解决了类不平衡分布问题,同时提高了合成样本的质量。其次,改进的iBBKA特征选择算法系统地识别出对分类任务最具判别性的特征。我们使用多个分类器在16个不平衡数据集(包括现实世界的医疗数据集)中全面评估RBU-SMOTE和HiBBKA,特别强调少数类别的表现。实验结果表明,与现有的重采样方法相比,RBU-SMOTE取得了具有竞争力的性能,而完整的HiBBKA框架在总体分类指标上明显优于最先进的算法,特别是在少数类识别方面。
{"title":"HiBBKA: A Hybrid Method With Resampling and Heuristic Feature Selection for Class-Imbalanced Data in Chemometrics","authors":"Ying Guo,&nbsp;Ying Kou,&nbsp;Lun-Zhao Yi,&nbsp;Guang-Hui Fu","doi":"10.1002/cem.70029","DOIUrl":"10.1002/cem.70029","url":null,"abstract":"<div>\u0000 \u0000 <p>In critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single-algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial-based under-sampling with SMOTE (RBU-SMOTE) and an improved binary black-winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU-SMOTE resampling method synergistically integrates radial-based under-sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class-imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU-SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real-world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU-SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state-of-the-art algorithms in overall classification metrics, particularly in the minority class recognition.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143852899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Geographical Influence on Metabolite Profiles of Cupressus torulosa: UPLC-QTOF-MS (Positive Mode) and Chemometric Insights 地理对柏树代谢物谱的影响:UPLC-QTOF-MS(正模式)和化学计量学研究
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-14 DOI: 10.1002/cem.70031
Radhika Khanna, Khushaboo Bhadoriya, Gaurav Pandey, V. K. Varshney

C. torulosa, known as the Himalayan or Bhutan cypress, is a significant evergreen conifer that typically reaches heights between 20 and 45 m. This species is primarily found in the Himalayan regions of Bhutan, northern India, Nepal, and Tibet. In this study, we utilized ultra-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-QTOF-MS) in positive ion mode, along with chemometric analysis, to investigate the metabolomic profiles of C. torulosa needles collected from 14 geographically distinct areas in Uttarakhand and Himachal Pradesh. Various statistical techniques, including ANOVA, Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), violin plots, scatter plots, box-and-whisker plots, and heatmaps, were employed to illustrate the relative quantitative differences among compounds based on their peak intensities across these regions. Our investigation revealed 34 marker compounds consistently detected across all samples (locations). These compounds were screened using rigorous filtering criteria, incorporating a moderated t-test and multiple testing adjustments using the Benjamini–Hochberg false discovery rate (FDR) approach. Furthermore, we pioneered the identification of the phenylpropanoid and flavonoid biosynthesis pathways in C. torulosa, providing new insights into its metabolic profile. This work establishes a foundational reference for future research into the species metabolome, helping guide studies in areas like genetic diversity, ecological adaptations, and climate resilience in C. torulosa. Mapping these pathways deepens scientific knowledge of C. torulosa's metabolic processes, contributing to a clearer understanding of its unique biochemical makeup.

C. torulosa,被称为喜马拉雅或不丹柏树,是一种重要的常绿针叶树,通常可以达到20到45米的高度。该物种主要分布在不丹、印度北部、尼泊尔和西藏的喜马拉雅地区。在这项研究中,我们利用超高效液相色谱-四极杆飞行时间质谱(UPLC-QTOF-MS)在正离子模式下,结合化学计量学分析,研究了在北阿坎德邦和喜马偕尔邦14个地理不同地区采集的C. torulosa针的代谢组学特征。利用方差分析(ANOVA)、主成分分析(PCA)、层次聚类分析(HCA)、小提琴图、散点图、盒须图和热图等统计技术,分析了这些地区化合物峰强度的相对定量差异。我们的调查揭示了34种标记化合物在所有样品(地点)中一致检测到。这些化合物使用严格的过滤标准进行筛选,包括适度t检验和使用benjamin - hochberg错误发现率(FDR)方法的多重测试调整。此外,我们率先鉴定了C. torulosa中苯丙素和类黄酮的生物合成途径,为其代谢谱提供了新的见解。本研究为今后的物种代谢组研究奠定了基础,有助于在遗传多样性、生态适应和气候适应等方面指导研究。绘制这些途径加深了对C. torulosa代谢过程的科学认识,有助于更清楚地了解其独特的生化组成。
{"title":"Geographical Influence on Metabolite Profiles of Cupressus torulosa: UPLC-QTOF-MS (Positive Mode) and Chemometric Insights","authors":"Radhika Khanna,&nbsp;Khushaboo Bhadoriya,&nbsp;Gaurav Pandey,&nbsp;V. K. Varshney","doi":"10.1002/cem.70031","DOIUrl":"10.1002/cem.70031","url":null,"abstract":"<div>\u0000 \u0000 <p><i>C. torulosa</i>, known as the Himalayan or Bhutan cypress, is a significant evergreen conifer that typically reaches heights between 20 and 45 m. This species is primarily found in the Himalayan regions of Bhutan, northern India, Nepal, and Tibet. In this study, we utilized ultra-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-QTOF-MS) in positive ion mode, along with chemometric analysis, to investigate the metabolomic profiles of <i>C. torulosa</i> needles collected from 14 geographically distinct areas in Uttarakhand and Himachal Pradesh. Various statistical techniques, including ANOVA, Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), violin plots, scatter plots, box-and-whisker plots, and heatmaps, were employed to illustrate the relative quantitative differences among compounds based on their peak intensities across these regions. Our investigation revealed 34 marker compounds consistently detected across all samples (locations). These compounds were screened using rigorous filtering criteria, incorporating a moderated <i>t</i>-test and multiple testing adjustments using the Benjamini–Hochberg false discovery rate (FDR) approach. Furthermore, we pioneered the identification of the phenylpropanoid and flavonoid biosynthesis pathways in <i>C. torulosa</i>, providing new insights into its metabolic profile. This work establishes a foundational reference for future research into the species metabolome, helping guide studies in areas like genetic diversity, ecological adaptations, and climate resilience in <i>C. torulosa</i>. Mapping these pathways deepens scientific knowledge of <i>C. torulosa</i>'s metabolic processes, contributing to a clearer understanding of its unique biochemical makeup.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143831301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Anomaly Score Rank Based Unsupervised Sample Selection Method 基于综合异常评分秩的无监督样本选择方法
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-08 DOI: 10.1002/cem.70028
Zhongjiang He, Zhonghai He, Xiaofang Zhang

The process of selecting representative samples is crucial for establishing an accurate calibration model. To enhance the representativeness of the samples, a method for sample selection, utilizing the degree of anomaly as the evaluation criterion, is proposed. Initially, anomaly scores corresponding to various detection methods are obtained to ensure a comprehensive evaluation. These scores are then normalized by the confidence lower limit to establish a consistent scoring criterion. Subsequently, the weights of different detection methods are determined through eigenvector centrality analysis of a graph, where the methods serve as nodes and the similarity acts as weighted edges. Finally, the comprehensive anomaly scores are computed as the sum of weighted scores and are subsequently sorted. Representative samples are selected using a uniformly spaced sampling approach, with the spacing determined by a predefined and provided sample number. The efficacy of the method is validated across different sample sets.

选择代表性样本的过程对于建立准确的校准模型至关重要。为了提高样本的代表性,提出了一种以异常程度作为评价标准的样本选择方法。首先得到不同检测方法对应的异常分数,以保证综合评价。然后通过置信下限将这些分数归一化,以建立一致的评分标准。然后,通过图的特征向量中心性分析确定不同检测方法的权重,其中方法作为节点,相似度作为加权边。最后,将综合异常分数计算为加权分数之和,并进行排序。使用均匀间隔采样方法选择代表性样本,其间隔由预定义的和提供的样本数确定。通过不同的样本集验证了该方法的有效性。
{"title":"Comprehensive Anomaly Score Rank Based Unsupervised Sample Selection Method","authors":"Zhongjiang He,&nbsp;Zhonghai He,&nbsp;Xiaofang Zhang","doi":"10.1002/cem.70028","DOIUrl":"10.1002/cem.70028","url":null,"abstract":"<div>\u0000 \u0000 <p>The process of selecting representative samples is crucial for establishing an accurate calibration model. To enhance the representativeness of the samples, a method for sample selection, utilizing the degree of anomaly as the evaluation criterion, is proposed. Initially, anomaly scores corresponding to various detection methods are obtained to ensure a comprehensive evaluation. These scores are then normalized by the confidence lower limit to establish a consistent scoring criterion. Subsequently, the weights of different detection methods are determined through eigenvector centrality analysis of a graph, where the methods serve as nodes and the similarity acts as weighted edges. Finally, the comprehensive anomaly scores are computed as the sum of weighted scores and are subsequently sorted. Representative samples are selected using a uniformly spaced sampling approach, with the spacing determined by a predefined and provided sample number. The efficacy of the method is validated across different sample sets.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143793390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Quality: Importance of the ‘before analysis’ domain (Theory of Sampling, TOS) 数据质量:“分析前”域的重要性(抽样理论,TOS)
IF 2.1 4区 化学 Q1 SOCIAL WORK Pub Date : 2025-04-06 DOI: 10.1002/cem.70025

Data analysts/chemometricians are part of a scientific collegium covering three distinct domains: i) sampling – ii) analysis – iii) data modelling, which are collectively influencing ‘data quality’. There is much more to data quality than analytical uncertainty. There are many situations where analysis is to be made of heterogeneous materials/batches/lots/flowing streams, which need to be sampled appropriately before analysis, following an often long and complex pathway ‘from-lot-to-aliquot’. In most cases, sampling and sub-sampling will dominate the total Measurement Uncertainty budget (MUtotal). Left-out MUsampling contributions may easily overwhelm the Total Analytical Error (TAE) uncertainty by factors 5, 10, 25 or higher as a function of the specific heterogeneity characteristics of the materials and systems targeted, and of the sampling procedure used (grab vs. composite sampling). Focus is here on the consequences of unwittingly ignoring the uncertainties originating in these domains, which e.g. will influence adversely on bilinear component directions (reducing model accuracy) as well as RMSE estimates reflecting precision (analyte concentration prediction, classification, time series prediction) and along the way will also clear up an evergreen mistake: contrary to many beliefs, ‘more data’ will not automatically reduce the magnitude of an unsatisfactory performance RMSE. It is shown how the Theory of Sampling (TOS) is the only guarantor of representative sampling in the critical ‘before analysis’ domain. This article introduces the essential minimum TOS competence which must be mastered by stakeholders from all three domains. The conceptual elements in the TOS system can be visualised as a graphic overview:

Kim H. Esbensen has been professor at three universities (National Geological Survey of Denmark and Greenland (2010–2015), Aalborg University, Denmark (2001–2010), Telemark Institute of Technology, Norway (1990–2000) and professeur associé, Université du Québec à Chicoutimi before switching to a quest as an independent consultant in 2015. He is a member of several scientific societies and has published widely across several scientific fields. He is the author of a widely used textbook in Multivariate Data Analysis (chemometrics), and in 2020 published: “Introduction to the Theory and Practice of Sampling”. He was chairman of the taskforce responsible for the world's first horizontal (matrix-independent) sampling standard DS 3077:2024 - Esbensen is the founding editor of: “Sampling Science and Technology (SST)” - https://www.sst-magazine.info/issues/ He can be reached at his homepage https://kheconsult.com/

数据分析师/化学计量学家是涵盖三个不同领域的科学学院的一部分:i)抽样- ii)分析- iii)数据建模,它们共同影响“数据质量”。数据质量不仅仅是分析上的不确定性。在许多情况下,分析是对异质材料/批次/批次/流动流进行的,需要在分析之前进行适当的采样,这通常是一个漫长而复杂的“从批次到等分”的途径。在大多数情况下,抽样和次抽样将主导整个测量不确定度预算(MUtotal)。遗漏的采样贡献可能很容易以5、10、25或更高的因子压倒总分析误差(TAE)的不确定性,这是针对材料和系统的特定异质性特征以及所使用的采样程序(抓取与复合采样)的函数。这里的重点是无意中忽略这些领域中产生的不确定性的后果,例如,这将对双线性分量方向(降低模型精度)以及反映精度的RMSE估计(分析物浓度预测,分类,时间序列预测)产生不利影响,并在此过程中也将清除一个常绿错误:与许多人的看法相反,“更多的数据”不会自动降低令人不满意的性能RMSE的大小。它显示了抽样理论(TOS)是如何在关键的“分析前”领域的代表性抽样的唯一保证。本文介绍了三个领域的利益相关者必须掌握的基本最低TOS能力。TOS系统中的概念元素可以可视化为图形概述:Kim H. Esbensen曾在三所大学(丹麦和格陵兰国家地质调查局(2010-2015),丹麦奥尔堡大学(2001-2010),挪威Telemark理工学院(1990-2000)担任教授,并在2015年作为独立顾问转换为quest之前,曾在quicoutimi大学担任副教授。他是几个科学学会的成员,并在几个科学领域发表了广泛的文章。他是一本被广泛使用的多元数据分析(化学计量学)教科书的作者,并于2020年出版了《抽样理论与实践概论》。他是负责世界上第一个横向(矩阵无关)采样标准DS 3077:2024的工作组主席- Esbensen是:“采样科学与技术(SST)”的创始编辑- https://www.sst-magazine.info/issues/他可以在他的主页https://kheconsult.com/上找到
{"title":"Data Quality: Importance of the ‘before analysis’ domain (Theory of Sampling, TOS)","authors":"","doi":"10.1002/cem.70025","DOIUrl":"10.1002/cem.70025","url":null,"abstract":"<p>Data analysts/chemometricians are part of a scientific collegium covering three distinct domains: i) sampling – ii) analysis – iii) data modelling, which are collectively influencing ‘data quality’. There is much more to data quality than analytical uncertainty. There are many situations where <i>analysis</i> is to be made of heterogeneous materials/batches/lots/flowing streams, which need to be <i>sampled</i> appropriately before analysis, following an often long and complex pathway ‘from-lot-to-aliquot’. In most cases, sampling and sub-sampling will <i>dominate</i> the total Measurement Uncertainty budget (MU<sub>total</sub>). Left-out MU<sub>sampling</sub> contributions may easily overwhelm the Total Analytical Error (TAE) uncertainty by factors 5, 10, 25 or <i>higher</i> as a function of the specific heterogeneity characteristics of the materials and systems targeted, and of the sampling procedure used (grab vs. composite sampling). Focus is here on the consequences of unwittingly ignoring the uncertainties originating in these domains, which e.g. will influence adversely on bilinear component directions (reducing model <i>accuracy</i>) as well as RMSE estimates reflecting <i>precision</i> (analyte concentration prediction, classification, time series prediction) and along the way will also clear up an evergreen mistake: contrary to many beliefs, ‘more data’ will <span>not</span> automatically reduce the magnitude of an unsatisfactory performance RMSE. It is shown how the Theory of Sampling (TOS) is the only guarantor of representative sampling in the critical ‘before analysis’ domain. This article introduces the essential minimum TOS competence which must be mastered by stakeholders from all three domains. The conceptual elements in the TOS <i>system</i> can be visualised as a graphic overview:</p><p>Kim H. Esbensen has been professor at three universities (National Geological Survey of Denmark and Greenland (2010–2015), Aalborg University, Denmark (2001–2010), Telemark Institute of Technology, Norway (1990–2000) and professeur associé, Université du Québec à Chicoutimi before switching to a quest as an independent consultant in 2015. He is a member of several scientific societies and has published widely across several scientific fields. He is the author of a widely used textbook in Multivariate Data Analysis (chemometrics), and in 2020 published: “Introduction to the Theory and Practice of Sampling”. He was chairman of the taskforce responsible for the world's first horizontal (matrix-independent) sampling standard DS 3077:2024 - Esbensen is the founding editor of: “Sampling Science and Technology (SST)” - https://www.sst-magazine.info/issues/ He can be reached at his homepage https://kheconsult.com/</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Chemometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1