Helene Fog Froriep Halberg, Marta Bevilacqua, Åsmund Rinnan
Fluorescence spectroscopy has been applied for analysis of complex samples, such as food and beverages. Parallel factor analysis (PARAFAC) is a well‐known decomposition method for fluorescence excitation–emission matrices (EEMs). When the complexity of the system increases, it becomes considerably more difficult to determine the optimal number of PARAFAC components, especially when the fluorophores of the system are unknown. The two commonly applied diagnostics, core consistency and split‐half analysis, appear to underestimate the model complexity due to covarying components and local minima, respectively. As a more robust alternative, we propose a resampling approach with multiple initializations and submodel comparisons for estimating the optimal number of PARAFAC components in complex data.
{"title":"Resampling as a Robust Measure of Model Complexity in PARAFAC Models","authors":"Helene Fog Froriep Halberg, Marta Bevilacqua, Åsmund Rinnan","doi":"10.1002/cem.3601","DOIUrl":"https://doi.org/10.1002/cem.3601","url":null,"abstract":"Fluorescence spectroscopy has been applied for analysis of complex samples, such as food and beverages. Parallel factor analysis (PARAFAC) is a well‐known decomposition method for fluorescence excitation–emission matrices (EEMs). When the complexity of the system increases, it becomes considerably more difficult to determine the optimal number of PARAFAC components, especially when the fluorophores of the system are unknown. The two commonly applied diagnostics, core consistency and split‐half analysis, appear to underestimate the model complexity due to covarying components and local minima, respectively. As a more robust alternative, we propose a resampling approach with multiple initializations and submodel comparisons for estimating the optimal number of PARAFAC components in complex data.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Irene Mariñas‐Collado, Juan M. Rodríguez‐Díaz, M. Teresa Santos‐Martín
This study addresses the complex dynamics of alcohol elimination in the human body, very important in forensic and healthcare areas. Existing models often oversimplify with the assumption of linear elimination kinetics, limiting practical application. This study presents a novel non‐linear model for estimating blood alcohol concentration after multiple intakes. Initially developed for two different alcohol incorporations, it can be straightforwardly extended to the case of more intakes. Emphasising the significance of accurate parameter estimation, the research underscores the importance of precise experimental design, utilising optimal experimental design (OED) methodologies. Sensitivity analysis of model coefficients and the determination of D‐optimal designs, considering correlation structures among observations, reveal a strong linear relationship between support points. This relationship can be used to obtain nearly optimal designs that are highly efficient and much easier to compute.
{"title":"A Non‐Linear Model for Multiple Alcohol Intakes and Optimal Designs Strategies","authors":"Irene Mariñas‐Collado, Juan M. Rodríguez‐Díaz, M. Teresa Santos‐Martín","doi":"10.1002/cem.3599","DOIUrl":"https://doi.org/10.1002/cem.3599","url":null,"abstract":"This study addresses the complex dynamics of alcohol elimination in the human body, very important in forensic and healthcare areas. Existing models often oversimplify with the assumption of linear elimination kinetics, limiting practical application. This study presents a novel non‐linear model for estimating blood alcohol concentration after multiple intakes. Initially developed for two different alcohol incorporations, it can be straightforwardly extended to the case of more intakes. Emphasising the significance of accurate parameter estimation, the research underscores the importance of precise experimental design, utilising optimal experimental design (OED) methodologies. Sensitivity analysis of model coefficients and the determination of D‐optimal designs, considering correlation structures among observations, reveal a strong linear relationship between support points. This relationship can be used to obtain nearly optimal designs that are highly efficient and much easier to compute.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we revisit the power curves in ANOVA simultaneous component analysis (ASCA) based on permutation testing and introduce the population curves derived from population parameters describing the relative effect among factors and interactions. The relative effect has important practical implications: The statistical power of a given factor depends on the design of other factors in the experiment and not only of the sample size. Thus, understanding the relative power in a specific experimental design can be extremely useful to maximize our capability of success when planning the experiment. In the paper, we derive relative and absolute population curves, where the former represent statistical power in terms of the normalized effect size between structure and noise, and the latter in terms of the sample size. Both types of population curves allow us to make decisions regarding the number and nature (fixed/random) of factors, their relationships (crossed/nested), and the number of levels and replicates, among others, in an multivariate experimental design (e.g., an omics study) during the planning phase of the experiment. We illustrate both types of curves through simulation.
{"title":"Population Power Curves in ASCA With Permutation Testing","authors":"José Camacho, Michael Sorochan Armstrong","doi":"10.1002/cem.3596","DOIUrl":"https://doi.org/10.1002/cem.3596","url":null,"abstract":"In this paper, we revisit the power curves in ANOVA simultaneous component analysis (ASCA) based on permutation testing and introduce the population curves derived from population parameters describing the relative effect among factors and interactions. The relative effect has important practical implications: The statistical power of a given factor depends on the design of other factors in the experiment and not only of the sample size. Thus, understanding the relative power in a specific experimental design can be extremely useful to maximize our capability of success when planning the experiment. In the paper, we derive relative and absolute population curves, where the former represent statistical power in terms of the normalized effect size between structure and noise, and the latter in terms of the sample size. Both types of population curves allow us to make decisions regarding the number and nature (fixed/random) of factors, their relationships (crossed/nested), and the number of levels and replicates, among others, in an multivariate experimental design (e.g., an omics study) during the planning phase of the experiment. We illustrate both types of curves through simulation.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Here, we demonstrate mid‐field 1H NMR spectroscopy combined with chemometrics to be powerful in the classification and authentication of motor oils (MOs). The 1H NMR data were processed with a new algorithm for simultaneous phase and baseline correction, which, for crowded spectra such as those of the refinery products, allowed for more accurate estimation of phase parameters than other literature approaches tested. A principal component analysis (PCA) model based on the unbinned CH3 fingerprint region (0.6–1.0 ppm) enabled the differentiation of hydrocracked and poly‐α‐olefin‐based MOs and was effective in resolving mixtures of these base stocks with conventional base oils. PCA analysis of the 1.0‐ to 1.14‐ppm region enabled the detection of poly (isobutylene) additive and was useful for differentiating between single‐grade and multigrade MOs. Non‐equidistantly binned 1H NMR data were used to detect the addition of esters and to establish discriminant models for classifying MOs by viscosity grade and by major categories of synthetic, semisynthetic, and mineral oils. The performances of four classifiers (linear discriminant analysis [LDA], quadratic discriminant analysis [QDA], naïve Bayes classifier [NBC], and support vector machine [SVM]) with and without PCA dimensionality reduction were compared. In both tasks, SVM showed the best efficiency, with average error rates of ~2.3% and 8.15% for predicting major MO categories and viscosity grades, respectively. The potential to merge spectra collected from different NMR instruments is discussed for models based on spectral binning. It is also shown that small errors in phase parameters are not detrimental to binning‐based PCA models.
{"title":"Chemometric Classification of Motor Oils Using 1H NMR Spectroscopy With Simultaneous Phase and Baseline Optimization","authors":"A. Olejniczak, J. P. Łukaszewicz","doi":"10.1002/cem.3598","DOIUrl":"https://doi.org/10.1002/cem.3598","url":null,"abstract":"Here, we demonstrate mid‐field <jats:sup>1</jats:sup>H NMR spectroscopy combined with chemometrics to be powerful in the classification and authentication of motor oils (MOs). The <jats:sup>1</jats:sup>H NMR data were processed with a new algorithm for simultaneous phase and baseline correction, which, for crowded spectra such as those of the refinery products, allowed for more accurate estimation of phase parameters than other literature approaches tested. A principal component analysis (PCA) model based on the unbinned CH<jats:sub>3</jats:sub> fingerprint region (0.6–1.0 ppm) enabled the differentiation of hydrocracked and poly‐α‐olefin‐based MOs and was effective in resolving mixtures of these base stocks with conventional base oils. PCA analysis of the 1.0‐ to 1.14‐ppm region enabled the detection of poly (isobutylene) additive and was useful for differentiating between single‐grade and multigrade MOs. Non‐equidistantly binned <jats:sup>1</jats:sup>H NMR data were used to detect the addition of esters and to establish discriminant models for classifying MOs by viscosity grade and by major categories of synthetic, semisynthetic, and mineral oils. The performances of four classifiers (linear discriminant analysis [LDA], quadratic discriminant analysis [QDA], naïve Bayes classifier [NBC], and support vector machine [SVM]) with and without PCA dimensionality reduction were compared. In both tasks, SVM showed the best efficiency, with average error rates of ~2.3% and 8.15% for predicting major MO categories and viscosity grades, respectively. The potential to merge spectra collected from different NMR instruments is discussed for models based on spectral binning. It is also shown that small errors in phase parameters are not detrimental to binning‐based PCA models.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many data analysis methods actually combine optimization of several criteria. In this paper, a framework is offered for categorizing such multi‐criteria methods. In particular, it categorizes multiset and three‐way analysis methods as well as penalized methods and combinations thereof. The framework aims to stimulate critical evaluation of methods and reflection on the purpose of methods and, by signaling gaps, to help the development of new data analysis methods.
{"title":"Some Views on Multi‐criteria Methods for Data Analysis","authors":"Henk A. L. Kiers, Marieke E. Timmerman","doi":"10.1002/cem.3597","DOIUrl":"https://doi.org/10.1002/cem.3597","url":null,"abstract":"Many data analysis methods actually combine optimization of several criteria. In this paper, a framework is offered for categorizing such multi‐criteria methods. In particular, it categorizes multiset and three‐way analysis methods as well as penalized methods and combinations thereof. The framework aims to stimulate critical evaluation of methods and reflection on the purpose of methods and, by signaling gaps, to help the development of new data analysis methods.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Klaus Neymeyr, Martina Beese, Hamid Abdollahi, Mathias Sawall
In MCR analyses, the similarity of pairs of spectra or concentration profiles can be measured in terms of the acute angle that is enclosed by the representing vectors. Acute angles between vectors can be generalized to pairs of subspaces. So‐called canonical angles, also called principal angles, measure the mutual orientation of a pair of subspaces. This work discusses how angles and canonical angles can support multivariate curve resolution analyses. A canonical angle analysis (CAA) can help to detect changes of the chemical composition during a chemical reaction in a way comparable, but different to the evolving factor analysis (EFA).
{"title":"Can Angle Measures Be Useful in MCR Analyses?","authors":"Klaus Neymeyr, Martina Beese, Hamid Abdollahi, Mathias Sawall","doi":"10.1002/cem.3582","DOIUrl":"https://doi.org/10.1002/cem.3582","url":null,"abstract":"In MCR analyses, the similarity of pairs of spectra or concentration profiles can be measured in terms of the acute angle that is enclosed by the representing vectors. Acute angles between vectors can be generalized to pairs of subspaces. So‐called canonical angles, also called principal angles, measure the mutual orientation of a pair of subspaces. This work discusses how angles and canonical angles can support multivariate curve resolution analyses. A canonical angle analysis (CAA) can help to detect changes of the chemical composition during a chemical reaction in a way comparable, but different to the evolving factor analysis (EFA).","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, two alternative ways of analyzing three‐way data with multivariate curve resolution alternating least squares (MCR‐ALS) using the trilinearity constraint are described and compared. Different synthetic datasets and experimental three‐way datasets covering different scenarios are analyzed, and the results obtained are compared. The two new different ways of applying the trilinearity constraint are named flexible trilinearity alignment (FTA) and shift invariant transformation (SIT). The effects of noise in the application of both types of constraints are investigated in detail. Results show that both approaches are particularly adequate for those cases like in gas chromatography and especially in liquid chromatography where the elution profiles of the same chemical component in different chromatographic runs are not totally reproducible because they are time shifted, although they preserve their shape. When strong time shifts and co‐elution occur, then the “standard” trilinear model does not work, and alternative approaches should be used, such as the MCR extended bilinear model to multiset (multirun) data, or the proposed relaxation of the trilinearity constraint in the FTA and SIT methods to capture the time drift changes produced in the elution profiles of the resolved components.
在这项工作中,描述并比较了使用三线性约束的多变量曲线分辨率交替最小二乘法(MCR-ALS)分析三向数据的两种替代方法。对涵盖不同场景的不同合成数据集和实验三向数据集进行了分析,并对所得结果进行了比较。应用三线性约束的两种新的不同方法被命名为灵活三线性配准(FTA)和移位不变变换(SIT)。在应用这两种约束时,对噪声的影响进行了详细研究。结果表明,这两种方法都特别适用于气相色谱法,尤其是液相色谱法中同一化学成分在不同色谱运行中的洗脱剖面图虽然形状保持不变,但由于时间偏移而无法完全重现的情况。当发生强烈的时间偏移和共洗脱时,"标准 "三线性模型就不起作用了,此时应采用其他方法,如针对多集(多运行)数据的 MCR 扩展双线性模型,或建议放宽 FTA 和 SIT 方法中的三线性约束,以捕捉已解析组分洗脱剖面中产生的时间漂移变化。
{"title":"Flexible Trilinearity Alignment (FTA) and Shift Invariant Transformation (SIT) Constraints in Three‐Way Multivariate Curve Resolution Data Analysis","authors":"Xin Zhang, R. Tauler","doi":"10.1002/cem.3581","DOIUrl":"https://doi.org/10.1002/cem.3581","url":null,"abstract":"In this work, two alternative ways of analyzing three‐way data with multivariate curve resolution alternating least squares (MCR‐ALS) using the trilinearity constraint are described and compared. Different synthetic datasets and experimental three‐way datasets covering different scenarios are analyzed, and the results obtained are compared. The two new different ways of applying the trilinearity constraint are named flexible trilinearity alignment (FTA) and shift invariant transformation (SIT). The effects of noise in the application of both types of constraints are investigated in detail. Results show that both approaches are particularly adequate for those cases like in gas chromatography and especially in liquid chromatography where the elution profiles of the same chemical component in different chromatographic runs are not totally reproducible because they are time shifted, although they preserve their shape. When strong time shifts and co‐elution occur, then the “standard” trilinear model does not work, and alternative approaches should be used, such as the MCR extended bilinear model to multiset (multirun) data, or the proposed relaxation of the trilinearity constraint in the FTA and SIT methods to capture the time drift changes produced in the elution profiles of the resolved components.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141928166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A robust method for multiplicative scatter correction (MSC) in infrared spectroscopy is presented. Using quantile regression, the outlier wavelengths (concentration‐dependent wavelengths) that are irrelevant to the regression are identified and therefore excluded from the regression model. This new MCS method, which could be implemented in its simple or extended form, is much simpler than the recently proposed methods and has only one hyperparameter (the quantile value) to be adjusted. To achieve this, a scoring function based on residual analysis can automatically determine the correct quantile value. The method is first explained using simulation data sets and then its validation is explained by analysing some experimental data sets. It was found that our new method can perform well in the presence of strong outlying variables. On the other hand, when the data sets are not associated outlying wavelengths, this method behaves similarly to the conventional MSC method.
{"title":"Robust Multiplicative Scatter Correction Using Quantile Regression","authors":"Bahram Hemmateenejad, Nabiollah Mobaraki, Knut Baumann","doi":"10.1002/cem.3589","DOIUrl":"https://doi.org/10.1002/cem.3589","url":null,"abstract":"A robust method for multiplicative scatter correction (MSC) in infrared spectroscopy is presented. Using quantile regression, the outlier wavelengths (concentration‐dependent wavelengths) that are irrelevant to the regression are identified and therefore excluded from the regression model. This new MCS method, which could be implemented in its simple or extended form, is much simpler than the recently proposed methods and has only one hyperparameter (the quantile value) to be adjusted. To achieve this, a scoring function based on residual analysis can automatically determine the correct quantile value. The method is first explained using simulation data sets and then its validation is explained by analysing some experimental data sets. It was found that our new method can perform well in the presence of strong outlying variables. On the other hand, when the data sets are not associated outlying wavelengths, this method behaves similarly to the conventional MSC method.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, a novel one‐class classification algorithm one‐class convolutional autoencoder (OC‐CAE) was proposed for the detection of abnormal samples in the excitation–emission matrix (EEM) fluorescence spectra dataset. The OC‐CAE used Boxplot to analyze the reconstruction errors and used the LOF algorithm to handle features extracted by the hidden layer in the convolutional autoencoder (CAE). The fused information provides the basis for more accurate pattern recognition, ensures flexibility in model training, and can obtain higher model specificity, which is important in the field of food quality control. To demonstrate the reliability and advantages of OC‐CAE, two EEM cases related to the authentication of food including the Zhenjiang aromatic vinegar (ZAV) case and the camellia oil (CAO) case were studied. The results showed that OC‐CAE identified all abnormal samples in the two cases, reflecting excellent performance in the detection of abnormal samples, and that it, coupled with EEM, would be an effective tool for the authenticity identification of food.
{"title":"A Novel One‐Class Convolutional Autoencoder Combined With Excitation–Emission Matrix Fluorescence Spectroscopy for Authenticity Identification of Food","authors":"Xiaoqin Yan, Baoshuo Jia, Wanjun Long, Kun Huang, Tong Wang, Hailong Wu, Ruqin Yu","doi":"10.1002/cem.3592","DOIUrl":"https://doi.org/10.1002/cem.3592","url":null,"abstract":"In this work, a novel one‐class classification algorithm one‐class convolutional autoencoder (OC‐CAE) was proposed for the detection of abnormal samples in the excitation–emission matrix (EEM) fluorescence spectra dataset. The OC‐CAE used Boxplot to analyze the reconstruction errors and used the LOF algorithm to handle features extracted by the hidden layer in the convolutional autoencoder (CAE). The fused information provides the basis for more accurate pattern recognition, ensures flexibility in model training, and can obtain higher model specificity, which is important in the field of food quality control. To demonstrate the reliability and advantages of OC‐CAE, two EEM cases related to the authentication of food including the Zhenjiang aromatic vinegar (ZAV) case and the camellia oil (CAO) case were studied. The results showed that OC‐CAE identified all abnormal samples in the two cases, reflecting excellent performance in the detection of abnormal samples, and that it, coupled with EEM, would be an effective tool for the authenticity identification of food.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance of multivariate calibration models ŷ = f(x) for the prediction of a numerical property y from a set of x‐variables depends on the type of scaling of the x‐variables. Common scaling methods are autoscaling (dividing the centered x by its standard deviation s) and Pareto scaling (dividing the centered x by sP with p = 0.5). The adjusted Pareto scaling presented here varies the exponent P between 0 (no scaling) and 1 (autoscaling) with the aim of obtaining an optimum prediction performance for ŷ. Related scaling methods based on the variable spread are range scaling and vast scaling; while level scaling is based on the location (central value) of the variable. These scaling methods and robust versions are compared for models created by partial least‐squares (PLS) regression. The applied strategy repeated double cross validation (rdCV) evaluates the model performance for test set objects and considers its variability. Results with three data sets from chemistry show: (a) the efficacy of the different scaling methods depends on the data structure; (b) optimization of the Pareto exponent P is recommended; (c) range scaling or vast scaling may be better than adjusted Pareto scaling; (d) in general a heuristic search for the best scaling method is advisable. Overall, the consideration of different variants of scaling allow for a flexible adjustment of the variable contributions to the calibration model.
多元校准模型 ŷ = f(x)从一组 x 变量预测数值属性 y 的性能取决于 x 变量的缩放类型。常见的缩放方法有自动缩放(将中心 x 除以标准偏差 s)和帕累托缩放(将中心 x 除以 sP,p = 0.5)。本文介绍的调整帕累托缩放法在 0(无缩放)和 1(自动缩放)之间改变指数 P,目的是获得 ŷ 的最佳预测性能。基于变量分布的相关缩放方法有范围缩放和广度缩放;而水平缩放则基于变量的位置(中心值)。通过偏最小二乘(PLS)回归创建的模型,对这些缩放方法和稳健版本进行了比较。所采用的重复双重交叉验证(rdCV)策略可评估测试集对象的模型性能,并考虑其可变性。三个化学数据集的结果表明:(a) 不同缩放方法的效果取决于数据结构;(b) 建议优化帕累托指数 P;(c) 范围缩放或广域缩放可能比调整后的帕累托缩放更好;(d) 一般来说,最好采用启发式搜索最佳缩放方法。总之,考虑不同的缩放变量可以灵活调整校准模型的变量贡献。
{"title":"Adjusted Pareto Scaling for Multivariate Calibration Models","authors":"Kurt Varmuza, Peter Filzmoser","doi":"10.1002/cem.3588","DOIUrl":"https://doi.org/10.1002/cem.3588","url":null,"abstract":"The performance of multivariate calibration models <jats:italic>ŷ</jats:italic> = f(<jats:italic>x</jats:italic>) for the prediction of a numerical property <jats:italic>y</jats:italic> from a set of <jats:italic>x</jats:italic>‐variables depends on the type of scaling of the <jats:italic>x</jats:italic>‐variables. Common scaling methods are autoscaling (dividing the centered <jats:italic>x</jats:italic> by its standard deviation <jats:italic>s</jats:italic>) and Pareto scaling (dividing the centered <jats:italic>x</jats:italic> by <jats:italic>s</jats:italic><jats:sup><jats:italic>P</jats:italic></jats:sup> with <jats:italic>p</jats:italic> = 0.5). The adjusted Pareto scaling presented here varies the exponent <jats:italic>P</jats:italic> between 0 (no scaling) and 1 (autoscaling) with the aim of obtaining an optimum prediction performance for <jats:italic>ŷ</jats:italic>. Related scaling methods based on the variable spread are range scaling and vast scaling; while level scaling is based on the location (central value) of the variable. These scaling methods and robust versions are compared for models created by partial least‐squares (PLS) regression. The applied strategy repeated double cross validation (rdCV) evaluates the model performance for test set objects and considers its variability. Results with three data sets from chemistry show: (a) the efficacy of the different scaling methods depends on the data structure; (b) optimization of the Pareto exponent <jats:italic>P</jats:italic> is recommended; (c) range scaling or vast scaling may be better than adjusted Pareto scaling; (d) in general a heuristic search for the best scaling method is advisable. Overall, the consideration of different variants of scaling allow for a flexible adjustment of the variable contributions to the calibration model.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}