Many real-world data mining applications involve using imbalanced datasets to obtain predictive models. Imbalanced data can hinder the model performance of learning algorithms in rare cases. Although there are many well-researched classification task solutions, most of them cannot be directly applied to regression task. One of the challenges in imbalanced regression is to find a suitable evaluation and optimization standard that can improve the predictive ability of the model without severe model bias. Based on the importance of rare cases, this study proposes a new evaluation metric called adapted squared error relevance (ASER) by defining new relevance function and weighting functions. This metric weights data points by defining the importance of rare cases and assigns different weights to losses of the same size at different rare cases, thus enabling the model selected by this evaluation metric to better predict rare cases. ASER is compared with SER on 32 real datasets and 9 simulated datasets to verify the predictive performance of the selected model at rare cases. The experimental results show that the new evaluation metric ASER can obtain a high prediction performance at rare cases, while also not losing too much prediction accuracy in common cases.
{"title":"ASER: Adapted squared error relevance for rare cases prediction in imbalanced regression","authors":"Ying Kou, Guang-Hui Fu","doi":"10.1002/cem.3515","DOIUrl":"https://doi.org/10.1002/cem.3515","url":null,"abstract":"<p>Many real-world data mining applications involve using imbalanced datasets to obtain predictive models. Imbalanced data can hinder the model performance of learning algorithms in rare cases. Although there are many well-researched classification task solutions, most of them cannot be directly applied to regression task. One of the challenges in imbalanced regression is to find a suitable evaluation and optimization standard that can improve the predictive ability of the model without severe model bias. Based on the importance of rare cases, this study proposes a new evaluation metric called adapted squared error relevance (ASER) by defining new relevance function and weighting functions. This metric weights data points by defining the importance of rare cases and assigns different weights to losses of the same size at different rare cases, thus enabling the model selected by this evaluation metric to better predict rare cases. ASER is compared with SER on 32 real datasets and 9 simulated datasets to verify the predictive performance of the selected model at rare cases. The experimental results show that the new evaluation metric ASER can obtain a high prediction performance at rare cases, while also not losing too much prediction accuracy in common cases.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134803873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Farnoosh Koleini, Siewert Hugelier, Mahsa Akbari Lakeh, Hamid Abdollahi, José Camacho, Paul J. Gemperline
The complementary nature of analysis of variance (ANOVA) Simultaneous Component Analysis (ASCA+) and Tucker3 tensor decompositions is demonstrated on designed datasets. We show how ASCA+ can be used to (a) identify statistically sufficient Tucker3 models; (b) identify statistically important triads making their interpretation easier; and (c) eliminate non-significant triads making visualization and interpretation simpler. For multivariate datasets with an experimental design of at least two factors, the data matrix can be folded into a multi-way tensor. ASCA+ can be used on the unfolded matrix, and Tucker3 modeling can be used on the folded matrix (tensor). Two novel strategies are reported to determine the statistical significance of Tucker3 models using a previously published dataset. A statistically sufficient model was created by adding factors to the Tucker3 model in a stepwise manner until no ASCA+ detectable structure was observed in the residuals. Bootstrap analysis of the Tucker3 model residuals was used to determine confidence intervals for the loadings and the individual elements of the core matrix and showed that 21 out of 63 core values of the 3 × 7 × 3 model were not significant at the 95% confidence level. Exploiting the mutual orthogonality of the 63 triads of the Tucker3 model, these 21 factors (triads) were removed from the model. An ASCA+ backward elimination strategy is reported to further simplify the Tucker3 3 × 7 × 3 model to 36 core values and associated triads. ASCA+ was also used to identify individual factors (triads) with selective responses on experimental factors A, B, or interactions, A × B, for improved model visualization and interpretation.
{"title":"On the complementary nature of ANOVA simultaneous component analysis (ASCA+) and Tucker3 tensor decompositions on designed multi-way datasets","authors":"Farnoosh Koleini, Siewert Hugelier, Mahsa Akbari Lakeh, Hamid Abdollahi, José Camacho, Paul J. Gemperline","doi":"10.1002/cem.3514","DOIUrl":"10.1002/cem.3514","url":null,"abstract":"<p>The complementary nature of analysis of variance (ANOVA) Simultaneous Component Analysis (ASCA+) and Tucker3 tensor decompositions is demonstrated on designed datasets. We show how ASCA+ can be used to (a) identify statistically sufficient Tucker3 models; (b) identify statistically important triads making their interpretation easier; and (c) eliminate non-significant triads making visualization and interpretation simpler. For multivariate datasets with an experimental design of at least two factors, the data matrix can be folded into a multi-way tensor. ASCA+ can be used on the unfolded matrix, and Tucker3 modeling can be used on the folded matrix (tensor). Two novel strategies are reported to determine the statistical significance of Tucker3 models using a previously published dataset. A statistically sufficient model was created by adding factors to the Tucker3 model in a stepwise manner until no ASCA+ detectable structure was observed in the residuals. Bootstrap analysis of the Tucker3 model residuals was used to determine confidence intervals for the loadings and the individual elements of the core matrix and showed that 21 out of 63 core values of the 3 × 7 × 3 model were not significant at the 95% confidence level. Exploiting the mutual orthogonality of the 63 triads of the Tucker3 model, these 21 factors (triads) were removed from the model. An ASCA+ backward elimination strategy is reported to further simplify the Tucker3 3 × 7 × 3 model to 36 core values and associated triads. ASCA+ was also used to identify individual factors (triads) with selective responses on experimental factors A, B, or interactions, A × B, for improved model visualization and interpretation.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3514","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44469467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chudong Tong, Xinyan Zhou, Kai Qian, Xin Xu, Jiongting Jiang
The increasing scale of modern chemical plants keeps popularizing investigation as well as application of distributed process monitoring approaches. With a goal of directly quantifying the normal relations between different blocks divided from the whole process, a novel multi-block modeling strategy called block-wise residual generator is proposed, which trains a residual generator for each block through using the partial least squares algorithm with single one output, so that the relation between the corresponding block and the others is quantified as a regression model in a block-wise manner. The deviations caused by the abnormal samples to the normal relations quantified for different blocks could thus be efficiently captured by the residuals generated from the block regression models, which then provide sensitive information for fault detection and contribution-based fault diagnosis. Moreover, the proposed method is applicable for both disjoint and overlapped block divisions, and the direct consideration of individually quantifying relations between different blocks can always guarantee its salient monitoring performance, as validated through comparisons with classical distributed process monitoring methods.
{"title":"Distributed statistical process monitoring based on block-wise residual generator","authors":"Chudong Tong, Xinyan Zhou, Kai Qian, Xin Xu, Jiongting Jiang","doi":"10.1002/cem.3513","DOIUrl":"10.1002/cem.3513","url":null,"abstract":"<p>The increasing scale of modern chemical plants keeps popularizing investigation as well as application of distributed process monitoring approaches. With a goal of directly quantifying the normal relations between different blocks divided from the whole process, a novel multi-block modeling strategy called block-wise residual generator is proposed, which trains a residual generator for each block through using the partial least squares algorithm with single one output, so that the relation between the corresponding block and the others is quantified as a regression model in a block-wise manner. The deviations caused by the abnormal samples to the normal relations quantified for different blocks could thus be efficiently captured by the residuals generated from the block regression models, which then provide sensitive information for fault detection and contribution-based fault diagnosis. Moreover, the proposed method is applicable for both disjoint and overlapped block divisions, and the direct consideration of individually quantifying relations between different blocks can always guarantee its salient monitoring performance, as validated through comparisons with classical distributed process monitoring methods.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43826121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In honour of Edmund R. Malinowski","authors":"Marcel Maeder","doi":"10.1002/cem.3499","DOIUrl":"10.1002/cem.3499","url":null,"abstract":"","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49447150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mewa S. Dhanoa, Secundino López, Ruth Sanderson, Sue J. Lister, Ralph J. Barnes, Jennifer L. Ellis, James France
Scatter corrections are commonly applied to refine near-infrared (NIR) spectra. The aim of this study is to assess the impact of measurement errors when using ordinary least squares (OLS) for multiplicative scatter correction (MSC). Any measurement errors attached to the set-mean spectrum may attenuate the OLS slope and that in turn will affect the estimate of the intercept and the adjustment of the spectra when using MSC methods to mitigate scattering. A corrected least squares slope may be used instead to prevent this problem, although the impact of this approach on the final outcome will depend on the relative size of the measurement errors in the individual spectra and the set-mean spectrum. The errors-in-variables or type II regression model (also known as Deming regression) and its special cases, major axis (MA) and reduced major axis (RMA), are discussed and illustrated. The extent of OLS slope bias or attenuation is demonstrated as is the resulting MSC spectral distortion. Further modification to the MSC transformation method is also suggested. The influence of scattering correction (by MSC, standard normal variate (SNV) and detrending) and of using the maximum likelihood estimate of the slope for MSC on the prediction of chemical composition of Lucerne herbage from NIR spectra was assessed. The predictive performance was slightly improved by the use of scattering corrections with fairly minor differences among methods. Nonetheless, it seems well worth considering the use of type II regression models for assessing MSC application aiming at improving the goodness of prediction from NIR spectra.
{"title":"Methodology adjusting for least squares regression slope in the application of multiplicative scatter correction to near-infrared spectra of forage feed samples","authors":"Mewa S. Dhanoa, Secundino López, Ruth Sanderson, Sue J. Lister, Ralph J. Barnes, Jennifer L. Ellis, James France","doi":"10.1002/cem.3511","DOIUrl":"10.1002/cem.3511","url":null,"abstract":"<p>Scatter corrections are commonly applied to refine near-infrared (NIR) spectra. The aim of this study is to assess the impact of measurement errors when using ordinary least squares (OLS) for multiplicative scatter correction (MSC). Any measurement errors attached to the set-mean spectrum may attenuate the OLS slope and that in turn will affect the estimate of the intercept and the adjustment of the spectra when using MSC methods to mitigate scattering. A corrected least squares slope may be used instead to prevent this problem, although the impact of this approach on the final outcome will depend on the relative size of the measurement errors in the individual spectra and the set-mean spectrum. The errors-in-variables or type II regression model (also known as Deming regression) and its special cases, major axis (MA) and reduced major axis (RMA), are discussed and illustrated. The extent of OLS slope bias or attenuation is demonstrated as is the resulting MSC spectral distortion. Further modification to the MSC transformation method is also suggested. The influence of scattering correction (by MSC, standard normal variate (SNV) and detrending) and of using the maximum likelihood estimate of the slope for MSC on the prediction of chemical composition of Lucerne herbage from NIR spectra was assessed. The predictive performance was slightly improved by the use of scattering corrections with fairly minor differences among methods. Nonetheless, it seems well worth considering the use of type II regression models for assessing MSC application aiming at improving the goodness of prediction from NIR spectra.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3511","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47508072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joakim Skogholt, Kristian H. Liland, Tormod Næs, Age K. Smilde, Ulf G. Indahl
In various situations requiring empirical model building from highly multivariate measurements, modelling based on partial least squares regression (PLSR) may often provide efficient low-dimensional model solutions. In unsupervised situations, the same may be true for principal component analysis (PCA). In both cases, however, it is also of interest to identify subsets of the measured variables useful for obtaining sparser but still comparable models without significant loss of information and performance. In the present paper, we propose a voting approach for sparse overall maximisation of variance analogous to PCA and a similar alternative for deriving sparse regression models influenced closely related to the PLSR method. Both cases yield pivoting strategies for a modified Gram–Schmidt process and its corresponding (partial) QR-factorisation of the underlying data matrix to manage the variable selection process. The proposed methods include score and loading plot possibilities that are acknowledged for providing efficient interpretations of the related PCA and PLS models in chemometric applications.
{"title":"Selection of principal variables through a modified Gram–Schmidt process with and without supervision","authors":"Joakim Skogholt, Kristian H. Liland, Tormod Næs, Age K. Smilde, Ulf G. Indahl","doi":"10.1002/cem.3510","DOIUrl":"10.1002/cem.3510","url":null,"abstract":"<p>In various situations requiring empirical model building from highly multivariate measurements, modelling based on partial least squares regression (PLSR) may often provide efficient low-dimensional model solutions. In unsupervised situations, the same may be true for principal component analysis (PCA). In both cases, however, it is also of interest to identify subsets of the measured variables useful for obtaining sparser but still comparable models without significant loss of information and performance. In the present paper, we propose a voting approach for sparse overall maximisation of variance analogous to PCA and a similar alternative for deriving sparse regression models influenced closely related to the PLSR method. Both cases yield pivoting strategies for a modified Gram–Schmidt process and its corresponding (partial) QR-factorisation of the underlying data matrix to manage the variable selection process. The proposed methods include score and loading plot possibilities that are acknowledged for providing efficient interpretations of the related PCA and PLS models in chemometric applications.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3510","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46454642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoran Zhong, Elizabeth Donkor, Lisa Whitworth, Collin G. White, Kaushalya Sharma Dahal, Ayuba Fasasi, Thomas M. Hancewicz, Franklin Uba, Barry K. Lavine
In several previously published studies, Lavine and coworkers have demonstrated that infrared (IR) spectra from all layers of an intact multilayered automotive paint chip can be collected in a single analysis by scanning across each layer of a cross sectioned paint chip using a Fourier transform IR imaging microscope. Applying alternating least squares to the spectral data, the IR spectrum of each layer of an original equipment manufacturer paint chip can be extracted from a line map of the spectral image. To further develop this imaging technique for automotive paint analysis, the capability to cross section “small” paint chips (1 mm or less) using an ultramicrotome has been incorporated into our current imaging methodology. An ultramicrotome does not require epoxy or other embedding media for the paint chip and will simplify the analysis. However, extracting the IR spectra for each layer of an original equipment manufacturer paint chip by alternating least squares can be problematic for thin peels (less than one micron thickness), necessitating the use of target testing factor analysis to determine whether a specific layer is present in the line map and modified alternating least squares to recover the IR spectrum of the layer. Using a new sample preparation technique and the appropriate multivariate curve resolution methods, high quality IR spectra of the layers of a modern automotive paint system can be obtained from paint fragments that are smaller than what is practical to analyze by conventional Fourier transform IR spectroscopy.
{"title":"Application of ultramicrotomy and infrared imaging to the forensic examination of automotive paint","authors":"Haoran Zhong, Elizabeth Donkor, Lisa Whitworth, Collin G. White, Kaushalya Sharma Dahal, Ayuba Fasasi, Thomas M. Hancewicz, Franklin Uba, Barry K. Lavine","doi":"10.1002/cem.3509","DOIUrl":"10.1002/cem.3509","url":null,"abstract":"<p>In several previously published studies, Lavine and coworkers have demonstrated that infrared (IR) spectra from all layers of an intact multilayered automotive paint chip can be collected in a single analysis by scanning across each layer of a cross sectioned paint chip using a Fourier transform IR imaging microscope. Applying alternating least squares to the spectral data, the IR spectrum of each layer of an original equipment manufacturer paint chip can be extracted from a line map of the spectral image. To further develop this imaging technique for automotive paint analysis, the capability to cross section “small” paint chips (1 mm or less) using an ultramicrotome has been incorporated into our current imaging methodology. An ultramicrotome does not require epoxy or other embedding media for the paint chip and will simplify the analysis. However, extracting the IR spectra for each layer of an original equipment manufacturer paint chip by alternating least squares can be problematic for thin peels (less than one micron thickness), necessitating the use of target testing factor analysis to determine whether a specific layer is present in the line map and modified alternating least squares to recover the IR spectrum of the layer. Using a new sample preparation technique and the appropriate multivariate curve resolution methods, high quality IR spectra of the layers of a modern automotive paint system can be obtained from paint fragments that are smaller than what is practical to analyze by conventional Fourier transform IR spectroscopy.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48507824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nelson H. T. Lemes, Taináh M. R. Santos, Camila A. Tavares, Luciano S. Virtuoso, Kelly A. S. Souza, Teodorico C. Ramalho
All signals obtained as instrumental response of analytical apparatus are affected by noise, as in Raman spectroscopy. Whereas Raman scattering is an inherently weak process, the noise background may lead to misinterpretations. Although surface amplification of the Raman signal using metallic nanoparticles has been a strategy employed to partially solve the signal-to-noise problem, the preprocessing of Raman spectral data through the use of mathematical filters has become an integral part of Raman spectroscopy analysis. In this paper, a Tikhonov modified method to remove random noise in experimental data is presented. In order to refine and improve the Tikhonov method as a filter, the proposed method includes Euclidean norm of the fractional-order derivative of the solution as an additional criterion in Tikhonov function. In the strategy used here, the solution depends on the regularization parameter,