Pub Date : 2024-05-04DOI: 10.1016/j.chemolab.2024.105136
Omar Nibouche , Fayas Asharindavida , Hui Wang , Jordan Vincent , Jun Liu , Saskia van Ruth , Paul Maguire , Enayet Rahman
The well-known and extensively studied Linear Discriminant Analysis (LDA) can have its performance lowered in scenarios where data is not homoscedastic or not Gaussian. That is, the classical assumptions when LDA models are built are not applicable, and consequently LDA projections would not be able to extract the needed features to explain the intrinsic structure of data and for classes to be separated. As with many real word data sets, data obtained using miniature spectrometers can suffer from such drawbacks which would limit the deployment of such technology needed for food analysis. The solution presented in the paper is to divide classes into subclasses and to use means of sub classes, classes, and data in the suggested between classes scatter metric. Further, samples belonging to the same subclass are used to build a measure of within subclass scatterness. Such a solution solves the shortcoming of the classical LDA. The obtained results when using the proposed solution on food data and on general machine learning datasets show that the work in this paper compares well to and is very competitive with similar sub-class LDA algorithms in the literature. An extension to a Hilbert space is also presented; and the kernel version of the presented solution can be fused with its linear counter parts to yield improved classification rates.
{"title":"A new sub-class linear discriminant for miniature spectrometer based food analysis","authors":"Omar Nibouche , Fayas Asharindavida , Hui Wang , Jordan Vincent , Jun Liu , Saskia van Ruth , Paul Maguire , Enayet Rahman","doi":"10.1016/j.chemolab.2024.105136","DOIUrl":"10.1016/j.chemolab.2024.105136","url":null,"abstract":"<div><p>The well-known and extensively studied Linear Discriminant Analysis (LDA) can have its performance lowered in scenarios where data is not homoscedastic or not Gaussian. That is, the classical assumptions when LDA models are built are not applicable, and consequently LDA projections would not be able to extract the needed features to explain the intrinsic structure of data and for classes to be separated. As with many real word data sets, data obtained using miniature spectrometers can suffer from such drawbacks which would limit the deployment of such technology needed for food analysis. The solution presented in the paper is to divide classes into subclasses and to use means of sub classes, classes, and data in the suggested between classes scatter metric. Further, samples belonging to the same subclass are used to build a measure of within subclass scatterness. Such a solution solves the shortcoming of the classical LDA. The obtained results when using the proposed solution on food data and on general machine learning datasets show that the work in this paper compares well to and is very competitive with similar sub-class LDA algorithms in the literature. An extension to a Hilbert space is also presented; and the kernel version of the presented solution can be fused with its linear counter parts to yield improved classification rates.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000765/pdfft?md5=79caa0e3ce066c5537d9c639d217ec83&pid=1-s2.0-S0169743924000765-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141055788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.chemolab.2024.105135
Mohammed Benaafi , Sani I. Abba , Mojeed Opeyemi Oyedeji , Auwalu Saleh Mubarak , Jamilu Usman , Isam H. Aljundi
Groundwater (GW) salinization of coastal aquifers has become a serious problem for attaining sustainable water resource management in Saudi Arabia and other parts of the world. Therefore, it is crucial to assess the extent of this salinization to protect and manage our water resources effectively. This research proposed real fieldwork GW samples at several locations supported with experimental based on chromatography (IC) and inductively coupled plasma mass spectrometry (ICP-MS) to analyze several GW physical, chemical, and hydro-geochemical elements. In this study, we model GW salinization with machine learning algorithms such as support vector regression, gaussian process regression, artificial neural networks, and least squares ensemble boosting regression tree. The performance of the standalone models was optimized with metaheuristic optimization-based algorithms such as fuzzy hybridized genetic algorithm (ANFIS-GA) and particle swarm optimization (ANFIS-PSO). The outcomes based on three variable input combinations were validated using several performance indicators and graphical methods. The quantitative analysis indicated that GPR-Combo1(MAE = 0.006 mg/L), Ensm- Combo2 (MAE = 0.025 mg/L), and GPR- Combo3 (MAE = 0.078 mg/L) proved merit among the standalone combinations. Where combo 1, 2, and 3 stand for model combinations derived from feature selection. The cumulative probability function (CPF) demonstrated that heuristic optimization ANFIS-GA (MAE = 0.0025 mg/L, MAPE = 0.19183) and ANFIS-PSO (MAE = 0.0018 mg/L, MAPE = 0.0723) outperformed the standalone error accuracy and served reliable approach. Both the standalone models and heuristic algorithms used for GW salinization modeling have demonstrated promising results in accurately predicting salinity. This approach could aid in effectively managing the GW resources for sustainable development.
{"title":"Experimental-based groundwater salinization from the carbonate aquifer of eastern Saudi Arabia: Insight into machine learning coupled with meta-heuristic algorithms","authors":"Mohammed Benaafi , Sani I. Abba , Mojeed Opeyemi Oyedeji , Auwalu Saleh Mubarak , Jamilu Usman , Isam H. Aljundi","doi":"10.1016/j.chemolab.2024.105135","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105135","url":null,"abstract":"<div><p>Groundwater (GW) salinization of coastal aquifers has become a serious problem for attaining sustainable water resource management in Saudi Arabia and other parts of the world. Therefore, it is crucial to assess the extent of this salinization to protect and manage our water resources effectively. This research proposed real fieldwork GW samples at several locations supported with experimental based on chromatography (IC) and inductively coupled plasma mass spectrometry (ICP-MS) to analyze several GW physical, chemical, and hydro-geochemical elements. In this study, we model GW salinization with machine learning algorithms such as support vector regression, gaussian process regression, artificial neural networks, and least squares ensemble boosting regression tree. The performance of the standalone models was optimized with metaheuristic optimization-based algorithms such as fuzzy hybridized genetic algorithm (ANFIS-GA) and particle swarm optimization (ANFIS-PSO). The outcomes based on three variable input combinations were validated using several performance indicators and graphical methods. The quantitative analysis indicated that GPR-Combo1(MAE = 0.006 mg/L), Ensm- Combo2 (MAE = 0.025 mg/L), and GPR- Combo3 (MAE = 0.078 mg/L) proved merit among the standalone combinations. Where combo 1, 2, and 3 stand for model combinations derived from feature selection. The cumulative probability function (CPF) demonstrated that heuristic optimization ANFIS-GA (MAE = 0.0025 mg/L, MAPE = 0.19183) and ANFIS-PSO (MAE = 0.0018 mg/L, MAPE = 0.0723) outperformed the standalone error accuracy and served reliable approach. Both the standalone models and heuristic algorithms used for GW salinization modeling have demonstrated promising results in accurately predicting salinity. This approach could aid in effectively managing the GW resources for sustainable development.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-27DOI: 10.1016/j.chemolab.2024.105134
Martina Beese , Tomass Andersons , Mathias Sawall , Cyril Ruckebusch , Adrián Gómez-Sánchez , Robert Francke , Adrian Prudlik , Robert Franke , Klaus Neymeyr
Multivariate curve resolution (MCR) methods are sometimes faced with missing or erroneous data, e.g., due to sensor saturation. In some cases, an estimation of the missing data is possible, but often MCR works with the largest submatrix without missing entries. This ignores all rows and columns of the data matrix that contain missing values. A successful approach to deal with incomplete data multisets has been proposed by Alier and Tauler (2013), but it does not include a factor ambiguity analysis. Here, the missing data problem is addressed in combination with a factor ambiguity analysis. An approach is presented that minimizes the factor ambiguity by extracting a maximum of spectral information even from incomplete rows and columns of the spectral data matrix. The method requires a high signal-to-noise ratio. Applications are presented for UV/Vis and HSI data.
{"title":"On the factor ambiguity of MCR problems for blockwise incomplete data sets","authors":"Martina Beese , Tomass Andersons , Mathias Sawall , Cyril Ruckebusch , Adrián Gómez-Sánchez , Robert Francke , Adrian Prudlik , Robert Franke , Klaus Neymeyr","doi":"10.1016/j.chemolab.2024.105134","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105134","url":null,"abstract":"<div><p>Multivariate curve resolution (MCR) methods are sometimes faced with missing or erroneous data, e.g., due to sensor saturation. In some cases, an estimation of the missing data is possible, but often MCR works with the largest submatrix without missing entries. This ignores all rows and columns of the data matrix that contain missing values. A successful approach to deal with incomplete data multisets has been proposed by Alier and Tauler (2013), but it does not include a factor ambiguity analysis. Here, the missing data problem is addressed in combination with a factor ambiguity analysis. An approach is presented that minimizes the factor ambiguity by extracting a maximum of spectral information even from incomplete rows and columns of the spectral data matrix. The method requires a high signal-to-noise ratio. Applications are presented for UV/Vis and HSI data.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000741/pdfft?md5=bb7d17fc695f88d0275f3839df0eb621&pid=1-s2.0-S0169743924000741-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140815811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dried oregano leaves are particularly prone to adulteration because of their widespread distribution and their easy mixing with leaves of other plants of lower commercial value, such as olive, myrtle, strawberry tree, or sumac. To reveal the presence of adulteration, in this study we considered an untargeted analytical approach, which instead of involving the a priori selection of specific compounds of interest is focused on defining the characteristic spectral signature of authentic oregano with respect to its most frequent adulterants. NIR HyperSpectral Imaging (NIR-HSI) represents a state-of-the-art, rapid and non-destructive technique, allowing for the collection of both spectral and spatial information from the sample, making it particularly suitable for characterizing visually heterogeneous samples.
Authentication issues are typically assessed through class modelling techniques and Soft Independent Modelling of class Analogy (SIMCA) is one of the most used algorithms in this scenario. However, the high variability and heterogeneity within the authentic oregano class resulted in poor outcomes when SIMCA was applied. As an alternative, Soft Partial Least Squares Discriminant Analysis (Soft PLS-DA) algorithm was applied to differentiate authentic oregano samples from pure adulterants. Soft PLS-DA represents a hybrid approach that combines the advantages of both discriminant and class modelling techniques. The resultant classification model has indeed led to promising results, achieving a prediction efficiency of 92.9 %. Finally, based on the percentage of pixels predicted as oregano in the Soft-PLSDA prediction images, a threshold value of 10 % was established, serving as a detection limit of NIR-HSI to distinguish authentic oregano samples from adulterated ones.
{"title":"Addressing adulteration challenges of dried oregano leaves by NIR HyperSpectral Imaging","authors":"Veronica Ferrari , Rosalba Calvini , Camilla Menozzi , Alessandro Ulrici , Marco Bragolusi , Roberto Piro , Alessandra Tata , Michele Suman , Giorgia Foca","doi":"10.1016/j.chemolab.2024.105133","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105133","url":null,"abstract":"<div><p>Dried oregano leaves are particularly prone to adulteration because of their widespread distribution and their easy mixing with leaves of other plants of lower commercial value, such as olive, myrtle, strawberry tree, or sumac. To reveal the presence of adulteration, in this study we considered an untargeted analytical approach, which instead of involving the <em>a priori</em> selection of specific compounds of interest is focused on defining the characteristic spectral signature of authentic oregano with respect to its most frequent adulterants. NIR HyperSpectral Imaging (NIR-HSI) represents a state-of-the-art, rapid and non-destructive technique, allowing for the collection of both spectral and spatial information from the sample, making it particularly suitable for characterizing visually heterogeneous samples.</p><p>Authentication issues are typically assessed through class modelling techniques and Soft Independent Modelling of class Analogy (SIMCA) is one of the most used algorithms in this scenario. However, the high variability and heterogeneity within the authentic oregano class resulted in poor outcomes when SIMCA was applied. As an alternative, Soft Partial Least Squares Discriminant Analysis (Soft PLS-DA) algorithm was applied to differentiate authentic oregano samples from pure adulterants. Soft PLS-DA represents a hybrid approach that combines the advantages of both discriminant and class modelling techniques. The resultant classification model has indeed led to promising results, achieving a prediction efficiency of 92.9 %. Finally, based on the percentage of pixels predicted as oregano in the Soft-PLSDA prediction images, a threshold value of 10 % was established, serving as a detection limit of NIR-HSI to distinguish authentic oregano samples from adulterated ones.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016974392400073X/pdfft?md5=9ca1205b6902ee41304da3031bdead5a&pid=1-s2.0-S016974392400073X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140640785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics can be used in a variety of ways to present, compute, and critically analyze experimental data. To determine the significance and validity of the experimental data, a variety of statistical tests are used. Using a synthesized CoO/NiO/MnO2 Nanocomposite, the present study used adsorption to remove the dye Bromophenol Blue (BPB) from a contaminated aqueous solution. In order to (a) determine the optimal pH of the solution, (b) confirm the experiment's success, and (c) investigate the effect of adsorbent dose on BPB dye removal from aqueous solutions. The experimental data were statistically analyzed through hypothesis testing using the t-test, paired t-test, and Chi-square test. The null hypothesis that the optimal pH value is 7 is accepted since tobserved (−1.979)<ttabulated (−2.262). Since χ2observed (1.052)< χ2tabulated (3.841), null hypothesis that the higher adsorbent dose helps in higher % removal of dye is accepted. Both the obtained Freundlich adsorption isotherm and the Langmuir isotherm's R2 values, which were both close to 1, indicate that the isotherms are favorable. Karl Pearson's relationship coefficient values for Langmuir and Freundlich adsorption isotherms found to be 0.9693 and 0.9994 respectively, which show a more significant level of connection between's the factors. The ANN model predicted adsorption percentage with regression value R is 0.996. ANN model result predict 99.60 % BPB dye adsorption using optimized parametric conditions. The ANN model produced values that were more precise, reliable, and reproducible, demonstrating its superiority.
{"title":"Application of ANN, hypothesis testing and statistics to the adsorptive removal of toxic dye by nanocomposite","authors":"Thamraa Alshahrani , Ganesh Jethave , Anil Nemade , Yogesh Khairnar , Umesh Fegade , Monali Khachane , Amir Al-Ahmed , Firoz Khan","doi":"10.1016/j.chemolab.2024.105132","DOIUrl":"10.1016/j.chemolab.2024.105132","url":null,"abstract":"<div><p>Statistics can be used in a variety of ways to present, compute, and critically analyze experimental data. To determine the significance and validity of the experimental data, a variety of statistical tests are used. Using a synthesized CoO/NiO/MnO<sub>2</sub> Nanocomposite, the present study used adsorption to remove the dye Bromophenol Blue (BPB) from a contaminated aqueous solution. In order to (a) determine the optimal pH of the solution, (b) confirm the experiment's success, and (c) investigate the effect of adsorbent dose on BPB dye removal from aqueous solutions. The experimental data were statistically analyzed through hypothesis testing using the <em>t</em>-test, paired <em>t</em>-test, and Chi-square test. The null hypothesis that the optimal pH value is 7 is accepted since t<sub>observed</sub> (−1.979)<t<sub>tabulated</sub> (−2.262). Since χ<sup>2</sup><sub>observed</sub> (1.052)< χ<sup>2</sup><sub>tabulated</sub> (3.841), null hypothesis that the higher adsorbent dose helps in higher % removal of dye is accepted. Both the obtained Freundlich adsorption isotherm and the Langmuir isotherm's R<sup>2</sup> values, which were both close to 1, indicate that the isotherms are favorable. Karl Pearson's relationship coefficient values for Langmuir and Freundlich adsorption isotherms found to be 0.9693 and 0.9994 respectively, which show a more significant level of connection between's the factors. The ANN model predicted adsorption percentage with regression value R is 0.996. ANN model result predict 99.60 % BPB dye adsorption using optimized parametric conditions. The ANN model produced values that were more precise, reliable, and reproducible, demonstrating its superiority.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140767102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-18DOI: 10.1016/j.chemolab.2024.105131
Yi Liu , Mingwei Jia , Danya Xu , Tao Yang , Yuan Yao
The surge in data-driven soft sensors for industrial processes is evident. However, most of them suffer from the limitation of being black-box models and this will hamper their widespread use. In response to this challenge, this study proposes a physics-guided graph-learning soft sensor that integrates a physical understanding of industrial processes by incorporating graph-based concepts with process physics. The soft sensor first constructs physical information based on causal relationships between variables using the conditional Granger causality test. Subsequently, it autonomously learns the unique sample information of each observation while employing a regularization loss to ensure the sparsity of the learned information. The model employs a two-stream structure for spatiotemporal encoding of both the physical and sample information. The modeling and prediction results on a penicillin fermentation process indicate that, using the proposed method, the knowledge gained from the data aligns with existing prior knowledge. This approach shows promise in filling the gap between data-driven and physics-based modeling in chemical processes.
{"title":"Physics-guided graph learning soft sensor for chemical processes","authors":"Yi Liu , Mingwei Jia , Danya Xu , Tao Yang , Yuan Yao","doi":"10.1016/j.chemolab.2024.105131","DOIUrl":"10.1016/j.chemolab.2024.105131","url":null,"abstract":"<div><p>The surge in data-driven soft sensors for industrial processes is evident. However, most of them suffer from the limitation of being black-box models and this will hamper their widespread use. In response to this challenge, this study proposes a physics-guided graph-learning soft sensor that integrates a physical understanding of industrial processes by incorporating graph-based concepts with process physics. The soft sensor first constructs physical information based on causal relationships between variables using the conditional Granger causality test. Subsequently, it autonomously learns the unique sample information of each observation while employing a regularization loss to ensure the sparsity of the learned information. The model employs a two-stream structure for spatiotemporal encoding of both the physical and sample information. The modeling and prediction results on a penicillin fermentation process indicate that, using the proposed method, the knowledge gained from the data aligns with existing prior knowledge. This approach shows promise in filling the gap between data-driven and physics-based modeling in chemical processes.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140635693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1016/j.chemolab.2024.105122
Andrew T. Karl
We introduce a heuristic to test the significance of fit of Self-Validated Ensemble Models (SVEM) against the null hypothesis of a constant response. A SVEM model averages predictions from nBoot fits of a model, applied to fractionally weighted bootstraps of the target dataset. It tunes each fit on a validation copy of the training data, utilizing anti-correlated weights for training and validation. The proposed test computes SVEM predictions centered by the response column mean and normalized by the ensemble variability at each of points spaced throughout the factor space. A reference distribution is constructed by refitting the SVEM model to nPerm randomized permutations of the response column and recording the corresponding standardized predictions at the points. A reduced-rank singular value decomposition applied to the centered and scaled reference matrix is used to calculate the Mahalanobis distance for each of the nPerm permutation results as well as the jackknife (holdout) Mahalanobis distance of the original response column. The process is repeated independently for each response in the experiment, producing a joint graphical summary. We present a simulation driven power analysis and discuss limitations of the test relating to model flexibility and design adequacy. The test maintains the nominal Type I error rate even when the base SVEM model contains more parameters than observations.
{"title":"A randomized permutation whole-model test heuristic for Self-Validated Ensemble Models (SVEM)","authors":"Andrew T. Karl","doi":"10.1016/j.chemolab.2024.105122","DOIUrl":"10.1016/j.chemolab.2024.105122","url":null,"abstract":"<div><p>We introduce a heuristic to test the significance of fit of Self-Validated Ensemble Models (SVEM) against the null hypothesis of a constant response. A SVEM model averages predictions from <em>nBoot</em> fits of a model, applied to fractionally weighted bootstraps of the target dataset. It tunes each fit on a validation copy of the training data, utilizing anti-correlated weights for training and validation. The proposed test computes SVEM predictions centered by the response column mean and normalized by the ensemble variability at each of <span><math><mrow><mi>n</mi><mi>P</mi><mi>o</mi><mi>i</mi><mi>n</mi><mi>t</mi></mrow></math></span> points spaced throughout the factor space. A reference distribution is constructed by refitting the SVEM model to <em>nPerm</em> randomized permutations of the response column and recording the corresponding standardized predictions at the <span><math><mrow><mi>n</mi><mi>P</mi><mi>o</mi><mi>i</mi><mi>n</mi><mi>t</mi></mrow></math></span> points. A reduced-rank singular value decomposition applied to the centered and scaled <span><math><mrow><mi>n</mi><mi>P</mi><mi>e</mi><mi>r</mi><mi>m</mi><mo>×</mo><mi>n</mi><mi>P</mi><mi>o</mi><mi>i</mi><mi>n</mi><mi>t</mi></mrow></math></span> reference matrix is used to calculate the Mahalanobis distance for each of the <em>nPerm</em> permutation results as well as the jackknife (holdout) Mahalanobis distance of the original response column. The process is repeated independently for each response in the experiment, producing a joint graphical summary. We present a simulation driven power analysis and discuss limitations of the test relating to model flexibility and design adequacy. The test maintains the nominal Type I error rate even when the base SVEM model contains more parameters than observations.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140794829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.1016/j.chemolab.2024.105120
Xudong Huang, Guangzao Huang, Xiaojing Chen, Zhonghao Xie, Shujat Ali, Xi Chen, Leiming Yuan, Wen Shi
Partial least squares (PLS) regression is a linear regression technique that performs well with high-dimensional regressors. Similar to many other supervised learning techniques, PLS is susceptible to the problem that the prediction and training data are drawn from different distributions, which deteriorates the PLS performance. To address this problem, an adaptive strategy via the minimum covariance determinant (MCD) estimator is proposed to improve the PLS model, which aims to find an appropriate training set for the adaptive construction of an accurate PLS model to fit the prediction data. In this study, an -subset of the merged set of prediction and training data with the smallest covariance determinant is found via the MCD estimator, and the prediction and training data with Mahalanobis distances to the -subset less than or equal to a cutoff that is the square root of a quantile of the chi-squared distribution are assumed to have the same distribution, then a PLS model is built on these training data. The proposed method is applied to three real-world datasets and compared with the results of classic PLS, the most significant improvement is obtained for the m5 prediction data in the corn dataset, where the root mean square error of prediction (RMSEP) is reduced from 0.149 to 0.023. For other datasets, our method can also perform better than PLS. The experimental results show the effectiveness of our method.
{"title":"An adaptive strategy to improve the partial least squares model via minimum covariance determinant","authors":"Xudong Huang, Guangzao Huang, Xiaojing Chen, Zhonghao Xie, Shujat Ali, Xi Chen, Leiming Yuan, Wen Shi","doi":"10.1016/j.chemolab.2024.105120","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105120","url":null,"abstract":"<div><p>Partial least squares (PLS) regression is a linear regression technique that performs well with high-dimensional regressors. Similar to many other supervised learning techniques, PLS is susceptible to the problem that the prediction and training data are drawn from different distributions, which deteriorates the PLS performance. To address this problem, an adaptive strategy via the minimum covariance determinant (MCD) estimator is proposed to improve the PLS model, which aims to find an appropriate training set for the adaptive construction of an accurate PLS model to fit the prediction data. In this study, an <span><math><mrow><mi>h</mi></mrow></math></span>-subset of the merged set of prediction and training data with the smallest covariance determinant is found via the MCD estimator, and the prediction and training data with Mahalanobis distances to the <span><math><mrow><mi>h</mi></mrow></math></span>-subset less than or equal to a cutoff that is the square root of a quantile of the chi-squared distribution are assumed to have the same distribution, then a PLS model is built on these training data. The proposed method is applied to three real-world datasets and compared with the results of classic PLS, the most significant improvement is obtained for the m5 prediction data in the corn dataset, where the root mean square error of prediction (RMSEP) is reduced from 0.149 to 0.023. For other datasets, our method can also perform better than PLS. The experimental results show the effectiveness of our method.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-04DOI: 10.1016/j.chemolab.2024.105121
Sujin Lee, Sungkyu Jung
An important problem in compositional data analysis is variable selection in linear regression models with compositional covariates. In the context of microbiome data analysis, there is a demand for considering grouping information such as structures among taxa and multiple sampling sites, resulting in multiple compositional covariates. We develop and compare two different methods of variable selection and inference strategies, based on the debiased lasso and a resampling-based approach. Confidence intervals for individual regression coefficients, obtained from each of the two methods, are shown to be asymptotically valid even in a high-dimension, low-sample-size regime. However, microbiome data often have extremely small sample sizes, rendering asymptotic results unreliable. Through extensive numerical comparisons of the finite-sample performances of the two methods, we find that resampling-based approaches outperform the debiased compositional lasso in cases of extremely small sample sizes, showing higher positive predictive values. Conversely, for larger sample sizes, debiasing yields better results. We apply the proposed multiple compositional regression to steer microbiome data, identifying key bacterial taxa associated with important cattle quality measures.
{"title":"Variable selection and inference strategies for multiple compositional regression","authors":"Sujin Lee, Sungkyu Jung","doi":"10.1016/j.chemolab.2024.105121","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105121","url":null,"abstract":"<div><p>An important problem in compositional data analysis is variable selection in linear regression models with compositional covariates. In the context of microbiome data analysis, there is a demand for considering grouping information such as structures among taxa and multiple sampling sites, resulting in multiple compositional covariates. We develop and compare two different methods of variable selection and inference strategies, based on the debiased lasso and a resampling-based approach. Confidence intervals for individual regression coefficients, obtained from each of the two methods, are shown to be asymptotically valid even in a high-dimension, low-sample-size regime. However, microbiome data often have extremely small sample sizes, rendering asymptotic results unreliable. Through extensive numerical comparisons of the finite-sample performances of the two methods, we find that resampling-based approaches outperform the debiased compositional lasso in cases of extremely small sample sizes, showing higher positive predictive values. Conversely, for larger sample sizes, debiasing yields better results. We apply the proposed multiple compositional regression to steer microbiome data, identifying key bacterial taxa associated with important cattle quality measures.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140535946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-28DOI: 10.1016/j.chemolab.2024.105118
Rosalba Calvini , José Manuel Amigo
Sparse-based models are a powerful tools for data compression, variable reduction, and model complexity reduction. Nevertheless, their major issue is the high computational time needed in large matrices. This manuscript proposes, for the first time, to couple randomised decomposition as a first step before sparsity calculations, followed by a projection of the full data onto a reduced-sparse set of loadings that will drastically reduce the time needed for calculations and built models that are equally reliable as their sparse-based homologous. While this new approach might be valid for several scenarios (exploration, regression and classification), we will focus on exploration methods (like Principal Component Analysis – PCA) applied to large datasets of hyperspectral images. Two datasets of different complexity have been tested, and the benefits of the coupled randomisation and sparse PCA (rsPCA) are extensively studied.
{"title":"Coupling randomisation and sparse modelling for the exploratory analysis of large hyperspectral datasets","authors":"Rosalba Calvini , José Manuel Amigo","doi":"10.1016/j.chemolab.2024.105118","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105118","url":null,"abstract":"<div><p>Sparse-based models are a powerful tools for data compression, variable reduction, and model complexity reduction. Nevertheless, their major issue is the high computational time needed in large matrices. This manuscript proposes, for the first time, to couple randomised decomposition as a first step before sparsity calculations, followed by a projection of the full data onto a reduced-sparse set of loadings that will drastically reduce the time needed for calculations and built models that are equally reliable as their sparse-based homologous. While this new approach might be valid for several scenarios (exploration, regression and classification), we will focus on exploration methods (like Principal Component Analysis – PCA) applied to large datasets of hyperspectral images. Two datasets of different complexity have been tested, and the benefits of the coupled randomisation and sparse PCA (rsPCA) are extensively studied.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000583/pdfft?md5=9376189f17f4e06ebcb4a11a7944f64d&pid=1-s2.0-S0169743924000583-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140320875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}