Zhen-Zhen Li, Niu Huang, Lun-Zhao Yi, Guang-Hui Fu
Imbalanced domain prediction analysis is currently one of the hot research topics. Many real-world data mining analyses involve using imbalanced data to obtain predictive models. In the context of imbalance, research on classification problems has been extensive, but research on regression problems is negligible. Rare values rarely occur in imbalanced regression problems, but the focus is on accurately predicting the continuous target variables of rare instances. One of the challenges in imbalanced regression is finding a suitable strategy to rebalance the original dataset in order to improve the predictive performance of the model in rare instances. In this study, two algorithms are proposed: sigma nearest over-sampling based on convex combination for regression (SNOCCR) and affine combination-based over-sampling (ACOS). ACOS rebalances the original dataset by generating new instances through the affine combinations of the original examples. The region where the new instances are generated can be adjusted based on the distribution of the data, ensuring that the generated cases better mimic the distribution of the original examples. The comparison among ACOS, SNOCCR, and other preprocessing methods was conducted on 15 datasets to validate the predictive performance of models trained on rebalanced datasets for rare instances. The experimental results indicate that ACOS outperforms other existing methods.
{"title":"Affine combination-based over-sampling for imbalanced regression","authors":"Zhen-Zhen Li, Niu Huang, Lun-Zhao Yi, Guang-Hui Fu","doi":"10.1002/cem.3537","DOIUrl":"10.1002/cem.3537","url":null,"abstract":"<p>Imbalanced domain prediction analysis is currently one of the hot research topics. Many real-world data mining analyses involve using imbalanced data to obtain predictive models. In the context of imbalance, research on classification problems has been extensive, but research on regression problems is negligible. Rare values rarely occur in imbalanced regression problems, but the focus is on accurately predicting the continuous target variables of rare instances. One of the challenges in imbalanced regression is finding a suitable strategy to rebalance the original dataset in order to improve the predictive performance of the model in rare instances. In this study, two algorithms are proposed: sigma nearest over-sampling based on convex combination for regression (SNOCCR) and affine combination-based over-sampling (ACOS). ACOS rebalances the original dataset by generating new instances through the affine combinations of the original examples. The region where the new instances are generated can be adjusted based on the distribution of the data, ensuring that the generated cases better mimic the distribution of the original examples. The comparison among ACOS, SNOCCR, and other preprocessing methods was conducted on 15 datasets to validate the predictive performance of models trained on rebalanced datasets for rare instances. The experimental results indicate that ACOS outperforms other existing methods.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two new models have been recently introduced for studying the remaining rotational ambiguity in the bilinear decomposition of matrix data. One of the models is N-BANDS, which yields two extreme profiles per sample component, corresponding to maximum or minimum signal contribution function or relative component area under its concentration profile. It is highly useful for computing the relative root mean square error due to rotational ambiguity in estimated analyte concentrations (RMSERA), which numerically quantifies the impact of the phenomenon in terms of prediction uncertainty. Since N-BANDS successfully consider the presence of instrumental noise in the data, it is extremely useful for the analysis of real data sets. The other model is SW-N-BANDS, which is similar to N-BANDS, but is applied in a sensor wise manner, that is, computing the maximum and minimum intensity value at each sensor. It provides the boundaries of the full set of feasible profiles, and helps to better understand the behavior of a given component under the application of several constraints. Both models are described in light of both simulations and experimental data, illustrating their main characteristics of importance to analytical chemistry studies.
{"title":"Two new methods for the estimation and interpretation of the range of feasible profiles in multivariate curve resolution and their implications to analytical chemistry","authors":"Alejandro C. Olivieri","doi":"10.1002/cem.3535","DOIUrl":"10.1002/cem.3535","url":null,"abstract":"<p>Two new models have been recently introduced for studying the remaining rotational ambiguity in the bilinear decomposition of matrix data. One of the models is N-BANDS, which yields two extreme profiles per sample component, corresponding to maximum or minimum signal contribution function or relative component area under its concentration profile. It is highly useful for computing the relative root mean square error due to rotational ambiguity in estimated analyte concentrations (RMSE<sub>RA</sub>), which numerically quantifies the impact of the phenomenon in terms of prediction uncertainty. Since N-BANDS successfully consider the presence of instrumental noise in the data, it is extremely useful for the analysis of real data sets. The other model is SW-N-BANDS, which is similar to N-BANDS, but is applied in a sensor wise manner, that is, computing the maximum and minimum intensity value at each sensor. It provides the boundaries of the full set of feasible profiles, and helps to better understand the behavior of a given component under the application of several constraints. Both models are described in light of both simulations and experimental data, illustrating their main characteristics of importance to analytical chemistry studies.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Li, Jiping Zou, Chengtao Su, Feng Zhang, Yande Liu, Jian Wu, Nan Chen, Yihua Xiao
Impact damage is one of the key factors affecting the quality of honey peaches. Quantitative study of impact damage is of great significance for the sorting of postharvest quality of honey peaches. In order to realize the quantitative prediction of impact damage of honey peaches, the impact damage of honey peaches was quantitatively studied based on the fusion of color characteristics with reflection spectra (R), absorbance spectra (A), and Kubelka-Munk spectra (K-M). The mechanical parameters of honey peaches during collision were obtained using a single pendulum collision device. Reflectance spectra and color characteristics of damaged honey peaches were obtained by a hyperspectral imaging system. The R spectra were converted into A and K-M spectra, and the partial least squares regression (PLSR) model was built based on the three spectra and the three spectra combined with color characteristics for quantitative prediction of mechanical parameters. The results show that the prediction performance of the PLSR model is improved by combining color features with spectral information. In order to eliminate the redundant information in the spectral data, the competitive adaptive reweighted sampling (CARS) algorithm was used to select the characteristic wavelengths of the three spectra, and the selected characteristic wavelengths were fused with the color features to establish the PLSR model. The results show that the PLSR model built by the characteristic wavelengths of the A spectrum combined with the color features has the best prediction performance for the mechanical parameters. The RP value for maximum force is 0.862, and the RP value for damage depth is 0.894. The results of this study not only provide the theoretical support for the quality sorting, packaging, and transportation of honey peaches but also provide the reference for the biomechanical properties of various agricultural products.
冲击损伤是影响蜜桃质量的关键因素之一。冲击损伤的定量研究对于蜜桃采后品质的分选具有重要意义。为了实现蜜桃撞击损伤的定量预测,基于颜色特征与反射光谱(R)、吸光度光谱(A)和库贝尔卡-蒙克光谱(K-M)的融合,对蜜桃的撞击损伤进行了定量研究。使用单摆碰撞装置获得了蜜桃在碰撞过程中的机械参数。利用高光谱成像系统获得了受损蜜桃的反射光谱和颜色特征。将 R 光谱转换为 A 光谱和 K-M 光谱,并根据三条光谱和三条光谱结合颜色特征建立偏最小二乘回归(PLSR)模型,对力学参数进行定量预测。结果表明,将颜色特征与光谱信息相结合可提高 PLSR 模型的预测性能。为了消除光谱数据中的冗余信息,采用竞争性自适应加权采样(CARS)算法来选择三条光谱的特征波长,并将所选的特征波长与颜色特征融合建立 PLSR 模型。结果表明,由 A 光谱特征波与颜色特征相结合建立的 PLSR 模型对力学参数的预测性能最佳。最大力的 RP 值为 0.862,损坏深度的 RP 值为 0.894。该研究结果不仅为蜜桃的质量分拣、包装和运输提供了理论支持,也为各种农产品的生物力学特性提供了参考。
{"title":"Quantitative study on impact damage of honey peaches based on reflection, absorbance, and Kubelka-Munk spectrum combined with color characteristics","authors":"Bin Li, Jiping Zou, Chengtao Su, Feng Zhang, Yande Liu, Jian Wu, Nan Chen, Yihua Xiao","doi":"10.1002/cem.3532","DOIUrl":"10.1002/cem.3532","url":null,"abstract":"<p>Impact damage is one of the key factors affecting the quality of honey peaches. Quantitative study of impact damage is of great significance for the sorting of postharvest quality of honey peaches. In order to realize the quantitative prediction of impact damage of honey peaches, the impact damage of honey peaches was quantitatively studied based on the fusion of color characteristics with reflection spectra (R), absorbance spectra (A), and Kubelka-Munk spectra (K-M). The mechanical parameters of honey peaches during collision were obtained using a single pendulum collision device. Reflectance spectra and color characteristics of damaged honey peaches were obtained by a hyperspectral imaging system. The R spectra were converted into A and K-M spectra, and the partial least squares regression (PLSR) model was built based on the three spectra and the three spectra combined with color characteristics for quantitative prediction of mechanical parameters. The results show that the prediction performance of the PLSR model is improved by combining color features with spectral information. In order to eliminate the redundant information in the spectral data, the competitive adaptive reweighted sampling (CARS) algorithm was used to select the characteristic wavelengths of the three spectra, and the selected characteristic wavelengths were fused with the color features to establish the PLSR model. The results show that the PLSR model built by the characteristic wavelengths of the A spectrum combined with the color features has the best prediction performance for the mechanical parameters. The R<sub>P</sub> value for maximum force is 0.862, and the R<sub>P</sub> value for damage depth is 0.894. The results of this study not only provide the theoretical support for the quality sorting, packaging, and transportation of honey peaches but also provide the reference for the biomechanical properties of various agricultural products.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nina Tomčić, Milica Jankov, Petar Ristivojević, Jelena Trifković, Filip Andrić
According to the study carried out at the University of Bristol, 60% of oregano spices present on the European Union (EU) market are adulterated with olive, myrtle, cistus, and hazelnut leaves. According to the same authors, the sage products are adulterated by similar bulking agents. The aim of this study was to assess possibilities for detection of sage adulteration by olive leaves using high-performance thin-layer chromatography (HPTLC) coupled with digital image analysis and multivariate linear regression/classification (partial least squares and partial least squares discriminant analysis). Twenty-four samples (4 pure sage leaves, 4 pure olive leaves, and 16 mixtures of olive and sage leaves with content of added olive leaves varying in 5%, 10%, 20%, and 50%) have been prepared, extracted, and analyzed under normal-phase conditions. Several derivatization methods were tested, and derivatized HPTLC plates were inspected under visible or ultraviolet light. Digital images of chromatograms were recorded. In order to minimize effects of intraplate and interplate peak shifts, background changes, and baseline drifts, correlation-optimized warping, standard normal variate, and mean centering were applied to acquired signals. Partial least squares and partial least squares discriminant analysis models with moderate complexity (two to four latent variables) based on chromatographic signals obtained after derivatization by FeCl3, anisaldehyde–sulfuric acid, and 2,2-diphenyl-1-picrylhydrazyl demonstrated good statistical performances with R2 ranging 0.894–0.998 and relative prediction error of 4–12%. Misclassification error <4% was obtained in the case of 2,2-diphenyl-1-picrylhydrazyl and anisaldehyde–sulfuric acid derivatization. Therefore, HPTLC combined with multivariate image analysis, signal processing, and linear modeling proved to be promising, cost-effective chromatographic tool for assessment of sage adulteration by olive leaves.
{"title":"Assessment of adulteration of sage (Salvia sp.) with olive leaves using high-performance thin-layer chromatography, image analysis, and multivariate linear modeling","authors":"Nina Tomčić, Milica Jankov, Petar Ristivojević, Jelena Trifković, Filip Andrić","doi":"10.1002/cem.3533","DOIUrl":"10.1002/cem.3533","url":null,"abstract":"<p>According to the study carried out at the University of Bristol, 60% of oregano spices present on the European Union (EU) market are adulterated with olive, myrtle, cistus, and hazelnut leaves. According to the same authors, the sage products are adulterated by similar bulking agents. The aim of this study was to assess possibilities for detection of sage adulteration by olive leaves using high-performance thin-layer chromatography (HPTLC) coupled with digital image analysis and multivariate linear regression/classification (partial least squares and partial least squares discriminant analysis). Twenty-four samples (4 pure sage leaves, 4 pure olive leaves, and 16 mixtures of olive and sage leaves with content of added olive leaves varying in 5%, 10%, 20%, and 50%) have been prepared, extracted, and analyzed under normal-phase conditions. Several derivatization methods were tested, and derivatized HPTLC plates were inspected under visible or ultraviolet light. Digital images of chromatograms were recorded. In order to minimize effects of intraplate and interplate peak shifts, background changes, and baseline drifts, correlation-optimized warping, standard normal variate, and mean centering were applied to acquired signals. Partial least squares and partial least squares discriminant analysis models with moderate complexity (two to four latent variables) based on chromatographic signals obtained after derivatization by FeCl<sub>3</sub>, anisaldehyde–sulfuric acid, and 2,2-diphenyl-1-picrylhydrazyl demonstrated good statistical performances with <i>R</i><sup>2</sup> ranging 0.894–0.998 and relative prediction error of 4–12%. Misclassification error <4% was obtained in the case of 2,2-diphenyl-1-picrylhydrazyl and anisaldehyde–sulfuric acid derivatization. Therefore, HPTLC combined with multivariate image analysis, signal processing, and linear modeling proved to be promising, cost-effective chromatographic tool for assessment of sage adulteration by olive leaves.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks have become an important tool for soft sensor modeling. However, common deep autoencoder networks are limited to mining the effective information of each input layer during hierarchical training, ignoring the loss of effective information in the original input data and accumulating it layer-by-layer, resulting in incomplete feature representation of the original input. At the same time, there is a lack of mining for temporal correlation between process samples and an adaptive mechanism to strengthen temporal related features, resulting in insufficient process information mining. In addition, deep neural networks generally have overfitting problems. To this end, an adaptive deep fusion neural network (ADFNN) method is proposed. This method reconstructs the original input data at each layer of the feature extraction network. By using the reconstructed original input error in pre-training loss, it reduces the loss of effective information from the original input. Simultaneously, incorporating sliding windows and self-attention mechanisms to select and calculate the contribution of historical samples to the current sample, integrating temporal related information, and overcoming dependence on high-dimensional local features by minimizing Kullback-Leibier (KL) divergence penalty terms. Finally, the temporal features are adaptively weighted and connected to a fully connected network to achieve quality prediction. Simulation experiments were conducted in cases of debutanizer and industrial polyethylene production to verify the effectiveness of the proposed method. The experimental results show that compared to the stacked autoencoder (SAE), target dependent stacked autoencoder (TSAE), and stacked isomorphic autoencoder (SIAE) models, the proposed method ADFNN has improved prediction accuracy by 2.4%, 1.7%, and 0.5% in the case of a debutanizer, respectively. In the industrial polyethylene production case, it has increased by 3.6%, 3.3%, and 1.8%, respectively.
{"title":"Adaptive deep fusion neural network based soft sensor for industrial process","authors":"Xiaoping Guo, Jialin Chong, Yuan Li","doi":"10.1002/cem.3529","DOIUrl":"10.1002/cem.3529","url":null,"abstract":"<p>Deep neural networks have become an important tool for soft sensor modeling. However, common deep autoencoder networks are limited to mining the effective information of each input layer during hierarchical training, ignoring the loss of effective information in the original input data and accumulating it layer-by-layer, resulting in incomplete feature representation of the original input. At the same time, there is a lack of mining for temporal correlation between process samples and an adaptive mechanism to strengthen temporal related features, resulting in insufficient process information mining. In addition, deep neural networks generally have overfitting problems. To this end, an adaptive deep fusion neural network (ADFNN) method is proposed. This method reconstructs the original input data at each layer of the feature extraction network. By using the reconstructed original input error in pre-training loss, it reduces the loss of effective information from the original input. Simultaneously, incorporating sliding windows and self-attention mechanisms to select and calculate the contribution of historical samples to the current sample, integrating temporal related information, and overcoming dependence on high-dimensional local features by minimizing Kullback-Leibier (KL) divergence penalty terms. Finally, the temporal features are adaptively weighted and connected to a fully connected network to achieve quality prediction. Simulation experiments were conducted in cases of debutanizer and industrial polyethylene production to verify the effectiveness of the proposed method. The experimental results show that compared to the stacked autoencoder (SAE), target dependent stacked autoencoder (TSAE), and stacked isomorphic autoencoder (SIAE) models, the proposed method ADFNN has improved prediction accuracy by 2.4%, 1.7%, and 0.5% in the case of a debutanizer, respectively. In the industrial polyethylene production case, it has increased by 3.6%, 3.3%, and 1.8%, respectively.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olav M. Kvalheim, Warren S. Vidar, Tim U. H. Baumeister, Roger G. Linington, Nadja B. Cech
With linear dependency between the explanatory variables, partial least squares (PLS) regression is commonly used for regression analysis. If the response variable correlates to a high degree with the explanatory variables, a model with excellent predictive ability can usually be obtained. Ranking of variable importance is commonly used to interpret the model and sometimes this interpretation guides further experimentation. For instance, when analyzing natural product extracts for bioactivity, an underlying assumption is that the highest ranked compounds represent the best candidates for isolation and further testing. A problem with this approach is that in most cases, the number of compounds is larger than the number of samples (and usually much larger) and that the concentrations of the compounds correlate. Furthermore, compounds may interact as synergists or as antagonists. If the modeling process does not account for this possibility, the interpretation can be thoroughly wrong because unmodeled variables that strongly influence the response will give rise to confounding of a first-order PLS model and send the experimenter on a wrong track. We show the consequences of this by a practical example from natural product research. Furthermore, we show that by including the possibility of interactions between explanatory variables, visualization using a selectivity ratio plot may provide model interpretation that can be used to make inferences.
{"title":"Implications of confounding from unmodeled interactions between explanatory variables when using latent variable regression models to make inferences","authors":"Olav M. Kvalheim, Warren S. Vidar, Tim U. H. Baumeister, Roger G. Linington, Nadja B. Cech","doi":"10.1002/cem.3531","DOIUrl":"10.1002/cem.3531","url":null,"abstract":"<p>With linear dependency between the explanatory variables, partial least squares (PLS) regression is commonly used for regression analysis. If the response variable correlates to a high degree with the explanatory variables, a model with excellent predictive ability can usually be obtained. Ranking of variable importance is commonly used to interpret the model and sometimes this interpretation guides further experimentation. For instance, when analyzing natural product extracts for bioactivity, an underlying assumption is that the highest ranked compounds represent the best candidates for isolation and further testing. A problem with this approach is that in most cases, the number of compounds is larger than the number of samples (and usually much larger) and that the concentrations of the compounds correlate. Furthermore, compounds may interact as synergists or as antagonists. If the modeling process does not account for this possibility, the interpretation can be thoroughly wrong because unmodeled variables that strongly influence the response will give rise to confounding of a first-order PLS model and send the experimenter on a wrong track. We show the consequences of this by a practical example from natural product research. Furthermore, we show that by including the possibility of interactions between explanatory variables, visualization using a selectivity ratio plot may provide model interpretation that can be used to make inferences.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3531","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139459018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The validation principles on Quantitative Structure Activity Relationship issued by Organization for Economic and Co-operation and Development describe three criteria of model assessment: goodness of fit, robustness and prediction. In the case of robustness, two ways are possible as internal validation: bootstrap and cross-validation. We compared these validation metrics by checking their sample size dependence, rank correlations to other metrics and uncertainty. We used modeling methods from multivariate linear regression to artificial neural network on 14 open access datasets. We found that the metrics provide similar sample size dependence and correlation to other validation parameters. The individual uncertainty originating from the calculation recipes of the metrics is much smaller for both ways than the part caused by the selection of the training set or the training/test split. We concluded that the metrics of the two techniques are interchangeable, but the interpretation of cross-validation parameters is easier according to their similar range to goodness-of-fit and prediction metrics. Furthermore, the variance originating from the random elements of the calculation of cross-validation metrics is slightly smaller than those of bootstrap ones, if equal calculation load is applied.
{"title":"The difference of model robustness assessment using cross-validation and bootstrap methods","authors":"Rita Lasfar, Gergely Tóth","doi":"10.1002/cem.3530","DOIUrl":"10.1002/cem.3530","url":null,"abstract":"<p>The validation principles on Quantitative Structure Activity Relationship issued by Organization for Economic and Co-operation and Development describe three criteria of model assessment: goodness of fit, robustness and prediction. In the case of robustness, two ways are possible as internal validation: bootstrap and cross-validation. We compared these validation metrics by checking their sample size dependence, rank correlations to other metrics and uncertainty. We used modeling methods from multivariate linear regression to artificial neural network on 14 open access datasets. We found that the metrics provide similar sample size dependence and correlation to other validation parameters. The individual uncertainty originating from the calculation recipes of the metrics is much smaller for both ways than the part caused by the selection of the training set or the training/test split. We concluded that the metrics of the two techniques are interchangeable, but the interpretation of cross-validation parameters is easier according to their similar range to goodness-of-fit and prediction metrics. Furthermore, the variance originating from the random elements of the calculation of cross-validation metrics is slightly smaller than those of bootstrap ones, if equal calculation load is applied.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When using near-infrared (NIR) techniques for analysis, model construction and maintenance updates are essential. When model construction is performed in machine learning, the sample set is usually divided into the calibration set and the validation set. The representativeness of the calibration set and the reasonable distribution of the validation set affects the accuracy of the established model. In addition, when maintaining and updating models, selecting the most informative updated sample not only improves the model prediction accuracy but also reduces sample preparation. In this paper, the spectral information entropy (SIE) is proposed to be used as a similarity criterion for dividing the sample set and use this criterion to select updated samples. The Kennard–Stone (KS) and the sample set portioning based on joint x–y distance (SPXY) methods were used for comparison to verify the superiority of the proposed method. The results showed that the model built after dividing the sample set using the SIE method has good prediction effect compared with KS and SPXY method. When predicting soluble solid content (SSC) and hardness, the prediction determination coefficient (