Pub Date : 2024-09-02DOI: 10.1016/j.chemolab.2024.105225
Xiaoyu Qian , Jinru Wu , Ligong Wei , Youwu Lin
In classification problems, many models with superior performance fail to provide confidence estimates or intervals for each prediction. This lack of reliability poses risks in real-world applications, making these models difficult to trust. Conformal prediction, as distribution-free and model-free approaches with finite-sample coverage guarantee, have recently been widely used to construct prediction sets for classification models. However, traditional conformal prediction methods only produce set-valued results without specifying a definitive predicted class. Particularly in complex settings, these methods fail to assist models in effectively addressing challenges such as high dimensionality, resulting in ambiguous prediction sets with low statistical efficiency, i.e. the prediction sets contain many false classes. In this study, a novel Ensemble Conformal Prediction algorithm based on Random Projection and a designed voting strategy, RPECP, is developed to tackle these challenges. Initially, a procedure for selecting the approximately oracle random projections and classifiers is executed to best leverage the internal information and structure of the data. Subsequently, based on the approximately oracle random projections and underlying classifiers, conformal prediction is performed on new test samples in a lower-dimensional space, resulting in multiple independent prediction sets. Finally, an accurate predicted class and a precise prediction set with high coverage and statistical efficiency are produced through a designed voting strategy. Compared to several base classifiers, RPECP obtain higher classification accuracy; against other conformal prediction algorithms, it achieves less ambiguous prediction sets with fewer false classes while guaranteeing high coverage. For illustration, this paper demonstrates RPECP's superiority over other methods in four cases: two high-dimensional settings and two real-world datasets.
{"title":"Random projection ensemble conformal prediction for high-dimensional classification","authors":"Xiaoyu Qian , Jinru Wu , Ligong Wei , Youwu Lin","doi":"10.1016/j.chemolab.2024.105225","DOIUrl":"10.1016/j.chemolab.2024.105225","url":null,"abstract":"<div><p>In classification problems, many models with superior performance fail to provide confidence estimates or intervals for each prediction. This lack of reliability poses risks in real-world applications, making these models difficult to trust. Conformal prediction, as distribution-free and model-free approaches with finite-sample coverage guarantee, have recently been widely used to construct prediction sets for classification models. However, traditional conformal prediction methods only produce set-valued results without specifying a definitive predicted class. Particularly in complex settings, these methods fail to assist models in effectively addressing challenges such as high dimensionality, resulting in ambiguous prediction sets with low statistical efficiency, i.e. the prediction sets contain many false classes. In this study, a novel Ensemble Conformal Prediction algorithm based on Random Projection and a designed voting strategy, RPECP, is developed to tackle these challenges. Initially, a procedure for selecting the approximately oracle random projections and classifiers is executed to best leverage the internal information and structure of the data. Subsequently, based on the approximately oracle random projections and underlying classifiers, conformal prediction is performed on new test samples in a lower-dimensional space, resulting in multiple independent prediction sets. Finally, an accurate predicted class and a precise prediction set with high coverage and statistical efficiency are produced through a designed voting strategy. Compared to several base classifiers, RPECP obtain higher classification accuracy; against other conformal prediction algorithms, it achieves less ambiguous prediction sets with fewer false classes while guaranteeing high coverage. For illustration, this paper demonstrates RPECP's superiority over other methods in four cases: two high-dimensional settings and two real-world datasets.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105225"},"PeriodicalIF":3.7,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142147568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimensionality reduction is an essential step in the processing of analytical chemistry data. When this reduction is carried out by variable selection, it can enable the identification of biochemical pathways. CovSel has been developed to meet this requirement, through a parsimonious selection of non-redundant variables. This article presents the g-CovSel method, which modifies the CovSel algorithm to produce highly complementary groups containing highly correlated variables. This modification requires the theoretical definition of the groups' construction and of the deflation of the data with respect to the selected groups. Two applications, on two extreme case studies, are presented. The first, based on near-infrared spectra related to four chemicals, demonstrates the relevance of the selected groups and the method's ability to handle highly correlated variables. The second, based on genomic data, demonstrates the method's ability to handle very highly multivariate data. Most of the groups formed can be interpreted from a functional point of view, making g-CovSel a tool of choice for biomarker identification in omics. Further work will be carried out to generalize g-CovSel to multi-block and multi-way data.
{"title":"G-CovSel: Covariance oriented variable clustering","authors":"Jean-Michel Roger , Alessandra Biancolillo , Bénédicte Favreau , Federico Marini","doi":"10.1016/j.chemolab.2024.105223","DOIUrl":"10.1016/j.chemolab.2024.105223","url":null,"abstract":"<div><p>Dimensionality reduction is an essential step in the processing of analytical chemistry data. When this reduction is carried out by variable selection, it can enable the identification of biochemical pathways. CovSel has been developed to meet this requirement, through a parsimonious selection of non-redundant variables. This article presents the g-CovSel method, which modifies the CovSel algorithm to produce highly complementary groups containing highly correlated variables. This modification requires the theoretical definition of the groups' construction and of the deflation of the data with respect to the selected groups. Two applications, on two extreme case studies, are presented. The first, based on near-infrared spectra related to four chemicals, demonstrates the relevance of the selected groups and the method's ability to handle highly correlated variables. The second, based on genomic data, demonstrates the method's ability to handle very highly multivariate data. Most of the groups formed can be interpreted from a functional point of view, making g-CovSel a tool of choice for biomarker identification in omics. Further work will be carried out to generalize g-CovSel to multi-block and multi-way data.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105223"},"PeriodicalIF":3.7,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924001631/pdfft?md5=52fb71b18968f61fe29df549f8fc05f7&pid=1-s2.0-S0169743924001631-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1016/j.chemolab.2024.105221
Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao
Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (1H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The 1H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled 1H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled 1H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.
{"title":"Enhancing quantitative 1H NMR model generalizability on honey from different years through partial least squares subspace and optimal transport based unsupervised domain adaptation","authors":"Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao","doi":"10.1016/j.chemolab.2024.105221","DOIUrl":"10.1016/j.chemolab.2024.105221","url":null,"abstract":"<div><div>Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (<sup>1</sup>H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The <sup>1</sup>H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled <sup>1</sup>H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled <sup>1</sup>H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105221"},"PeriodicalIF":3.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142441360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guar gum is a non-ionic polysaccharide found in abundance in nature. It may be used as a thickening agent, stabilizer, or emulsifier in pharmaceutical formulations, food products, or cosmetics. Its ability to form viscous solutions makes it useful in drug delivery systems, controlled release formulations, and as a matrix for oral drug delivery. The investigation of chemical structures through graph invariants is of great concern. Topological descriptors are numerical numbers associated with the molecular structure and have the ability to predict certain physical and chemical properties of the underlying structure. In this paper, we have calculated the harmonic index, the inverse sum indeg index, the third Zagreb index, the Hyper Zagreb index, the sigma index, the reformulated first Zagreb index, the reformulated multiplicative first Zagreb index, the Harmonic–arithmetic index, and the Atom Bond sum connectivity indices of guar gum and its chemical derivatives. Finally, the chemical applicability of these topological descriptors is checked for different carbohydrates (monosaccharides, disaccharides, and polysaccharides) by using straight-line, parabolic and logarithmic regression models. It has been observed that these topological descriptors are useful to predict two physical properties, namely density and molecular weight.
{"title":"Analyzing topological descriptors of guar gum and its derivatives for predicting physical properties in carbohydrates","authors":"Xiujun Zhang , Shamaila Yousaf , Anisa Naeem , Ferdous M. Tawfiq , Adnan Aslam","doi":"10.1016/j.chemolab.2024.105203","DOIUrl":"10.1016/j.chemolab.2024.105203","url":null,"abstract":"<div><p>Guar gum is a non-ionic polysaccharide found in abundance in nature. It may be used as a thickening agent, stabilizer, or emulsifier in pharmaceutical formulations, food products, or cosmetics. Its ability to form viscous solutions makes it useful in drug delivery systems, controlled release formulations, and as a matrix for oral drug delivery. The investigation of chemical structures through graph invariants is of great concern. Topological descriptors are numerical numbers associated with the molecular structure and have the ability to predict certain physical and chemical properties of the underlying structure. In this paper, we have calculated the harmonic index, the inverse sum indeg index, the third Zagreb index, the Hyper Zagreb index, the sigma index, the reformulated first Zagreb index, the reformulated multiplicative first Zagreb index, the Harmonic–arithmetic index, and the Atom Bond sum connectivity indices of guar gum and its chemical derivatives. Finally, the chemical applicability of these topological descriptors is checked for different carbohydrates (monosaccharides, disaccharides, and polysaccharides) by using straight-line, parabolic and logarithmic regression models. It has been observed that these topological descriptors are useful to predict two physical properties, namely density and molecular weight.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105203"},"PeriodicalIF":3.7,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-24DOI: 10.1016/j.chemolab.2024.105218
Knut Dyrstad , Frank Westad
Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.
A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.
{"title":"Interpretation of high dimensional definitive screening designs assisted by bootstrapped partial least squares regression","authors":"Knut Dyrstad , Frank Westad","doi":"10.1016/j.chemolab.2024.105218","DOIUrl":"10.1016/j.chemolab.2024.105218","url":null,"abstract":"<div><p>Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.</p><p>A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105218"},"PeriodicalIF":3.7,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.chemolab.2024.105222
Honghong Wang , Qiong Wu , Wuye Yang , Jie Yu , Ting Wu , Zhixin Xiong , Yiping Du
The determination of total nicotine, total sugar, reducing sugar and total nitrogen contents in tobacco is of great significance to tobacco quality evaluation and formulation design. To quickly detect the content of 4 components of tobacco, using near-infrared (NIR) and mid-infrared (MIR) spectral data from 129 solid samples of tobacco powder provided by Shanghai Tobacco Group Co., Ltd., Two NIR-MIR spectral fusion techniques are studied, that is, fusion technology 1 is to establish a model by fusing feature variables after variable selection of each spectrum. The fusion technology 2 is to first fuse the NIR-MIR spectral data and then select the variables to establish the model. Both fusion technologies use successive projections algorithm (SPA), competitive adaptive reweighted sampling (CARS), backward interval PLS (biPLS), forward interval PLS (fiPLS), synergy interval PLS (siPLS), and interval interaction moving window partial least squares (iMWPLS) algorithms to filter wavelength variables. The results showed that for total nicotine and total sugar, the PLSR model established by fusion technology method 2 combined with iMWPLS algorithm is the best, and its RMSEP decreases from 0.2314 to 1.3225 to 0.0821 and 0.8079 respectively compared with the full spectrum fusion method, which is superior to the single NIR and MIR models and NIR-MIR fusion technology 1. For reducing sugars, the simple full-spectrum fusion model has the best analytical ability and the lowest RMSEP, which is superior to the single NIR-MIR models and all models established by two spectral fusion techniques combined with six wavelength selection algorithms. For total nitrogen, the prediction effect of fusion technology 1 combined with iMWPLS algorithm model was significantly improved compared with single NIR and MIR models and NIR-MIR fusion technology 2, and its RMSEP was 0.0634. The results showed that the two NIR-MIR spectral fusion techniques made full use of the complementary information provided by NIR and MIR spectroscopy, and successfully applied them to the rapid detection of total nicotine, total sugar, reducing sugar and total nitrogen content in tobacco, which provided a new method and idea for the rapid detection of tobacco components.
{"title":"NIR and MIR spectral feature information fusion strategy for multivariate quantitative analysis of tobacco components","authors":"Honghong Wang , Qiong Wu , Wuye Yang , Jie Yu , Ting Wu , Zhixin Xiong , Yiping Du","doi":"10.1016/j.chemolab.2024.105222","DOIUrl":"10.1016/j.chemolab.2024.105222","url":null,"abstract":"<div><p>The determination of total nicotine, total sugar, reducing sugar and total nitrogen contents in tobacco is of great significance to tobacco quality evaluation and formulation design. To quickly detect the content of 4 components of tobacco, using near-infrared (NIR) and mid-infrared (MIR) spectral data from 129 solid samples of tobacco powder provided by Shanghai Tobacco Group Co., Ltd., Two NIR-MIR spectral fusion techniques are studied, that is, fusion technology 1 is to establish a model by fusing feature variables after variable selection of each spectrum. The fusion technology 2 is to first fuse the NIR-MIR spectral data and then select the variables to establish the model. Both fusion technologies use successive projections algorithm (SPA), competitive adaptive reweighted sampling (CARS), backward interval PLS (biPLS), forward interval PLS (fiPLS), synergy interval PLS (siPLS), and interval interaction moving window partial least squares (iMWPLS) algorithms to filter wavelength variables. The results showed that for total nicotine and total sugar, the PLSR model established by fusion technology method 2 combined with iMWPLS algorithm is the best, and its RMSEP decreases from 0.2314 to 1.3225 to 0.0821 and 0.8079 respectively compared with the full spectrum fusion method, which is superior to the single NIR and MIR models and NIR-MIR fusion technology 1. For reducing sugars, the simple full-spectrum fusion model has the best analytical ability and the lowest RMSEP, which is superior to the single NIR-MIR models and all models established by two spectral fusion techniques combined with six wavelength selection algorithms. For total nitrogen, the prediction effect of fusion technology 1 combined with iMWPLS algorithm model was significantly improved compared with single NIR and MIR models and NIR-MIR fusion technology 2, and its RMSEP was 0.0634. The results showed that the two NIR-MIR spectral fusion techniques made full use of the complementary information provided by NIR and MIR spectroscopy, and successfully applied them to the rapid detection of total nicotine, total sugar, reducing sugar and total nitrogen content in tobacco, which provided a new method and idea for the rapid detection of tobacco components.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105222"},"PeriodicalIF":3.7,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.chemolab.2024.105220
Hang Ci, Chengxi Zhang, Shunyi Zhao
This paper proposes a joint state and unknown inputs (UIs) discrete-time estimation method for industrial processes, represented by a state-space model. To cope with the outliers in process data, the measurement noise is characterized by the Student’s t-distribution. The identification of UIs is accomplished through the recursive expectation–maximization (REM) approach. Specifically, in the E-step, a recursively calculated Q-function is formulated by the maximum likelihood criterion, and the states and the variance scale factor are estimated iteratively. In the M-step, UIs are updated analytically together with the degree of freedom is updated approximately. The effectiveness of the proposed algorithm is validated using a quadruple water tank process and a continuous stirred tank reactor. It shows that the proposed method significantly enhances the robustness and estimation accuracy of state and UIs in industrial processes, effectively handling outliers and reducing computational demands for real-time applications.
本文提出了一种以状态空间模型为代表的工业过程状态和未知输入(UIs)离散时间联合估计方法。为了应对过程数据中的异常值,测量噪声采用了 Student's t 分布。UIs 的识别是通过递归期望最大化(REM)方法完成的。具体来说,在 E 步中,通过最大似然准则制定递归计算的 Q 函数,并对状态和方差比例因子进行迭代估计。在 M 步中,UIs 是通过分析更新的,自由度也是近似更新的。利用四重水槽工艺和连续搅拌罐反应器验证了所提算法的有效性。结果表明,所提出的方法大大提高了工业过程中状态和 UI 的鲁棒性和估计精度,有效地处理了异常值,降低了实时应用的计算需求。
{"title":"Joint state and process inputs estimation for state-space models with Student’s t-distribution","authors":"Hang Ci, Chengxi Zhang, Shunyi Zhao","doi":"10.1016/j.chemolab.2024.105220","DOIUrl":"10.1016/j.chemolab.2024.105220","url":null,"abstract":"<div><p>This paper proposes a joint state and unknown inputs (UIs) discrete-time estimation method for industrial processes, represented by a state-space model. To cope with the outliers in process data, the measurement noise is characterized by the Student’s t-distribution. The identification of UIs is accomplished through the recursive expectation–maximization (REM) approach. Specifically, in the E-step, a recursively calculated Q-function is formulated by the maximum likelihood criterion, and the states and the variance scale factor are estimated iteratively. In the M-step, UIs are updated analytically together with the degree of freedom is updated approximately. The effectiveness of the proposed algorithm is validated using a quadruple water tank process and a continuous stirred tank reactor. It shows that the proposed method significantly enhances the robustness and estimation accuracy of state and UIs in industrial processes, effectively handling outliers and reducing computational demands for real-time applications.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105220"},"PeriodicalIF":3.7,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The research work shows the potentiality of advanced linear and nonlinear learning algorithm techniques in the prediction of apples texture sensory attributes as “hardness”, “crunchiness”, “flouriness”, “fibrousness”, and “graininess”. Starting from the information contained in the entire mechanical and acoustic curves acquired during samples compression test, the prediction performances of five different statistical tools as Partial Least Squares regression (PLS), Multilayer Perceptron (MLP), Support Vector Regression (SVR) and Gaussian Process Regression (GPR) are shown and discussed.
All Predictive models validations evidence best accuracies for texture sensory attributes “hardness” and “crunchiness” and in general for GPR learning algorithm. By combining mechanical and acoustic profiles, 5-fold cross validations produce values of coefficient of determination R2 up to 0.885 (GPR) and 0.840 (GPR), respectively for “hardness” and “crunchiness”. These results, comparable to those obtained by considering a large number of mechanical and acoustic parameters extracted from acquired profiles as predictive factors, evidence a new and reliable way for the prediction of texture sensory attributes of apples. The proposed approach can overcome the necessity to define, in advance, number and type of features to be calculated from instrumental texture profiles and can be easily implemented in an automatic process.
{"title":"Combining algorithm techniques with mechanical and acoustic profiles for the prediction of apples sensory attributes","authors":"Riccardo Ricci , Annachiara Berardinelli , Flavia Gasperi , Isabella Endrizzi , Farid Melgani , Eugenio Aprea","doi":"10.1016/j.chemolab.2024.105217","DOIUrl":"10.1016/j.chemolab.2024.105217","url":null,"abstract":"<div><p>The research work shows the potentiality of advanced linear and nonlinear learning algorithm techniques in the prediction of apples texture sensory attributes as “hardness”, “crunchiness”, “flouriness”, “fibrousness”, and “graininess”. Starting from the information contained in the entire mechanical and acoustic curves acquired during samples compression test, the prediction performances of five different statistical tools as Partial Least Squares regression (PLS), Multilayer Perceptron (MLP), Support Vector Regression (SVR) and Gaussian Process Regression (GPR) are shown and discussed.</p><p>All Predictive models validations evidence best accuracies for texture sensory attributes “hardness” and “crunchiness” and in general for GPR learning algorithm. By combining mechanical and acoustic profiles, 5-fold cross validations produce values of coefficient of determination R<sup>2</sup> up to 0.885 (GPR) and 0.840 (GPR), respectively for “hardness” and “crunchiness”. These results, comparable to those obtained by considering a large number of mechanical and acoustic parameters extracted from acquired profiles as predictive factors, evidence a new and reliable way for the prediction of texture sensory attributes of apples. The proposed approach can overcome the necessity to define, in advance, number and type of features to be calculated from instrumental texture profiles and can be easily implemented in an automatic process.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105217"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.chemolab.2024.105219
Wael A. Mahdi , Ahmad J. Obaidullah
<div><p>In this study, we develop predictive models for three target variables, denoted as <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, and <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, the PAR model outperforms others with an R<sup>2</sup> score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R<sup>2</sup> of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>. The PR model performs on <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, the GPR model provides the best performance, achieving an R<sup>2</sup> score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span>, the GPR model again shows superior performance with an R<sup>2</sup> score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately p
{"title":"Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility","authors":"Wael A. Mahdi , Ahmad J. Obaidullah","doi":"10.1016/j.chemolab.2024.105219","DOIUrl":"10.1016/j.chemolab.2024.105219","url":null,"abstract":"<div><p>In this study, we develop predictive models for three target variables, denoted as <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, and <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, the PAR model outperforms others with an R<sup>2</sup> score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R<sup>2</sup> of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>. The PR model performs on <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, the GPR model provides the best performance, achieving an R<sup>2</sup> score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span>, the GPR model again shows superior performance with an R<sup>2</sup> score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately p","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105219"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.chemolab.2024.105216
Mohammed Alqarni , Shaimaa Mohammed Al Harthi , Mohammed Abdullah Alzubaidi , Ali Abdullah Alqarni , Bandar Saud Shukr , Hassan Talat Shawli
A comprehensive multi-scale computational strategy was developed in this study based on mass transfer and machine learning for simulation of drug concentration distribution in a biomaterial matrix. The controlled release was modeled and validated via the hybrid model. Mass transfer equations along with kinetics models were solved numerically and the results were then used for machine learning models. We investigated the performance of three regression models, namely Decision Tree (DT), Random Forest (RF), and Extra Tree (ET) in predicting medicine concentration (C) based on r and z data. Hyper-parameter optimization is conducted using Glowworm Swarm Optimization (GSO). Results revealed high predictive accuracy across all models, with ET demonstrating superior performance, achieving a coefficient of determination value (R2) of 0.99854, an RMSE of 1.1446E-05, and a maximum error of 6.49087E-05. DT and RF also exhibit notable performance, with coefficients of determination equal to 0.99571 and 0.99655, respectively. These results highlight the effectiveness of ensemble tree-based methods in accurately predicting chemical concentrations, with Extra Tree (ET) Regression emerging as the most promising model for this specific dataset.
本研究开发了一种基于传质和机器学习的多尺度综合计算策略,用于模拟生物材料基质中的药物浓度分布。通过混合模型对控释进行了建模和验证。对传质方程和动力学模型进行了数值求解,然后将结果用于机器学习模型。我们研究了三种回归模型,即决策树(DT)、随机森林(RF)和额外树(ET)在基于 r 和 z 数据预测药物浓度(C)方面的性能。使用萤火虫群优化(GSO)对超参数进行了优化。结果表明,所有模型的预测准确率都很高,其中 ET 表现优异,其决定系数 (R2) 为 0.99854,均方根误差为 1.1446E-05,最大误差为 6.49087E-05。DT 和 RF 也表现不俗,它们的判定系数分别为 0.99571 和 0.99655。这些结果凸显了基于集合树的方法在准确预测化学物质浓度方面的有效性,其中额外树(ET)回归是该特定数据集最有前途的模型。
{"title":"Model development using hybrid method for prediction of drug release from biomaterial matrix","authors":"Mohammed Alqarni , Shaimaa Mohammed Al Harthi , Mohammed Abdullah Alzubaidi , Ali Abdullah Alqarni , Bandar Saud Shukr , Hassan Talat Shawli","doi":"10.1016/j.chemolab.2024.105216","DOIUrl":"10.1016/j.chemolab.2024.105216","url":null,"abstract":"<div><p>A comprehensive multi-scale computational strategy was developed in this study based on mass transfer and machine learning for simulation of drug concentration distribution in a biomaterial matrix. The controlled release was modeled and validated via the hybrid model. Mass transfer equations along with kinetics models were solved numerically and the results were then used for machine learning models. We investigated the performance of three regression models, namely Decision Tree (DT), Random Forest (RF), and Extra Tree (ET) in predicting medicine concentration (C) based on r and z data. Hyper-parameter optimization is conducted using Glowworm Swarm Optimization (GSO). Results revealed high predictive accuracy across all models, with ET demonstrating superior performance, achieving a coefficient of determination value (R<sup>2</sup>) of 0.99854, an RMSE of 1.1446E-05, and a maximum error of 6.49087E-05. DT and RF also exhibit notable performance, with coefficients of determination equal to 0.99571 and 0.99655, respectively. These results highlight the effectiveness of ensemble tree-based methods in accurately predicting chemical concentrations, with Extra Tree (ET) Regression emerging as the most promising model for this specific dataset.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105216"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}