Pub Date : 2024-08-22DOI: 10.1016/j.chemolab.2024.105216
Mohammed Alqarni , Shaimaa Mohammed Al Harthi , Mohammed Abdullah Alzubaidi , Ali Abdullah Alqarni , Bandar Saud Shukr , Hassan Talat Shawli
A comprehensive multi-scale computational strategy was developed in this study based on mass transfer and machine learning for simulation of drug concentration distribution in a biomaterial matrix. The controlled release was modeled and validated via the hybrid model. Mass transfer equations along with kinetics models were solved numerically and the results were then used for machine learning models. We investigated the performance of three regression models, namely Decision Tree (DT), Random Forest (RF), and Extra Tree (ET) in predicting medicine concentration (C) based on r and z data. Hyper-parameter optimization is conducted using Glowworm Swarm Optimization (GSO). Results revealed high predictive accuracy across all models, with ET demonstrating superior performance, achieving a coefficient of determination value (R2) of 0.99854, an RMSE of 1.1446E-05, and a maximum error of 6.49087E-05. DT and RF also exhibit notable performance, with coefficients of determination equal to 0.99571 and 0.99655, respectively. These results highlight the effectiveness of ensemble tree-based methods in accurately predicting chemical concentrations, with Extra Tree (ET) Regression emerging as the most promising model for this specific dataset.
本研究开发了一种基于传质和机器学习的多尺度综合计算策略,用于模拟生物材料基质中的药物浓度分布。通过混合模型对控释进行了建模和验证。对传质方程和动力学模型进行了数值求解,然后将结果用于机器学习模型。我们研究了三种回归模型,即决策树(DT)、随机森林(RF)和额外树(ET)在基于 r 和 z 数据预测药物浓度(C)方面的性能。使用萤火虫群优化(GSO)对超参数进行了优化。结果表明,所有模型的预测准确率都很高,其中 ET 表现优异,其决定系数 (R2) 为 0.99854,均方根误差为 1.1446E-05,最大误差为 6.49087E-05。DT 和 RF 也表现不俗,它们的判定系数分别为 0.99571 和 0.99655。这些结果凸显了基于集合树的方法在准确预测化学物质浓度方面的有效性,其中额外树(ET)回归是该特定数据集最有前途的模型。
{"title":"Model development using hybrid method for prediction of drug release from biomaterial matrix","authors":"Mohammed Alqarni , Shaimaa Mohammed Al Harthi , Mohammed Abdullah Alzubaidi , Ali Abdullah Alqarni , Bandar Saud Shukr , Hassan Talat Shawli","doi":"10.1016/j.chemolab.2024.105216","DOIUrl":"10.1016/j.chemolab.2024.105216","url":null,"abstract":"<div><p>A comprehensive multi-scale computational strategy was developed in this study based on mass transfer and machine learning for simulation of drug concentration distribution in a biomaterial matrix. The controlled release was modeled and validated via the hybrid model. Mass transfer equations along with kinetics models were solved numerically and the results were then used for machine learning models. We investigated the performance of three regression models, namely Decision Tree (DT), Random Forest (RF), and Extra Tree (ET) in predicting medicine concentration (C) based on r and z data. Hyper-parameter optimization is conducted using Glowworm Swarm Optimization (GSO). Results revealed high predictive accuracy across all models, with ET demonstrating superior performance, achieving a coefficient of determination value (R<sup>2</sup>) of 0.99854, an RMSE of 1.1446E-05, and a maximum error of 6.49087E-05. DT and RF also exhibit notable performance, with coefficients of determination equal to 0.99571 and 0.99655, respectively. These results highlight the effectiveness of ensemble tree-based methods in accurately predicting chemical concentrations, with Extra Tree (ET) Regression emerging as the most promising model for this specific dataset.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105216"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.chemolab.2024.105205
Sungwon Park, Hongjoong Kim
Accurate baseline correction is a fundamental requirement for extracting meaningful spectral information and enabling precise quantitative analysis using Raman spectroscopy. Although numerous baseline correction techniques have been developed, they often require meticulous parameter adjustments and yield inconsistent results. To address these challenges, we have introduced a novel approach, namely constrained Gaussian radial basis function fitting (CGF). Our method involves solving a curve-fitting problem using Gaussian radial basis functions under specific constraints. To ensure stability and efficiency, we developed a linear programming algorithm for the proposed approach. We evaluated the performance of CGF using simulated Raman spectra and demonstrated its robustness across various scenarios, including changes in data length and noise levels. In contrast to standard methods, which frequently require complicated parameter adjustments and may exhibit varying errors, our approach provides a simple parameter search and consistently achieves low errors. We further assessed CGF using real Raman spectra, leading to enhanced accuracy in the quantitative analysis of the Raman spectra of chemical warfare agents. Our results emphasize the potential of CGF as a valuable tool for Raman spectroscopy data analysis, significantly advancing sophisticated analytical techniques.
{"title":"Robust baseline correction for Raman spectra by constrained Gaussian radial basis function fitting","authors":"Sungwon Park, Hongjoong Kim","doi":"10.1016/j.chemolab.2024.105205","DOIUrl":"10.1016/j.chemolab.2024.105205","url":null,"abstract":"<div><p>Accurate baseline correction is a fundamental requirement for extracting meaningful spectral information and enabling precise quantitative analysis using Raman spectroscopy. Although numerous baseline correction techniques have been developed, they often require meticulous parameter adjustments and yield inconsistent results. To address these challenges, we have introduced a novel approach, namely constrained Gaussian radial basis function fitting (CGF). Our method involves solving a curve-fitting problem using Gaussian radial basis functions under specific constraints. To ensure stability and efficiency, we developed a linear programming algorithm for the proposed approach. We evaluated the performance of CGF using simulated Raman spectra and demonstrated its robustness across various scenarios, including changes in data length and noise levels. In contrast to standard methods, which frequently require complicated parameter adjustments and may exhibit varying errors, our approach provides a simple parameter search and consistently achieves low errors. We further assessed CGF using real Raman spectra, leading to enhanced accuracy in the quantitative analysis of the Raman spectra of chemical warfare agents. Our results emphasize the potential of CGF as a valuable tool for Raman spectroscopy data analysis, significantly advancing sophisticated analytical techniques.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105205"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1016/j.chemolab.2024.105200
Erik Andries , Ramin Nikzad-Langerodi
Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We modify a class of state-of-the-art methods (penalized baseline correction) that easily admit the incorporation of a priori analyte concentrations such that predictions can be enhanced. This modified approach will be deemed supervised and penalized baseline correction (SPBC). Performance will be assessed on two near infrared data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information). There are cases of SPBC that provide useful baseline-corrected signals such that they outperform state-of-the-art penalized baseline correction algorithms such as AIRPLS. In particular, we observe that performance is conditional on the correlation between separate analytes: the analyte used for baseline correlation and the analyte used for prediction—the greater the correlation between the analyte used for baseline correlation and the analyte used for prediction, the better the prediction performance.
{"title":"Supervised and penalized baseline correction","authors":"Erik Andries , Ramin Nikzad-Langerodi","doi":"10.1016/j.chemolab.2024.105200","DOIUrl":"10.1016/j.chemolab.2024.105200","url":null,"abstract":"<div><p>Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We modify a class of state-of-the-art methods (<em>penalized baseline correction</em>) that easily admit the incorporation of a priori analyte concentrations such that predictions can be enhanced. This modified approach will be deemed <em>supervised and penalized baseline correction</em> (SPBC). Performance will be assessed on two near infrared data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information). There are cases of SPBC that provide useful baseline-corrected signals such that they outperform state-of-the-art penalized baseline correction algorithms such as AIRPLS. In particular, we observe that performance is conditional on the correlation between separate analytes: the analyte used for baseline correlation and the analyte used for prediction—the greater the correlation between the analyte used for baseline correlation and the analyte used for prediction, the better the prediction performance.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105200"},"PeriodicalIF":3.7,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17DOI: 10.1016/j.chemolab.2024.105206
Saad M Alshahrani
This study illustrates the effective control of COVID-19 infection through the adsorption of safranal (SAF) on B16N16 and Al16N16 fullerene-like cages. The SAF adsorption onto the B16N16 and Al16N16 surfaces in gas, water (H2O), and chloroform (CHCl3) environments were assessed using density functional theory (DFT) and time-dependent (TD) density functional theory methods, analyzing the substrates and their complexes. The Al16N16/SAF complex exhibited the most negative binding energy and structural stability in the water phase compared to the B16N16/SAF complex at the PBE0-D3 level. The thermodynamic parameters indicated that the adsorption of SAF onto the fullerene-like cages is exothermic, particularly for the Al16N16/SAF complex. Additionally, the interaction of SAF with the fullerene-like cages in the water phase is more pronounced than in gas and chloroform environments. The complexes' energy gap (Eg) decreases in all three environments compared to the perfect systems, with a significant reduction of over 21 % in all phases. This substantial decrease in the energy gap suggests that the complexes have increased reactivity and sensitivity to SAF, likely due to a significant change in electronic conductivity. The results of molecular docking indicate that the Al16N16/SAF complex in the water phase exhibited a strong binding affinity compared to the other compounds studied. These findings suggest that the Al16N16/SAF complex holds promise as a potential inhibitor for COVID-19 and as a valuable material for biomedical applications and drug delivery systems.
{"title":"Novel investigation on adsorption analysis of safranal interacting with boron nitride and aluminum nitride fullerene-like cages: Drug delivery system","authors":"Saad M Alshahrani","doi":"10.1016/j.chemolab.2024.105206","DOIUrl":"10.1016/j.chemolab.2024.105206","url":null,"abstract":"<div><p>This study illustrates the effective control of COVID-19 infection through the adsorption of safranal (SAF) on B<sub>16</sub>N<sub>16</sub> and Al<sub>16</sub>N<sub>16</sub> fullerene-like cages. The SAF adsorption onto the B<sub>16</sub>N<sub>16</sub> and Al<sub>16</sub>N<sub>16</sub> surfaces in gas, water (H<sub>2</sub>O), and chloroform (CHCl<sub>3</sub>) environments were assessed using density functional theory (DFT) and time-dependent (TD) density functional theory methods, analyzing the substrates and their complexes. The Al<sub>16</sub>N<sub>16</sub>/SAF complex exhibited the most negative binding energy and structural stability in the water phase compared to the B<sub>16</sub>N<sub>16</sub>/SAF complex at the PBE0-D3 level. The thermodynamic parameters indicated that the adsorption of SAF onto the fullerene-like cages is exothermic, particularly for the Al<sub>16</sub>N<sub>16</sub>/SAF complex. Additionally, the interaction of SAF with the fullerene-like cages in the water phase is more pronounced than in gas and chloroform environments. The complexes' energy gap (Eg) decreases in all three environments compared to the perfect systems, with a significant reduction of over 21 % in all phases. This substantial decrease in the energy gap suggests that the complexes have increased reactivity and sensitivity to SAF, likely due to a significant change in electronic conductivity. The results of molecular docking indicate that the Al<sub>16</sub>N<sub>16</sub>/SAF complex in the water phase exhibited a strong binding affinity compared to the other compounds studied. These findings suggest that the Al<sub>16</sub>N<sub>16</sub>/SAF complex holds promise as a potential inhibitor for COVID-19 and as a valuable material for biomedical applications and drug delivery systems.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105206"},"PeriodicalIF":3.7,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1016/j.chemolab.2024.105204
Mariano M. Perdomo , Luis A. Clementi , Jorge R. Vega
The first stage in the industrial production of Styrene-Butadiene Rubber (SBR) typically consists in obtaining a latex from a train of continuous stirred tank reactors. Accurate real-time estimation of some key process variables is of paramount importance to ensure the production of high-quality rubber. Monitoring the mass conversion of monomers in the last reactor of the train is particularly important. To this effect, various soft sensors (SS) have been proposed, however they have not addressed the underlying complex dynamic relationships existing among the process variables. In this work, a SS based on recurrent neural networks (RNN) is developed to estimate the mass conversion in the last reactor of the train. The main challenge is to obtain an adequate estimate of the conversion both in its usual steady-state operation and during its frequent transient operating phases. Three architectures of RNN: Elman, GRU (Gated Recurrent Unit), and LSTM (Long Short-Term Memory) are compared to critically evaluate their performances. Moreover, a comprehensive analysis is conducted to assess the ability of these models to represent different operational modes of the train. The results reveal that the GRU network exhibits the best performance for estimating the mass conversion of monomers. Then, the performance of the proposed model is compared with a previously-developed SS, which was based on a linear estimation model with a Bayesian bias adaptation mechanism and the use of Control Charts for decision-making. The model proposed here proved to be more efficient for estimating the mass conversion of monomers, particularly during transient operating phases. Finally, to evaluate the methodology utilized for designing the SS, the same RNN architectures were trained to online estimate another quality variable: the mass fraction of Styrene bound to the copolymer. The obtained results were also acceptable.
丁苯橡胶(SBR)工业生产的第一阶段通常是从一列连续搅拌罐反应器中获得胶乳。要确保生产出高质量的橡胶,对一些关键工艺变量进行准确的实时估算至关重要。监测反应器组最后一个反应器中单体的质量转化率尤为重要。为此,人们提出了各种软传感器(SS),但它们并没有解决工艺变量之间存在的潜在复杂动态关系。在这项工作中,开发了一种基于递归神经网络(RNN)的软传感器,用于估算列车最后一个反应器的质量转换。主要的挑战是如何在通常的稳态运行和频繁的瞬态运行阶段都能对转换率进行充分估计。RNN 有三种结构:Elman、GRU(门控递归单元)和 LSTM(长短期记忆)三种 RNN 结构进行了比较,以严格评估其性能。此外,还进行了综合分析,以评估这些模型代表列车不同运行模式的能力。结果表明,GRU 网络在估计单体的质量转换方面表现最佳。然后,将所提出模型的性能与之前开发的 SS 进行了比较,后者是基于线性估计模型和贝叶斯偏差适应机制,并使用控制图进行决策。事实证明,这里提出的模型在估算单体的质量转换方面更为有效,尤其是在瞬态运行阶段。最后,为了评估设计 SS 所采用的方法,对相同的 RNN 架构进行了训练,以在线估算另一个质量变量:苯乙烯与共聚物结合的质量分数。得到的结果也是可以接受的。
{"title":"Estimation of quality variables in a continuous train of reactors using recurrent neural networks-based soft sensors","authors":"Mariano M. Perdomo , Luis A. Clementi , Jorge R. Vega","doi":"10.1016/j.chemolab.2024.105204","DOIUrl":"10.1016/j.chemolab.2024.105204","url":null,"abstract":"<div><p>The first stage in the industrial production of Styrene-Butadiene Rubber (SBR) typically consists in obtaining a latex from a train of continuous stirred tank reactors. Accurate real-time estimation of some key process variables is of paramount importance to ensure the production of high-quality rubber. Monitoring the mass conversion of monomers in the last reactor of the train is particularly important. To this effect, various soft sensors (SS) have been proposed, however they have not addressed the underlying complex dynamic relationships existing among the process variables. In this work, a SS based on recurrent neural networks (RNN) is developed to estimate the mass conversion in the last reactor of the train. The main challenge is to obtain an adequate estimate of the conversion both in its usual steady-state operation and during its frequent transient operating phases. Three architectures of RNN: Elman, GRU (Gated Recurrent Unit), and LSTM (Long Short-Term Memory) are compared to critically evaluate their performances. Moreover, a comprehensive analysis is conducted to assess the ability of these models to represent different operational modes of the train. The results reveal that the GRU network exhibits the best performance for estimating the mass conversion of monomers. Then, the performance of the proposed model is compared with a previously-developed SS, which was based on a linear estimation model with a Bayesian bias adaptation mechanism and the use of Control Charts for decision-making. The model proposed here proved to be more efficient for estimating the mass conversion of monomers, particularly during transient operating phases. Finally, to evaluate the methodology utilized for designing the SS, the same RNN architectures were trained to online estimate another quality variable: the mass fraction of Styrene bound to the copolymer. The obtained results were also acceptable.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105204"},"PeriodicalIF":3.7,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142039656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1016/j.chemolab.2024.105202
Biyun Yang , Zhiling Yang , Yong Xu , Wei Cheng , Fenglin Zhong , Dapeng Ye , Haiyong Weng
Among the most frequently diagnosed diseases in citrus, citrus Huanglongbing disease has caused severe economic losses to the citrus industry worldwide since there is no curable method and it spreads quickly. As callose accumulation in phloem is one of the early response events to Asian species Candidatus Liberibacter asiaticus (CLas) infection, the dynamic perception of the sieve plate region can be used as an indicator for the early diagnosis of citrus HLB disease. In this study, one-dimensional convolutional neural network (1D-CNN) models were established to achieve early detection of HLB disease based on spectral information in the sieve plate region using Fourier transform infrared microscopy (micro-FTIR) spectrometer. Partial least squares regression (PLSR) and the least squares support vector machine regression (LS-SVR) models are used for the prediction of callose based on the micro-FTIR information in the sieve plate region of the citrus midrib. Furthermore, an improved data augmentation method by superimposing Gaussian noise was proposed to expand the spectral amplitude. The proposed method has achieved 98.65 % classification accuracy, which was higher than that of other traditional algorithms such as the logistic model tree (LMT), linear discriminant analysis (LDA), Bayes (BS), support vector machine (SVM) and k-nearest neighbors (kNN), and also than that of the molecular detection qPCR (Quantitative real-time polymerase chain reaction) method. Finally, based on the established early detection model with laboratory samples, it can also be used to detect the citrus HLB in complex field samples by using model updating methods, and the overall detection accuracy of the model reached 91.21 %. Our approach has potential for the early diagnosis of citrus HLB disease from the microscopic scale, which would provide useful and precise guidelines to prevent and control citrus HLB disease.
{"title":"A 1D-CNN model for the early detection of citrus Huanglongbing disease in the sieve plate of phloem tissue using micro-FTIR","authors":"Biyun Yang , Zhiling Yang , Yong Xu , Wei Cheng , Fenglin Zhong , Dapeng Ye , Haiyong Weng","doi":"10.1016/j.chemolab.2024.105202","DOIUrl":"10.1016/j.chemolab.2024.105202","url":null,"abstract":"<div><p>Among the most frequently diagnosed diseases in citrus, citrus Huanglongbing disease has caused severe economic losses to the citrus industry worldwide since there is no curable method and it spreads quickly. As callose accumulation in phloem is one of the early response events to Asian species <em>Candidatus</em> Liberibacter asiaticus (<em>C</em>Las) infection, the dynamic perception of the sieve plate region can be used as an indicator for the early diagnosis of citrus HLB disease. In this study, one-dimensional convolutional neural network (1D-CNN) models were established to achieve early detection of HLB disease based on spectral information in the sieve plate region using Fourier transform infrared microscopy (micro-FTIR) spectrometer. Partial least squares regression (PLSR) and the least squares support vector machine regression (LS-SVR) models are used for the prediction of callose based on the micro-FTIR information in the sieve plate region of the citrus midrib. Furthermore, an improved data augmentation method by superimposing Gaussian noise was proposed to expand the spectral amplitude. The proposed method has achieved 98.65 % classification accuracy, which was higher than that of other traditional algorithms such as the logistic model tree (LMT), linear discriminant analysis (LDA), Bayes (BS), support vector machine (SVM) and k-nearest neighbors (kNN), and also than that of the molecular detection qPCR (Quantitative real-time polymerase chain reaction) method. Finally, based on the established early detection model with laboratory samples, it can also be used to detect the citrus HLB in complex field samples by using model updating methods, and the overall detection accuracy of the model reached 91.21 %. Our approach has potential for the early diagnosis of citrus HLB disease from the microscopic scale, which would provide useful and precise guidelines to prevent and control citrus HLB disease.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105202"},"PeriodicalIF":3.7,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.chemolab.2024.105201
Yaonan Guan , Shaoying He , Shuangshuang Ren , Shuren Liu , Dewei Li
In the era of chemical big data, the high complexity and strong interdependencies present in the datasets pose considerable challenges when constructing accurate parametric models. The Gaussian process model, owing to its non-parametric nature, demonstrates better adaptability when confronted with complex and interdependent data. However, the standard Gaussian process has two significant limitations. Firstly, the time complexity of inverting its kernel matrix during the inference process is . Secondly, all data share a common kernel function parameter, which mixes different data types and reduces the model accuracy in mixing-category data identification problems. In light of this, this paper proposes a mixture Gaussian process model that addresses these limitations. This model reduces time complexity and distinguishes data based on different data features. It incorporates a Gaussian mixture distribution for the inducing variables to approximate the original data distribution. Stochastic Variational Inference is utilized to reduce the computational time required for parameter inference. The inducing variables have distinct parameters for the kernel function based on the data category, leading to improved analytical accuracy and reduced time complexity of the Gaussian process model. Numerical experiments are conducted to analyze and compare the performance of the proposed model on different-sized datasets and various data category cases.
{"title":"Mixture Gaussian process model with Gaussian mixture distribution for big data","authors":"Yaonan Guan , Shaoying He , Shuangshuang Ren , Shuren Liu , Dewei Li","doi":"10.1016/j.chemolab.2024.105201","DOIUrl":"10.1016/j.chemolab.2024.105201","url":null,"abstract":"<div><p>In the era of chemical big data, the high complexity and strong interdependencies present in the datasets pose considerable challenges when constructing accurate parametric models. The Gaussian process model, owing to its non-parametric nature, demonstrates better adaptability when confronted with complex and interdependent data. However, the standard Gaussian process has two significant limitations. Firstly, the time complexity of inverting its kernel matrix during the inference process is <span><math><mrow><mi>O</mi><msup><mrow><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow><mrow><mn>3</mn></mrow></msup></mrow></math></span>. Secondly, all data share a common kernel function parameter, which mixes different data types and reduces the model accuracy in mixing-category data identification problems. In light of this, this paper proposes a mixture Gaussian process model that addresses these limitations. This model reduces time complexity and distinguishes data based on different data features. It incorporates a Gaussian mixture distribution for the inducing variables to approximate the original data distribution. Stochastic Variational Inference is utilized to reduce the computational time required for parameter inference. The inducing variables have distinct parameters for the kernel function based on the data category, leading to improved analytical accuracy and reduced time complexity of the Gaussian process model. Numerical experiments are conducted to analyze and compare the performance of the proposed model on different-sized datasets and various data category cases.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105201"},"PeriodicalIF":3.7,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142002255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.chemolab.2024.105198
Kuanhsuan Chiu , Junghui Chen , Zhengjiang Zhang
In the chemical plants, data-driven process monitoring serves as a vital tool to ensure product quality and maintain production line safety. However, the accuracy of monitoring hinges directly upon the quality of process data. Given the inherently slow and complex nature of chemical processes, coupled with the potential for gross errors in process data leading to inaccuracies in model predictions, this paper proposes a method called Conditional Dynamic Variational Autoencoder combined with a Particle Filter (CDVAE-PF) for data reconciliation and subsequent process monitoring. CDVAE-PF leverages the capabilities of Conditional Dynamic Variational Autoencoder (CDVAE) to effectively model chemical process data in the presence of noise. This probabilistic model serves as the foundation for the Particle Filter (PF), which is employed for data reconciliation. Moreover, CDVAE-PF incorporates mechanisms to detect and rectify gross errors in process data, further enhancing its efficacy in data reconciliation. Subsequently, monitoring indices based on CDVAE are established to facilitate process monitoring. Through numerical simulations of a two-to-one variables Continuous Stirred Tank Reactor (CSTR) example and a fifteen-to-one variables dichloroethane distillation process from an actual chemical plant, CDVAE-PF demonstrates its effectiveness by reducing mean absolute error to 7.8 % and 12.8 % respectively in gross error data reconciliation. Moreover, in terms of monitoring performance, CDVAE-PF successfully mitigates misjudgments caused by gross errors, thereby significantly enhancing the reliability of process monitoring in chemical plants.
{"title":"Online nonlinear data reconciliation to enhance nonlinear dynamic process monitoring using conditional dynamic variational autoencoder networks with particle filters","authors":"Kuanhsuan Chiu , Junghui Chen , Zhengjiang Zhang","doi":"10.1016/j.chemolab.2024.105198","DOIUrl":"10.1016/j.chemolab.2024.105198","url":null,"abstract":"<div><p>In the chemical plants, data-driven process monitoring serves as a vital tool to ensure product quality and maintain production line safety. However, the accuracy of monitoring hinges directly upon the quality of process data. Given the inherently slow and complex nature of chemical processes, coupled with the potential for gross errors in process data leading to inaccuracies in model predictions, this paper proposes a method called Conditional Dynamic Variational Autoencoder combined with a Particle Filter (CDVAE-PF) for data reconciliation and subsequent process monitoring. CDVAE-PF leverages the capabilities of Conditional Dynamic Variational Autoencoder (CDVAE) to effectively model chemical process data in the presence of noise. This probabilistic model serves as the foundation for the Particle Filter (PF), which is employed for data reconciliation. Moreover, CDVAE-PF incorporates mechanisms to detect and rectify gross errors in process data, further enhancing its efficacy in data reconciliation. Subsequently, monitoring indices based on CDVAE are established to facilitate process monitoring. Through numerical simulations of a two-to-one variables Continuous Stirred Tank Reactor (CSTR) example and a fifteen-to-one variables dichloroethane distillation process from an actual chemical plant, CDVAE-PF demonstrates its effectiveness by reducing mean absolute error to 7.8 % and 12.8 % respectively in gross error data reconciliation. Moreover, in terms of monitoring performance, CDVAE-PF successfully mitigates misjudgments caused by gross errors, thereby significantly enhancing the reliability of process monitoring in chemical plants.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105198"},"PeriodicalIF":3.7,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142039660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.chemolab.2024.105197
Vijay H. Masand , Sami Al-Hussain , Abdullah Y. Alzahrani , Aamal A. Al-Mutairi , Arwa sultan Alqahtani , Abdul Samad , Gaurav S. Masand , Magdi E.A. Zaki
The present work involves extreme gradient boosting in combination with shapley values, a thriving amalgamation under the terrain of Explainable artificial intelligence, along with genetic algorithm for the analysis of thrombin inhibitory activity of diverse pool of 2803 molecules. The methodology involves genetic algorithm for feature selection, followed by extreme gradient boosting analysis. The eight parametric genetic algorithm - extreme gradient boosting analysis has high statistical acceptance with R2tr = 0.895, R2L10%O = 0.900, and Q2F3 = 0.873. Shapley additive explanations, which provide each variable in a model an importance value, served as the foundation for the interpretation. Then, ceteris paribus approach involving comparison of counterfactual examples has been used to understand the influence of a structural feature on activity profile. The analysis indicates that aromatic carbon, ring/non-ring nitrogen in combination with other structural features govern the inhibitory profile. The genetic algorithm - extreme gradient boosting model's simplicity and predictions suggest that “Explainable AI” is useful in the future for identifying and using structural features in drug discovery.
{"title":"GA-XGBoost, an explainable AI technique, for analysis of thrombin inhibitory activity of diverse pool of molecules and supported by X-ray","authors":"Vijay H. Masand , Sami Al-Hussain , Abdullah Y. Alzahrani , Aamal A. Al-Mutairi , Arwa sultan Alqahtani , Abdul Samad , Gaurav S. Masand , Magdi E.A. Zaki","doi":"10.1016/j.chemolab.2024.105197","DOIUrl":"10.1016/j.chemolab.2024.105197","url":null,"abstract":"<div><p>The present work involves extreme gradient boosting in combination with shapley values, a thriving amalgamation under the terrain of Explainable artificial intelligence, along with genetic algorithm for the analysis of thrombin inhibitory activity of diverse pool of 2803 molecules. The methodology involves genetic algorithm for feature selection, followed by extreme gradient boosting analysis. The eight parametric genetic algorithm - extreme gradient boosting analysis has high statistical acceptance with R<sup>2</sup><sub>tr</sub> = 0.895, R<sup>2</sup><sub>L10%O</sub> = 0.900, and Q2F3 = 0.873. Shapley additive explanations, which provide each variable in a model an importance value, served as the foundation for the interpretation. Then, <em>ceteris paribus</em> approach involving comparison of counterfactual examples has been used to understand the influence of a structural feature on activity profile. The analysis indicates that aromatic carbon, ring/non-ring nitrogen in combination with other structural features govern the inhibitory profile. The genetic algorithm - extreme gradient boosting model's simplicity and predictions suggest that “Explainable AI” is useful in the future for identifying and using structural features in drug discovery.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105197"},"PeriodicalIF":3.7,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1016/j.chemolab.2024.105193
Cong Guo, Wei Yang, Zheng Li, Chun Liu
Feature selection on incomplete datasets is a challenging task. To address this challenge, existing methods first employ imputation methods to complete the dataset and then perform feature selection based on the imputed dataset. Since missing value imputation and feature selection are entirely independent, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To this end, we proposed a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: M-stage and W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. In particular, the feature importance output by the W-stage in the current iteration will be used as the input of the M-stage in the next iteration. Experimental results on artificial and real missing datasets demonstrate that the proposed method outperforms other approaches significantly.
在不完整数据集上进行特征选择是一项具有挑战性的任务。为了应对这一挑战,现有方法首先采用估算方法来完成数据集,然后根据估算数据集进行特征选择。由于缺失值估算和特征选择是完全独立的,因此在估算过程中无法考虑特征的重要性。然而,在现实世界的场景或数据集中,不同特征的重要程度各不相同。为此,我们提出了一种考虑特征重要性的新型不完整数据特征选择框架。该框架主要包括两个交替迭代阶段:M 阶段和 W 阶段。在 M 阶段,根据给定的特征重要性向量和多个初始估算结果对缺失值进行估算。在 W 阶段,采用改进的 reliefF 算法,根据估算数据学习特征重要性向量。特别是,W 阶段在当前迭代中输出的特征重要性将在下一次迭代中用作 M 阶段的输入。在人工和真实缺失数据集上的实验结果表明,所提出的方法明显优于其他方法。
{"title":"A novel feature selection framework for incomplete data","authors":"Cong Guo, Wei Yang, Zheng Li, Chun Liu","doi":"10.1016/j.chemolab.2024.105193","DOIUrl":"10.1016/j.chemolab.2024.105193","url":null,"abstract":"<div><p>Feature selection on incomplete datasets is a challenging task. To address this challenge, existing methods first employ imputation methods to complete the dataset and then perform feature selection based on the imputed dataset. Since missing value imputation and feature selection are entirely independent, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To this end, we proposed a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: M-stage and W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. In particular, the feature importance output by the W-stage in the current iteration will be used as the input of the M-stage in the next iteration. Experimental results on artificial and real missing datasets demonstrate that the proposed method outperforms other approaches significantly.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105193"},"PeriodicalIF":3.7,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}