Pub Date : 2024-08-22DOI: 10.1016/j.chemolab.2024.105205
Sungwon Park, Hongjoong Kim
Accurate baseline correction is a fundamental requirement for extracting meaningful spectral information and enabling precise quantitative analysis using Raman spectroscopy. Although numerous baseline correction techniques have been developed, they often require meticulous parameter adjustments and yield inconsistent results. To address these challenges, we have introduced a novel approach, namely constrained Gaussian radial basis function fitting (CGF). Our method involves solving a curve-fitting problem using Gaussian radial basis functions under specific constraints. To ensure stability and efficiency, we developed a linear programming algorithm for the proposed approach. We evaluated the performance of CGF using simulated Raman spectra and demonstrated its robustness across various scenarios, including changes in data length and noise levels. In contrast to standard methods, which frequently require complicated parameter adjustments and may exhibit varying errors, our approach provides a simple parameter search and consistently achieves low errors. We further assessed CGF using real Raman spectra, leading to enhanced accuracy in the quantitative analysis of the Raman spectra of chemical warfare agents. Our results emphasize the potential of CGF as a valuable tool for Raman spectroscopy data analysis, significantly advancing sophisticated analytical techniques.
{"title":"Robust baseline correction for Raman spectra by constrained Gaussian radial basis function fitting","authors":"Sungwon Park, Hongjoong Kim","doi":"10.1016/j.chemolab.2024.105205","DOIUrl":"10.1016/j.chemolab.2024.105205","url":null,"abstract":"<div><p>Accurate baseline correction is a fundamental requirement for extracting meaningful spectral information and enabling precise quantitative analysis using Raman spectroscopy. Although numerous baseline correction techniques have been developed, they often require meticulous parameter adjustments and yield inconsistent results. To address these challenges, we have introduced a novel approach, namely constrained Gaussian radial basis function fitting (CGF). Our method involves solving a curve-fitting problem using Gaussian radial basis functions under specific constraints. To ensure stability and efficiency, we developed a linear programming algorithm for the proposed approach. We evaluated the performance of CGF using simulated Raman spectra and demonstrated its robustness across various scenarios, including changes in data length and noise levels. In contrast to standard methods, which frequently require complicated parameter adjustments and may exhibit varying errors, our approach provides a simple parameter search and consistently achieves low errors. We further assessed CGF using real Raman spectra, leading to enhanced accuracy in the quantitative analysis of the Raman spectra of chemical warfare agents. Our results emphasize the potential of CGF as a valuable tool for Raman spectroscopy data analysis, significantly advancing sophisticated analytical techniques.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105205"},"PeriodicalIF":3.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1016/j.chemolab.2024.105200
Erik Andries , Ramin Nikzad-Langerodi
Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We modify a class of state-of-the-art methods (penalized baseline correction) that easily admit the incorporation of a priori analyte concentrations such that predictions can be enhanced. This modified approach will be deemed supervised and penalized baseline correction (SPBC). Performance will be assessed on two near infrared data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information). There are cases of SPBC that provide useful baseline-corrected signals such that they outperform state-of-the-art penalized baseline correction algorithms such as AIRPLS. In particular, we observe that performance is conditional on the correlation between separate analytes: the analyte used for baseline correlation and the analyte used for prediction—the greater the correlation between the analyte used for baseline correlation and the analyte used for prediction, the better the prediction performance.
{"title":"Supervised and penalized baseline correction","authors":"Erik Andries , Ramin Nikzad-Langerodi","doi":"10.1016/j.chemolab.2024.105200","DOIUrl":"10.1016/j.chemolab.2024.105200","url":null,"abstract":"<div><p>Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We modify a class of state-of-the-art methods (<em>penalized baseline correction</em>) that easily admit the incorporation of a priori analyte concentrations such that predictions can be enhanced. This modified approach will be deemed <em>supervised and penalized baseline correction</em> (SPBC). Performance will be assessed on two near infrared data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information). There are cases of SPBC that provide useful baseline-corrected signals such that they outperform state-of-the-art penalized baseline correction algorithms such as AIRPLS. In particular, we observe that performance is conditional on the correlation between separate analytes: the analyte used for baseline correlation and the analyte used for prediction—the greater the correlation between the analyte used for baseline correlation and the analyte used for prediction, the better the prediction performance.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105200"},"PeriodicalIF":3.7,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17DOI: 10.1016/j.chemolab.2024.105206
Saad M Alshahrani
This study illustrates the effective control of COVID-19 infection through the adsorption of safranal (SAF) on B16N16 and Al16N16 fullerene-like cages. The SAF adsorption onto the B16N16 and Al16N16 surfaces in gas, water (H2O), and chloroform (CHCl3) environments were assessed using density functional theory (DFT) and time-dependent (TD) density functional theory methods, analyzing the substrates and their complexes. The Al16N16/SAF complex exhibited the most negative binding energy and structural stability in the water phase compared to the B16N16/SAF complex at the PBE0-D3 level. The thermodynamic parameters indicated that the adsorption of SAF onto the fullerene-like cages is exothermic, particularly for the Al16N16/SAF complex. Additionally, the interaction of SAF with the fullerene-like cages in the water phase is more pronounced than in gas and chloroform environments. The complexes' energy gap (Eg) decreases in all three environments compared to the perfect systems, with a significant reduction of over 21 % in all phases. This substantial decrease in the energy gap suggests that the complexes have increased reactivity and sensitivity to SAF, likely due to a significant change in electronic conductivity. The results of molecular docking indicate that the Al16N16/SAF complex in the water phase exhibited a strong binding affinity compared to the other compounds studied. These findings suggest that the Al16N16/SAF complex holds promise as a potential inhibitor for COVID-19 and as a valuable material for biomedical applications and drug delivery systems.
{"title":"Novel investigation on adsorption analysis of safranal interacting with boron nitride and aluminum nitride fullerene-like cages: Drug delivery system","authors":"Saad M Alshahrani","doi":"10.1016/j.chemolab.2024.105206","DOIUrl":"10.1016/j.chemolab.2024.105206","url":null,"abstract":"<div><p>This study illustrates the effective control of COVID-19 infection through the adsorption of safranal (SAF) on B<sub>16</sub>N<sub>16</sub> and Al<sub>16</sub>N<sub>16</sub> fullerene-like cages. The SAF adsorption onto the B<sub>16</sub>N<sub>16</sub> and Al<sub>16</sub>N<sub>16</sub> surfaces in gas, water (H<sub>2</sub>O), and chloroform (CHCl<sub>3</sub>) environments were assessed using density functional theory (DFT) and time-dependent (TD) density functional theory methods, analyzing the substrates and their complexes. The Al<sub>16</sub>N<sub>16</sub>/SAF complex exhibited the most negative binding energy and structural stability in the water phase compared to the B<sub>16</sub>N<sub>16</sub>/SAF complex at the PBE0-D3 level. The thermodynamic parameters indicated that the adsorption of SAF onto the fullerene-like cages is exothermic, particularly for the Al<sub>16</sub>N<sub>16</sub>/SAF complex. Additionally, the interaction of SAF with the fullerene-like cages in the water phase is more pronounced than in gas and chloroform environments. The complexes' energy gap (Eg) decreases in all three environments compared to the perfect systems, with a significant reduction of over 21 % in all phases. This substantial decrease in the energy gap suggests that the complexes have increased reactivity and sensitivity to SAF, likely due to a significant change in electronic conductivity. The results of molecular docking indicate that the Al<sub>16</sub>N<sub>16</sub>/SAF complex in the water phase exhibited a strong binding affinity compared to the other compounds studied. These findings suggest that the Al<sub>16</sub>N<sub>16</sub>/SAF complex holds promise as a potential inhibitor for COVID-19 and as a valuable material for biomedical applications and drug delivery systems.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105206"},"PeriodicalIF":3.7,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1016/j.chemolab.2024.105204
Mariano M. Perdomo , Luis A. Clementi , Jorge R. Vega
The first stage in the industrial production of Styrene-Butadiene Rubber (SBR) typically consists in obtaining a latex from a train of continuous stirred tank reactors. Accurate real-time estimation of some key process variables is of paramount importance to ensure the production of high-quality rubber. Monitoring the mass conversion of monomers in the last reactor of the train is particularly important. To this effect, various soft sensors (SS) have been proposed, however they have not addressed the underlying complex dynamic relationships existing among the process variables. In this work, a SS based on recurrent neural networks (RNN) is developed to estimate the mass conversion in the last reactor of the train. The main challenge is to obtain an adequate estimate of the conversion both in its usual steady-state operation and during its frequent transient operating phases. Three architectures of RNN: Elman, GRU (Gated Recurrent Unit), and LSTM (Long Short-Term Memory) are compared to critically evaluate their performances. Moreover, a comprehensive analysis is conducted to assess the ability of these models to represent different operational modes of the train. The results reveal that the GRU network exhibits the best performance for estimating the mass conversion of monomers. Then, the performance of the proposed model is compared with a previously-developed SS, which was based on a linear estimation model with a Bayesian bias adaptation mechanism and the use of Control Charts for decision-making. The model proposed here proved to be more efficient for estimating the mass conversion of monomers, particularly during transient operating phases. Finally, to evaluate the methodology utilized for designing the SS, the same RNN architectures were trained to online estimate another quality variable: the mass fraction of Styrene bound to the copolymer. The obtained results were also acceptable.
丁苯橡胶(SBR)工业生产的第一阶段通常是从一列连续搅拌罐反应器中获得胶乳。要确保生产出高质量的橡胶,对一些关键工艺变量进行准确的实时估算至关重要。监测反应器组最后一个反应器中单体的质量转化率尤为重要。为此,人们提出了各种软传感器(SS),但它们并没有解决工艺变量之间存在的潜在复杂动态关系。在这项工作中,开发了一种基于递归神经网络(RNN)的软传感器,用于估算列车最后一个反应器的质量转换。主要的挑战是如何在通常的稳态运行和频繁的瞬态运行阶段都能对转换率进行充分估计。RNN 有三种结构:Elman、GRU(门控递归单元)和 LSTM(长短期记忆)三种 RNN 结构进行了比较,以严格评估其性能。此外,还进行了综合分析,以评估这些模型代表列车不同运行模式的能力。结果表明,GRU 网络在估计单体的质量转换方面表现最佳。然后,将所提出模型的性能与之前开发的 SS 进行了比较,后者是基于线性估计模型和贝叶斯偏差适应机制,并使用控制图进行决策。事实证明,这里提出的模型在估算单体的质量转换方面更为有效,尤其是在瞬态运行阶段。最后,为了评估设计 SS 所采用的方法,对相同的 RNN 架构进行了训练,以在线估算另一个质量变量:苯乙烯与共聚物结合的质量分数。得到的结果也是可以接受的。
{"title":"Estimation of quality variables in a continuous train of reactors using recurrent neural networks-based soft sensors","authors":"Mariano M. Perdomo , Luis A. Clementi , Jorge R. Vega","doi":"10.1016/j.chemolab.2024.105204","DOIUrl":"10.1016/j.chemolab.2024.105204","url":null,"abstract":"<div><p>The first stage in the industrial production of Styrene-Butadiene Rubber (SBR) typically consists in obtaining a latex from a train of continuous stirred tank reactors. Accurate real-time estimation of some key process variables is of paramount importance to ensure the production of high-quality rubber. Monitoring the mass conversion of monomers in the last reactor of the train is particularly important. To this effect, various soft sensors (SS) have been proposed, however they have not addressed the underlying complex dynamic relationships existing among the process variables. In this work, a SS based on recurrent neural networks (RNN) is developed to estimate the mass conversion in the last reactor of the train. The main challenge is to obtain an adequate estimate of the conversion both in its usual steady-state operation and during its frequent transient operating phases. Three architectures of RNN: Elman, GRU (Gated Recurrent Unit), and LSTM (Long Short-Term Memory) are compared to critically evaluate their performances. Moreover, a comprehensive analysis is conducted to assess the ability of these models to represent different operational modes of the train. The results reveal that the GRU network exhibits the best performance for estimating the mass conversion of monomers. Then, the performance of the proposed model is compared with a previously-developed SS, which was based on a linear estimation model with a Bayesian bias adaptation mechanism and the use of Control Charts for decision-making. The model proposed here proved to be more efficient for estimating the mass conversion of monomers, particularly during transient operating phases. Finally, to evaluate the methodology utilized for designing the SS, the same RNN architectures were trained to online estimate another quality variable: the mass fraction of Styrene bound to the copolymer. The obtained results were also acceptable.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105204"},"PeriodicalIF":3.7,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142039656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1016/j.chemolab.2024.105202
Biyun Yang , Zhiling Yang , Yong Xu , Wei Cheng , Fenglin Zhong , Dapeng Ye , Haiyong Weng
Among the most frequently diagnosed diseases in citrus, citrus Huanglongbing disease has caused severe economic losses to the citrus industry worldwide since there is no curable method and it spreads quickly. As callose accumulation in phloem is one of the early response events to Asian species Candidatus Liberibacter asiaticus (CLas) infection, the dynamic perception of the sieve plate region can be used as an indicator for the early diagnosis of citrus HLB disease. In this study, one-dimensional convolutional neural network (1D-CNN) models were established to achieve early detection of HLB disease based on spectral information in the sieve plate region using Fourier transform infrared microscopy (micro-FTIR) spectrometer. Partial least squares regression (PLSR) and the least squares support vector machine regression (LS-SVR) models are used for the prediction of callose based on the micro-FTIR information in the sieve plate region of the citrus midrib. Furthermore, an improved data augmentation method by superimposing Gaussian noise was proposed to expand the spectral amplitude. The proposed method has achieved 98.65 % classification accuracy, which was higher than that of other traditional algorithms such as the logistic model tree (LMT), linear discriminant analysis (LDA), Bayes (BS), support vector machine (SVM) and k-nearest neighbors (kNN), and also than that of the molecular detection qPCR (Quantitative real-time polymerase chain reaction) method. Finally, based on the established early detection model with laboratory samples, it can also be used to detect the citrus HLB in complex field samples by using model updating methods, and the overall detection accuracy of the model reached 91.21 %. Our approach has potential for the early diagnosis of citrus HLB disease from the microscopic scale, which would provide useful and precise guidelines to prevent and control citrus HLB disease.
{"title":"A 1D-CNN model for the early detection of citrus Huanglongbing disease in the sieve plate of phloem tissue using micro-FTIR","authors":"Biyun Yang , Zhiling Yang , Yong Xu , Wei Cheng , Fenglin Zhong , Dapeng Ye , Haiyong Weng","doi":"10.1016/j.chemolab.2024.105202","DOIUrl":"10.1016/j.chemolab.2024.105202","url":null,"abstract":"<div><p>Among the most frequently diagnosed diseases in citrus, citrus Huanglongbing disease has caused severe economic losses to the citrus industry worldwide since there is no curable method and it spreads quickly. As callose accumulation in phloem is one of the early response events to Asian species <em>Candidatus</em> Liberibacter asiaticus (<em>C</em>Las) infection, the dynamic perception of the sieve plate region can be used as an indicator for the early diagnosis of citrus HLB disease. In this study, one-dimensional convolutional neural network (1D-CNN) models were established to achieve early detection of HLB disease based on spectral information in the sieve plate region using Fourier transform infrared microscopy (micro-FTIR) spectrometer. Partial least squares regression (PLSR) and the least squares support vector machine regression (LS-SVR) models are used for the prediction of callose based on the micro-FTIR information in the sieve plate region of the citrus midrib. Furthermore, an improved data augmentation method by superimposing Gaussian noise was proposed to expand the spectral amplitude. The proposed method has achieved 98.65 % classification accuracy, which was higher than that of other traditional algorithms such as the logistic model tree (LMT), linear discriminant analysis (LDA), Bayes (BS), support vector machine (SVM) and k-nearest neighbors (kNN), and also than that of the molecular detection qPCR (Quantitative real-time polymerase chain reaction) method. Finally, based on the established early detection model with laboratory samples, it can also be used to detect the citrus HLB in complex field samples by using model updating methods, and the overall detection accuracy of the model reached 91.21 %. Our approach has potential for the early diagnosis of citrus HLB disease from the microscopic scale, which would provide useful and precise guidelines to prevent and control citrus HLB disease.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105202"},"PeriodicalIF":3.7,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.chemolab.2024.105201
Yaonan Guan , Shaoying He , Shuangshuang Ren , Shuren Liu , Dewei Li
In the era of chemical big data, the high complexity and strong interdependencies present in the datasets pose considerable challenges when constructing accurate parametric models. The Gaussian process model, owing to its non-parametric nature, demonstrates better adaptability when confronted with complex and interdependent data. However, the standard Gaussian process has two significant limitations. Firstly, the time complexity of inverting its kernel matrix during the inference process is . Secondly, all data share a common kernel function parameter, which mixes different data types and reduces the model accuracy in mixing-category data identification problems. In light of this, this paper proposes a mixture Gaussian process model that addresses these limitations. This model reduces time complexity and distinguishes data based on different data features. It incorporates a Gaussian mixture distribution for the inducing variables to approximate the original data distribution. Stochastic Variational Inference is utilized to reduce the computational time required for parameter inference. The inducing variables have distinct parameters for the kernel function based on the data category, leading to improved analytical accuracy and reduced time complexity of the Gaussian process model. Numerical experiments are conducted to analyze and compare the performance of the proposed model on different-sized datasets and various data category cases.
{"title":"Mixture Gaussian process model with Gaussian mixture distribution for big data","authors":"Yaonan Guan , Shaoying He , Shuangshuang Ren , Shuren Liu , Dewei Li","doi":"10.1016/j.chemolab.2024.105201","DOIUrl":"10.1016/j.chemolab.2024.105201","url":null,"abstract":"<div><p>In the era of chemical big data, the high complexity and strong interdependencies present in the datasets pose considerable challenges when constructing accurate parametric models. The Gaussian process model, owing to its non-parametric nature, demonstrates better adaptability when confronted with complex and interdependent data. However, the standard Gaussian process has two significant limitations. Firstly, the time complexity of inverting its kernel matrix during the inference process is <span><math><mrow><mi>O</mi><msup><mrow><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow><mrow><mn>3</mn></mrow></msup></mrow></math></span>. Secondly, all data share a common kernel function parameter, which mixes different data types and reduces the model accuracy in mixing-category data identification problems. In light of this, this paper proposes a mixture Gaussian process model that addresses these limitations. This model reduces time complexity and distinguishes data based on different data features. It incorporates a Gaussian mixture distribution for the inducing variables to approximate the original data distribution. Stochastic Variational Inference is utilized to reduce the computational time required for parameter inference. The inducing variables have distinct parameters for the kernel function based on the data category, leading to improved analytical accuracy and reduced time complexity of the Gaussian process model. Numerical experiments are conducted to analyze and compare the performance of the proposed model on different-sized datasets and various data category cases.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105201"},"PeriodicalIF":3.7,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142002255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.chemolab.2024.105198
Kuanhsuan Chiu , Junghui Chen , Zhengjiang Zhang
In the chemical plants, data-driven process monitoring serves as a vital tool to ensure product quality and maintain production line safety. However, the accuracy of monitoring hinges directly upon the quality of process data. Given the inherently slow and complex nature of chemical processes, coupled with the potential for gross errors in process data leading to inaccuracies in model predictions, this paper proposes a method called Conditional Dynamic Variational Autoencoder combined with a Particle Filter (CDVAE-PF) for data reconciliation and subsequent process monitoring. CDVAE-PF leverages the capabilities of Conditional Dynamic Variational Autoencoder (CDVAE) to effectively model chemical process data in the presence of noise. This probabilistic model serves as the foundation for the Particle Filter (PF), which is employed for data reconciliation. Moreover, CDVAE-PF incorporates mechanisms to detect and rectify gross errors in process data, further enhancing its efficacy in data reconciliation. Subsequently, monitoring indices based on CDVAE are established to facilitate process monitoring. Through numerical simulations of a two-to-one variables Continuous Stirred Tank Reactor (CSTR) example and a fifteen-to-one variables dichloroethane distillation process from an actual chemical plant, CDVAE-PF demonstrates its effectiveness by reducing mean absolute error to 7.8 % and 12.8 % respectively in gross error data reconciliation. Moreover, in terms of monitoring performance, CDVAE-PF successfully mitigates misjudgments caused by gross errors, thereby significantly enhancing the reliability of process monitoring in chemical plants.
{"title":"Online nonlinear data reconciliation to enhance nonlinear dynamic process monitoring using conditional dynamic variational autoencoder networks with particle filters","authors":"Kuanhsuan Chiu , Junghui Chen , Zhengjiang Zhang","doi":"10.1016/j.chemolab.2024.105198","DOIUrl":"10.1016/j.chemolab.2024.105198","url":null,"abstract":"<div><p>In the chemical plants, data-driven process monitoring serves as a vital tool to ensure product quality and maintain production line safety. However, the accuracy of monitoring hinges directly upon the quality of process data. Given the inherently slow and complex nature of chemical processes, coupled with the potential for gross errors in process data leading to inaccuracies in model predictions, this paper proposes a method called Conditional Dynamic Variational Autoencoder combined with a Particle Filter (CDVAE-PF) for data reconciliation and subsequent process monitoring. CDVAE-PF leverages the capabilities of Conditional Dynamic Variational Autoencoder (CDVAE) to effectively model chemical process data in the presence of noise. This probabilistic model serves as the foundation for the Particle Filter (PF), which is employed for data reconciliation. Moreover, CDVAE-PF incorporates mechanisms to detect and rectify gross errors in process data, further enhancing its efficacy in data reconciliation. Subsequently, monitoring indices based on CDVAE are established to facilitate process monitoring. Through numerical simulations of a two-to-one variables Continuous Stirred Tank Reactor (CSTR) example and a fifteen-to-one variables dichloroethane distillation process from an actual chemical plant, CDVAE-PF demonstrates its effectiveness by reducing mean absolute error to 7.8 % and 12.8 % respectively in gross error data reconciliation. Moreover, in terms of monitoring performance, CDVAE-PF successfully mitigates misjudgments caused by gross errors, thereby significantly enhancing the reliability of process monitoring in chemical plants.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105198"},"PeriodicalIF":3.7,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142039660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.chemolab.2024.105197
Vijay H. Masand , Sami Al-Hussain , Abdullah Y. Alzahrani , Aamal A. Al-Mutairi , Arwa sultan Alqahtani , Abdul Samad , Gaurav S. Masand , Magdi E.A. Zaki
The present work involves extreme gradient boosting in combination with shapley values, a thriving amalgamation under the terrain of Explainable artificial intelligence, along with genetic algorithm for the analysis of thrombin inhibitory activity of diverse pool of 2803 molecules. The methodology involves genetic algorithm for feature selection, followed by extreme gradient boosting analysis. The eight parametric genetic algorithm - extreme gradient boosting analysis has high statistical acceptance with R2tr = 0.895, R2L10%O = 0.900, and Q2F3 = 0.873. Shapley additive explanations, which provide each variable in a model an importance value, served as the foundation for the interpretation. Then, ceteris paribus approach involving comparison of counterfactual examples has been used to understand the influence of a structural feature on activity profile. The analysis indicates that aromatic carbon, ring/non-ring nitrogen in combination with other structural features govern the inhibitory profile. The genetic algorithm - extreme gradient boosting model's simplicity and predictions suggest that “Explainable AI” is useful in the future for identifying and using structural features in drug discovery.
{"title":"GA-XGBoost, an explainable AI technique, for analysis of thrombin inhibitory activity of diverse pool of molecules and supported by X-ray","authors":"Vijay H. Masand , Sami Al-Hussain , Abdullah Y. Alzahrani , Aamal A. Al-Mutairi , Arwa sultan Alqahtani , Abdul Samad , Gaurav S. Masand , Magdi E.A. Zaki","doi":"10.1016/j.chemolab.2024.105197","DOIUrl":"10.1016/j.chemolab.2024.105197","url":null,"abstract":"<div><p>The present work involves extreme gradient boosting in combination with shapley values, a thriving amalgamation under the terrain of Explainable artificial intelligence, along with genetic algorithm for the analysis of thrombin inhibitory activity of diverse pool of 2803 molecules. The methodology involves genetic algorithm for feature selection, followed by extreme gradient boosting analysis. The eight parametric genetic algorithm - extreme gradient boosting analysis has high statistical acceptance with R<sup>2</sup><sub>tr</sub> = 0.895, R<sup>2</sup><sub>L10%O</sub> = 0.900, and Q2F3 = 0.873. Shapley additive explanations, which provide each variable in a model an importance value, served as the foundation for the interpretation. Then, <em>ceteris paribus</em> approach involving comparison of counterfactual examples has been used to understand the influence of a structural feature on activity profile. The analysis indicates that aromatic carbon, ring/non-ring nitrogen in combination with other structural features govern the inhibitory profile. The genetic algorithm - extreme gradient boosting model's simplicity and predictions suggest that “Explainable AI” is useful in the future for identifying and using structural features in drug discovery.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105197"},"PeriodicalIF":3.7,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1016/j.chemolab.2024.105193
Cong Guo, Wei Yang, Zheng Li, Chun Liu
Feature selection on incomplete datasets is a challenging task. To address this challenge, existing methods first employ imputation methods to complete the dataset and then perform feature selection based on the imputed dataset. Since missing value imputation and feature selection are entirely independent, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To this end, we proposed a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: M-stage and W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. In particular, the feature importance output by the W-stage in the current iteration will be used as the input of the M-stage in the next iteration. Experimental results on artificial and real missing datasets demonstrate that the proposed method outperforms other approaches significantly.
在不完整数据集上进行特征选择是一项具有挑战性的任务。为了应对这一挑战,现有方法首先采用估算方法来完成数据集,然后根据估算数据集进行特征选择。由于缺失值估算和特征选择是完全独立的,因此在估算过程中无法考虑特征的重要性。然而,在现实世界的场景或数据集中,不同特征的重要程度各不相同。为此,我们提出了一种考虑特征重要性的新型不完整数据特征选择框架。该框架主要包括两个交替迭代阶段:M 阶段和 W 阶段。在 M 阶段,根据给定的特征重要性向量和多个初始估算结果对缺失值进行估算。在 W 阶段,采用改进的 reliefF 算法,根据估算数据学习特征重要性向量。特别是,W 阶段在当前迭代中输出的特征重要性将在下一次迭代中用作 M 阶段的输入。在人工和真实缺失数据集上的实验结果表明,所提出的方法明显优于其他方法。
{"title":"A novel feature selection framework for incomplete data","authors":"Cong Guo, Wei Yang, Zheng Li, Chun Liu","doi":"10.1016/j.chemolab.2024.105193","DOIUrl":"10.1016/j.chemolab.2024.105193","url":null,"abstract":"<div><p>Feature selection on incomplete datasets is a challenging task. To address this challenge, existing methods first employ imputation methods to complete the dataset and then perform feature selection based on the imputed dataset. Since missing value imputation and feature selection are entirely independent, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To this end, we proposed a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: M-stage and W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. In particular, the feature importance output by the W-stage in the current iteration will be used as the input of the M-stage in the next iteration. Experimental results on artificial and real missing datasets demonstrate that the proposed method outperforms other approaches significantly.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105193"},"PeriodicalIF":3.7,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Near-infrared materials find extensive applications in bio-sensing, photodynamic treatment, anti-counterfeiting and opto-electronics. Their progress has notably expanded possibilities in optical communication systems, non-invasive imaging and targeted therapy, benefiting fields such as material science, medicine, tele-communication and biology. In light of these advancements, developments of near-infrared region (NIR) based probes are highly desirable. Moreover, the prediction of the optical properties of a compound prior to its synthesis can diminish the need for expensive experimental testing. Considering the importance of prior prediction, we herein present QSPR models for the prediction of absorption maxima using a dataset of 384 compounds. The aim of the present study is to identify molecular features that could shift their in the near-infrared region. The Monte Carlo Optimization approach along with the index of ideality of correlation (TF2) has been utilized using CORAL 2019 software for the development of ten splits. The predictability of the resulting ten models was assessed using various validation metrics. The model derived from the tenth split proved to be efficient, exhibiting , . Good and bad fragments were also identified that are responsible for the change in absorption maxima (). Identified fragments were utilized for designing ten new molecules to evaluate their reliability. It was observed that molecules designed using positive attributes shifted the absorption maxima towards the near-infrared region, specifically between 711 and 893 nm. This study opens up new possibilities for the advancement of NIR-based chromophores and will contribute significantly by reducing the overall cost of chromophore development.
{"title":"Structural attributes driving λmax towards NIR region: A QSPR approach","authors":"Payal Rani , Sandhya Chahal , Priyanka , Parvin Kumar , Devender Singh , Jayant Sindhu","doi":"10.1016/j.chemolab.2024.105199","DOIUrl":"10.1016/j.chemolab.2024.105199","url":null,"abstract":"<div><p>Near-infrared materials find extensive applications in <em>bio</em>-sensing, photodynamic treatment, anti-counterfeiting and <em>opto</em>-electronics. Their progress has notably expanded possibilities in optical communication systems, non-invasive imaging and targeted therapy, benefiting fields such as material science, medicine, tele-communication and biology. In light of these advancements, developments of near-infrared region (NIR) based probes are highly desirable. Moreover, the prediction of the optical properties of a compound prior to its synthesis can diminish the need for expensive experimental testing. Considering the importance of prior prediction, we herein present QSPR models for the prediction of absorption maxima using a dataset of 384 compounds. The aim of the present study is to identify molecular features that could shift their <span><math><mrow><msub><mi>λ</mi><mi>max</mi></msub></mrow></math></span> in the near-infrared region. The Monte Carlo Optimization approach along with the index of ideality of correlation (TF<sub>2</sub>) has been utilized using CORAL 2019 software for the development of ten splits. The predictability of the resulting ten models was assessed using various validation metrics. The model derived from the tenth split proved to be efficient, exhibiting <span><math><mrow><msubsup><mi>R</mi><mrow><mi>V</mi><mi>a</mi><mi>l</mi><mi>i</mi><mi>d</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow><mn>2</mn></msubsup><mo>=</mo><mn>0.8561</mn></mrow></math></span>, <span><math><mrow><mi>I</mi><mi>I</mi><mi>C</mi><mo>=</mo><mn>0.7849</mn><mspace></mspace><mi>a</mi><mi>n</mi><mi>d</mi><mspace></mspace><msup><mi>Q</mi><mn>2</mn></msup><mo>=</mo><mn>0.8512</mn></mrow></math></span>. Good and bad fragments were also identified that are responsible for the change in absorption maxima (<span><math><mrow><msub><mi>λ</mi><mi>max</mi></msub></mrow></math></span>). Identified fragments were utilized for designing ten new molecules to evaluate their reliability. It was observed that molecules designed using positive attributes shifted the absorption maxima towards the near-infrared region, specifically between 711 and 893 nm. This study opens up new possibilities for the advancement of NIR-based chromophores and will contribute significantly by reducing the overall cost of chromophore development.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105199"},"PeriodicalIF":3.7,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141985556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}