Pub Date : 2024-10-15DOI: 10.1016/j.chemolab.2024.105247
In recent years, deep learning methods have exhibited superior performance in mineral identification when especially compared with conventional machine learning methods such as Support Vector Machine (SVM) and Partial Least Squares (PLS). Nevertheless, almost all of these deep learning methods pay more attention to improving and designing network structures, while neglecting the phenomenon of long-tail distribution in spectral data due to the inconsistency of ore distribution and the scarcity of several natural minerals. To alleviate the interference of majority categories on minority categories, we propose Long-Tail Few-shot Module (LTFM) which is inspired by rethinking the fashionable decoupling strategy that conducts primary representation learning and further classifier retrained on mineral spectral data. In particular, LTFM serves as a multi-expert mode, where these experts are respectively specialized in improving feature representation learning, mitigating the long-tail effect, and alleviating the interference of few shots. Additionally, the loose coupling learning strategy is introduced to facilitate primary representation learning and the subsequent additional experts to inherit this knowledge. Experiments on two publicly available spectral datasets show that the proposed LTFM significantly outperforms existing methods. In the end, extensive ablation studies are conducted to investigate the effectiveness, correctness, and robustness of our proposal.
{"title":"LTFM: Long-tail few-shot module with loose coupling strategy for mineral spectral identification","authors":"","doi":"10.1016/j.chemolab.2024.105247","DOIUrl":"10.1016/j.chemolab.2024.105247","url":null,"abstract":"<div><div>In recent years, deep learning methods have exhibited superior performance in mineral identification when especially compared with conventional machine learning methods such as Support Vector Machine (SVM) and Partial Least Squares (PLS). Nevertheless, almost all of these deep learning methods pay more attention to improving and designing network structures, while neglecting the phenomenon of long-tail distribution in spectral data due to the inconsistency of ore distribution and the scarcity of several natural minerals. To alleviate the interference of majority categories on minority categories, we propose <strong>L</strong>ong-<strong>T</strong>ail <strong>F</strong>ew-shot <strong>M</strong>odule (LTFM) which is inspired by rethinking the fashionable decoupling strategy that conducts primary representation learning and further classifier retrained on mineral spectral data. In particular, LTFM serves as a multi-expert mode, where these experts are respectively specialized in improving feature representation learning, mitigating the long-tail effect, and alleviating the interference of few shots. Additionally, the loose coupling learning strategy is introduced to facilitate primary representation learning and the subsequent additional experts to inherit this knowledge. Experiments on two publicly available spectral datasets show that the proposed LTFM significantly outperforms existing methods. In the end, extensive ablation studies are conducted to investigate the effectiveness, correctness, and robustness of our proposal.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142441358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.chemolab.2024.105243
Analytical Quality-by-Design (AQbD) represents a systematic methodology for method development. The pharmaceutical and biopharmaceutical industries have increasingly recognized and applied AQbD concepts, guided by the overall framework provided by ICH. AQbD is established to ensure that an analytical procedure is fit for its intended purpose throughout its entire lifecycle, leading to a well-understood and purpose-driven method. It guides each stage of the analytical process lifecycle by establishing the Analytical Target Profile (ATP), identifying critical method parameters (CMPs), and selecting critical method attributes (CMAs). By employing screening and response-surface experimental designs, significant factors are pinpointed and optimized through statistical analysis. This methodology aids in defining the design space or Method Operable Design Region (MODR) to ensure consistent method performance. This review delves into the foundational principles of AQbD for method development and presents its latest applications in the period 2019–2024 with reference to chromatographic analysis of both non-synthetic and synthetic compounds in different sample matrices. The implementation of AQbD proved to generate more robust chromatographic methods, enhancing their efficiency in the process. Nevertheless, its adoption can be hindered owing to the necessity for a comprehensive grasp of statistical analysis and experimental design, coupled with the absence of standardized directives or regulatory prerequisites.
{"title":"Recent applications of analytical quality-by-design methodology for chromatographic analysis: A review","authors":"","doi":"10.1016/j.chemolab.2024.105243","DOIUrl":"10.1016/j.chemolab.2024.105243","url":null,"abstract":"<div><div>Analytical Quality-by-Design (AQbD) represents a systematic methodology for method development. The pharmaceutical and biopharmaceutical industries have increasingly recognized and applied AQbD concepts, guided by the overall framework provided by ICH. AQbD is established to ensure that an analytical procedure is fit for its intended purpose throughout its entire lifecycle, leading to a well-understood and purpose-driven method. It guides each stage of the analytical process lifecycle by establishing the Analytical Target Profile (ATP), identifying critical method parameters (CMPs), and selecting critical method attributes (CMAs). By employing screening and response-surface experimental designs, significant factors are pinpointed and optimized through statistical analysis. This methodology aids in defining the design space or Method Operable Design Region (MODR) to ensure consistent method performance. This review delves into the foundational principles of AQbD for method development and presents its latest applications in the period 2019–2024 with reference to chromatographic analysis of both non-synthetic and synthetic compounds in different sample matrices. The implementation of AQbD proved to generate more robust chromatographic methods, enhancing their efficiency in the process. Nevertheless, its adoption can be hindered owing to the necessity for a comprehensive grasp of statistical analysis and experimental design, coupled with the absence of standardized directives or regulatory prerequisites.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142441356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.chemolab.2024.105245
Driven by the requirements for a comprehensive understanding of composite dynamic systems in industrial processes, this paper investigates a new soft sensor for quality prediction based on slow and fast time-varying latent variables extraction using layer-wise residuals. First, the slow feature partial least squares were expanded into long-term dependency by introducing explicit expressions of the potential state of the process into the objective function. Then, the multilayer regression model for exploring composite dynamics driven by layer-wise residuals is developed using a serial structure that can extract both slow and fast time-varying latent variables that are completely orthogonal. Finally, the exponential-weighted partial least squares are proposed for extracting fast time-varying dynamic latent variables by learning the exponential decay properties of the time-series data correlation. Case studies on the industrial debutanizer and sulfur recovery unit show that the prediction accuracy of the proposed approach outperforms traditional methods.
{"title":"Layer-wise-residual-driven approach for soft sensing in composite dynamic system based on slow and fast time-varying latent variables","authors":"","doi":"10.1016/j.chemolab.2024.105245","DOIUrl":"10.1016/j.chemolab.2024.105245","url":null,"abstract":"<div><div>Driven by the requirements for a comprehensive understanding of composite dynamic systems in industrial processes, this paper investigates a new soft sensor for quality prediction based on slow and fast time-varying latent variables extraction using layer-wise residuals. First, the slow feature partial least squares were expanded into long-term dependency by introducing explicit expressions of the potential state of the process into the objective function. Then, the multilayer regression model for exploring composite dynamics driven by layer-wise residuals is developed using a serial structure that can extract both slow and fast time-varying latent variables that are completely orthogonal. Finally, the exponential-weighted partial least squares are proposed for extracting fast time-varying dynamic latent variables by learning the exponential decay properties of the time-series data correlation. Case studies on the industrial debutanizer and sulfur recovery unit show that the prediction accuracy of the proposed approach outperforms traditional methods.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142446163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1016/j.chemolab.2024.105242
Artificial neural networks are used as calibration models in routine analytical determinations that involve spectroscopic data. To ensure that the model will generate reliable predictions for new samples, the applicability domain must be well defined. This article describes a strategy for establishing the limits of the applicability domain when the calibration model is a feed-forward neural network. The applicability domain was defined by two limits: 1) the 0.99 quantile of the squared Mahalanobis distance calculated from the network activations of the training set and 2) the 0.99 quantile of the reconstruction error of the training spectra using either an autoencoder network or a decoder network. A new sample with a squared Mahalanobis distance and/or spectral residuals beyond these limits is said to be outside the applicability domain, and the prediction is questionable. The approach was illustrated by predicting the density of diesel fuel samples from mid-infrared spectra and the fat content in meat from near-infrared spectra. The methodology could correctly detect anomalous spectra in prediction using either the autoencoder or the decoder.
{"title":"Applicability domain of a calibration model based on neural networks and infrared spectroscopy","authors":"","doi":"10.1016/j.chemolab.2024.105242","DOIUrl":"10.1016/j.chemolab.2024.105242","url":null,"abstract":"<div><div>Artificial neural networks are used as calibration models in routine analytical determinations that involve spectroscopic data. To ensure that the model will generate reliable predictions for new samples, the applicability domain must be well defined. This article describes a strategy for establishing the limits of the applicability domain when the calibration model is a feed-forward neural network. The applicability domain was defined by two limits: 1) the 0.99 quantile of the squared Mahalanobis distance calculated from the network activations of the training set and 2) the 0.99 quantile of the reconstruction error of the training spectra using either an autoencoder network or a decoder network. A new sample with a squared Mahalanobis distance and/or spectral residuals beyond these limits is said to be outside the applicability domain, and the prediction is questionable. The approach was illustrated by predicting the density of diesel fuel samples from mid-infrared spectra and the fat content in meat from near-infrared spectra. The methodology could correctly detect anomalous spectra in prediction using either the autoencoder or the decoder.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142441357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-03DOI: 10.1016/j.chemolab.2024.105241
Here, we employed machine learning models to predict how well Capecitabine drug would dissolve in supercritical carbon dioxide as the green solvent. The vision is to investigate the drug suitability for processing of nanodrugs with enhanced bioavailability in the body. In the employed data set, P (pressure) and T (temperature) serve as inputs, and Y, the solubility, is the only output for building the models. This study uses DT (Decision Tree) and MLP (Multilayer perceptron) as the core models. However, the raw and individual form of conventional algorithms may not provide accurate and general results. Ensemble methods like boosting improve the model performance. Also, single and ensemble models mounted on these models have hyper-parameters whose optimization affects the final models. Meta-heuristic algorithms are popular for tuning hyper-parameters. In this research, we used a new hybrid framework by coupling the basic models with the Adaboost algorithm (as an ensemble method) and PO and CS algorithms (as optimizers) to obtain four different models. The MLP model boosted with Adaboost and tuned with PO algorithm showed the best fitting accuracy among all models. This model reduces the RMSE error rate to 1.71, MSE to 2.92, and MAE to 1.42.
{"title":"Machine learning based modeling for estimation of drug solubility in supercritical fluid by adjusting important parameters","authors":"","doi":"10.1016/j.chemolab.2024.105241","DOIUrl":"10.1016/j.chemolab.2024.105241","url":null,"abstract":"<div><div>Here, we employed machine learning models to predict how well Capecitabine drug would dissolve in supercritical carbon dioxide as the green solvent. The vision is to investigate the drug suitability for processing of nanodrugs with enhanced bioavailability in the body. In the employed data set, P (pressure) and T (temperature) serve as inputs, and Y, the solubility, is the only output for building the models. This study uses DT (Decision Tree) and MLP (Multilayer perceptron) as the core models. However, the raw and individual form of conventional algorithms may not provide accurate and general results. Ensemble methods like boosting improve the model performance. Also, single and ensemble models mounted on these models have hyper-parameters whose optimization affects the final models. Meta-heuristic algorithms are popular for tuning hyper-parameters. In this research, we used a new hybrid framework by coupling the basic models with the Adaboost algorithm (as an ensemble method) and PO and CS algorithms (as optimizers) to obtain four different models. The MLP model boosted with Adaboost and tuned with PO algorithm showed the best fitting accuracy among all models. This model reduces the RMSE error rate to 1.71, MSE to 2.92, and MAE to 1.42.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-02DOI: 10.1016/j.chemolab.2024.105240
Data measured on the same observations and organized in blocks of variables — from different measurement sources or deduced from topics specified by the user — are common in practice. Multiblock exploratory methods are useful tools to extract information from data in a reduced and interpretable common space. However, many methods have been proposed independently and the users are often lost in selecting the appropriate one, especially as they do not always lead to the same results or because outputs do not have the same form. For this purpose, the data decomposition by canonical factorization was introduced thus applied to some widely-used methods, CPCA, MCOA, MFA, STATIS and CCSWA. The methods were compared on simulated (resp. real) data whose structure is controlled (resp. known). Theoretical and practical results pinpoint that the block-structure must be carefully explored beforehand. The number of block-variables and the block-variance distribution along dimensions impacts the choice of the block-scaling. The observation-structure within and between blocks impacts the choice of the method. CPCA or MCOA mix common and specific information, STATIS highlights common structure only whereas CCSWA focuses on specific information. To enable these diagnoses, methods and proposed comparison tools are available on R, Matlab or Galaxy.
{"title":"Benchmarking multiblock methods with canonical factorization","authors":"","doi":"10.1016/j.chemolab.2024.105240","DOIUrl":"10.1016/j.chemolab.2024.105240","url":null,"abstract":"<div><div>Data measured on the same observations and organized in blocks of variables — from different measurement sources or deduced from topics specified by the user — are common in practice. Multiblock exploratory methods are useful tools to extract information from data in a reduced and interpretable common space. However, many methods have been proposed independently and the users are often lost in selecting the appropriate one, especially as they do not always lead to the same results or because outputs do not have the same form. For this purpose, the data decomposition by canonical factorization was introduced thus applied to some widely-used methods, CPCA, MCOA, MFA, STATIS and CCSWA. The methods were compared on simulated (resp. real) data whose structure is controlled (resp. known). Theoretical and practical results pinpoint that the block-structure must be carefully explored beforehand. The number of block-variables and the block-variance distribution along dimensions impacts the choice of the block-scaling. The observation-structure within and between blocks impacts the choice of the method. CPCA or MCOA mix common and specific information, STATIS highlights common structure only whereas CCSWA focuses on specific information. To enable these diagnoses, methods and proposed comparison tools are available on <span>R</span>, <span>Matlab</span> or <span>Galaxy</span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.chemolab.2024.105238
Partial Least-Squares (PLS) regression is a widely used tool in chemometrics for performing multivariate regression. As PLS has a limited capacity of modelling non-linear relations between the predictor variables and the response, Kernel PLS (K-PLS) has been introduced for modelling non-linear predictor-response relations. Most available studies use fixed kernel parameters, reducing the performance potential of the method. Only a few studies have been conducted on optimizing the kernel parameters for K-PLS. In this article, we propose a methodology for the kernel function optimization based on Kernel Flows (KF), a technique developed for Gaussian Process Regression (GPR). The results are illustrated with four case studies. The case studies represent both numerical examples and real data used in classification and regression tasks. K-PLS optimized with KF, called KF-PLS in this study, is shown to yield good results in all illustrated scenarios, outperforming literature results and other non-linear regression methodologies. In the present study, KF-PLS has been compared to convolutional neural networks (CNN), random trees, ensemble methods, support vector machines (SVM), and GPR, and it has proved to perform very well.
{"title":"KF-PLS: Optimizing Kernel Partial Least-Squares (K-PLS) with Kernel Flows","authors":"","doi":"10.1016/j.chemolab.2024.105238","DOIUrl":"10.1016/j.chemolab.2024.105238","url":null,"abstract":"<div><div>Partial Least-Squares (PLS) regression is a widely used tool in chemometrics for performing multivariate regression. As PLS has a limited capacity of modelling non-linear relations between the predictor variables and the response, Kernel PLS (K-PLS) has been introduced for modelling non-linear predictor-response relations. Most available studies use fixed kernel parameters, reducing the performance potential of the method. Only a few studies have been conducted on optimizing the kernel parameters for K-PLS. In this article, we propose a methodology for the kernel function optimization based on Kernel Flows (KF), a technique developed for Gaussian Process Regression (GPR). The results are illustrated with four case studies. The case studies represent both numerical examples and real data used in classification and regression tasks. K-PLS optimized with KF, called KF-PLS in this study, is shown to yield good results in all illustrated scenarios, outperforming literature results and other non-linear regression methodologies. In the present study, KF-PLS has been compared to convolutional neural networks (CNN), random trees, ensemble methods, support vector machines (SVM), and GPR, and it has proved to perform very well.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-29DOI: 10.1016/j.chemolab.2024.105239
Inflammation is a biological response to harmful stimuli including infections, damaged cells, tissue injuries, and toxic chemicals. It plays an essential role in facilitating tissue repair by eliminating pathogenic microorganisms. Currently, numerous therapies are applied to treat autoimmune and inflammatory diseases. However, these conventional anti-inflammatory medications are often labor-intensive, costly, and associated with adverse side effects. Recently, researchers have identified anti-inflammatory peptides (AIPs) as a cost-effective alternative for treating several inflammatory diseases, due to their high selectivity for target cells with minimal side effects. In this study, we introduce a novel computational predictor, AIPs-DeepEnC-GA, developed to accurately predict AIP samples. The training sequences are encoded using a novel n-spaced dipeptide-based position-specific scoring matrix (NsDP-PSSM) and Pseudo position-specific scoring matrix (PsePSSM)-based embedded evolutionary features. Additionally, the reduced-amino acid alphabet (RAAA-11), and composite Physiochemical properties (CPP) are employed to capture cluster-physiochemical properties based on structural information. A hybrid feature strategy is then applied, integrating embedded evolutionary features, CPP and RAAA-11 descriptors to overcome the limitations of individual encoding methods. Minimum redundancy and maximum relevance (mRMR) is utilized to select the optimal features. The selected features are trained using four different deep-learning models. The predictive labels generated by these models are provided to a genetic algorithm to form a deep-ensemble training model. The proposed AIPs-DeepEnC-GA model achieved a ∼15 % increase in predictive accuracy, reaching 94.39 %, and a 19 % improvement in the area under the curve (AUC), achieving a value of 0.98 using training sequences. For independent datasets, our method obtained improved accuracies of 91.87 %, and 89.21 %, with AUC values of 0.94 and 0.92 for Ind-I, and Ind-II, respectively. Our proposed AIPs-DeepEnC-GA model demonstrates an 11 % improvement in predictive accuracy over existing AIPs computational models using training samples. The efficacy and reliability of this model make it a promising tool for both in drug development and research academia.
{"title":"AIPs-DeepEnC-GA: Predicting anti-inflammatory peptides using embedded evolutionary and sequential feature integration with genetic algorithm based deep ensemble model","authors":"","doi":"10.1016/j.chemolab.2024.105239","DOIUrl":"10.1016/j.chemolab.2024.105239","url":null,"abstract":"<div><div>Inflammation is a biological response to harmful stimuli including infections, damaged cells, tissue injuries, and toxic chemicals. It plays an essential role in facilitating tissue repair by eliminating pathogenic microorganisms. Currently, numerous therapies are applied to treat autoimmune and inflammatory diseases. However, these conventional anti-inflammatory medications are often labor-intensive, costly, and associated with adverse side effects. Recently, researchers have identified anti-inflammatory peptides (AIPs) as a cost-effective alternative for treating several inflammatory diseases, due to their high selectivity for target cells with minimal side effects. In this study, we introduce a novel computational predictor, AIPs-DeepEnC-GA, developed to accurately predict AIP samples. The training sequences are encoded using a novel n-spaced dipeptide-based position-specific scoring matrix (NsDP-PSSM) and Pseudo position-specific scoring matrix (PsePSSM)-based embedded evolutionary features. Additionally, the reduced-amino acid alphabet (RAAA-11), and composite Physiochemical properties (CPP) are employed to capture cluster-physiochemical properties based on structural information. A hybrid feature strategy is then applied, integrating embedded evolutionary features, CPP and RAAA-11 descriptors to overcome the limitations of individual encoding methods. Minimum redundancy and maximum relevance (mRMR) is utilized to select the optimal features. The selected features are trained using four different deep-learning models. The predictive labels generated by these models are provided to a genetic algorithm to form a deep-ensemble training model. The proposed AIPs-DeepEnC-GA model achieved a ∼15 % increase in predictive accuracy, reaching 94.39 %, and a 19 % improvement in the area under the curve (AUC), achieving a value of 0.98 using training sequences. For independent datasets, our method obtained improved accuracies of 91.87 %, and 89.21 %, with AUC values of 0.94 and 0.92 for Ind-I, and Ind-II, respectively. Our proposed AIPs-DeepEnC-GA model demonstrates an 11 % improvement in predictive accuracy over existing AIPs computational models using training samples. The efficacy and reliability of this model make it a promising tool for both in drug development and research academia.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-23DOI: 10.1016/j.chemolab.2024.105234
Peak Group Analysis (PGA) is a multivariate curve resolution technique that attempts to extract single pure component spectra from time series of spectral mixture data. It requires that the mixture spectra consist of relatively sharp peaks, as is typical in IR and Raman spectroscopy. PGA aims to construct from individual peaks the associated pure component spectra in the form of nonnegative linear combinations of the right singular vectors of the spectral data matrix.
This work presents an automated PGA (autoPGA) that starts with upstream peak detection applied to time series of spectra, combining different window-based peak detection techniques with balanced peak acceptance criteria and peak grouping to deal with repeated detections. The next step is a single-spectrum-oriented PGA analysis. This is followed by a downstream correlation analysis to identify pure component spectra that occur multiple times. AutoPGA provides a complete pure component factorization of the matrix of measured data. The algorithm is applied to FT-IR data sets on various rhodium carbonyl complexes and from an equilibrium of iridium complexes.
{"title":"An automated Peak Group Analysis for vibrational spectra analysis","authors":"","doi":"10.1016/j.chemolab.2024.105234","DOIUrl":"10.1016/j.chemolab.2024.105234","url":null,"abstract":"<div><div>Peak Group Analysis (PGA) is a multivariate curve resolution technique that attempts to extract single pure component spectra from time series of spectral mixture data. It requires that the mixture spectra consist of relatively sharp peaks, as is typical in IR and Raman spectroscopy. PGA aims to construct from individual peaks the associated pure component spectra in the form of nonnegative linear combinations of the right singular vectors of the spectral data matrix.</div><div>This work presents an automated PGA (autoPGA) that starts with upstream peak detection applied to time series of spectra, combining different window-based peak detection techniques with balanced peak acceptance criteria and peak grouping to deal with repeated detections. The next step is a single-spectrum-oriented PGA analysis. This is followed by a downstream correlation analysis to identify pure component spectra that occur multiple times. AutoPGA provides a complete pure component factorization of the matrix of measured data. The algorithm is applied to FT-IR data sets on various rhodium carbonyl complexes and from an equilibrium of iridium complexes.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-21DOI: 10.1016/j.chemolab.2024.105237
The responses of the paper-based colorimetric sensor arrays are typically recorded by an imaging device. The color values of the images are subjected to chemometrics data analysis, with a view to extract the relevant information. As is the case with data extracted from other analytical instruments, these data must undergo pre-processing prior to undergoing further analysis. This study represents the first comprehensive and systematic investigation into the impact of data pre-processing techniques on the quality of subsequent data analysis methods applied to imaging data collected from paper-based colorimetric sensor arrays. The use of color difference data (calculated by subtracting the images of the sensors before exposure from those after exposure) revealed that pre-treatment of the data was not a critical factor, although it could reduce the complexity of the model. For example, the number of principal components in the principal component-linear discriminant analysis model was reduced from eight (for data that had not been pre-processed) to three (for pre-processed data) to achieve the same level of accuracy (92 %). Nevertheless, the pivotal role of data pre-processing was elucidated through the examination of data sets collected immediately following exposure to the samples’ vapor. It was demonstrated that the use of an appropriate pre-processing method allows for the elimination or significant reduction of between-sensor variations, obviating the necessity for the inclusion of data from images taken prior to exposure. With regard to the objective of classification, the object pre-processing methods that demonstrated particular promise were mean (or median) centering, Pareto scaling and standard normal variate. To illustrate, in the analysis of volatile organic compounds by an array of metallic nanoparticles, the cross-validation classification accuracy of the unprocessed data, which was 70 %, increased to 95 % when unit variance scaling and range scaling were applied to objects and variables, respectively. In the calibration phase, the majority of pre-processing methods enhanced the quality of the regression models. Using suitable pre-processing methods for both objects and variables, eliminated the need for using the before exposing image of the CSAs.
{"title":"Data pre-processing for paper-based colorimetric sensor arrays","authors":"","doi":"10.1016/j.chemolab.2024.105237","DOIUrl":"10.1016/j.chemolab.2024.105237","url":null,"abstract":"<div><div>The responses of the paper-based colorimetric sensor arrays are typically recorded by an imaging device. The color values of the images are subjected to chemometrics data analysis, with a view to extract the relevant information. As is the case with data extracted from other analytical instruments, these data must undergo pre-processing prior to undergoing further analysis. This study represents the first comprehensive and systematic investigation into the impact of data pre-processing techniques on the quality of subsequent data analysis methods applied to imaging data collected from paper-based colorimetric sensor arrays. The use of color difference data (calculated by subtracting the images of the sensors before exposure from those after exposure) revealed that pre-treatment of the data was not a critical factor, although it could reduce the complexity of the model. For example, the number of principal components in the principal component-linear discriminant analysis model was reduced from eight (for data that had not been pre-processed) to three (for pre-processed data) to achieve the same level of accuracy (92 %). Nevertheless, the pivotal role of data pre-processing was elucidated through the examination of data sets collected immediately following exposure to the samples’ vapor. It was demonstrated that the use of an appropriate pre-processing method allows for the elimination or significant reduction of between-sensor variations, obviating the necessity for the inclusion of data from images taken prior to exposure. With regard to the objective of classification, the object pre-processing methods that demonstrated particular promise were mean (or median) centering, Pareto scaling and standard normal variate. To illustrate, in the analysis of volatile organic compounds by an array of metallic nanoparticles, the cross-validation classification accuracy of the unprocessed data, which was 70 %, increased to 95 % when unit variance scaling and range scaling were applied to objects and variables, respectively. In the calibration phase, the majority of pre-processing methods enhanced the quality of the regression models. Using suitable pre-processing methods for both objects and variables, eliminated the need for using the before exposing image of the CSAs.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142314962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}