Pub Date : 2024-05-19DOI: 10.1016/j.chemolab.2024.105150
Riccardo Voccio , Cristina Malegori , Paolo Oliveri , Federica Branduani , Marco Arimondi , Andrea Bernardi , Giorgio Luciano , Mattia Cettolin
Tyre materials are complex products, as they are prepared using a number of raw materials, each of them with its specific chemical composition and functionality in the final product. It is, therefore, of crucial importance to avoid mislabeling errors and even to verify the compliance of raw materials entering the factory.
The present study proposes a strategy that makes use of near infrared (NIR) spectroscopy combined with chemometrics for raw material identification (RMID) and compliance verification of the most common raw materials used in the tyre industry. In particular, the chemometric model developed consists of a global hierarchical classification model, which combines nested PLS-DA nodes for RMID and SIMCA nodes for compliance verification, in a two-step approach.
The global model showed satisfactory results, as a 100 % of total correct predictions and a sensitivity higher than 90 % in the test set were obtained for most of the classes of interest.
The strategy obtained has the final goal of being directly applied on the raw materials at their receiving stage in factory, with the double advantage of minimizing the risk of mislabeling and, at the same time, decreasing the number of suspicious samples that need to be analyzed in the laboratory, by means of traditional methods, for verifying their compliance.
{"title":"Combining PLS-DA and SIMCA on NIR data for classifying raw materials for tyre industry: A hierarchical classification model","authors":"Riccardo Voccio , Cristina Malegori , Paolo Oliveri , Federica Branduani , Marco Arimondi , Andrea Bernardi , Giorgio Luciano , Mattia Cettolin","doi":"10.1016/j.chemolab.2024.105150","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105150","url":null,"abstract":"<div><p>Tyre materials are complex products, as they are prepared using a number of raw materials, each of them with its specific chemical composition and functionality in the final product. It is, therefore, of crucial importance to avoid mislabeling errors and even to verify the compliance of raw materials entering the factory.</p><p>The present study proposes a strategy that makes use of near infrared (NIR) spectroscopy combined with chemometrics for raw material identification (RMID) and compliance verification of the most common raw materials used in the tyre industry. In particular, the chemometric model developed consists of a global hierarchical classification model, which combines nested PLS-DA nodes for RMID and SIMCA nodes for compliance verification, in a two-step approach.</p><p>The global model showed satisfactory results, as a 100 % of total correct predictions and a sensitivity higher than 90 % in the test set were obtained for most of the classes of interest.</p><p>The strategy obtained has the final goal of being directly applied on the raw materials at their receiving stage in factory, with the double advantage of minimizing the risk of mislabeling and, at the same time, decreasing the number of suspicious samples that need to be analyzed in the laboratory, by means of traditional methods, for verifying their compliance.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105150"},"PeriodicalIF":3.9,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016974392400090X/pdfft?md5=c98998e0122d4f4f2c21e7b0a46c05e0&pid=1-s2.0-S016974392400090X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing repeatability and reproducibility in analytical chemistry is commonly based on parametric dispersion indicators, such as relative standard deviation and standard deviation, calculated for each detected variable using repeated measurements of Quality Control (QC) samples collected throughout the data acquisition sequence. However, their reliability strongly relies on the assumption of normality distribution. Knowing that analytical variability is conditional to many sources, the use of such parametric estimators is not always suitable. There is therefore a need for robust indicators of data quality independent of central values and any parametric assumption.
Methods
Three specific indicators were developed: (i) intra-group dispersion, based on the median area of the convex hull of QC samples within an analytical batch; (ii) inter-group dispersion, defined as the gradient of the deviation between analytical batches; and (iii) dispersion index. Mathematical properties of these indicators, including positivity, stability, and translation invariance, were then evaluated using synthetic data under normal and non-normal distributions. Finally, the relevance of these indicators and the associated visualization methods were highlighted based on a metabolomics case study involving liquid chromatography coupled to mass spectrometry measurements of the NIST SRM1950 reference material analyzed over more than one year within different projects.
Results
The proposed indicators were shown to be translation invariant and always positive, while first investigations performed on synthetic data revealed a high stability for multiplication. Moreover, their application to experimental data revealed specific behaviors depending on the characteristics of the signal associated with the different detected analytes, showing their ability to capture the variability observed either in parametric or non-parametric conditions. Moreover, this investigation showed different structures of sensitivity to analytical variability all along the data processing steps. The proposed indicators also allowed a visualization of the analytical drift in two dimensions, to facilitate result interpretation.
Conclusion
These indicators open the way to a better and more robust assessment of repeatability and reproducibility but also to improvements of long-term data comparability involving suitability testing.
{"title":"An alternative for the robust assessment of the repeatability and reproducibility of analytical measurements using bivariate dispersion","authors":"Elfried Salanon , Blandine Comte , Delphine Centeno , Stéphanie Durand , Estelle Pujos-Guillot , Julien Boccard","doi":"10.1016/j.chemolab.2024.105148","DOIUrl":"10.1016/j.chemolab.2024.105148","url":null,"abstract":"<div><h3>Introduction</h3><p>Assessing repeatability and reproducibility in analytical chemistry is commonly based on parametric dispersion indicators, such as relative standard deviation and standard deviation, calculated for each detected variable using repeated measurements of Quality Control (QC) samples collected throughout the data acquisition sequence. However, their reliability strongly relies on the assumption of normality distribution. Knowing that analytical variability is conditional to many sources, the use of such parametric estimators is not always suitable. There is therefore a need for robust indicators of data quality independent of central values and any parametric assumption.</p></div><div><h3>Methods</h3><p>Three specific indicators were developed: (i) intra-group dispersion, based on the median area of the convex hull of QC samples within an analytical batch; (ii) inter-group dispersion, defined as the gradient of the deviation between analytical batches; and (iii) dispersion index. Mathematical properties of these indicators, including positivity, stability, and translation invariance, were then evaluated using synthetic data under normal and non-normal distributions. Finally, the relevance of these indicators and the associated visualization methods were highlighted based on a metabolomics case study involving liquid chromatography coupled to mass spectrometry measurements of the NIST SRM1950 reference material analyzed over more than one year within different projects.</p></div><div><h3>Results</h3><p>The proposed indicators were shown to be translation invariant and always positive, while first investigations performed on synthetic data revealed a high stability for multiplication. Moreover, their application to experimental data revealed specific behaviors depending on the characteristics of the signal associated with the different detected analytes, showing their ability to capture the variability observed either in parametric or non-parametric conditions. Moreover, this investigation showed different structures of sensitivity to analytical variability all along the data processing steps. The proposed indicators also allowed a visualization of the analytical drift in two dimensions, to facilitate result interpretation.</p></div><div><h3>Conclusion</h3><p>These indicators open the way to a better and more robust assessment of repeatability and reproducibility but also to improvements of long-term data comparability involving suitability testing.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105148"},"PeriodicalIF":3.9,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000881/pdfft?md5=12d877a2bc93c6070b76e59f9583bbfc&pid=1-s2.0-S0169743924000881-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141135489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the main issues affecting the effectiveness of the quantitative structure-activity relationship (QSAR) classification techniques in chemometrics is high dimensionality. Applying feature selection is a critical procedure that determines the most relevant and important aspects of a dataset. It improves the effectiveness and accuracy of prediction models by effectively lowering the number of features. This decrease increases classification accuracy, reduces computing strain, and improves overall performance. Recently, the golden jackal optimization (GJO) algorithm was introduced, which has been successfully used to solve various continuous optimization issues. Therefore, this study proposes an improvement in the GJO algorithm employing chaotic maps, abbreviated as CGJO, to enhance the exploration and exploitation capability of the GJO algorithm in picking the essential descriptors in QSAR classification models with high classification accuracy and less computation time. Experimental findings based on four different high-dimensional chemical datasets show that the proposed CGJO algorithm can maximize classification accuracy while simultaneously decreasing the number of chosen descriptors and lowering the time required for computing. Thus, the proposed algorithm can be useful for chemical data classification in other QSAR modeling.
{"title":"Improving golden jackel optimization algorithm: An application of chemical data classification","authors":"Aiedh Mrisi Alharthi , Dler Hussein Kadir , Abdo Mohammed Al-Fakih , Zakariya Yahya Algamal , Niam Abdulmunim Al-Thanoon , Maimoonah Khalid Qasim","doi":"10.1016/j.chemolab.2024.105149","DOIUrl":"10.1016/j.chemolab.2024.105149","url":null,"abstract":"<div><p>One of the main issues affecting the effectiveness of the quantitative structure-activity relationship (QSAR) classification techniques in chemometrics is high dimensionality. Applying feature selection is a critical procedure that determines the most relevant and important aspects of a dataset. It improves the effectiveness and accuracy of prediction models by effectively lowering the number of features. This decrease increases classification accuracy, reduces computing strain, and improves overall performance. Recently, the golden jackal optimization (GJO) algorithm was introduced, which has been successfully used to solve various continuous optimization issues. Therefore, this study proposes an improvement in the GJO algorithm employing chaotic maps, abbreviated as CGJO, to enhance the exploration and exploitation capability of the GJO algorithm in picking the essential descriptors in QSAR classification models with high classification accuracy and less computation time. Experimental findings based on four different high-dimensional chemical datasets show that the proposed CGJO algorithm can maximize classification accuracy while simultaneously decreasing the number of chosen descriptors and lowering the time required for computing. Thus, the proposed algorithm can be useful for chemical data classification in other QSAR modeling.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105149"},"PeriodicalIF":3.9,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.1016/j.chemolab.2024.105145
Alexandre de Fátima Cobre , Anderson Ara , Alexessander Couto Alves , Moisés Maia Neto , Mariana Millan Fachi , Laize Sílvia dos Anjos Botas Beca , Fernanda Stumpf Tonin , Roberto Pontarolo
Recent data from the World Health Organization reveals that in 2023, 38.8 million people were living with HIV. Within this population, there were 1.5 million new cases and 650 thousand deaths attributed to the disease. This study employs an integrated approach involving QSAR-based machine learning models, molecular docking, and molecular dynamics simulations to identify potential compounds for inhibiting the bioactivity of the CC chemokine receptor type 5 (CCR5) protein, a key entry point for the HIV virus. Using non-redundant experimental data from the CHEMBL database, 40 different machine learning algorithms were trained and the top four models (XGBoost, Histogram based gradient Boosting, Light Gradient Boosted Machine, and Extra Trees Regression) were utilized to predict anti-HIV bioactivity for 37 billion compounds in the ZINC-22 database. The screening resulted in the identification of 124 new anti-HIV drug candidates, confirmed through molecular docking and dynamics simulations. The study underscores the therapeutic potential of these compounds, paving the way for further in vitro and in vivo investigations. The convergence of machine learning and experimental findings presents a promising avenue for significant advancements in pharmaceutical research, particularly in the treatment of viral diseases such as HIV. To guarantee the reproducibility of our study, we have made the Python code (google colab) and the associated database available on GitHub. You can access them through the following link: GitHub Link: https://github.com/AlexandreCOBRE/code.
{"title":"Identifying 124 new anti-HIV drug candidates in a 37 billion-compound database: An integrated approach of machine learning (QSAR), molecular docking, and molecular dynamics simulation","authors":"Alexandre de Fátima Cobre , Anderson Ara , Alexessander Couto Alves , Moisés Maia Neto , Mariana Millan Fachi , Laize Sílvia dos Anjos Botas Beca , Fernanda Stumpf Tonin , Roberto Pontarolo","doi":"10.1016/j.chemolab.2024.105145","DOIUrl":"10.1016/j.chemolab.2024.105145","url":null,"abstract":"<div><p>Recent data from the World Health Organization reveals that in 2023, 38.8 million people were living with HIV. Within this population, there were 1.5 million new cases and 650 thousand deaths attributed to the disease<strong>.</strong> This study employs an integrated approach involving QSAR-based machine learning models, molecular docking, and molecular dynamics simulations to identify potential compounds for inhibiting the bioactivity of the CC chemokine receptor type 5 (CCR5) protein, a key entry point for the HIV virus. Using non-redundant experimental data from the CHEMBL database, 40 different machine learning algorithms were trained and the top four models (XGBoost, Histogram based gradient Boosting, Light Gradient Boosted Machine, and Extra Trees Regression) were utilized to predict <em>anti</em>-HIV bioactivity for 37 billion compounds in the ZINC-22 database. The screening resulted in the identification of 124 new <em>anti</em>-HIV drug candidates, confirmed through molecular docking and dynamics simulations. The study underscores the therapeutic potential of these compounds, paving the way for further in vitro and in vivo investigations. The convergence of machine learning and experimental findings presents a promising avenue for significant advancements in pharmaceutical research, particularly in the treatment of viral diseases such as HIV. To guarantee the reproducibility of our study, we have made the Python code (google colab) and the associated database available on GitHub. You can access them through the following link: GitHub Link: <span>https://github.com/AlexandreCOBRE/code</span><svg><path></path></svg>.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105145"},"PeriodicalIF":3.9,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141031854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.1016/j.chemolab.2024.105138
Theofanis Gerodimos , Ioannis Georvasilis , Anastasios Asvestas , Georgios P. Mastrotheodoros , Aristidis Likas , Dimitrios F. Anagnostopoulos
Macroscopic X-ray fluorescence (MA-XRF) datasets are analyzed using Artificial Neural Networks. Specifically, Convolutional Neural Networks (CNNs) are trained by coupling the spectra acquired during the MA-XRF scan of two religious panel paintings (“icons”) with the associated Ground-Truth counts per characteristic transition line, as they are extracted by X-ray fluorescence fundamental parameters analysis. In total, twenty thousand XRF spectra were used for the CNN training. The trained neural networks were applied to analyze millions of MA-XRF spectra acquired during the scan of religious painting panels by computing the counts per pixel of X-ray characteristic transition lines and creating the elemental transition maps. Comparison of the CNN extracted results to the Ground-Truth (GT) shows remarkable agreement. The successful MA-XRF datasets analysis applying the CNN method paves an analytical path to the direction of the auto-identification of spectral lines, offering the means for the non-experienced XRF analyst to provide a state-of-the-art analysis and supporting the experienced user not to overlook hardly resolved transition lines.
利用人工神经网络对宏观 X 射线荧光 (MA-XRF) 数据集进行分析。具体来说,卷积神经网络(CNN)的训练方法是将对两幅宗教板画("圣像")进行 MA-XRF 扫描时获取的光谱与通过 X 射线荧光基本参数分析提取的每条特征过渡线的相关地面实况计数相耦合。CNN 训练总共使用了两万个 X 射线荧光光谱。通过计算每个像素的 X 射线特征转变线计数和创建元素转变图,将训练好的神经网络用于分析在扫描宗教绘画板时获取的数百万 MA-XRF 光谱。将 CNN 提取的结果与 "地面实况"(Ground-Truth,GT)进行比较,结果显示两者非常一致。应用 CNN 方法成功分析 MA-XRF 数据集为光谱线的自动识别方向铺平了分析道路,为没有经验的 XRF 分析师提供了最先进的分析手段,并帮助有经验的用户避免忽略难以解析的过渡线。
{"title":"MA-XRF datasets analysis based on convolutional neural network: A case study on religious panel paintings","authors":"Theofanis Gerodimos , Ioannis Georvasilis , Anastasios Asvestas , Georgios P. Mastrotheodoros , Aristidis Likas , Dimitrios F. Anagnostopoulos","doi":"10.1016/j.chemolab.2024.105138","DOIUrl":"10.1016/j.chemolab.2024.105138","url":null,"abstract":"<div><p>Macroscopic X-ray fluorescence (MA-XRF) datasets are analyzed using Artificial Neural Networks. Specifically, Convolutional Neural Networks (CNNs) are trained by coupling the spectra acquired during the MA-XRF scan of two religious panel paintings (“icons”) with the associated Ground-Truth counts per characteristic transition line, as they are extracted by X-ray fluorescence fundamental parameters analysis. In total, twenty thousand XRF spectra were used for the CNN training. The trained neural networks were applied to analyze millions of MA-XRF spectra acquired during the scan of religious painting panels by computing the counts per pixel of X-ray characteristic transition lines and creating the elemental transition maps. Comparison of the CNN extracted results to the Ground-Truth (GT) shows remarkable agreement. The successful MA-XRF datasets analysis applying the CNN method paves an analytical path to the direction of the auto-identification of spectral lines, offering the means for the non-experienced XRF analyst to provide a state-of-the-art analysis and supporting the experienced user not to overlook hardly resolved transition lines.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105138"},"PeriodicalIF":3.9,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141045310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1016/j.chemolab.2024.105137
Meysam Latifi Amoghin , Yousef Abbaspour-Gilandeh , Mohammad Tahmasebi , Juan Ignacio Arribas
The browning process of food products if often formed upon cutting and damage during their processing, transport, and storage, amongst other potential sources and reasons. Enzymic browning can be mainly due to polyphenol oxidase (PPO) and peroxidase (POD) enzymes. Visible/near-infrared (Vis/NIR) imaging spectroscopy in the range of 350–1150 nm was used in this study for automatic and non-destructive evaluation of PPO and POD activity levels in three bell pepper varieties (red, yellow, orange; N = 30), with a total of 30 inputs samples in each variety. The spectral data were then modeled by the partial least squares regression (PLSR) throughout the whole spectral range, without using any subset of the most effective wavelength (EW) values. Regression determination coefficient (R2) values for the estimation (prediction) of POD enzyme activity levels were 0.794, 0.772, and 0.726 for red, yellow, and orange bell peppers, respectively, all over the validation set. At the same time, the activity levels of PPO enzyme over bell peppers showed R2 values of 0.901, 0.810, and 0.859, for red, yellow, and orange bell peppers, respectively, all over the validation set. In addition, a combination of support vector machine (SVM) with either genetic algorithms (GA), particle swarm optimization (PSO), ant colony optimization (ACO), or imperialistic competitive algorithms (ICA) hybrid machine learning (ML) techniques were used to select the optimal (discriminant) spectral EW wavelength values, and regression performance was consistently improved, to judge from higher regression fit R2 values. Either 14 or 15 EWs were computed and selected in order of their discriminative power using previously mentioned ML techniques. The hybrid SVM-PSO method resulted the best one in the process of selecting the most effective wavelength values (nm). On the other hand, three regression methods comprising PLSR, multiple least regression (MLR), and neural network (NN), were employed to model the SVM-PSO selected EWs. The ratio of performance to deviation (RPD), the R2 and the root mean square error (RMSE), over the test set, for the non-linear NN regression method exhibited better results as compared to the other two regression methods, being closely followed by PLSR, and therefore NN regression method was selected as the best approach for modeling the most effective spectral wavelength values in this study.
食品在加工、运输和储藏过程中往往会因切割和损坏而形成褐变,此外还有其他潜在的来源和原因。酶促褐变主要是由多酚氧化酶(PPO)和过氧化物酶(POD)引起的。本研究采用 350-1150 纳米波长范围内的可见光/近红外(Vis/NIR)成像光谱,对三个甜椒品种(红、黄、橙;N = 30)的 PPO 和 POD 活性水平进行自动、非破坏性评估,每个品种共 30 个输入样本。然后在整个光谱范围内对光谱数据进行偏最小二乘回归(PLSR)建模,而不使用任何最有效波长(EW)值子集。在整个验证集中,红椒、黄椒和橙椒的 POD 酶活性水平的估计(预测)回归决定系数 (R2) 值分别为 0.794、0.772 和 0.726。同时,在所有验证集上,红椒、黄椒和橙椒的 PPO 酶活性水平的 R2 值分别为 0.901、0.810 和 0.859。此外,支持向量机(SVM)与遗传算法(GA)、粒子群优化(PSO)、蚁群优化(ACO)或帝国竞争算法(ICA)混合机器学习(ML)技术相结合,用于选择最佳(判别)光谱 EW 波长值,从更高的回归拟合 R2 值来看,回归性能得到了持续改善。利用前面提到的 ML 技术,计算出了 14 或 15 个 EW,并按照其判别能力的顺序进行了选择。在选择最有效波长值(纳米)的过程中,SVM-PSO 混合方法的效果最好。另一方面,包括 PLSR、多元最小回归 (MLR) 和神经网络 (NN) 在内的三种回归方法被用来为 SVM-PSO 选定的 EW 建模。与其他两种回归方法相比,非线性 NN 回归方法在测试集上的性能与偏差比(RPD)、R2 和均方根误差(RMSE)都表现出更好的结果,PLSR 紧随其后,因此 NN 回归方法被选为本研究中最有效光谱波长值建模的最佳方法。
{"title":"Automatic non-destructive estimation of polyphenol oxidase and peroxidase enzyme activity levels in three bell pepper varieties by Vis/NIR spectroscopy imaging data based on machine learning methods","authors":"Meysam Latifi Amoghin , Yousef Abbaspour-Gilandeh , Mohammad Tahmasebi , Juan Ignacio Arribas","doi":"10.1016/j.chemolab.2024.105137","DOIUrl":"10.1016/j.chemolab.2024.105137","url":null,"abstract":"<div><p>The browning process of food products if often formed upon cutting and damage during their processing, transport, and storage, amongst other potential sources and reasons. Enzymic browning can be mainly due to polyphenol oxidase (PPO) and peroxidase (POD) enzymes. Visible/near-infrared (Vis/NIR) imaging spectroscopy in the range of 350–1150 nm was used in this study for automatic and non-destructive evaluation of PPO and POD activity levels in three bell pepper varieties (red, yellow, orange; N = 30), with a total of 30 inputs samples in each variety. The spectral data were then modeled by the partial least squares regression (PLSR) throughout the whole spectral range, without using any subset of the most effective wavelength (EW) values. Regression determination coefficient (R<sup>2</sup>) values for the estimation (prediction) of POD enzyme activity levels were 0.794, 0.772, and 0.726 for red, yellow, and orange bell peppers, respectively, all over the validation set. At the same time, the activity levels of PPO enzyme over bell peppers showed R<sup>2</sup> values of 0.901, 0.810, and 0.859, for red, yellow, and orange bell peppers, respectively, all over the validation set. In addition, a combination of support vector machine (SVM) with either genetic algorithms (GA), particle swarm optimization (PSO), ant colony optimization (ACO), or imperialistic competitive algorithms (ICA) hybrid machine learning (ML) techniques were used to select the optimal (discriminant) spectral EW wavelength values, and regression performance was consistently improved, to judge from higher regression fit R<sup>2</sup> values. Either 14 or 15 EWs were computed and selected in order of their discriminative power using previously mentioned ML techniques. The hybrid SVM-PSO method resulted the best one in the process of selecting the most effective wavelength values (nm). On the other hand, three regression methods comprising PLSR, multiple least regression (MLR), and neural network (NN), were employed to model the SVM-PSO selected EWs. The ratio of performance to deviation (RPD), the R<sup>2</sup> and the root mean square error (RMSE), over the test set, for the non-linear NN regression method exhibited better results as compared to the other two regression methods, being closely followed by PLSR, and therefore NN regression method was selected as the best approach for modeling the most effective spectral wavelength values in this study.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105137"},"PeriodicalIF":3.9,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000777/pdfft?md5=1c66f4c9e2d7fdb5e8fd71595aa511f4&pid=1-s2.0-S0169743924000777-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141026864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-04DOI: 10.1016/j.chemolab.2024.105136
Omar Nibouche , Fayas Asharindavida , Hui Wang , Jordan Vincent , Jun Liu , Saskia van Ruth , Paul Maguire , Enayet Rahman
The well-known and extensively studied Linear Discriminant Analysis (LDA) can have its performance lowered in scenarios where data is not homoscedastic or not Gaussian. That is, the classical assumptions when LDA models are built are not applicable, and consequently LDA projections would not be able to extract the needed features to explain the intrinsic structure of data and for classes to be separated. As with many real word data sets, data obtained using miniature spectrometers can suffer from such drawbacks which would limit the deployment of such technology needed for food analysis. The solution presented in the paper is to divide classes into subclasses and to use means of sub classes, classes, and data in the suggested between classes scatter metric. Further, samples belonging to the same subclass are used to build a measure of within subclass scatterness. Such a solution solves the shortcoming of the classical LDA. The obtained results when using the proposed solution on food data and on general machine learning datasets show that the work in this paper compares well to and is very competitive with similar sub-class LDA algorithms in the literature. An extension to a Hilbert space is also presented; and the kernel version of the presented solution can be fused with its linear counter parts to yield improved classification rates.
{"title":"A new sub-class linear discriminant for miniature spectrometer based food analysis","authors":"Omar Nibouche , Fayas Asharindavida , Hui Wang , Jordan Vincent , Jun Liu , Saskia van Ruth , Paul Maguire , Enayet Rahman","doi":"10.1016/j.chemolab.2024.105136","DOIUrl":"10.1016/j.chemolab.2024.105136","url":null,"abstract":"<div><p>The well-known and extensively studied Linear Discriminant Analysis (LDA) can have its performance lowered in scenarios where data is not homoscedastic or not Gaussian. That is, the classical assumptions when LDA models are built are not applicable, and consequently LDA projections would not be able to extract the needed features to explain the intrinsic structure of data and for classes to be separated. As with many real word data sets, data obtained using miniature spectrometers can suffer from such drawbacks which would limit the deployment of such technology needed for food analysis. The solution presented in the paper is to divide classes into subclasses and to use means of sub classes, classes, and data in the suggested between classes scatter metric. Further, samples belonging to the same subclass are used to build a measure of within subclass scatterness. Such a solution solves the shortcoming of the classical LDA. The obtained results when using the proposed solution on food data and on general machine learning datasets show that the work in this paper compares well to and is very competitive with similar sub-class LDA algorithms in the literature. An extension to a Hilbert space is also presented; and the kernel version of the presented solution can be fused with its linear counter parts to yield improved classification rates.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105136"},"PeriodicalIF":3.9,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000765/pdfft?md5=79caa0e3ce066c5537d9c639d217ec83&pid=1-s2.0-S0169743924000765-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141055788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.chemolab.2024.105135
Mohammed Benaafi , Sani I. Abba , Mojeed Opeyemi Oyedeji , Auwalu Saleh Mubarak , Jamilu Usman , Isam H. Aljundi
Groundwater (GW) salinization of coastal aquifers has become a serious problem for attaining sustainable water resource management in Saudi Arabia and other parts of the world. Therefore, it is crucial to assess the extent of this salinization to protect and manage our water resources effectively. This research proposed real fieldwork GW samples at several locations supported with experimental based on chromatography (IC) and inductively coupled plasma mass spectrometry (ICP-MS) to analyze several GW physical, chemical, and hydro-geochemical elements. In this study, we model GW salinization with machine learning algorithms such as support vector regression, gaussian process regression, artificial neural networks, and least squares ensemble boosting regression tree. The performance of the standalone models was optimized with metaheuristic optimization-based algorithms such as fuzzy hybridized genetic algorithm (ANFIS-GA) and particle swarm optimization (ANFIS-PSO). The outcomes based on three variable input combinations were validated using several performance indicators and graphical methods. The quantitative analysis indicated that GPR-Combo1(MAE = 0.006 mg/L), Ensm- Combo2 (MAE = 0.025 mg/L), and GPR- Combo3 (MAE = 0.078 mg/L) proved merit among the standalone combinations. Where combo 1, 2, and 3 stand for model combinations derived from feature selection. The cumulative probability function (CPF) demonstrated that heuristic optimization ANFIS-GA (MAE = 0.0025 mg/L, MAPE = 0.19183) and ANFIS-PSO (MAE = 0.0018 mg/L, MAPE = 0.0723) outperformed the standalone error accuracy and served reliable approach. Both the standalone models and heuristic algorithms used for GW salinization modeling have demonstrated promising results in accurately predicting salinity. This approach could aid in effectively managing the GW resources for sustainable development.
{"title":"Experimental-based groundwater salinization from the carbonate aquifer of eastern Saudi Arabia: Insight into machine learning coupled with meta-heuristic algorithms","authors":"Mohammed Benaafi , Sani I. Abba , Mojeed Opeyemi Oyedeji , Auwalu Saleh Mubarak , Jamilu Usman , Isam H. Aljundi","doi":"10.1016/j.chemolab.2024.105135","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105135","url":null,"abstract":"<div><p>Groundwater (GW) salinization of coastal aquifers has become a serious problem for attaining sustainable water resource management in Saudi Arabia and other parts of the world. Therefore, it is crucial to assess the extent of this salinization to protect and manage our water resources effectively. This research proposed real fieldwork GW samples at several locations supported with experimental based on chromatography (IC) and inductively coupled plasma mass spectrometry (ICP-MS) to analyze several GW physical, chemical, and hydro-geochemical elements. In this study, we model GW salinization with machine learning algorithms such as support vector regression, gaussian process regression, artificial neural networks, and least squares ensemble boosting regression tree. The performance of the standalone models was optimized with metaheuristic optimization-based algorithms such as fuzzy hybridized genetic algorithm (ANFIS-GA) and particle swarm optimization (ANFIS-PSO). The outcomes based on three variable input combinations were validated using several performance indicators and graphical methods. The quantitative analysis indicated that GPR-Combo1(MAE = 0.006 mg/L), Ensm- Combo2 (MAE = 0.025 mg/L), and GPR- Combo3 (MAE = 0.078 mg/L) proved merit among the standalone combinations. Where combo 1, 2, and 3 stand for model combinations derived from feature selection. The cumulative probability function (CPF) demonstrated that heuristic optimization ANFIS-GA (MAE = 0.0025 mg/L, MAPE = 0.19183) and ANFIS-PSO (MAE = 0.0018 mg/L, MAPE = 0.0723) outperformed the standalone error accuracy and served reliable approach. Both the standalone models and heuristic algorithms used for GW salinization modeling have demonstrated promising results in accurately predicting salinity. This approach could aid in effectively managing the GW resources for sustainable development.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"249 ","pages":"Article 105135"},"PeriodicalIF":3.9,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-27DOI: 10.1016/j.chemolab.2024.105134
Martina Beese , Tomass Andersons , Mathias Sawall , Cyril Ruckebusch , Adrián Gómez-Sánchez , Robert Francke , Adrian Prudlik , Robert Franke , Klaus Neymeyr
Multivariate curve resolution (MCR) methods are sometimes faced with missing or erroneous data, e.g., due to sensor saturation. In some cases, an estimation of the missing data is possible, but often MCR works with the largest submatrix without missing entries. This ignores all rows and columns of the data matrix that contain missing values. A successful approach to deal with incomplete data multisets has been proposed by Alier and Tauler (2013), but it does not include a factor ambiguity analysis. Here, the missing data problem is addressed in combination with a factor ambiguity analysis. An approach is presented that minimizes the factor ambiguity by extracting a maximum of spectral information even from incomplete rows and columns of the spectral data matrix. The method requires a high signal-to-noise ratio. Applications are presented for UV/Vis and HSI data.
{"title":"On the factor ambiguity of MCR problems for blockwise incomplete data sets","authors":"Martina Beese , Tomass Andersons , Mathias Sawall , Cyril Ruckebusch , Adrián Gómez-Sánchez , Robert Francke , Adrian Prudlik , Robert Franke , Klaus Neymeyr","doi":"10.1016/j.chemolab.2024.105134","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105134","url":null,"abstract":"<div><p>Multivariate curve resolution (MCR) methods are sometimes faced with missing or erroneous data, e.g., due to sensor saturation. In some cases, an estimation of the missing data is possible, but often MCR works with the largest submatrix without missing entries. This ignores all rows and columns of the data matrix that contain missing values. A successful approach to deal with incomplete data multisets has been proposed by Alier and Tauler (2013), but it does not include a factor ambiguity analysis. Here, the missing data problem is addressed in combination with a factor ambiguity analysis. An approach is presented that minimizes the factor ambiguity by extracting a maximum of spectral information even from incomplete rows and columns of the spectral data matrix. The method requires a high signal-to-noise ratio. Applications are presented for UV/Vis and HSI data.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"249 ","pages":"Article 105134"},"PeriodicalIF":3.9,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000741/pdfft?md5=bb7d17fc695f88d0275f3839df0eb621&pid=1-s2.0-S0169743924000741-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140815811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dried oregano leaves are particularly prone to adulteration because of their widespread distribution and their easy mixing with leaves of other plants of lower commercial value, such as olive, myrtle, strawberry tree, or sumac. To reveal the presence of adulteration, in this study we considered an untargeted analytical approach, which instead of involving the a priori selection of specific compounds of interest is focused on defining the characteristic spectral signature of authentic oregano with respect to its most frequent adulterants. NIR HyperSpectral Imaging (NIR-HSI) represents a state-of-the-art, rapid and non-destructive technique, allowing for the collection of both spectral and spatial information from the sample, making it particularly suitable for characterizing visually heterogeneous samples.
Authentication issues are typically assessed through class modelling techniques and Soft Independent Modelling of class Analogy (SIMCA) is one of the most used algorithms in this scenario. However, the high variability and heterogeneity within the authentic oregano class resulted in poor outcomes when SIMCA was applied. As an alternative, Soft Partial Least Squares Discriminant Analysis (Soft PLS-DA) algorithm was applied to differentiate authentic oregano samples from pure adulterants. Soft PLS-DA represents a hybrid approach that combines the advantages of both discriminant and class modelling techniques. The resultant classification model has indeed led to promising results, achieving a prediction efficiency of 92.9 %. Finally, based on the percentage of pixels predicted as oregano in the Soft-PLSDA prediction images, a threshold value of 10 % was established, serving as a detection limit of NIR-HSI to distinguish authentic oregano samples from adulterated ones.
{"title":"Addressing adulteration challenges of dried oregano leaves by NIR HyperSpectral Imaging","authors":"Veronica Ferrari , Rosalba Calvini , Camilla Menozzi , Alessandro Ulrici , Marco Bragolusi , Roberto Piro , Alessandra Tata , Michele Suman , Giorgia Foca","doi":"10.1016/j.chemolab.2024.105133","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105133","url":null,"abstract":"<div><p>Dried oregano leaves are particularly prone to adulteration because of their widespread distribution and their easy mixing with leaves of other plants of lower commercial value, such as olive, myrtle, strawberry tree, or sumac. To reveal the presence of adulteration, in this study we considered an untargeted analytical approach, which instead of involving the <em>a priori</em> selection of specific compounds of interest is focused on defining the characteristic spectral signature of authentic oregano with respect to its most frequent adulterants. NIR HyperSpectral Imaging (NIR-HSI) represents a state-of-the-art, rapid and non-destructive technique, allowing for the collection of both spectral and spatial information from the sample, making it particularly suitable for characterizing visually heterogeneous samples.</p><p>Authentication issues are typically assessed through class modelling techniques and Soft Independent Modelling of class Analogy (SIMCA) is one of the most used algorithms in this scenario. However, the high variability and heterogeneity within the authentic oregano class resulted in poor outcomes when SIMCA was applied. As an alternative, Soft Partial Least Squares Discriminant Analysis (Soft PLS-DA) algorithm was applied to differentiate authentic oregano samples from pure adulterants. Soft PLS-DA represents a hybrid approach that combines the advantages of both discriminant and class modelling techniques. The resultant classification model has indeed led to promising results, achieving a prediction efficiency of 92.9 %. Finally, based on the percentage of pixels predicted as oregano in the Soft-PLSDA prediction images, a threshold value of 10 % was established, serving as a detection limit of NIR-HSI to distinguish authentic oregano samples from adulterated ones.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"249 ","pages":"Article 105133"},"PeriodicalIF":3.9,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016974392400073X/pdfft?md5=9ca1205b6902ee41304da3031bdead5a&pid=1-s2.0-S016974392400073X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140640785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}