首页 > 最新文献

Chemometrics and Intelligent Laboratory Systems最新文献

英文 中文
A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models 通过主成分分析生成合成样本和人工离群值并评估二元分类模型的预测能力
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-06-01 DOI: 10.1016/j.chemolab.2024.105154
Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras

Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.

不平衡的样本组往往会产生主要类别较多的模型。具有均衡类别的样本组有助于开发更稳健的模型,提高预测能力,对类别进行平等分类。在文献中,可以找到两种平衡样本的方法:剔除(欠采样)和合成样本生成(过采样)。欠采样方法会导致真实样本的丢失,而超采样方法可能会带来在原始光谱中添加非真实信号的相关问题。为了克服这些挑战,本文旨在利用主成分分析(PCA)生成虚拟样本(合成样本和人工离群值),以平衡多元分类模型中的数据。本文将所提出的方法应用于中红外光谱(MIR)和高分辨质谱(HRMS)数据,并使用了偏最小二乘法判别分析(PLS-DA)和支持向量机(SVM)模型。所构建的模型表明,与不平衡模型相比,添加虚拟样本可提高性能参数(如假阴性率、假阳性率、准确性、灵敏度、特异性等),同时还可减轻过度拟合(不平衡模型中存在的问题)。与线性模型(PLS-DA)相比,使用非线性模型(SVM)的性能参数提高幅度更大。此外,创建的虚拟光谱不会引入新的信号,即原始光谱和虚拟光谱显示出相似的光谱轮廓,仅在强度级别上有所不同。最后,根据对本研究中开发的二元模型的置换测试,所有模型都表现出良好的预测能力,限制了类别置换保留率(40% 至 60% 的 y 向量保留在原始类别中)。所有创建的模型的准确度值都高于测试组中带有置换类别的模型的准确度分布。
{"title":"A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models","authors":"Gabriely S. Folli ,&nbsp;Márcia H.C. Nascimento ,&nbsp;Betina P.O. Lovatti ,&nbsp;Wanderson Romão ,&nbsp;Paulo R. Filgueiras","doi":"10.1016/j.chemolab.2024.105154","DOIUrl":"10.1016/j.chemolab.2024.105154","url":null,"abstract":"<div><p>Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141232858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Solving the missing value problem in PCA by Orthogonalized-Alternating Least Squares (O-ALS) 用正交-替代最小二乘法 (O-ALS) 解决 PCA 中的缺失值问题
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-31 DOI: 10.1016/j.chemolab.2024.105153
Adrián Gómez-Sánchez , Raffaele Vitale , Cyril Ruckebusch , Anna de Juan

Dealing with missing data poses a challenge in Principal Component Analysis (PCA) since the most common algorithms are not designed to handle them. Several approaches have been proposed to solve the missing value problem in PCA, such as Imputation based on SVD (I-SVD), where missing entries are filled by imputation and updated in every iteration until convergence of the PCA model, and the adaptation of the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm, able to work skipping the missing entries during the least-squares estimation of scores and loadings. However, some limitations have been reported for both approaches. On the one hand, convergence of the I-SVD algorithm can be very slow for data sets with a high percentage of missing data. On the other hand, the orthogonality properties among scores and loadings might be lost when using NIPALS.

To solve these issues and perform PCA of data sets with missing values without the need of imputation steps, a novel algorithm called Orthogonalized-Alternating Least Squares (O-ALS) is proposed. The O-ALS algorithm is an alternating least-squares algorithm that estimates the scores and loadings subject to the Gram-Schmidt orthogonalization constraint. The way to estimate scores and loadings is adapted to work only with the available information.

In this study, the performance of O-ALS is tested and compared with NIPALS and I-SVD in simulated data sets and in a real case study. The results show that O-ALS is an accurate and fast algorithm to analyze data with any percentage and distribution pattern of missing entries, being able to provide correct scores and loadings in cases where I-SVD and NIPALS do not perform satisfactorily.

处理缺失数据是主成分分析(PCA)中的一项挑战,因为最常见的算法并不是为处理缺失数据而设计的。已经提出了几种方法来解决 PCA 中的缺失值问题,如基于 SVD 的估算(I-SVD),即通过估算来填补缺失项,并在每次迭代中更新,直到 PCA 模型收敛;以及非线性迭代部分最小二乘法(NIPALS)算法的改编,该算法能够在最小二乘法估算分数和载荷时跳过缺失项。不过,这两种方法都存在一些局限性。一方面,对于缺失数据比例较高的数据集,I-SVD 算法的收敛速度会非常慢。为了解决这些问题,并在不需要估算步骤的情况下对有缺失值的数据集进行 PCA 分析,我们提出了一种名为 "正交-替代最小二乘法(O-ALS)"的新算法。O-ALS 算法是一种交替最小二乘法算法,可在格拉姆-施密特正交化约束下估计分数和载荷。本研究对 O-ALS 的性能进行了测试,并在模拟数据集和实际案例研究中将其与 NIPALS 和 I-SVD 进行了比较。结果表明,O-ALS 是一种准确、快速的算法,可用于分析具有任何缺失条目百分比和分布模式的数据,在 I-SVD 和 NIPALS 的性能不能令人满意的情况下,也能提供正确的分数和载荷。
{"title":"Solving the missing value problem in PCA by Orthogonalized-Alternating Least Squares (O-ALS)","authors":"Adrián Gómez-Sánchez ,&nbsp;Raffaele Vitale ,&nbsp;Cyril Ruckebusch ,&nbsp;Anna de Juan","doi":"10.1016/j.chemolab.2024.105153","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105153","url":null,"abstract":"<div><p>Dealing with missing data poses a challenge in Principal Component Analysis (PCA) since the most common algorithms are not designed to handle them. Several approaches have been proposed to solve the missing value problem in PCA, such as Imputation based on SVD (I-SVD), where missing entries are filled by imputation and updated in every iteration until convergence of the PCA model, and the adaptation of the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm, able to work skipping the missing entries during the least-squares estimation of scores and loadings. However, some limitations have been reported for both approaches. On the one hand, convergence of the I-SVD algorithm can be very slow for data sets with a high percentage of missing data. On the other hand, the orthogonality properties among scores and loadings might be lost when using NIPALS.</p><p>To solve these issues and perform PCA of data sets with missing values without the need of imputation steps, a novel algorithm called Orthogonalized-Alternating Least Squares (O-ALS) is proposed. The O-ALS algorithm is an alternating least-squares algorithm that estimates the scores and loadings subject to the Gram-Schmidt orthogonalization constraint. The way to estimate scores and loadings is adapted to work only with the available information.</p><p>In this study, the performance of O-ALS is tested and compared with NIPALS and I-SVD in simulated data sets and in a real case study. The results show that O-ALS is an accurate and fast algorithm to analyze data with any percentage and distribution pattern of missing entries, being able to provide correct scores and loadings in cases where I-SVD and NIPALS do not perform satisfactorily.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000935/pdfft?md5=86dff5a658083570086161657efcf7cb&pid=1-s2.0-S0169743924000935-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141241037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Virtual biopsies for breast cancer using MCR-ALS perfusion-based biomarkers and double cross-validation PLS-DA 利用基于 MCR-ALS 灌注的生物标记和双交叉验证 PLS-DA 对乳腺癌进行虚拟活检
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-28 DOI: 10.1016/j.chemolab.2024.105152
E. Aguado-Sarrió , J.M. Prats-Montalbán , J. Camps-Herrero , A. Ferrer

Functional MRI is, currently, the most sensitive technique in breast cancer for detecting early tumors, and perfusion (DCE-MRI) has become the most important sequence to depict and characterize angiogenesis and neovascularization. In this work, we propose the use of new biomarkers that are related to clear physiological phenomena, obtained from MCR-ALS as an alternative to curve-based pseudo-biomarkers and pharmacokinetics models. In order to provide a discrimination and prediction model between healthy tissue and cancer, we propose using PLS-DA with double cross-validation (2CV) and variable selection, repeated several times and obtaining excellent average results for the performance indexes (f-score: 0.9149, MCC: 0.8538, AUROC: 0.8794). After selecting the optimal prediction model, a unique probabilistic map called “virtual biopsy” that shows in different colors the probability that each pixel of the image has a tumor behavior is obtained, helping the specialist with the identification and characterization of breast tumors with only one easy-to-interpret biomarker map.

功能磁共振成像(Functional MRI)是目前乳腺癌检测早期肿瘤最灵敏的技术,而灌注(DCE-MRI)已成为描述血管生成和新生血管特征的最重要序列。在这项工作中,我们建议使用从 MCR-ALS 中获得的与明确生理现象相关的新生物标记物,以替代基于曲线的伪生物标记物和药代动力学模型。为了提供健康组织和癌症之间的鉴别和预测模型,我们建议使用双交叉验证(2CV)和变量选择的 PLS-DA,重复多次后,性能指标的平均结果非常好(f-score:0.9149,MCC:0.8538,AUROC:0.8794)。在选择出最佳预测模型后,就得到了一个被称为 "虚拟活检 "的独特概率图,该图以不同颜色显示图像中每个像素具有肿瘤行为的概率,只需一张易于理解的生物标志物图就能帮助专家识别乳腺肿瘤并确定其特征。
{"title":"Virtual biopsies for breast cancer using MCR-ALS perfusion-based biomarkers and double cross-validation PLS-DA","authors":"E. Aguado-Sarrió ,&nbsp;J.M. Prats-Montalbán ,&nbsp;J. Camps-Herrero ,&nbsp;A. Ferrer","doi":"10.1016/j.chemolab.2024.105152","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105152","url":null,"abstract":"<div><p>Functional MRI is, currently, the most sensitive technique in breast cancer for detecting early tumors, and perfusion (DCE-MRI) has become the most important sequence to depict and characterize angiogenesis and neovascularization. In this work, we propose the use of new biomarkers that are related to clear physiological phenomena, obtained from MCR-ALS as an alternative to curve-based pseudo-biomarkers and pharmacokinetics models. In order to provide a discrimination and prediction model between healthy tissue and cancer, we propose using PLS-DA with double cross-validation (2CV) and variable selection, repeated several times and obtaining excellent average results for the performance indexes (f-score: 0.9149, MCC: 0.8538, AUROC: 0.8794). After selecting the optimal prediction model, a unique probabilistic map called “virtual biopsy” that shows in different colors the probability that each pixel of the image has a tumor behavior is obtained, helping the specialist with the identification and characterization of breast tumors with only one easy-to-interpret biomarker map.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000923/pdfft?md5=a737eedaef6f346a82a2ab4bc9a92f6d&pid=1-s2.0-S0169743924000923-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141241036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A long sequence NOx emission prediction model for rotary kilns based on transformer 基于变压器的回转窑长序列氮氧化物排放预测模型
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-21 DOI: 10.1016/j.chemolab.2024.105151
Youlin Guo, Zhizhong Mao

Time-series prediction is of great practical value in industrial scenarios such as rotary kilns, especially for long sequence time-series prediction. Accurate long sequence NOx emission predictions help us monitor rotary kiln operations in advance to plan and control NOx emissions according to emission policies and production requirements. However, in actual industrial scenarios, the NOx emission pattern is dominated by long-term trends rather than simply repetitive patterns. Existing NOx prediction models are not effective in capturing long-term dependencies. Therefore, this paper proposes a novel model based on Transformer to solve this problem. First, we propose a novel series decomposition architecture based on LSTM and self-attention, which is embedded inside the Transformer. The architecture allows self-attention at the sub-series level and provides short-term trend and position information. In addition, the model designs a one-step inference structure to improve the error accumulation phenomenon under traditional inference methods for long sequence prediction and reduce the inference time. We conducted extensive experiments on two real-world datasets with different sampling intervals, which validated the model’s effectiveness. It achieves a relative improvement of 53.2% and 43.4% in prediction accuracy compared to popular NOx emission prediction methods.

时间序列预测在回转窑等工业场景中具有重要的实用价值,尤其是长序列时间序列预测。准确的长序列氮氧化物排放预测有助于我们提前监测回转窑的运行情况,从而根据排放政策和生产要求规划和控制氮氧化物的排放。然而,在实际工业场景中,氮氧化物的排放模式以长期趋势为主,而非简单的重复模式。现有的氮氧化物预测模型无法有效捕捉长期依赖关系。因此,本文提出了一种基于变压器的新型模型来解决这一问题。首先,我们提出了一种基于 LSTM 和自我关注的新型序列分解架构,并将其嵌入 Transformer 中。该架构允许在子序列级别进行自我关注,并提供短期趋势和位置信息。此外,该模型还设计了一步推理结构,以改善传统推理方法在长序列预测中的误差累积现象,并缩短推理时间。我们在两个不同采样间隔的实际数据集上进行了大量实验,验证了该模型的有效性。与流行的氮氧化物排放预测方法相比,该模型的预测精度分别提高了 53.2% 和 43.4%。
{"title":"A long sequence NOx emission prediction model for rotary kilns based on transformer","authors":"Youlin Guo,&nbsp;Zhizhong Mao","doi":"10.1016/j.chemolab.2024.105151","DOIUrl":"10.1016/j.chemolab.2024.105151","url":null,"abstract":"<div><p>Time-series prediction is of great practical value in industrial scenarios such as rotary kilns, especially for long sequence time-series prediction. Accurate long sequence NOx emission predictions help us monitor rotary kiln operations in advance to plan and control NOx emissions according to emission policies and production requirements. However, in actual industrial scenarios, the NOx emission pattern is dominated by long-term trends rather than simply repetitive patterns. Existing NOx prediction models are not effective in capturing long-term dependencies. Therefore, this paper proposes a novel model based on Transformer to solve this problem. First, we propose a novel series decomposition architecture based on LSTM and self-attention, which is embedded inside the Transformer. The architecture allows self-attention at the sub-series level and provides short-term trend and position information. In addition, the model designs a one-step inference structure to improve the error accumulation phenomenon under traditional inference methods for long sequence prediction and reduce the inference time. We conducted extensive experiments on two real-world datasets with different sampling intervals, which validated the model’s effectiveness. It achieves a relative improvement of 53.2% and 43.4% in prediction accuracy compared to popular NOx emission prediction methods.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141145469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining PLS-DA and SIMCA on NIR data for classifying raw materials for tyre industry: A hierarchical classification model 在近红外数据上结合 PLS-DA 和 SIMCA 对轮胎工业原料进行分类:分层分类模型
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-19 DOI: 10.1016/j.chemolab.2024.105150
Riccardo Voccio , Cristina Malegori , Paolo Oliveri , Federica Branduani , Marco Arimondi , Andrea Bernardi , Giorgio Luciano , Mattia Cettolin

Tyre materials are complex products, as they are prepared using a number of raw materials, each of them with its specific chemical composition and functionality in the final product. It is, therefore, of crucial importance to avoid mislabeling errors and even to verify the compliance of raw materials entering the factory.

The present study proposes a strategy that makes use of near infrared (NIR) spectroscopy combined with chemometrics for raw material identification (RMID) and compliance verification of the most common raw materials used in the tyre industry. In particular, the chemometric model developed consists of a global hierarchical classification model, which combines nested PLS-DA nodes for RMID and SIMCA nodes for compliance verification, in a two-step approach.

The global model showed satisfactory results, as a 100 % of total correct predictions and a sensitivity higher than 90 % in the test set were obtained for most of the classes of interest.

The strategy obtained has the final goal of being directly applied on the raw materials at their receiving stage in factory, with the double advantage of minimizing the risk of mislabeling and, at the same time, decreasing the number of suspicious samples that need to be analyzed in the laboratory, by means of traditional methods, for verifying their compliance.

轮胎材料是一种复杂的产品,因为它们是由多种原材料制备而成的,每种原材料都有其特定的化学成分和在最终产品中的功能。本研究提出了一种策略,利用近红外光谱与化学计量学相结合,对轮胎行业最常用的原材料进行原材料识别(RMID)和合规性验证。特别是,所开发的化学计量学模型由一个全局分层分类模型组成,该模型以两步法将用于 RMID 的嵌套 PLS-DA 节点和用于合规性验证的 SIMCA 节点结合在一起。全局模型显示出令人满意的结果,对大多数相关类别的预测正确率达到 100%,测试集的灵敏度高于 90%。所获策略的最终目标是直接应用于工厂接收阶段的原材料,具有双重优势,即最大限度地降低错误标记的风险,同时减少需要在实验室通过传统方法分析的可疑样品数量,以验证其是否符合要求。
{"title":"Combining PLS-DA and SIMCA on NIR data for classifying raw materials for tyre industry: A hierarchical classification model","authors":"Riccardo Voccio ,&nbsp;Cristina Malegori ,&nbsp;Paolo Oliveri ,&nbsp;Federica Branduani ,&nbsp;Marco Arimondi ,&nbsp;Andrea Bernardi ,&nbsp;Giorgio Luciano ,&nbsp;Mattia Cettolin","doi":"10.1016/j.chemolab.2024.105150","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105150","url":null,"abstract":"<div><p>Tyre materials are complex products, as they are prepared using a number of raw materials, each of them with its specific chemical composition and functionality in the final product. It is, therefore, of crucial importance to avoid mislabeling errors and even to verify the compliance of raw materials entering the factory.</p><p>The present study proposes a strategy that makes use of near infrared (NIR) spectroscopy combined with chemometrics for raw material identification (RMID) and compliance verification of the most common raw materials used in the tyre industry. In particular, the chemometric model developed consists of a global hierarchical classification model, which combines nested PLS-DA nodes for RMID and SIMCA nodes for compliance verification, in a two-step approach.</p><p>The global model showed satisfactory results, as a 100 % of total correct predictions and a sensitivity higher than 90 % in the test set were obtained for most of the classes of interest.</p><p>The strategy obtained has the final goal of being directly applied on the raw materials at their receiving stage in factory, with the double advantage of minimizing the risk of mislabeling and, at the same time, decreasing the number of suspicious samples that need to be analyzed in the laboratory, by means of traditional methods, for verifying their compliance.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016974392400090X/pdfft?md5=c98998e0122d4f4f2c21e7b0a46c05e0&pid=1-s2.0-S016974392400090X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An alternative for the robust assessment of the repeatability and reproducibility of analytical measurements using bivariate dispersion 利用双变量离散度对分析测量的重复性和再现性进行稳健评估的替代方法。
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-18 DOI: 10.1016/j.chemolab.2024.105148
Elfried Salanon , Blandine Comte , Delphine Centeno , Stéphanie Durand , Estelle Pujos-Guillot , Julien Boccard

Introduction

Assessing repeatability and reproducibility in analytical chemistry is commonly based on parametric dispersion indicators, such as relative standard deviation and standard deviation, calculated for each detected variable using repeated measurements of Quality Control (QC) samples collected throughout the data acquisition sequence. However, their reliability strongly relies on the assumption of normality distribution. Knowing that analytical variability is conditional to many sources, the use of such parametric estimators is not always suitable. There is therefore a need for robust indicators of data quality independent of central values and any parametric assumption.

Methods

Three specific indicators were developed: (i) intra-group dispersion, based on the median area of the convex hull of QC samples within an analytical batch; (ii) inter-group dispersion, defined as the gradient of the deviation between analytical batches; and (iii) dispersion index. Mathematical properties of these indicators, including positivity, stability, and translation invariance, were then evaluated using synthetic data under normal and non-normal distributions. Finally, the relevance of these indicators and the associated visualization methods were highlighted based on a metabolomics case study involving liquid chromatography coupled to mass spectrometry measurements of the NIST SRM1950 reference material analyzed over more than one year within different projects.

Results

The proposed indicators were shown to be translation invariant and always positive, while first investigations performed on synthetic data revealed a high stability for multiplication. Moreover, their application to experimental data revealed specific behaviors depending on the characteristics of the signal associated with the different detected analytes, showing their ability to capture the variability observed either in parametric or non-parametric conditions. Moreover, this investigation showed different structures of sensitivity to analytical variability all along the data processing steps. The proposed indicators also allowed a visualization of the analytical drift in two dimensions, to facilitate result interpretation.

Conclusion

These indicators open the way to a better and more robust assessment of repeatability and reproducibility but also to improvements of long-term data comparability involving suitability testing.

引言 在分析化学中,评估重复性和再现性通常基于参数离散度指标,如相对标准偏差和标准偏差,这些指标是利用整个数据采集序列中收集的质量控制(QC)样本的重复测量结果,为每个检测变量计算出来的。然而,这些指标的可靠性在很大程度上依赖于正态分布假设。由于分析变异性受多种因素影响,使用这种参数估计器并不总是合适的。因此,我们需要独立于中心值和任何参数假设的稳健的数据质量指标。方法 我们开发了三个具体指标:(i) 组内离散度,基于分析批次内质控样本凸壳的中位面积;(ii) 组间离散度,定义为分析批次之间的偏差梯度;以及 (iii) 离散指数。然后,利用正态分布和非正态分布下的合成数据对这些指标的数学特性(包括正向性、稳定性和平移不变性)进行了评估。最后,基于一项代谢组学案例研究,强调了这些指标和相关可视化方法的相关性,该案例研究涉及在不同项目中对 NIST SRM1950 参考材料进行的一年多的液相色谱耦合质谱测量。此外,根据与不同检测分析物相关的信号特征,这些指标在实验数据中的应用揭示了特定的行为,显示了它们捕捉参数或非参数条件下观察到的变异性的能力。此外,这项调查还显示了数据处理步骤中对分析变异性的不同敏感性结构。这些指标不仅为更好、更稳健地评估重复性和再现性开辟了道路,也为改进涉及适用性测试的长期数据可比性开辟了道路。
{"title":"An alternative for the robust assessment of the repeatability and reproducibility of analytical measurements using bivariate dispersion","authors":"Elfried Salanon ,&nbsp;Blandine Comte ,&nbsp;Delphine Centeno ,&nbsp;Stéphanie Durand ,&nbsp;Estelle Pujos-Guillot ,&nbsp;Julien Boccard","doi":"10.1016/j.chemolab.2024.105148","DOIUrl":"10.1016/j.chemolab.2024.105148","url":null,"abstract":"<div><h3>Introduction</h3><p>Assessing repeatability and reproducibility in analytical chemistry is commonly based on parametric dispersion indicators, such as relative standard deviation and standard deviation, calculated for each detected variable using repeated measurements of Quality Control (QC) samples collected throughout the data acquisition sequence. However, their reliability strongly relies on the assumption of normality distribution. Knowing that analytical variability is conditional to many sources, the use of such parametric estimators is not always suitable. There is therefore a need for robust indicators of data quality independent of central values and any parametric assumption.</p></div><div><h3>Methods</h3><p>Three specific indicators were developed: (i) intra-group dispersion, based on the median area of the convex hull of QC samples within an analytical batch; (ii) inter-group dispersion, defined as the gradient of the deviation between analytical batches; and (iii) dispersion index. Mathematical properties of these indicators, including positivity, stability, and translation invariance, were then evaluated using synthetic data under normal and non-normal distributions. Finally, the relevance of these indicators and the associated visualization methods were highlighted based on a metabolomics case study involving liquid chromatography coupled to mass spectrometry measurements of the NIST SRM1950 reference material analyzed over more than one year within different projects.</p></div><div><h3>Results</h3><p>The proposed indicators were shown to be translation invariant and always positive, while first investigations performed on synthetic data revealed a high stability for multiplication. Moreover, their application to experimental data revealed specific behaviors depending on the characteristics of the signal associated with the different detected analytes, showing their ability to capture the variability observed either in parametric or non-parametric conditions. Moreover, this investigation showed different structures of sensitivity to analytical variability all along the data processing steps. The proposed indicators also allowed a visualization of the analytical drift in two dimensions, to facilitate result interpretation.</p></div><div><h3>Conclusion</h3><p>These indicators open the way to a better and more robust assessment of repeatability and reproducibility but also to improvements of long-term data comparability involving suitability testing.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000881/pdfft?md5=12d877a2bc93c6070b76e59f9583bbfc&pid=1-s2.0-S0169743924000881-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141135489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving golden jackel optimization algorithm: An application of chemical data classification 改进 Golden Jackel 优化算法:化学数据分类应用
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-17 DOI: 10.1016/j.chemolab.2024.105149
Aiedh Mrisi Alharthi , Dler Hussein Kadir , Abdo Mohammed Al-Fakih , Zakariya Yahya Algamal , Niam Abdulmunim Al-Thanoon , Maimoonah Khalid Qasim

One of the main issues affecting the effectiveness of the quantitative structure-activity relationship (QSAR) classification techniques in chemometrics is high dimensionality. Applying feature selection is a critical procedure that determines the most relevant and important aspects of a dataset. It improves the effectiveness and accuracy of prediction models by effectively lowering the number of features. This decrease increases classification accuracy, reduces computing strain, and improves overall performance. Recently, the golden jackal optimization (GJO) algorithm was introduced, which has been successfully used to solve various continuous optimization issues. Therefore, this study proposes an improvement in the GJO algorithm employing chaotic maps, abbreviated as CGJO, to enhance the exploration and exploitation capability of the GJO algorithm in picking the essential descriptors in QSAR classification models with high classification accuracy and less computation time. Experimental findings based on four different high-dimensional chemical datasets show that the proposed CGJO algorithm can maximize classification accuracy while simultaneously decreasing the number of chosen descriptors and lowering the time required for computing. Thus, the proposed algorithm can be useful for chemical data classification in other QSAR modeling.

影响化学计量学中定量结构-活性关系(QSAR)分类技术有效性的主要问题之一是高维度。特征选择是确定数据集最相关和最重要方面的关键程序。它通过有效降低特征数量来提高预测模型的有效性和准确性。减少特征数量可以提高分类准确性,减少计算压力,并提高整体性能。最近,金豺优化(GJO)算法被引入,并成功用于解决各种连续优化问题。因此,本研究提出了一种采用混沌图的 GJO 算法改进方案,简称 CGJO,以增强 GJO 算法的探索和利用能力,从而在 QSAR 分类模型中挑选出高分类精度和较少计算时间的基本描述符。基于四个不同高维化学数据集的实验结果表明,所提出的 CGJO 算法可以最大限度地提高分类准确率,同时减少所选描述符的数量并降低计算所需的时间。因此,提出的算法可用于其他 QSAR 建模中的化学数据分类。
{"title":"Improving golden jackel optimization algorithm: An application of chemical data classification","authors":"Aiedh Mrisi Alharthi ,&nbsp;Dler Hussein Kadir ,&nbsp;Abdo Mohammed Al-Fakih ,&nbsp;Zakariya Yahya Algamal ,&nbsp;Niam Abdulmunim Al-Thanoon ,&nbsp;Maimoonah Khalid Qasim","doi":"10.1016/j.chemolab.2024.105149","DOIUrl":"10.1016/j.chemolab.2024.105149","url":null,"abstract":"<div><p>One of the main issues affecting the effectiveness of the quantitative structure-activity relationship (QSAR) classification techniques in chemometrics is high dimensionality. Applying feature selection is a critical procedure that determines the most relevant and important aspects of a dataset. It improves the effectiveness and accuracy of prediction models by effectively lowering the number of features. This decrease increases classification accuracy, reduces computing strain, and improves overall performance. Recently, the golden jackal optimization (GJO) algorithm was introduced, which has been successfully used to solve various continuous optimization issues. Therefore, this study proposes an improvement in the GJO algorithm employing chaotic maps, abbreviated as CGJO, to enhance the exploration and exploitation capability of the GJO algorithm in picking the essential descriptors in QSAR classification models with high classification accuracy and less computation time. Experimental findings based on four different high-dimensional chemical datasets show that the proposed CGJO algorithm can maximize classification accuracy while simultaneously decreasing the number of chosen descriptors and lowering the time required for computing. Thus, the proposed algorithm can be useful for chemical data classification in other QSAR modeling.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying 124 new anti-HIV drug candidates in a 37 billion-compound database: An integrated approach of machine learning (QSAR), molecular docking, and molecular dynamics simulation 在 370 亿化合物数据库中识别 124 种新的抗艾滋病毒候选药物:机器学习(QSAR)、分子对接和分子动力学模拟的综合方法
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-15 DOI: 10.1016/j.chemolab.2024.105145
Alexandre de Fátima Cobre , Anderson Ara , Alexessander Couto Alves , Moisés Maia Neto , Mariana Millan Fachi , Laize Sílvia dos Anjos Botas Beca , Fernanda Stumpf Tonin , Roberto Pontarolo

Recent data from the World Health Organization reveals that in 2023, 38.8 million people were living with HIV. Within this population, there were 1.5 million new cases and 650 thousand deaths attributed to the disease. This study employs an integrated approach involving QSAR-based machine learning models, molecular docking, and molecular dynamics simulations to identify potential compounds for inhibiting the bioactivity of the CC chemokine receptor type 5 (CCR5) protein, a key entry point for the HIV virus. Using non-redundant experimental data from the CHEMBL database, 40 different machine learning algorithms were trained and the top four models (XGBoost, Histogram based gradient Boosting, Light Gradient Boosted Machine, and Extra Trees Regression) were utilized to predict anti-HIV bioactivity for 37 billion compounds in the ZINC-22 database. The screening resulted in the identification of 124 new anti-HIV drug candidates, confirmed through molecular docking and dynamics simulations. The study underscores the therapeutic potential of these compounds, paving the way for further in vitro and in vivo investigations. The convergence of machine learning and experimental findings presents a promising avenue for significant advancements in pharmaceutical research, particularly in the treatment of viral diseases such as HIV. To guarantee the reproducibility of our study, we have made the Python code (google colab) and the associated database available on GitHub. You can access them through the following link: GitHub Link: https://github.com/AlexandreCOBRE/code.

世界卫生组织的最新数据显示,2023 年有 3 880 万人感染艾滋病毒。在这一人群中,有 150 万新增病例和 65 万死亡病例。这项研究采用了一种综合方法,包括基于QSAR的机器学习模型、分子对接和分子动力学模拟,以确定抑制CC趋化因子受体5型(CCR5)蛋白生物活性的潜在化合物,CCR5蛋白是HIV病毒的一个关键入口。利用来自 CHEMBL 数据库的非冗余实验数据,对 40 种不同的机器学习算法进行了训练,并利用前四种模型(XGBoost、基于直方图的梯度提升、光梯度提升机和额外树回归)预测 ZINC-22 数据库中 370 亿种化合物的抗 HIV 生物活性。通过分子对接和动力学模拟,筛选出了 124 种新的抗 HIV 候选药物。这项研究强调了这些化合物的治疗潜力,为进一步的体外和体内研究铺平了道路。机器学习与实验结果的融合为药物研究的重大进展提供了一条大有可为的途径,尤其是在治疗艾滋病毒等病毒性疾病方面。为了保证研究的可重复性,我们在 GitHub 上提供了 Python 代码(google colab)和相关数据库。您可以通过以下链接访问它们:GitHub 链接:https://github.com/AlexandreCOBRE/code.
{"title":"Identifying 124 new anti-HIV drug candidates in a 37 billion-compound database: An integrated approach of machine learning (QSAR), molecular docking, and molecular dynamics simulation","authors":"Alexandre de Fátima Cobre ,&nbsp;Anderson Ara ,&nbsp;Alexessander Couto Alves ,&nbsp;Moisés Maia Neto ,&nbsp;Mariana Millan Fachi ,&nbsp;Laize Sílvia dos Anjos Botas Beca ,&nbsp;Fernanda Stumpf Tonin ,&nbsp;Roberto Pontarolo","doi":"10.1016/j.chemolab.2024.105145","DOIUrl":"10.1016/j.chemolab.2024.105145","url":null,"abstract":"<div><p>Recent data from the World Health Organization reveals that in 2023, 38.8 million people were living with HIV. Within this population, there were 1.5 million new cases and 650 thousand deaths attributed to the disease<strong>.</strong> This study employs an integrated approach involving QSAR-based machine learning models, molecular docking, and molecular dynamics simulations to identify potential compounds for inhibiting the bioactivity of the CC chemokine receptor type 5 (CCR5) protein, a key entry point for the HIV virus. Using non-redundant experimental data from the CHEMBL database, 40 different machine learning algorithms were trained and the top four models (XGBoost, Histogram based gradient Boosting, Light Gradient Boosted Machine, and Extra Trees Regression) were utilized to predict <em>anti</em>-HIV bioactivity for 37 billion compounds in the ZINC-22 database. The screening resulted in the identification of 124 new <em>anti</em>-HIV drug candidates, confirmed through molecular docking and dynamics simulations. The study underscores the therapeutic potential of these compounds, paving the way for further in vitro and in vivo investigations. The convergence of machine learning and experimental findings presents a promising avenue for significant advancements in pharmaceutical research, particularly in the treatment of viral diseases such as HIV. To guarantee the reproducibility of our study, we have made the Python code (google colab) and the associated database available on GitHub. You can access them through the following link: GitHub Link: <span>https://github.com/AlexandreCOBRE/code</span><svg><path></path></svg>.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141031854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MA-XRF datasets analysis based on convolutional neural network: A case study on religious panel paintings 基于卷积神经网络的 MA-XRF 数据集分析:宗教壁画案例研究
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-10 DOI: 10.1016/j.chemolab.2024.105138
Theofanis Gerodimos , Ioannis Georvasilis , Anastasios Asvestas , Georgios P. Mastrotheodoros , Aristidis Likas , Dimitrios F. Anagnostopoulos

Macroscopic X-ray fluorescence (MA-XRF) datasets are analyzed using Artificial Neural Networks. Specifically, Convolutional Neural Networks (CNNs) are trained by coupling the spectra acquired during the MA-XRF scan of two religious panel paintings (“icons”) with the associated Ground-Truth counts per characteristic transition line, as they are extracted by X-ray fluorescence fundamental parameters analysis. In total, twenty thousand XRF spectra were used for the CNN training. The trained neural networks were applied to analyze millions of MA-XRF spectra acquired during the scan of religious painting panels by computing the counts per pixel of X-ray characteristic transition lines and creating the elemental transition maps. Comparison of the CNN extracted results to the Ground-Truth (GT) shows remarkable agreement. The successful MA-XRF datasets analysis applying the CNN method paves an analytical path to the direction of the auto-identification of spectral lines, offering the means for the non-experienced XRF analyst to provide a state-of-the-art analysis and supporting the experienced user not to overlook hardly resolved transition lines.

利用人工神经网络对宏观 X 射线荧光 (MA-XRF) 数据集进行分析。具体来说,卷积神经网络(CNN)的训练方法是将对两幅宗教板画("圣像")进行 MA-XRF 扫描时获取的光谱与通过 X 射线荧光基本参数分析提取的每条特征过渡线的相关地面实况计数相耦合。CNN 训练总共使用了两万个 X 射线荧光光谱。通过计算每个像素的 X 射线特征转变线计数和创建元素转变图,将训练好的神经网络用于分析在扫描宗教绘画板时获取的数百万 MA-XRF 光谱。将 CNN 提取的结果与 "地面实况"(Ground-Truth,GT)进行比较,结果显示两者非常一致。应用 CNN 方法成功分析 MA-XRF 数据集为光谱线的自动识别方向铺平了分析道路,为没有经验的 XRF 分析师提供了最先进的分析手段,并帮助有经验的用户避免忽略难以解析的过渡线。
{"title":"MA-XRF datasets analysis based on convolutional neural network: A case study on religious panel paintings","authors":"Theofanis Gerodimos ,&nbsp;Ioannis Georvasilis ,&nbsp;Anastasios Asvestas ,&nbsp;Georgios P. Mastrotheodoros ,&nbsp;Aristidis Likas ,&nbsp;Dimitrios F. Anagnostopoulos","doi":"10.1016/j.chemolab.2024.105138","DOIUrl":"10.1016/j.chemolab.2024.105138","url":null,"abstract":"<div><p>Macroscopic X-ray fluorescence (MA-XRF) datasets are analyzed using Artificial Neural Networks. Specifically, Convolutional Neural Networks (CNNs) are trained by coupling the spectra acquired during the MA-XRF scan of two religious panel paintings (“icons”) with the associated Ground-Truth counts per characteristic transition line, as they are extracted by X-ray fluorescence fundamental parameters analysis. In total, twenty thousand XRF spectra were used for the CNN training. The trained neural networks were applied to analyze millions of MA-XRF spectra acquired during the scan of religious painting panels by computing the counts per pixel of X-ray characteristic transition lines and creating the elemental transition maps. Comparison of the CNN extracted results to the Ground-Truth (GT) shows remarkable agreement. The successful MA-XRF datasets analysis applying the CNN method paves an analytical path to the direction of the auto-identification of spectral lines, offering the means for the non-experienced XRF analyst to provide a state-of-the-art analysis and supporting the experienced user not to overlook hardly resolved transition lines.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141045310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic non-destructive estimation of polyphenol oxidase and peroxidase enzyme activity levels in three bell pepper varieties by Vis/NIR spectroscopy imaging data based on machine learning methods 基于机器学习方法的可见光/近红外光谱成像数据自动无损估算三个甜椒品种的多酚氧化酶和过氧化物酶活性水平
IF 3.9 2区 化学 Q1 Chemistry Pub Date : 2024-05-09 DOI: 10.1016/j.chemolab.2024.105137
Meysam Latifi Amoghin , Yousef Abbaspour-Gilandeh , Mohammad Tahmasebi , Juan Ignacio Arribas

The browning process of food products if often formed upon cutting and damage during their processing, transport, and storage, amongst other potential sources and reasons. Enzymic browning can be mainly due to polyphenol oxidase (PPO) and peroxidase (POD) enzymes. Visible/near-infrared (Vis/NIR) imaging spectroscopy in the range of 350–1150 nm was used in this study for automatic and non-destructive evaluation of PPO and POD activity levels in three bell pepper varieties (red, yellow, orange; N = 30), with a total of 30 inputs samples in each variety. The spectral data were then modeled by the partial least squares regression (PLSR) throughout the whole spectral range, without using any subset of the most effective wavelength (EW) values. Regression determination coefficient (R2) values for the estimation (prediction) of POD enzyme activity levels were 0.794, 0.772, and 0.726 for red, yellow, and orange bell peppers, respectively, all over the validation set. At the same time, the activity levels of PPO enzyme over bell peppers showed R2 values of 0.901, 0.810, and 0.859, for red, yellow, and orange bell peppers, respectively, all over the validation set. In addition, a combination of support vector machine (SVM) with either genetic algorithms (GA), particle swarm optimization (PSO), ant colony optimization (ACO), or imperialistic competitive algorithms (ICA) hybrid machine learning (ML) techniques were used to select the optimal (discriminant) spectral EW wavelength values, and regression performance was consistently improved, to judge from higher regression fit R2 values. Either 14 or 15 EWs were computed and selected in order of their discriminative power using previously mentioned ML techniques. The hybrid SVM-PSO method resulted the best one in the process of selecting the most effective wavelength values (nm). On the other hand, three regression methods comprising PLSR, multiple least regression (MLR), and neural network (NN), were employed to model the SVM-PSO selected EWs. The ratio of performance to deviation (RPD), the R2 and the root mean square error (RMSE), over the test set, for the non-linear NN regression method exhibited better results as compared to the other two regression methods, being closely followed by PLSR, and therefore NN regression method was selected as the best approach for modeling the most effective spectral wavelength values in this study.

食品在加工、运输和储藏过程中往往会因切割和损坏而形成褐变,此外还有其他潜在的来源和原因。酶促褐变主要是由多酚氧化酶(PPO)和过氧化物酶(POD)引起的。本研究采用 350-1150 纳米波长范围内的可见光/近红外(Vis/NIR)成像光谱,对三个甜椒品种(红、黄、橙;N = 30)的 PPO 和 POD 活性水平进行自动、非破坏性评估,每个品种共 30 个输入样本。然后在整个光谱范围内对光谱数据进行偏最小二乘回归(PLSR)建模,而不使用任何最有效波长(EW)值子集。在整个验证集中,红椒、黄椒和橙椒的 POD 酶活性水平的估计(预测)回归决定系数 (R2) 值分别为 0.794、0.772 和 0.726。同时,在所有验证集上,红椒、黄椒和橙椒的 PPO 酶活性水平的 R2 值分别为 0.901、0.810 和 0.859。此外,支持向量机(SVM)与遗传算法(GA)、粒子群优化(PSO)、蚁群优化(ACO)或帝国竞争算法(ICA)混合机器学习(ML)技术相结合,用于选择最佳(判别)光谱 EW 波长值,从更高的回归拟合 R2 值来看,回归性能得到了持续改善。利用前面提到的 ML 技术,计算出了 14 或 15 个 EW,并按照其判别能力的顺序进行了选择。在选择最有效波长值(纳米)的过程中,SVM-PSO 混合方法的效果最好。另一方面,包括 PLSR、多元最小回归 (MLR) 和神经网络 (NN) 在内的三种回归方法被用来为 SVM-PSO 选定的 EW 建模。与其他两种回归方法相比,非线性 NN 回归方法在测试集上的性能与偏差比(RPD)、R2 和均方根误差(RMSE)都表现出更好的结果,PLSR 紧随其后,因此 NN 回归方法被选为本研究中最有效光谱波长值建模的最佳方法。
{"title":"Automatic non-destructive estimation of polyphenol oxidase and peroxidase enzyme activity levels in three bell pepper varieties by Vis/NIR spectroscopy imaging data based on machine learning methods","authors":"Meysam Latifi Amoghin ,&nbsp;Yousef Abbaspour-Gilandeh ,&nbsp;Mohammad Tahmasebi ,&nbsp;Juan Ignacio Arribas","doi":"10.1016/j.chemolab.2024.105137","DOIUrl":"10.1016/j.chemolab.2024.105137","url":null,"abstract":"<div><p>The browning process of food products if often formed upon cutting and damage during their processing, transport, and storage, amongst other potential sources and reasons. Enzymic browning can be mainly due to polyphenol oxidase (PPO) and peroxidase (POD) enzymes. Visible/near-infrared (Vis/NIR) imaging spectroscopy in the range of 350–1150 nm was used in this study for automatic and non-destructive evaluation of PPO and POD activity levels in three bell pepper varieties (red, yellow, orange; N = 30), with a total of 30 inputs samples in each variety. The spectral data were then modeled by the partial least squares regression (PLSR) throughout the whole spectral range, without using any subset of the most effective wavelength (EW) values. Regression determination coefficient (R<sup>2</sup>) values for the estimation (prediction) of POD enzyme activity levels were 0.794, 0.772, and 0.726 for red, yellow, and orange bell peppers, respectively, all over the validation set. At the same time, the activity levels of PPO enzyme over bell peppers showed R<sup>2</sup> values of 0.901, 0.810, and 0.859, for red, yellow, and orange bell peppers, respectively, all over the validation set. In addition, a combination of support vector machine (SVM) with either genetic algorithms (GA), particle swarm optimization (PSO), ant colony optimization (ACO), or imperialistic competitive algorithms (ICA) hybrid machine learning (ML) techniques were used to select the optimal (discriminant) spectral EW wavelength values, and regression performance was consistently improved, to judge from higher regression fit R<sup>2</sup> values. Either 14 or 15 EWs were computed and selected in order of their discriminative power using previously mentioned ML techniques. The hybrid SVM-PSO method resulted the best one in the process of selecting the most effective wavelength values (nm). On the other hand, three regression methods comprising PLSR, multiple least regression (MLR), and neural network (NN), were employed to model the SVM-PSO selected EWs. The ratio of performance to deviation (RPD), the R<sup>2</sup> and the root mean square error (RMSE), over the test set, for the non-linear NN regression method exhibited better results as compared to the other two regression methods, being closely followed by PLSR, and therefore NN regression method was selected as the best approach for modeling the most effective spectral wavelength values in this study.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000777/pdfft?md5=1c66f4c9e2d7fdb5e8fd71595aa511f4&pid=1-s2.0-S0169743924000777-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141026864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Chemometrics and Intelligent Laboratory Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1