A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS Chemometrics and Intelligent Laboratory Systems Pub Date : 2024-06-01 DOI:10.1016/j.chemolab.2024.105154

Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras

{"title":"A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models","authors":"Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras","doi":"10.1016/j.chemolab.2024.105154","DOIUrl":null,"url":null,"abstract":"<div><p>Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"251 ","pages":"Article 105154"},"PeriodicalIF":3.7000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743924000947","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过主成分分析生成合成样本和人工离群值并评估二元分类模型的预测能力

不平衡的样本组往往会产生主要类别较多的模型。具有均衡类别的样本组有助于开发更稳健的模型，提高预测能力，对类别进行平等分类。在文献中，可以找到两种平衡样本的方法：剔除（欠采样）和合成样本生成（过采样）。欠采样方法会导致真实样本的丢失，而超采样方法可能会带来在原始光谱中添加非真实信号的相关问题。为了克服这些挑战，本文旨在利用主成分分析（PCA）生成虚拟样本（合成样本和人工离群值），以平衡多元分类模型中的数据。本文将所提出的方法应用于中红外光谱（MIR）和高分辨质谱（HRMS）数据，并使用了偏最小二乘法判别分析（PLS-DA）和支持向量机（SVM）模型。所构建的模型表明，与不平衡模型相比，添加虚拟样本可提高性能参数（如假阴性率、假阳性率、准确性、灵敏度、特异性等），同时还可减轻过度拟合（不平衡模型中存在的问题）。与线性模型（PLS-DA）相比，使用非线性模型（SVM）的性能参数提高幅度更大。此外，创建的虚拟光谱不会引入新的信号，即原始光谱和虚拟光谱显示出相似的光谱轮廓，仅在强度级别上有所不同。最后，根据对本研究中开发的二元模型的置换测试，所有模型都表现出良好的预测能力，限制了类别置换保留率（40% 至 60% 的 y 向量保留在原始类别中）。所有创建的模型的准确度值都高于测试组中带有置换类别的模型的准确度分布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.