A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models

IF 3.7 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS Chemometrics and Intelligent Laboratory Systems Pub Date : 2024-06-01 DOI:10.1016/j.chemolab.2024.105154
Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras
{"title":"A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models","authors":"Gabriely S. Folli ,&nbsp;Márcia H.C. Nascimento ,&nbsp;Betina P.O. Lovatti ,&nbsp;Wanderson Romão ,&nbsp;Paulo R. Filgueiras","doi":"10.1016/j.chemolab.2024.105154","DOIUrl":null,"url":null,"abstract":"<div><p>Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743924000947","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过主成分分析生成合成样本和人工离群值并评估二元分类模型的预测能力
不平衡的样本组往往会产生主要类别较多的模型。具有均衡类别的样本组有助于开发更稳健的模型,提高预测能力,对类别进行平等分类。在文献中,可以找到两种平衡样本的方法:剔除(欠采样)和合成样本生成(过采样)。欠采样方法会导致真实样本的丢失,而超采样方法可能会带来在原始光谱中添加非真实信号的相关问题。为了克服这些挑战,本文旨在利用主成分分析(PCA)生成虚拟样本(合成样本和人工离群值),以平衡多元分类模型中的数据。本文将所提出的方法应用于中红外光谱(MIR)和高分辨质谱(HRMS)数据,并使用了偏最小二乘法判别分析(PLS-DA)和支持向量机(SVM)模型。所构建的模型表明,与不平衡模型相比,添加虚拟样本可提高性能参数(如假阴性率、假阳性率、准确性、灵敏度、特异性等),同时还可减轻过度拟合(不平衡模型中存在的问题)。与线性模型(PLS-DA)相比,使用非线性模型(SVM)的性能参数提高幅度更大。此外,创建的虚拟光谱不会引入新的信号,即原始光谱和虚拟光谱显示出相似的光谱轮廓,仅在强度级别上有所不同。最后,根据对本研究中开发的二元模型的置换测试,所有模型都表现出良好的预测能力,限制了类别置换保留率(40% 至 60% 的 y 向量保留在原始类别中)。所有创建的模型的准确度值都高于测试组中带有置换类别的模型的准确度分布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.50
自引率
7.70%
发文量
169
审稿时长
3.4 months
期刊介绍: Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.
期刊最新文献
LTFM: Long-tail few-shot module with loose coupling strategy for mineral spectral identification Recent applications of analytical quality-by-design methodology for chromatographic analysis: A review Layer-wise-residual-driven approach for soft sensing in composite dynamic system based on slow and fast time-varying latent variables Applicability domain of a calibration model based on neural networks and infrared spectroscopy Machine learning based modeling for estimation of drug solubility in supercritical fluid by adjusting important parameters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1