Collinear datasets augmentation using Procrustes validation sets

IF 6 2区 化学 Q1 CHEMISTRY, ANALYTICAL Analytica Chimica Acta Pub Date : 2025-05-15 Epub Date: 2025-03-15 DOI:10.1016/j.aca.2025.343913
Sergey Kucheryavskiy , Sergei Zhilin
{"title":"Collinear datasets augmentation using Procrustes validation sets","authors":"Sergey Kucheryavskiy ,&nbsp;Sergei Zhilin","doi":"10.1016/j.aca.2025.343913","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.</div></div><div><h3>Results:</h3><div>we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.</div></div><div><h3>Significance</h3><div>: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1351 ","pages":"Article 343913"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025003071","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/15 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Background:

high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.

Results:

we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.

Significance

: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用 Procrustes 验证集扩充共线数据集
背景:高复杂性模型,如人工神经网络(ANN),需要大型数据集进行训练,以避免过拟合和再现性问题。然而,实验数据集,特别是那些涉及光谱或其他高度共线数据的数据集,由于实际的限制,往往受到尺寸限制。目前可用的数据增强方法,要么不能很好地处理共线性,要么需要资源密集的培训。因此,迫切需要一种有效的、可扩展的方法来增加共线性数据集,以增强模型在回归和分类任务中的性能。结果:我们提出了一种新颖,高效的数据增强方法,适合于中等至高度共线性的数据集,特别是光谱数据。该方法利用潜在变量建模结合交叉验证重采样来生成新的数据点。该方法已经使用各种数据集进行了验证,在这里,我们报告了两个案例研究的详细结果:肉末脂肪含量预测和基于近红外光谱的橄榄识别。在这两种情况下,都使用了人工神经网络,使得模型在预测和分类方面的性能有了显著的提高。具体来说,对于脂肪含量预测,该方法在独立测试集上将均方根误差降低了3倍。意义:该方法为增强共线数据集提供了一种快速、简单的解决方案,在不需要大量参数调优的情况下显著提高了模型性能。它是通用的,可以应用于一系列数据集,为更复杂的增强技术提供了一个实用的替代方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Analytica Chimica Acta
Analytica Chimica Acta 化学-分析化学
CiteScore
10.40
自引率
6.50%
发文量
1081
审稿时长
38 days
期刊介绍: Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.
期刊最新文献
Multiscale evaluation of metal-organic framework-based adsorbents for shipboard carbon capture: Simulation, experiment, and sustainability analysis A label-free fluorescent aptasensor based on DNA-templated silver nanoclusters and catalytic hairpin assembly for highly specific detection of Escherichia coli O111:B4 lipopolysaccharide Static headspace enantioselective comprehensive two-dimensional gas chromatography–time-of-flight mass spectrometry in food analysis: a proof-of-principle study Screening of a novel VHH pair targeting different epitopes of D-dimer and development of a sandwich luminex immunoassay for plasma D-dimer detection Novel microfluidic gelatin-paper-based chip for sensitive colorimetric detection of urine albumin
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1