Enhancing quantitative 1H NMR model generalizability on honey from different years through partial least squares subspace and optimal transport based unsupervised domain adaptation
Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao
{"title":"Enhancing quantitative 1H NMR model generalizability on honey from different years through partial least squares subspace and optimal transport based unsupervised domain adaptation","authors":"Peng Shan , Hongming Xiao , Xiang Li , Ruige Yang , Lin Zhang , Yuliang Zhao","doi":"10.1016/j.chemolab.2024.105221","DOIUrl":null,"url":null,"abstract":"<div><div>Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (<sup>1</sup>H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The <sup>1</sup>H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled <sup>1</sup>H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled <sup>1</sup>H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"254 ","pages":"Article 105221"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743924001618","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Honey is a nourishing and natural food product that is widely favored by a diverse group of consumers. Proton Nuclear Magnetic Resonance (1H NMR) is a powerful tool for quantitative analysis of honey and plays a crucial role in ensuring its quality. The 1H NMR technique necessitates the utilization of multivariate calibration models to facilitate the quantitative analysis of key compounds present in honey. However, maintaining consistent measurement conditions across different years is scarcely possible, which can significantly impact the distribution of training and test spectra, ultimately leading to reduced performance of predictive models. Unsupervised domain adaptation (UDA) methods have gained considerable attention for their ability to match distribution differences between the labeled source spectra and the unlabeled target spectra without costly annotation. To enhance the quantitative model generalizability on honey from different years, we propose a UDA method known as partial least squares subspace and optimal transport-based UDA (PLSS-OT-UDA). This approach eliminates distribution differences between the source subspace and target subspace via partial least squares (PLS) dimensionality reduction and OT. Firstly, the optimal latent variable weight matrix from the source domain (i.e., labeled 1H NMR data in 2017) is extracted with PLS. Next, the dimension of both source and target domains (i.e., unlabeled 1H NMR data in 2018) is reduced and their corresponding subspaces are obtained with weight matrix of the source domain. Finally, OT is then employed to align the distribution of the source and target domains within the subspace. Experimental results on the honey dataset demonstrate that the PLSS-OT-UDA outperforms traditional methods, including transfer component analysis (TCA), optimal transport for domain adaptation (OTDA), domain adaptation based on principal component analysis and optimal transport (PCA-OTDA), and subspace alignment (SA), with respect to generalization performance on three components: baume degree, sugar content, and water content.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.