Ian Ramtanon , Marion Lacoue-Nègre , Alexandra Berlioz-Barbier , Agnès Le Masle , Jean-Hugues Renault
{"title":"A selective genetic algorithm - PLS-DA approach based on untargeted LC-HRMS: Application to complex biomass samples","authors":"Ian Ramtanon , Marion Lacoue-Nègre , Alexandra Berlioz-Barbier , Agnès Le Masle , Jean-Hugues Renault","doi":"10.1016/j.chemolab.2025.105381","DOIUrl":null,"url":null,"abstract":"<div><div>Supervised multivariate analyses (MVA) are commonly used to extract valuable information, classify samples and highlight data trends from complex LC-HRMS datasets. However, traditional MVA methods, such as partial least-squares discriminant analysis (PLS-DA), may not provide sufficient class discrimination in the case of complex mixtures due to the presence of confounders, high-dimensionality and multicollinearity. In this regard, variable selection methods such as genetic algorithms can reduce datasets by retaining explanatory information while discarding non-informative features. The hyphenation of these selective heuristic methods and highly interpretable PLS-DA has shown significant potential for sample discrimination in spectroscopic techniques but has never been applied to complex LC-HRMS data.</div><div>Therefore, this work investigated the potential of combining variable selection by GA and classification by PLS-DA for identifying key chemical descriptors in these intricate LC-HRMS datasets. The determination of low reactivity discriminants (<em>i.e.,</em> potential enzymatic inhibitors) in complex biomass samples was chosen as a case study and is presented in this work. Initially, PLS-DA was applied to an LC-ESI(±)-HRMS dataset containing 2167 features. However, the resulting model suffered from overfitting and yielded sub-optimal classification for low reactivity samples. Subsequently, GA was employed in triplicate, resulting in 768 final models. The most predictive model retained only 418 variables (<em>i.e.,</em> an 81 % decrease in the dataset) and was subjected to PLS-DA. The new supervised model avoided overfitting and demonstrated significantly improved classification for low reactivity samples, achieving sensitivity, precision, and accuracy increases of 45 %, 25 %, and 38 %, respectively. Ultimately, an original strategy based on combining the variables’ importance in projection (VIP), GA-selection frequency (GA-SF) and cosine similarity was used to pinpoint species associated with low enzymatic reactivity. This study marks the first application of GA-PLS-DA for LC-HRMS data analysis. This approach effectively streamlined feature selection and enhanced class discrimination, successfully identifying 10 potential inhibitory features within the biomass samples. This methodology offers significant potential for studies on complex mixtures that encounter limited classification and/or overfitting issues with traditional supervised MVA methods.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"261 ","pages":"Article 105381"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925000668","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Supervised multivariate analyses (MVA) are commonly used to extract valuable information, classify samples and highlight data trends from complex LC-HRMS datasets. However, traditional MVA methods, such as partial least-squares discriminant analysis (PLS-DA), may not provide sufficient class discrimination in the case of complex mixtures due to the presence of confounders, high-dimensionality and multicollinearity. In this regard, variable selection methods such as genetic algorithms can reduce datasets by retaining explanatory information while discarding non-informative features. The hyphenation of these selective heuristic methods and highly interpretable PLS-DA has shown significant potential for sample discrimination in spectroscopic techniques but has never been applied to complex LC-HRMS data.
Therefore, this work investigated the potential of combining variable selection by GA and classification by PLS-DA for identifying key chemical descriptors in these intricate LC-HRMS datasets. The determination of low reactivity discriminants (i.e., potential enzymatic inhibitors) in complex biomass samples was chosen as a case study and is presented in this work. Initially, PLS-DA was applied to an LC-ESI(±)-HRMS dataset containing 2167 features. However, the resulting model suffered from overfitting and yielded sub-optimal classification for low reactivity samples. Subsequently, GA was employed in triplicate, resulting in 768 final models. The most predictive model retained only 418 variables (i.e., an 81 % decrease in the dataset) and was subjected to PLS-DA. The new supervised model avoided overfitting and demonstrated significantly improved classification for low reactivity samples, achieving sensitivity, precision, and accuracy increases of 45 %, 25 %, and 38 %, respectively. Ultimately, an original strategy based on combining the variables’ importance in projection (VIP), GA-selection frequency (GA-SF) and cosine similarity was used to pinpoint species associated with low enzymatic reactivity. This study marks the first application of GA-PLS-DA for LC-HRMS data analysis. This approach effectively streamlined feature selection and enhanced class discrimination, successfully identifying 10 potential inhibitory features within the biomass samples. This methodology offers significant potential for studies on complex mixtures that encounter limited classification and/or overfitting issues with traditional supervised MVA methods.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.