A selective genetic algorithm - PLS-DA approach based on untargeted LC-HRMS: Application to complex biomass samples

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS Chemometrics and Intelligent Laboratory Systems Pub Date : 2025-03-13 DOI:10.1016/j.chemolab.2025.105381

Ian Ramtanon , Marion Lacoue-Nègre , Alexandra Berlioz-Barbier , Agnès Le Masle , Jean-Hugues Renault

{"title":"A selective genetic algorithm - PLS-DA approach based on untargeted LC-HRMS: Application to complex biomass samples","authors":"Ian Ramtanon , Marion Lacoue-Nègre , Alexandra Berlioz-Barbier , Agnès Le Masle , Jean-Hugues Renault","doi":"10.1016/j.chemolab.2025.105381","DOIUrl":null,"url":null,"abstract":"<div><div>Supervised multivariate analyses (MVA) are commonly used to extract valuable information, classify samples and highlight data trends from complex LC-HRMS datasets. However, traditional MVA methods, such as partial least-squares discriminant analysis (PLS-DA), may not provide sufficient class discrimination in the case of complex mixtures due to the presence of confounders, high-dimensionality and multicollinearity. In this regard, variable selection methods such as genetic algorithms can reduce datasets by retaining explanatory information while discarding non-informative features. The hyphenation of these selective heuristic methods and highly interpretable PLS-DA has shown significant potential for sample discrimination in spectroscopic techniques but has never been applied to complex LC-HRMS data.</div><div>Therefore, this work investigated the potential of combining variable selection by GA and classification by PLS-DA for identifying key chemical descriptors in these intricate LC-HRMS datasets. The determination of low reactivity discriminants (<em>i.e.,</em> potential enzymatic inhibitors) in complex biomass samples was chosen as a case study and is presented in this work. Initially, PLS-DA was applied to an LC-ESI(±)-HRMS dataset containing 2167 features. However, the resulting model suffered from overfitting and yielded sub-optimal classification for low reactivity samples. Subsequently, GA was employed in triplicate, resulting in 768 final models. The most predictive model retained only 418 variables (<em>i.e.,</em> an 81 % decrease in the dataset) and was subjected to PLS-DA. The new supervised model avoided overfitting and demonstrated significantly improved classification for low reactivity samples, achieving sensitivity, precision, and accuracy increases of 45 %, 25 %, and 38 %, respectively. Ultimately, an original strategy based on combining the variables’ importance in projection (VIP), GA-selection frequency (GA-SF) and cosine similarity was used to pinpoint species associated with low enzymatic reactivity. This study marks the first application of GA-PLS-DA for LC-HRMS data analysis. This approach effectively streamlined feature selection and enhanced class discrimination, successfully identifying 10 potential inhibitory features within the biomass samples. This methodology offers significant potential for studies on complex mixtures that encounter limited classification and/or overfitting issues with traditional supervised MVA methods.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"261 ","pages":"Article 105381"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925000668","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Supervised multivariate analyses (MVA) are commonly used to extract valuable information, classify samples and highlight data trends from complex LC-HRMS datasets. However, traditional MVA methods, such as partial least-squares discriminant analysis (PLS-DA), may not provide sufficient class discrimination in the case of complex mixtures due to the presence of confounders, high-dimensionality and multicollinearity. In this regard, variable selection methods such as genetic algorithms can reduce datasets by retaining explanatory information while discarding non-informative features. The hyphenation of these selective heuristic methods and highly interpretable PLS-DA has shown significant potential for sample discrimination in spectroscopic techniques but has never been applied to complex LC-HRMS data.

Therefore, this work investigated the potential of combining variable selection by GA and classification by PLS-DA for identifying key chemical descriptors in these intricate LC-HRMS datasets. The determination of low reactivity discriminants (i.e., potential enzymatic inhibitors) in complex biomass samples was chosen as a case study and is presented in this work. Initially, PLS-DA was applied to an LC-ESI(±)-HRMS dataset containing 2167 features. However, the resulting model suffered from overfitting and yielded sub-optimal classification for low reactivity samples. Subsequently, GA was employed in triplicate, resulting in 768 final models. The most predictive model retained only 418 variables (i.e., an 81 % decrease in the dataset) and was subjected to PLS-DA. The new supervised model avoided overfitting and demonstrated significantly improved classification for low reactivity samples, achieving sensitivity, precision, and accuracy increases of 45 %, 25 %, and 38 %, respectively. Ultimately, an original strategy based on combining the variables’ importance in projection (VIP), GA-selection frequency (GA-SF) and cosine similarity was used to pinpoint species associated with low enzymatic reactivity. This study marks the first application of GA-PLS-DA for LC-HRMS data analysis. This approach effectively streamlined feature selection and enhanced class discrimination, successfully identifying 10 potential inhibitory features within the biomass samples. This methodology offers significant potential for studies on complex mixtures that encounter limited classification and/or overfitting issues with traditional supervised MVA methods.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

有监督的多元分析（MVA）通常用于从复杂的 LC-HRMS 数据集中提取有价值的信息、对样品进行分类并突出数据趋势。然而，由于存在混杂因素、高维度和多重共线性，传统的 MVA 方法（如偏最小二乘判别分析 (PLS-DA)）在复杂混合物的情况下可能无法提供足够的类别判别。在这方面，遗传算法等变量选择方法可以通过保留解释性信息而摒弃非信息特征来减少数据集。这些选择性启发式方法与可解释性极强的 PLS-DA 相结合，在光谱技术的样品鉴别方面显示出巨大的潜力，但却从未应用于复杂的 LC-HRMS 数据。本研究选择了复杂生物质样品中的低反应性判别物（即潜在的酶抑制剂）作为案例进行研究。最初，PLS-DA 被应用于包含 2167 个特征的 LC-ESI(±)-HRMS 数据集。然而，所得到的模型存在过度拟合的问题，对低反应性样品的分类效果不理想。随后，GA 一式三份，最终得到 768 个模型。最具预测性的模型只保留了 418 个变量（即数据集减少了 81%），并进行了 PLS-DA。新的监督模型避免了过度拟合，对低反应性样本的分类效果显著提高，灵敏度、精确度和准确度分别提高了 45%、25% 和 38%。最终，一种基于变量在预测中的重要性（VIP）、GA-选择频率（GA-SF）和余弦相似性的原创策略被用于确定与低酶反应性相关的物种。本研究首次将 GA-PLS-DA 应用于 LC-HRMS 数据分析。这种方法有效地简化了特征选择并提高了类别区分度，成功地在生物质样品中识别出了 10 个潜在的抑制性特征。这种方法为复杂混合物的研究提供了巨大的潜力，因为传统的有监督 MVA 方法会遇到分类有限和/或过拟合的问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.