{"title":"Wavelength selection method for near-infrared spectroscopy based on the combination of mutual information and genetic algorithm.","authors":"Xiao-Hui Ma, Zheng-Guang Chen, Shuo Liu, Jin-Ming Liu, Xue-Song Tian","doi":"10.1016/j.talanta.2025.127573","DOIUrl":null,"url":null,"abstract":"<p><p>Near-infrared (NIR) spectroscopy analysis technology has become a widely utilized analytical tool in various fields due to its convenience and efficiency. However, with the promotion of instrument precision, the spectral dimension can now be expanded to include hundreds of dimensions. This expansion results in time-consuming modeling processes and a decrease in model performance. Hence, it is crucial to carefully choose representative features before constructing models. This paper focuses on the limitations of filter algorithms, which can only sort features and cannot directly determine the best subset of features. A hybrid method of combination of the Max-Relevance Min-Redundancy (mRMR) algorithm and the Genetic Algorithm (GA), as well as filter and wrapper feature selection methods, are combined to select appropriate features automatically. This hybrid algorithm retains the features in each individual that are considered to have a strong correlation and low redundancy by the mRMR algorithms during each iteration of the GA. On the other hand, it deletes the features that are regarded as having little correlation or high redundancy. Through the process of iteration, the feature subset is continuously optimized. We use the proposed hybrid method to select features on two datasets and establish various models to verify our proposed method in this paper. The experimental results indicate the feature selection approach, which combines mRMR with the GA, covers the advantages of both feature selection methods. This approach can select features that show good predictive performance. When compared with other common feature selection methods, such as the Uninformative Variable Elimination algorithm (UVE), Competitive Adaptive Reweighted Sampling algorithm (CARS), Successive Projections Algorithm (SPA), Iteratively Retains Informative Variables (IRIV), and GA, the hybrid algorithm can select a larger number of feature variables that are both representative and informative, additionally, it significantly enhances the predictive performance of the model.</p>","PeriodicalId":435,"journal":{"name":"Talanta","volume":"286 ","pages":"127573"},"PeriodicalIF":5.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Talanta","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1016/j.talanta.2025.127573","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/10 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Near-infrared (NIR) spectroscopy analysis technology has become a widely utilized analytical tool in various fields due to its convenience and efficiency. However, with the promotion of instrument precision, the spectral dimension can now be expanded to include hundreds of dimensions. This expansion results in time-consuming modeling processes and a decrease in model performance. Hence, it is crucial to carefully choose representative features before constructing models. This paper focuses on the limitations of filter algorithms, which can only sort features and cannot directly determine the best subset of features. A hybrid method of combination of the Max-Relevance Min-Redundancy (mRMR) algorithm and the Genetic Algorithm (GA), as well as filter and wrapper feature selection methods, are combined to select appropriate features automatically. This hybrid algorithm retains the features in each individual that are considered to have a strong correlation and low redundancy by the mRMR algorithms during each iteration of the GA. On the other hand, it deletes the features that are regarded as having little correlation or high redundancy. Through the process of iteration, the feature subset is continuously optimized. We use the proposed hybrid method to select features on two datasets and establish various models to verify our proposed method in this paper. The experimental results indicate the feature selection approach, which combines mRMR with the GA, covers the advantages of both feature selection methods. This approach can select features that show good predictive performance. When compared with other common feature selection methods, such as the Uninformative Variable Elimination algorithm (UVE), Competitive Adaptive Reweighted Sampling algorithm (CARS), Successive Projections Algorithm (SPA), Iteratively Retains Informative Variables (IRIV), and GA, the hybrid algorithm can select a larger number of feature variables that are both representative and informative, additionally, it significantly enhances the predictive performance of the model.
近红外(NIR)光谱分析技术因其便捷、高效的特点,已成为各领域广泛使用的分析工具。然而,随着仪器精度的提高,光谱维度现在可以扩展到数百个维度。这种扩展导致建模过程耗时,模型性能下降。因此,在构建模型之前仔细选择具有代表性的特征至关重要。过滤算法只能对特征进行排序,不能直接确定最佳特征子集,本文重点讨论过滤算法的局限性。本文结合了最大相关性最小冗余(mRMR)算法和遗传算法(GA)的混合方法,以及过滤器和包装特征选择方法,来自动选择合适的特征。在遗传算法的每次迭代中,这种混合算法保留了 mRMR 算法认为每个个体中相关性强、冗余度低的特征。另一方面,它删除了被认为相关性小或冗余度高的特征。通过迭代过程,特征子集不断得到优化。本文使用所提出的混合方法在两个数据集上选择特征,并建立各种模型来验证我们所提出的方法。实验结果表明,mRMR 与 GA 结合的特征选择方法涵盖了两种特征选择方法的优点。这种方法可以选择出具有良好预测性能的特征。与其他常见的特征选择方法,如无信息变量消除算法(UVE)、竞争性自适应重加权采样算法(CARS)、连续投影算法(SPA)、迭代保留有信息变量算法(IRIV)和 GA 相比,混合算法可以选择更多既有代表性又有信息量的特征变量,而且还能显著提高模型的预测性能。
期刊介绍:
Talanta provides a forum for the publication of original research papers, short communications, and critical reviews in all branches of pure and applied analytical chemistry. Papers are evaluated based on established guidelines, including the fundamental nature of the study, scientific novelty, substantial improvement or advantage over existing technology or methods, and demonstrated analytical applicability. Original research papers on fundamental studies, and on novel sensor and instrumentation developments, are encouraged. Novel or improved applications in areas such as clinical and biological chemistry, environmental analysis, geochemistry, materials science and engineering, and analytical platforms for omics development are welcome.
Analytical performance of methods should be determined, including interference and matrix effects, and methods should be validated by comparison with a standard method, or analysis of a certified reference material. Simple spiking recoveries may not be sufficient. The developed method should especially comprise information on selectivity, sensitivity, detection limits, accuracy, and reliability. However, applying official validation or robustness studies to a routine method or technique does not necessarily constitute novelty. Proper statistical treatment of the data should be provided. Relevant literature should be cited, including related publications by the authors, and authors should discuss how their proposed methodology compares with previously reported methods.