基于多标准的主动学习样本选择

IF 1.6 4区 化学 Q3 CHEMISTRY, APPLIED Journal of Near Infrared Spectroscopy Pub Date : 2023-11-08 DOI:10.1177/09670335231211618
Zhonghai He, Kun Shen, Xiaofang Zhang
{"title":"基于多标准的主动学习样本选择","authors":"Zhonghai He, Kun Shen, Xiaofang Zhang","doi":"10.1177/09670335231211618","DOIUrl":null,"url":null,"abstract":"In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.","PeriodicalId":16551,"journal":{"name":"Journal of Near Infrared Spectroscopy","volume":"144 3‐6","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Active learning sample selection - based on multicriteria\",\"authors\":\"Zhonghai He, Kun Shen, Xiaofang Zhang\",\"doi\":\"10.1177/09670335231211618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.\",\"PeriodicalId\":16551,\"journal\":{\"name\":\"Journal of Near Infrared Spectroscopy\",\"volume\":\"144 3‐6\",\"pages\":\"0\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2023-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Near Infrared Spectroscopy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/09670335231211618\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Near Infrared Spectroscopy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/09670335231211618","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0

摘要

在多变量校准问题中,模型性能受到模型构建过程中使用的校准样本的显著影响。近年来,主动学习方法已成为样本选择的最佳方法之一。然而,大多数主动学习方法仅从预测不确定性或样本空间距离中选择实例,这些单一标准的方法往往会选择不需要的样本。此外,样本密度表征了样本所携带的空间信息,但定量分析中很少有研究单独利用样本密度来选择校准样本。针对这些问题,本文在k-means聚类算法的基础上,提出了一种结合多样性、信息量和样本密度三个标准的主动学习样本选择方法(DIDAL)。迭代选择最具代表性的样品加入校准集,用于建模和估计分析物的化学浓度。采用DIDAL对豆粕和酱油样品进行分析,并对现有的样品选择方法进行比较。预测结果表明,DIDAL算法明显优于现有的几种算法,接近全样本建模的性能。采用DIDAL方法,只需选取少量的样本,就可以构建具有较高预测精度的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Active learning sample selection - based on multicriteria
In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.30
自引率
5.60%
发文量
35
审稿时长
6 months
期刊介绍: JNIRS — Journal of Near Infrared Spectroscopy is a peer reviewed journal, publishing original research papers, short communications, review articles and letters concerned with near infrared spectroscopy and technology, its application, new instrumentation and the use of chemometric and data handling techniques within NIR.
期刊最新文献
Non-linear machine learning coupled near infrared spectroscopy enhanced model performance and insights for coffee origin traceability Using visible and near infrared spectroscopy and machine learning for estimating total petroleum hydrocarbons in contaminated soils Detection and classification of spongy tissue disorder in mango fruit during ripening by using visible-near infrared spectroscopy and multivariate analysis A method to standardize the temperature for near infrared spectra of the indigo pigment in non-dairy cream based on symbolic regression Moisture content of Panax notoginseng taproot predicted using near infrared spectroscopy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1