Optimised sampling of SDSS-IV MaStar spectra for stellar classification using supervised models

IF 5.8 2区 物理与天体物理 Q1 ASTRONOMY & ASTROPHYSICS Astronomy & Astrophysics Pub Date : 2025-01-27 DOI:10.1051/0004-6361/202451309
R. I. El-Kholy, Z. M. Hayman
{"title":"Optimised sampling of SDSS-IV MaStar spectra for stellar classification using supervised models","authors":"R. I. El-Kholy, Z. M. Hayman","doi":"10.1051/0004-6361/202451309","DOIUrl":null,"url":null,"abstract":"<i>Context<i/>. Supervised machine learning models are increasingly being used for solving the problem of stellar classification of spectroscopic data. However, training these models calls for a large number of labelled instances, whereas their collection is usually costly in both time and expertise.<i>Aims<i/>. Active learning (AL) algorithms minimise training dataset sizes by keeping only the most informative instances. This paper explores the application of AL to sampling stellar spectra using data from a highly class-imbalanced dataset.<i>Methods<i/>. We utilised the MaStar Stellar Library from the SDSS DR17, along with its associated stellar parameter catalogue. A preprocessing pipeline that includes feature selection, scaling, and dimensionality reduction was applied to the data. Using different AL algorithms, we iteratively queried instances where the model or committee of models exhibits the highest uncertainty or disagreement, respectively. We assessed the effectiveness of the sampling techniques by comparing several performance metrics of supervised-learning models trained on the queried samples with randomly sampled counterparts. Evaluation metrics included specificity, sensitivity, and the area under the curve. In addition, we used Matthew’s correlation coefficient, which accounts for class imbalance. We applied this procedure to the effective temperature, surface gravity, and iron metallicity, separately.<i>Results<i/>. Our results demonstrate the effectiveness of AL algorithms in selecting samples that produce performance metrics that are superior to random sampling and even stratified samples, with fewer training instances.<i>Conclusions<i/>. We find AL is recommended for prioritising instance labelling for astronomical-survey data by experts or crowdsourcing to mitigate the high time cost. Its effectiveness can be further exploited in selecting targets for follow-up observations in automated astronomical surveys.","PeriodicalId":8571,"journal":{"name":"Astronomy & Astrophysics","volume":"28 1","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Astronomy & Astrophysics","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1051/0004-6361/202451309","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ASTRONOMY & ASTROPHYSICS","Score":null,"Total":0}
引用次数: 0

Abstract

Context. Supervised machine learning models are increasingly being used for solving the problem of stellar classification of spectroscopic data. However, training these models calls for a large number of labelled instances, whereas their collection is usually costly in both time and expertise.Aims. Active learning (AL) algorithms minimise training dataset sizes by keeping only the most informative instances. This paper explores the application of AL to sampling stellar spectra using data from a highly class-imbalanced dataset.Methods. We utilised the MaStar Stellar Library from the SDSS DR17, along with its associated stellar parameter catalogue. A preprocessing pipeline that includes feature selection, scaling, and dimensionality reduction was applied to the data. Using different AL algorithms, we iteratively queried instances where the model or committee of models exhibits the highest uncertainty or disagreement, respectively. We assessed the effectiveness of the sampling techniques by comparing several performance metrics of supervised-learning models trained on the queried samples with randomly sampled counterparts. Evaluation metrics included specificity, sensitivity, and the area under the curve. In addition, we used Matthew’s correlation coefficient, which accounts for class imbalance. We applied this procedure to the effective temperature, surface gravity, and iron metallicity, separately.Results. Our results demonstrate the effectiveness of AL algorithms in selecting samples that produce performance metrics that are superior to random sampling and even stratified samples, with fewer training instances.Conclusions. We find AL is recommended for prioritising instance labelling for astronomical-survey data by experts or crowdsourcing to mitigate the high time cost. Its effectiveness can be further exploited in selecting targets for follow-up observations in automated astronomical surveys.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于监督模型的SDSS-IV恒星光谱分类优化采样
上下文。有监督机器学习模型越来越多地被用于解决光谱数据的恒星分类问题。然而,训练这些模型需要大量的标记实例,而它们的收集通常在时间和专业知识上都是昂贵的。主动学习(AL)算法通过只保留信息量最大的实例来最小化训练数据集的大小。本文利用一个高度类不平衡的数据集,探讨了人工智能在恒星光谱采样中的应用。我们利用了来自SDSS DR17的恒星库,以及它相关的恒星参数目录。对数据进行预处理,包括特征选择、缩放和降维。使用不同的人工智能算法,我们分别迭代查询模型或模型委员会表现出最高不确定性或分歧的实例。我们通过比较在查询样本上训练的监督学习模型与随机抽样的模型的几个性能指标来评估抽样技术的有效性。评价指标包括特异性、敏感性和曲线下面积。此外,我们使用了Matthew’s相关系数来解释班级的不平衡。我们将此程序分别应用于有效温度、表面重力和铁的金属丰度。我们的研究结果证明了人工智能算法在选择样本方面的有效性,这些样本产生的性能指标优于随机抽样甚至分层样本,并且训练实例较少。我们发现人工智能被推荐用于专家或众包的天文调查数据的优先实例标记,以减轻高时间成本。它的有效性可以进一步用于自动天文调查中后续观测目标的选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Astronomy & Astrophysics
Astronomy & Astrophysics 地学天文-天文与天体物理
CiteScore
10.20
自引率
27.70%
发文量
2105
审稿时长
1-2 weeks
期刊介绍: Astronomy & Astrophysics is an international Journal that publishes papers on all aspects of astronomy and astrophysics (theoretical, observational, and instrumental) independently of the techniques used to obtain the results.
期刊最新文献
More power on large scales Exploring the origins of high-velocity features in SNe Ia with the spectral synthesis code TARDIS Lyman continuum escaping from in situ formed stars in a tidal bridge at z = 3 Data-driven magnetohydrodynamic simulation of the initiation of a coronal mass ejection with multiple stages eROSITA selection of new period-bounce cataclysmic variables
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1