Seán Enis Cody, Sebastian Scher, Iain McDonald, Albert Zijlstra, Emma Alexander, Nick Cox
{"title":"基于机器学习的恒星分类与高度稀疏的测光数据。","authors":"Seán Enis Cody, Sebastian Scher, Iain McDonald, Albert Zijlstra, Emma Alexander, Nick Cox","doi":"10.12688/openreseurope.17023.2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Identifying stars belonging to different classes is vital in order to build up statistical samples of different phases and pathways of stellar evolution. In the era of surveys covering billions of stars, an automated method of identifying these classes becomes necessary.</p><p><strong>Methods: </strong>Many classes of stars are identified based on their emitted spectra. In this paper, we use a combination of the multi-class multi-label Machine Learning (ML) method XGBoost and the PySSED spectral-energy-distribution fitting algorithm to classify stars into nine different classes, based on their photometric data. The classifier is trained on subsets of the SIMBAD database. Particular challenges are the very high sparsity (large fraction of missing values) of the underlying data as well as the high class imbalance. We discuss the different variables available, such as photometric measurements on the one hand, and indirect predictors such as Galactic position on the other hand.</p><p><strong>Results: </strong>We show the difference in performance when excluding certain variables, and discuss in which contexts which of the variables should be used. Finally, we show that increasing the number of samples of a particular type of star significantly increases the performance of the model for that particular type, while having little to no impact on other types. The accuracy of the main classifier is ∼0.7 with a macro F1 score of 0.61.</p><p><strong>Conclusions: </strong>While the current accuracy of the classifier is not high enough to be reliably used in stellar classification, this work is an initial proof of feasibility for using ML to classify stars based on photometry.</p>","PeriodicalId":74359,"journal":{"name":"Open research Europe","volume":"4 ","pages":"29"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11362725/pdf/","citationCount":"0","resultStr":"{\"title\":\"Machine learning based stellar classification with highly sparse photometry data.\",\"authors\":\"Seán Enis Cody, Sebastian Scher, Iain McDonald, Albert Zijlstra, Emma Alexander, Nick Cox\",\"doi\":\"10.12688/openreseurope.17023.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Identifying stars belonging to different classes is vital in order to build up statistical samples of different phases and pathways of stellar evolution. In the era of surveys covering billions of stars, an automated method of identifying these classes becomes necessary.</p><p><strong>Methods: </strong>Many classes of stars are identified based on their emitted spectra. In this paper, we use a combination of the multi-class multi-label Machine Learning (ML) method XGBoost and the PySSED spectral-energy-distribution fitting algorithm to classify stars into nine different classes, based on their photometric data. The classifier is trained on subsets of the SIMBAD database. Particular challenges are the very high sparsity (large fraction of missing values) of the underlying data as well as the high class imbalance. We discuss the different variables available, such as photometric measurements on the one hand, and indirect predictors such as Galactic position on the other hand.</p><p><strong>Results: </strong>We show the difference in performance when excluding certain variables, and discuss in which contexts which of the variables should be used. Finally, we show that increasing the number of samples of a particular type of star significantly increases the performance of the model for that particular type, while having little to no impact on other types. The accuracy of the main classifier is ∼0.7 with a macro F1 score of 0.61.</p><p><strong>Conclusions: </strong>While the current accuracy of the classifier is not high enough to be reliably used in stellar classification, this work is an initial proof of feasibility for using ML to classify stars based on photometry.</p>\",\"PeriodicalId\":74359,\"journal\":{\"name\":\"Open research Europe\",\"volume\":\"4 \",\"pages\":\"29\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11362725/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open research Europe\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12688/openreseurope.17023.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open research Europe","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/openreseurope.17023.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
背景:要建立恒星演化不同阶段和途径的统计样本,识别属于不同类别的恒星至关重要。在对数十亿恒星进行巡天观测的时代,有必要采用一种自动方法来识别这些类别:方法:许多恒星类别都是根据它们的发射光谱来识别的。在本文中,我们结合使用多类多标签机器学习(ML)方法 XGBoost 和 PySSED 光谱能量分布拟合算法,根据测光数据将恒星分为九个不同的类别。分类器是在 SIMBAD 数据库的子集上进行训练的。所面临的特殊挑战是基础数据的高度稀疏性(大量缺失值)以及高度的类别不平衡。我们讨论了可用的不同变量,一方面是光度测量数据,另一方面是银河系位置等间接预测指标:结果:我们展示了排除某些变量后的性能差异,并讨论了在哪些情况下应使用哪些变量。最后,我们展示了增加特定类型恒星的样本数量会显著提高模型对该特定类型恒星的性能,而对其他类型恒星几乎没有影响。主分类器的准确率为 0.7,宏观 F1 得分为 0.61.结论:虽然目前分类器的准确率还不够高,不能可靠地用于恒星分类,但这项工作初步证明了使用 ML 根据光度测量对恒星进行分类的可行性。
Machine learning based stellar classification with highly sparse photometry data.
Background: Identifying stars belonging to different classes is vital in order to build up statistical samples of different phases and pathways of stellar evolution. In the era of surveys covering billions of stars, an automated method of identifying these classes becomes necessary.
Methods: Many classes of stars are identified based on their emitted spectra. In this paper, we use a combination of the multi-class multi-label Machine Learning (ML) method XGBoost and the PySSED spectral-energy-distribution fitting algorithm to classify stars into nine different classes, based on their photometric data. The classifier is trained on subsets of the SIMBAD database. Particular challenges are the very high sparsity (large fraction of missing values) of the underlying data as well as the high class imbalance. We discuss the different variables available, such as photometric measurements on the one hand, and indirect predictors such as Galactic position on the other hand.
Results: We show the difference in performance when excluding certain variables, and discuss in which contexts which of the variables should be used. Finally, we show that increasing the number of samples of a particular type of star significantly increases the performance of the model for that particular type, while having little to no impact on other types. The accuracy of the main classifier is ∼0.7 with a macro F1 score of 0.61.
Conclusions: While the current accuracy of the classifier is not high enough to be reliably used in stellar classification, this work is an initial proof of feasibility for using ML to classify stars based on photometry.