T. A. SemenikhinThe SNAD team, M. V. KornilovThe SNAD team, M. V. PruzhinskayaThe SNAD team, A. D. LavrukhinaThe SNAD team, E. RusseilThe SNAD team, E. GanglerThe SNAD team, E. E. O. IshidaThe SNAD team, V. S. KorolevThe SNAD team, K. L. MalanchevThe SNAD team, A. A. VolnovaThe SNAD team, S. SreejithThe SNAD team
{"title":"Real-bogus scores for active anomaly detection","authors":"T. A. SemenikhinThe SNAD team, M. V. KornilovThe SNAD team, M. V. PruzhinskayaThe SNAD team, A. D. LavrukhinaThe SNAD team, E. RusseilThe SNAD team, E. GanglerThe SNAD team, E. E. O. IshidaThe SNAD team, V. S. KorolevThe SNAD team, K. L. MalanchevThe SNAD team, A. A. VolnovaThe SNAD team, S. SreejithThe SNAD team","doi":"arxiv-2409.10256","DOIUrl":null,"url":null,"abstract":"In the task of anomaly detection in modern time-domain photometric surveys,\nthe primary goal is to identify astrophysically interesting, rare, and unusual\nobjects among a large volume of data. Unfortunately, artifacts -- such as plane\nor satellite tracks, bad columns on CCDs, and ghosts -- often constitute\nsignificant contaminants in results from anomaly detection analysis. In such\ncontexts, the Active Anomaly Discovery (AAD) algorithm allows tailoring the\noutput of anomaly detection pipelines according to what the expert judges to be\nscientifically interesting. We demonstrate how the introduction real-bogus\nscores, obtained from a machine learning classifier, improves the results from\nAAD. Using labeled data from the SNAD ZTF knowledge database, we train four\nreal-bogus classifiers: XGBoost, CatBoost, Random Forest, and Extremely\nRandomized Trees. All the models perform real-bogus classification with similar\neffectiveness, achieving ROC-AUC scores ranging from 0.93 to 0.95.\nConsequently, we select the Random Forest model as the main model due to its\nsimplicity and interpretability. The Random Forest classifier is applied to 67\nmillion light curves from ZTF DR17. The output real-bogus score is used as an\nadditional feature for two anomaly detection algorithms: static Isolation\nForest and AAD. While results from Isolation Forest remained unchanged, the\nnumber of artifacts detected by the active approach decreases significantly\nwith the inclusion of the real-bogus score, from 27 to 3 out of 100. We\nconclude that incorporating the real-bogus classifier result as an additional\nfeature in the active anomaly detection pipeline significantly reduces the\nnumber of artifacts in the outputs, thereby increasing the incidence of\nastrophysically interesting objects presented to human experts.","PeriodicalId":501163,"journal":{"name":"arXiv - PHYS - Instrumentation and Methods for Astrophysics","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Instrumentation and Methods for Astrophysics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the task of anomaly detection in modern time-domain photometric surveys,
the primary goal is to identify astrophysically interesting, rare, and unusual
objects among a large volume of data. Unfortunately, artifacts -- such as plane
or satellite tracks, bad columns on CCDs, and ghosts -- often constitute
significant contaminants in results from anomaly detection analysis. In such
contexts, the Active Anomaly Discovery (AAD) algorithm allows tailoring the
output of anomaly detection pipelines according to what the expert judges to be
scientifically interesting. We demonstrate how the introduction real-bogus
scores, obtained from a machine learning classifier, improves the results from
AAD. Using labeled data from the SNAD ZTF knowledge database, we train four
real-bogus classifiers: XGBoost, CatBoost, Random Forest, and Extremely
Randomized Trees. All the models perform real-bogus classification with similar
effectiveness, achieving ROC-AUC scores ranging from 0.93 to 0.95.
Consequently, we select the Random Forest model as the main model due to its
simplicity and interpretability. The Random Forest classifier is applied to 67
million light curves from ZTF DR17. The output real-bogus score is used as an
additional feature for two anomaly detection algorithms: static Isolation
Forest and AAD. While results from Isolation Forest remained unchanged, the
number of artifacts detected by the active approach decreases significantly
with the inclusion of the real-bogus score, from 27 to 3 out of 100. We
conclude that incorporating the real-bogus classifier result as an additional
feature in the active anomaly detection pipeline significantly reduces the
number of artifacts in the outputs, thereby increasing the incidence of
astrophysically interesting objects presented to human experts.