{"title":"Machine Learning Techniques for Anomaly Detection in High-Frequency Time Series of Wind Speed and Greenhouse Gas Concentration Measurements","authors":"A. J. Kasatkin, M. A. Krinitskiy","doi":"10.3103/S0027134923070135","DOIUrl":null,"url":null,"abstract":"<p>Fluxes of greenhouse gases (GHG) may be assessed in situ using the eddy covariance method through processing high-frequency measurements of gas concentration and wind speed acquired at certain sites, e.g., carbon measurement test areas of the pilot project of the Ministry of Education and Science of Russia. The measurements commonly come with noise, anomalies, and gaps of various natures. These anomalies result in biased GHG flux estimates. There are a number of empirical and heuristic approaches for filtering noise and anomalies, as well as for gap-filling. These approaches are characterized by many tuning parameters that are commonly adjusted by an expert, which is a limiting factor for large-scale deployment of GHG monitoring stations. In this study, we propose an alternative approach for anomaly detection in high-frequency measurements of GHG concentration and wind speed. Our approach is based on machine learning techniques. This approach is characterized by a lower number of tuning parameters. The goal of our study is to develop a fully automated data preprocessing routine based on machine learning algorithms. We collected the dataset of high-frequency GHG concentration and wind speed measurements from one of the carbon measurement test areas. In order to compare anomaly detection algorithms, we labeled anomalies in a subset of this dataset. We present two approaches for anomaly detection, namely: (a) identification of outliers based on the error magnitude in time series statistical forecasts performed by a machine learning (ML) algorithm; and (b) classification of anomalies using an ML model trained on the labeled dataset of outliers we mentioned above. We compared the approaches and algorithms based on the F1-score metric assessed with respect to an expert-labeled subset of anomalies in GHG concentration and wind speed time series. Within the forecast-error based approach, we trained several ML models: the ARIMA autoregression method, the CatBoost model for autoregression, the CatBoost model for forecasting employing additional features, and the LSTM artificial neural network. Within the supervised classification approach, we tested the CatBoost classification model. We demonstrate that ML models for forecasting deliver a high quality of time series prediction within the autoregression approach. We also show that the anomaly identification method based on the autoregression approach delivers the best quality with the F1-score reaching <span>\\(0.812\\)</span>.</p>","PeriodicalId":711,"journal":{"name":"Moscow University Physics Bulletin","volume":"78 1 supplement","pages":"S138 - S148"},"PeriodicalIF":0.4000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Moscow University Physics Bulletin","FirstCategoryId":"101","ListUrlMain":"https://link.springer.com/article/10.3103/S0027134923070135","RegionNum":4,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Fluxes of greenhouse gases (GHG) may be assessed in situ using the eddy covariance method through processing high-frequency measurements of gas concentration and wind speed acquired at certain sites, e.g., carbon measurement test areas of the pilot project of the Ministry of Education and Science of Russia. The measurements commonly come with noise, anomalies, and gaps of various natures. These anomalies result in biased GHG flux estimates. There are a number of empirical and heuristic approaches for filtering noise and anomalies, as well as for gap-filling. These approaches are characterized by many tuning parameters that are commonly adjusted by an expert, which is a limiting factor for large-scale deployment of GHG monitoring stations. In this study, we propose an alternative approach for anomaly detection in high-frequency measurements of GHG concentration and wind speed. Our approach is based on machine learning techniques. This approach is characterized by a lower number of tuning parameters. The goal of our study is to develop a fully automated data preprocessing routine based on machine learning algorithms. We collected the dataset of high-frequency GHG concentration and wind speed measurements from one of the carbon measurement test areas. In order to compare anomaly detection algorithms, we labeled anomalies in a subset of this dataset. We present two approaches for anomaly detection, namely: (a) identification of outliers based on the error magnitude in time series statistical forecasts performed by a machine learning (ML) algorithm; and (b) classification of anomalies using an ML model trained on the labeled dataset of outliers we mentioned above. We compared the approaches and algorithms based on the F1-score metric assessed with respect to an expert-labeled subset of anomalies in GHG concentration and wind speed time series. Within the forecast-error based approach, we trained several ML models: the ARIMA autoregression method, the CatBoost model for autoregression, the CatBoost model for forecasting employing additional features, and the LSTM artificial neural network. Within the supervised classification approach, we tested the CatBoost classification model. We demonstrate that ML models for forecasting deliver a high quality of time series prediction within the autoregression approach. We also show that the anomaly identification method based on the autoregression approach delivers the best quality with the F1-score reaching \(0.812\).
摘要利用涡度协方差法,通过处理在某些地点(如俄罗斯教育和科学部试点项目的碳测量试验区)获得的气体浓度和风速的高频测量数据,可以对温室气体(GHG)流量进行现场评估。这些测量结果通常带有噪音、异常和各种性质的间隙。这些异常现象会导致温室气体通量估计值出现偏差。有许多经验性和启发式方法可用于过滤噪声和异常,以及填补空白。这些方法的特点是有许多调整参数,通常由专家进行调整,这是大规模部署温室气体监测站的一个限制因素。在本研究中,我们提出了一种在温室气体浓度和风速的高频测量中进行异常检测的替代方法。我们的方法基于机器学习技术。这种方法的特点是调整参数数量较少。我们的研究目标是开发一种基于机器学习算法的全自动数据预处理程序。我们从一个碳测量测试区收集了高频温室气体浓度和风速测量数据集。为了比较异常检测算法,我们对该数据集中的一个子集进行了异常标注。我们提出了两种异常检测方法,即:(a) 根据机器学习(ML)算法在时间序列统计预测中的误差大小识别异常值;(b) 使用在上述异常值标注数据集上训练的 ML 模型对异常值进行分类。我们根据对专家标注的温室气体浓度和风速时间序列异常子集评估的 F1 分数指标,对各种方法和算法进行了比较。在基于预测误差的方法中,我们训练了多个 ML 模型:ARIMA 自回归方法、用于自回归的 CatBoost 模型、用于预测附加特征的 CatBoost 模型以及 LSTM 人工神经网络。在监督分类方法中,我们测试了 CatBoost 分类模型。我们证明,在自回归方法中,用于预测的 ML 模型可提供高质量的时间序列预测。我们还表明,基于自回归方法的异常识别方法质量最好,F1-分数达到了(0.812\)。
期刊介绍:
Moscow University Physics Bulletin publishes original papers (reviews, articles, and brief communications) in the following fields of experimental and theoretical physics: theoretical and mathematical physics; physics of nuclei and elementary particles; radiophysics, electronics, acoustics; optics and spectroscopy; laser physics; condensed matter physics; chemical physics, physical kinetics, and plasma physics; biophysics and medical physics; astronomy, astrophysics, and cosmology; physics of the Earth’s, atmosphere, and hydrosphere.