{"title":"Towards Benchmarking for Evaluating Machine Learning Methods in Detecting Outliers in Process Datasets","authors":"T. Schindler, Simon Schlicht, K. Thoben","doi":"10.3390/computers12120253","DOIUrl":null,"url":null,"abstract":"Within the integration and development of data-driven process models, the underlying process is digitally mapped in a model through sensory data acquisition and subsequent modelling. In this process, challenges of different types and degrees of severity arise in each modelling step, according to the Cross-Industry Standard Process for Data Mining (CRISP-DM). Particularly in the context of data acquisition and integration into the process model, it can be assumed with a sufficiently high degree of probability that the acquired data contain anomalies of various kinds. The outliers must be detected in the data preparation and processing phase and dealt with accordingly. If this is sufficiently implemented, it will positively impact the subsequent modelling in terms of accuracy and precision. Therefore, this paper shows how outliers can be identified using the unsupervised machine learning methods autoencoder, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Isolation Forest (iForest), and One-Class Support Vector Machine (OCSVM). Following implementing these methods, we compared them by applying the Numenta Anomaly Benchmark (NAB) and sufficiently presented the individual strengths and disadvantages. Evaluating the correctness, distinctiveness and robustness criteria described in the paper showed that the One-Class Support Vector Machine was outstanding among the methods considered. This is because the OCSVM achieved acceptable anomaly detections on the available process datasets with comparatively little effort.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"8 24","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Within the integration and development of data-driven process models, the underlying process is digitally mapped in a model through sensory data acquisition and subsequent modelling. In this process, challenges of different types and degrees of severity arise in each modelling step, according to the Cross-Industry Standard Process for Data Mining (CRISP-DM). Particularly in the context of data acquisition and integration into the process model, it can be assumed with a sufficiently high degree of probability that the acquired data contain anomalies of various kinds. The outliers must be detected in the data preparation and processing phase and dealt with accordingly. If this is sufficiently implemented, it will positively impact the subsequent modelling in terms of accuracy and precision. Therefore, this paper shows how outliers can be identified using the unsupervised machine learning methods autoencoder, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Isolation Forest (iForest), and One-Class Support Vector Machine (OCSVM). Following implementing these methods, we compared them by applying the Numenta Anomaly Benchmark (NAB) and sufficiently presented the individual strengths and disadvantages. Evaluating the correctness, distinctiveness and robustness criteria described in the paper showed that the One-Class Support Vector Machine was outstanding among the methods considered. This is because the OCSVM achieved acceptable anomaly detections on the available process datasets with comparatively little effort.