Context:
As a current consensus, data quality strongly impacts the process of building software and AI systems. Hence, practitioners must detect the anomalies in data and repair these underlying problems. When dealing with big data in the industry, statistic-based unsupervised anomaly detectors come in handy since they do not require labels and are highly scalable. However, we noticed that these tools unsupervised, always require data-dependent parameters, which can largely affect the detection performance and are effort-consuming to configure.
Objectives:
In this work, we propose a fully unsupervised, statistic-based cell-level data anomaly detector, LUCARIO (Learning Unsupervised, Cell-level Anomaly-detector for Regex Incompatibilities and Outliers). Our approach aims to detect common cell-level data anomalies (pattern violations and outliers) without manual efforts in data annotations or parameter configurations, yet providing a robust performance for different data across diverse domains.
Methods:
According to previous studies, we categorized cell anomalies into two categories: pattern violations and outliers (categorical and numerical). We proposed three detection approaches based on heuristics and statistical theories to identify these anomalies. To evaluate LUCARIO’s effectiveness and usability, we conducted experiments on six open-source datasets and a real-life industrial dataset from our industrial partner CompanyX.
Results:
According to our experiment on six open-source datasets in various domains, LUCARIO can stably detect cell-level data issues (pattern violations and outliers) regardless of the dataset’s size and anomaly rate. LUCARIO reached an average F1 score of 0.54, higher than all baseline unsupervised anomaly detectors, including GPT-5 with few-shot prompting. Practitioners from CompanyX generally agree that LUCARIO can benefit their data quality by detecting critical data issues and providing reliable suggestions.
Conclusion:
The experimental results show that LUCARIO has the potential to improve the data used for both software and AI system construction in real-life applications, suggesting its practicality in data management.
扫码关注我们
求助内容:
应助结果提醒方式:
