Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on constraint graphs, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a finite history set. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool ada, and an evaluation proving feasibility is provided.
Event abstraction is a crucial step in applying process mining in real-world scenarios. However, practitioners often face challenges in selecting relevant research for their specific needs. To address this, we present a comprehensive framework for understanding event abstraction, comprising four key components: event abstraction sub-problems, consideration of process properties, data types for event abstraction, and various approaches to event abstraction. By systematically examining these components, practitioners can efficiently identify research that aligns with their requirements. Additionally, we analyze existing studies using this framework to provide practitioners with a clearer view of current research and suggest expanded applications of existing methods.
The public sector, private firms, and civil society constantly create data of high volume, velocity, and veracity from diverse sources. This kind of data is known as big data. As in other industries, public administrations consider big data as the “new oil" and employ data-centric policies to transform data into knowledge, stimulate good governance, innovative digital services, transparency, and citizens' engagement in public policy. More and more public organizations understand the value created by exploiting internal and external data sources, delivering new capabilities, and fostering collaboration inside and outside of public administrations. Despite the broad interest in this ecosystem, we still lack a detailed and systematic view of it. In this paper, we attempt to describe the emerging Government Big Data Ecosystem as a socio-technical network of people, organizations, processes, technology, infrastructure, standards & policies, procedures, and resources. This ecosystem supports data functions such as data collection, integration, analysis, storage, sharing, use, protection, and archiving. Through these functions, value is created by promoting evidence-based policymaking, modern public services delivery, data-driven administration and open government, and boosting the data economy. Through a Design Science Research methodology, we propose a conceptual framework, which we call ‘datagov.eco’. We believe our ‘datagov.eco’ framework will provide insights and support to different stakeholders’ profiles, including administrators, consultants, data engineers, and data scientists.
For a range of reasons, organisations collect data on the work behaviour of their employees. However, each data collection technique displays its own unique mix of intrusiveness, information richness, and risks. For the sake of understanding the differences between data collection techniques, we conducted a multiple-case study in a multinational professional services organisation, tracking six participants throughout a workday using non-participant observation, screen recording, and timesheet techniques. This led to 136 hours of data. Our findings show that relying on one data collection technique alone cannot provide a comprehensive and accurate account of activities that are screen-based, offline, or overtime. The collected data also provided an opportunity to investigate the use of process mining for analysing employee behaviour, specifically with respect to the completeness of the collected data. Our study underlines the importance of judiciously selecting data collection techniques, as well as using a sufficiently broad data set to generate reliable insights into employee behaviour.
Data streams, continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. In various data stream applications, such as network security and credit card transaction monitoring, real-time detection of outliers is crucial, as these outliers often signify potential threats. Equally important is the real-time explanation of outliers, enabling users to glean insights and thereby shorten their investigation time. The investigation time for outliers is closely tied to their number of attributes, making it essential to provide explanations that detail which attributes are responsible for the abnormality of a data point, referred to as outlying attributes. However, the unbounded volume of data and concept drift of data streams pose challenges for discovering the outlying attributes of outliers in real time. In response, in this paper we propose EXOS, an algorithm designed for discovering the outlying attributes of multi-dimensional outliers in data streams. EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm is model-agnostic for point outlier detection and provides real-time explanations based on the local context of the outlier, derived from time-based tumbling windows. The paper provides a complexity analysis of EXOS and an experimental analysis comparing EXOS with existing algorithms. The evaluation includes an assessment of performance on both real-world and synthetic datasets in terms of average precision, recall, F1-score, and explanation time. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms.
Clustering is a fundamental task in data mining, aiming to group similar objects together based on their features or attributes. With the rapid increase in data analysis volume and the growing complexity of high-dimensional data distribution, clustering has become increasingly important in numerous applications, including image analysis, text mining, and anomaly detection. DBSCAN is a powerful tool for clustering analysis and is widely used in density-based clustering algorithms. However, DBSCAN and its variants encounter challenges when confronted with datasets exhibiting clusters of varying densities in intricate high-dimensional spaces affected by significant disturbance factors. A typical example is multi-density clustering connected by a few data points with strong internal correlations, a scenario commonly encountered in the analysis of crowd mobility. To address these challenges, we propose a Self-adaptive Density-Based Clustering Algorithm for Varying Densities Datasets with Strong Disturbance Factor (SADBSCAN). This algorithm comprises a data block splitter, a local clustering module, a global clustering module, and a data block merger to obtain adaptive clustering results. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of SADBSCAN. The experimental results indicate that SADBSCAN significantly outperforms several strong baselines across different metrics, demonstrating the high adaptability and scalability of our algorithm.