Through the application of process mining, business processes can be improved on the basis of process execution data captured in event logs. Naturally, the quality of this data determines the quality of the improvement recommendations. Improving data quality is non-trivial and there is great potential to exploit unstructured text, e.g. from notes, reviews, and comments, for this purpose and to enrich event logs. To this end, this paper introduces Text2EL+ a three-phase approach to enrich event logs using unstructured text. In its first phase, events and (case and event) attributes are derived from unstructured text linked to organisational processes. In its second phase, these events and attributes undergo a semantic and contextual validation before their incorporation in the event log. In its third and final phase, recognising the importance of human domain expertise, expert guidance is used to further improve data quality by removing redundant and irrelevant events. Expert input is used to train a Named Entity Recognition (NER) model with customised tags to detect event log elements. The approach applies natural language processing techniques, sentence embeddings, training pipelines and models, as well as contextual and expression validation. Various unstructured clinical notes associated with a healthcare case study were analysed and completeness, concordance, and correctness of the derived event log elements were evaluated through experiments. The results show that the proposed method is feasible and applicable.
{"title":"Text2EL+: Expert Guided Event Log Enrichment using Unstructured Text","authors":"D. T. K. Geeganage, M. Wynn, A. Hofstede","doi":"10.1145/3640018","DOIUrl":"https://doi.org/10.1145/3640018","url":null,"abstract":"Through the application of process mining, business processes can be improved on the basis of process execution data captured in event logs. Naturally, the quality of this data determines the quality of the improvement recommendations. Improving data quality is non-trivial and there is great potential to exploit unstructured text, e.g. from notes, reviews, and comments, for this purpose and to enrich event logs. To this end, this paper introduces Text2EL+ a three-phase approach to enrich event logs using unstructured text. In its first phase, events and (case and event) attributes are derived from unstructured text linked to organisational processes. In its second phase, these events and attributes undergo a semantic and contextual validation before their incorporation in the event log. In its third and final phase, recognising the importance of human domain expertise, expert guidance is used to further improve data quality by removing redundant and irrelevant events. Expert input is used to train a Named Entity Recognition (NER) model with customised tags to detect event log elements. The approach applies natural language processing techniques, sentence embeddings, training pipelines and models, as well as contextual and expression validation. Various unstructured clinical notes associated with a healthcare case study were analysed and completeness, concordance, and correctness of the derived event log elements were evaluated through experiments. The results show that the proposed method is feasible and applicable.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"5 8","pages":""},"PeriodicalIF":2.1,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139440108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentina Golendukhina, Harald Foidl, Daniel Hörl, Michael Felderer
The Internet of Things (IoT) is rapidly growing and spreading across different markets, including the customer market and consumer IoT (CIoT). The large variety of gadgets and their availability makes CIoT more and more influential, especially in the wearable and smart home domains. However, the large variety of devices and their inconsistent quality due to varying hardware costs have an influence on the data produced by such devices. In this article, a catalog of CIoT properties is introduced, which enables the prediction of data quality. The data quality catalog contains six categories and 21 properties with descriptions and trust score calculation methods. A diagramming tool is implemented to support and facilitate the process of evaluation. The tool was assessed in an experimental setting with 14 users and received positive feedback. Additionally, we provide an exemplary application for smartwatch devices and compare the results obtained with the approach with the users’ evaluation based on the feedback from 158 smartwatch owners. As a result, the method-based ranking does not provide similar results to the regular users. However, it yields comparable outcomes to the assessment conducted by experienced users.
{"title":"A Catalog of Consumer IoT Device Characteristics for Data Quality Estimation","authors":"Valentina Golendukhina, Harald Foidl, Daniel Hörl, Michael Felderer","doi":"10.1145/3639708","DOIUrl":"https://doi.org/10.1145/3639708","url":null,"abstract":"The Internet of Things (IoT) is rapidly growing and spreading across different markets, including the customer market and consumer IoT (CIoT). The large variety of gadgets and their availability makes CIoT more and more influential, especially in the wearable and smart home domains. However, the large variety of devices and their inconsistent quality due to varying hardware costs have an influence on the data produced by such devices. In this article, a catalog of CIoT properties is introduced, which enables the prediction of data quality. The data quality catalog contains six categories and 21 properties with descriptions and trust score calculation methods. A diagramming tool is implemented to support and facilitate the process of evaluation. The tool was assessed in an experimental setting with 14 users and received positive feedback. Additionally, we provide an exemplary application for smartwatch devices and compare the results obtained with the approach with the users’ evaluation based on the feedback from 158 smartwatch owners. As a result, the method-based ranking does not provide similar results to the regular users. However, it yields comparable outcomes to the assessment conducted by experienced users.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"28 12","pages":""},"PeriodicalIF":2.1,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139444060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gj. Richard, Thales Dms, Imt Atlantique, France J. Habonneau, France D. Gueriot, France
In critical operational context such as Mine Warfare, Automatic Target Recognition (ATR) algorithms are still hardly accepted. The complexity of their decision-making hampers understanding of predictions despite performances approaching human expert ones. Much research has been done in Explainability Artificial Intelligence (XAI) field to avoid this ”black box” effect. This field of research attempts to provide explanations for the decision-making of complex networks to promote their acceptability. Most of the explanation methods applied on image classifier networks provide heat maps. These maps highlight pixels according to their importance in decision-making. In this work, we first implement different XAI methods for the automatic classification of Synthetic Aperture Sonar (SAS) images by convolutional neural networks (CNN). These different methods are based on a Post-Hoc approach. We study and compare the different heat maps obtained. Secondly, we evaluate the benefits and the usefulness of explainability in an operational framework for collaboration. To do this, different user tests are carried out with different levels of assistance ranging from classification for an unaided operator, to classification with explained ATR. These tests allow us to study whether heat maps are useful in this context. The results obtained show that the heat maps explanation have a disputed utility according to the operators. Heat map presence does not increase the quality of the classifications. On the contrary, it even increases the response time. Nevertheless, half of operators see a certain usefulness in heat maps explanation.
{"title":"AI explainibility and acceptance; a case study for underwater mine hunting","authors":"Gj. Richard, Thales Dms, Imt Atlantique, France J. Habonneau, France D. Gueriot, France","doi":"10.1145/3635113","DOIUrl":"https://doi.org/10.1145/3635113","url":null,"abstract":"In critical operational context such as Mine Warfare, Automatic Target Recognition (ATR) algorithms are still hardly accepted. The complexity of their decision-making hampers understanding of predictions despite performances approaching human expert ones. Much research has been done in Explainability Artificial Intelligence (XAI) field to avoid this ”black box” effect. This field of research attempts to provide explanations for the decision-making of complex networks to promote their acceptability. Most of the explanation methods applied on image classifier networks provide heat maps. These maps highlight pixels according to their importance in decision-making. In this work, we first implement different XAI methods for the automatic classification of Synthetic Aperture Sonar (SAS) images by convolutional neural networks (CNN). These different methods are based on a Post-Hoc approach. We study and compare the different heat maps obtained. Secondly, we evaluate the benefits and the usefulness of explainability in an operational framework for collaboration. To do this, different user tests are carried out with different levels of assistance ranging from classification for an unaided operator, to classification with explained ATR. These tests allow us to study whether heat maps are useful in this context. The results obtained show that the heat maps explanation have a disputed utility according to the operators. Heat map presence does not increase the quality of the classifications. On the contrary, it even increases the response time. Nevertheless, half of operators see a certain usefulness in heat maps explanation.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"8 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138952091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Le Deunf, Arwa Khannoussi, Laurent Lecornu, Patrick Meyer, John Puentes
Evaluating the quality of data is a problem of a multi-dimensional nature and quite frequently depends on the perspective of an expected use or final purpose of the data. Numerous works have explored the well-known specification of data quality dimensions in various application domains, without addressing the inter-dependencies and aggregation of quality attributes for decision support. In this work we therefore propose a context-dependent formal process to evaluate the quality of data which integrates a preference model from the field of Multi-Criteria Decision Aiding. The parameters of this preference model are determined through interviews with work-domain experts. We show the interest of the proposal on a case study related to the evaluation of the quality of hydrographical survey data.
{"title":"Data quality assessment through a preference model","authors":"Julian Le Deunf, Arwa Khannoussi, Laurent Lecornu, Patrick Meyer, John Puentes","doi":"10.1145/3632407","DOIUrl":"https://doi.org/10.1145/3632407","url":null,"abstract":"Evaluating the quality of data is a problem of a multi-dimensional nature and quite frequently depends on the perspective of an expected use or final purpose of the data. Numerous works have explored the well-known specification of data quality dimensions in various application domains, without addressing the inter-dependencies and aggregation of quality attributes for decision support. In this work we therefore propose a context-dependent formal process to evaluate the quality of data which integrates a preference model from the field of Multi-Criteria Decision Aiding. The parameters of this preference model are determined through interviews with work-domain experts. We show the interest of the proposal on a case study related to the evaluation of the quality of hydrographical survey data.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"20 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139214628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This Special Issue of the Journal of Data and Information Quality (JDIQ) contains novel theoretical and methodological contributions as well as state-of-the-art reviews and research perspectives on quality aspects of data preparation. In this editorial, we summarize the scope of the issue and briefly describe its content.
{"title":"Editorial: Special Issue on Quality Aspects of Data Preparation","authors":"Marco Console, Maurizio Lenzerini","doi":"10.1145/3626461","DOIUrl":"https://doi.org/10.1145/3626461","url":null,"abstract":"This Special Issue of the Journal of Data and Information Quality (JDIQ) contains novel theoretical and methodological contributions as well as state-of-the-art reviews and research perspectives on quality aspects of data preparation. In this editorial, we summarize the scope of the issue and briefly describe its content.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"1180 1","pages":"1 - 2"},"PeriodicalIF":2.1,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139294854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The success of artificial intelligence (AI) applications is heavily dependant on the quality of data they rely on. Thus, data curation, dealing with cleaning, organising and managing data, has become a significant research area to be addressed. Increasingly, semantic data structures such as ontologies and knowledge graphs empower the new generation of AI systems. In this paper, we focus on ontologies, as a special type of data. Ontologies are conceptual data structures representing a domain of interest and are often used as a backbone to knowledge-based intelligent systems or as an additional input for machine learning algorithms. Low-quality ontologies, containing incorrectly represented information or controversial concepts modelled from a single viewpoint can lead to invalid application outputs and biased systems. Thus, we focus on the curation of ontologies as a crucial factor for ensuring trust in the enabled AI systems. While some ontology quality aspects can be automatically evaluated, others require a human-in-the-loop evaluation. Yet, despite the importance of the field several ontology quality aspects have not yet been addressed and there is a lack of guidelines for optimal design of human computation tasks to perform such evaluations. In this paper, we advance the state-of-the-art by making two novel contributions: First, we propose a human-computation (HC)-based approach for the verification of ontology restrictions - an ontology evaluation aspect that has not yet been addressed with HC techniques. Second, by performing two controlled experiments with a junior expert crowd, we empirically derive task design guidelines for achieving high-quality evaluation results related to i) the formalism for representing ontology axioms and ii) crowd qualification testing . We find that the representation format of the ontology does not significantly influence the campaign results, nevertheless, contributors expressed a preference in working with a graphical ontology representation. Additionally we show that an objective qualification test is better fitted at assessing contributors’ prior knowledge rather than a subjective self-assessment and that prior modelling knowledge of the contributors had a positive effect on their judgements. We make all artefacts designed and used in the experimental campaign publicly available.
{"title":"Enhancing Human-in-the-Loop Ontology Curation Results through Task Design","authors":"Stefani Tsaneva, Marta Sabou","doi":"10.1145/3626960","DOIUrl":"https://doi.org/10.1145/3626960","url":null,"abstract":"The success of artificial intelligence (AI) applications is heavily dependant on the quality of data they rely on. Thus, data curation, dealing with cleaning, organising and managing data, has become a significant research area to be addressed. Increasingly, semantic data structures such as ontologies and knowledge graphs empower the new generation of AI systems. In this paper, we focus on ontologies, as a special type of data. Ontologies are conceptual data structures representing a domain of interest and are often used as a backbone to knowledge-based intelligent systems or as an additional input for machine learning algorithms. Low-quality ontologies, containing incorrectly represented information or controversial concepts modelled from a single viewpoint can lead to invalid application outputs and biased systems. Thus, we focus on the curation of ontologies as a crucial factor for ensuring trust in the enabled AI systems. While some ontology quality aspects can be automatically evaluated, others require a human-in-the-loop evaluation. Yet, despite the importance of the field several ontology quality aspects have not yet been addressed and there is a lack of guidelines for optimal design of human computation tasks to perform such evaluations. In this paper, we advance the state-of-the-art by making two novel contributions: First, we propose a human-computation (HC)-based approach for the verification of ontology restrictions - an ontology evaluation aspect that has not yet been addressed with HC techniques. Second, by performing two controlled experiments with a junior expert crowd, we empirically derive task design guidelines for achieving high-quality evaluation results related to i) the formalism for representing ontology axioms and ii) crowd qualification testing . We find that the representation format of the ontology does not significantly influence the campaign results, nevertheless, contributors expressed a preference in working with a graphical ontology representation. Additionally we show that an objective qualification test is better fitted at assessing contributors’ prior knowledge rather than a subjective self-assessment and that prior modelling knowledge of the contributors had a positive effect on their judgements. We make all artefacts designed and used in the experimental campaign publicly available.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chinmay Chakraborty, Mohammad Hossein Khosravi, Muhammad Khurram Khan, Houbing Song
This editorial summarizes the content of the collection on Multimodality, Multidimensional Representation, and Multimedia Quality Assessment Toward Information Quality in Social Web of Things for the Journal of Data and Information Quality.
{"title":"Editorial: Multimodality, Multidimensional Representation, and Multimedia Quality Assessment Toward Information Quality in Social Web of Things","authors":"Chinmay Chakraborty, Mohammad Hossein Khosravi, Muhammad Khurram Khan, Houbing Song","doi":"10.1145/3625102","DOIUrl":"https://doi.org/10.1145/3625102","url":null,"abstract":"This editorial summarizes the content of the collection on Multimodality, Multidimensional Representation, and Multimedia Quality Assessment Toward Information Quality in Social Web of Things for the Journal of Data and Information Quality.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"23 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139332411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data becomes available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.
{"title":"Validating Synthetic Usage Data in Living Lab Environments","authors":"Timo Breuer, Norbert Fuhr, Philipp Schaer","doi":"10.1145/3623640","DOIUrl":"https://doi.org/10.1145/3623640","url":null,"abstract":"Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data becomes available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135925519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A data-driven public sector recognizes data as a key element for implementing policies based on evidence. The open data movement has been a major catalyst for elevating data to a privileged position in many governments around the globe. In Panama, open data has enabled the improvement of data management in each institution. However, it is required to go further to create an integrated data-driven government with a common objective. Public institutions collect a huge amount of data that may never be used, and some others do not contain enough quality to provide trustworthy results. The state of emergency caused by the COVID-19 showed the necessity of establishing a common digital government vision for planning, delivering, and monitoring public services, as well as strengthening the technical foundation in the public sector to improve data value cycle: acquisition, storage, and exploitation. This paper reports from a data custodian perspective how the state of emergency worked as a catalyst to boost government data management, specifically for the Vale Digital program, a social relief linked to the identity card implemented by the Panamanian government during the COVID-19 pandemic, which may possibly be the greatest government data integration to date in terms of impact, data volume, rapid implementation, and institutions involved.
{"title":"Experience: Data Management for delivering COVID-19 relief in Panama","authors":"Luis Del Vasto-Terrientes","doi":"10.1145/3623511","DOIUrl":"https://doi.org/10.1145/3623511","url":null,"abstract":"A data-driven public sector recognizes data as a key element for implementing policies based on evidence. The open data movement has been a major catalyst for elevating data to a privileged position in many governments around the globe. In Panama, open data has enabled the improvement of data management in each institution. However, it is required to go further to create an integrated data-driven government with a common objective. Public institutions collect a huge amount of data that may never be used, and some others do not contain enough quality to provide trustworthy results. The state of emergency caused by the COVID-19 showed the necessity of establishing a common digital government vision for planning, delivering, and monitoring public services, as well as strengthening the technical foundation in the public sector to improve data value cycle: acquisition, storage, and exploitation. This paper reports from a data custodian perspective how the state of emergency worked as a catalyst to boost government data management, specifically for the Vale Digital program, a social relief linked to the identity card implemented by the Panamanian government during the COVID-19 pandemic, which may possibly be the greatest government data integration to date in terms of impact, data volume, rapid implementation, and institutions involved.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136060963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. T. Ter Hofstede, A. Koschmider, Andrea Marrella, R. Andrews, D. Fischer, Sareh Sadeghianasl, M. Wynn, M. Comuzzi, Jochen De Weerdt, Kanika Goel, Niels Martin, P. Soffer
Since its emergence over two decades ago, process mining has flourished as a discipline, with numerous contributions to its theory, widespread practical applications, and mature support by commercial tooling environments. However, its potential for significant organisational impact is hampered by poor quality event data. Process mining starts with the acquisition and preparation of event data coming from different data sources. These are then transformed into event logs, consisting of process execution traces including multiple events. In real-life scenarios, event logs suffer from significant data quality problems, which must be recognised and effectively resolved for obtaining meaningful insights from process mining analysis. Despite its importance, the topic of data quality in process mining has received limited attention. In this paper, we discuss the emerging challenges related to process-data quality from both a research and practical point of view. Additionally, we present a corresponding research agenda with key research directions.
{"title":"Process-Data Quality: The True Frontier of Process Mining","authors":"A. T. Ter Hofstede, A. Koschmider, Andrea Marrella, R. Andrews, D. Fischer, Sareh Sadeghianasl, M. Wynn, M. Comuzzi, Jochen De Weerdt, Kanika Goel, Niels Martin, P. Soffer","doi":"10.1145/3613247","DOIUrl":"https://doi.org/10.1145/3613247","url":null,"abstract":"Since its emergence over two decades ago, process mining has flourished as a discipline, with numerous contributions to its theory, widespread practical applications, and mature support by commercial tooling environments. However, its potential for significant organisational impact is hampered by poor quality event data. Process mining starts with the acquisition and preparation of event data coming from different data sources. These are then transformed into event logs, consisting of process execution traces including multiple events. In real-life scenarios, event logs suffer from significant data quality problems, which must be recognised and effectively resolved for obtaining meaningful insights from process mining analysis. Despite its importance, the topic of data quality in process mining has received limited attention. In this paper, we discuss the emerging challenges related to process-data quality from both a research and practical point of view. Additionally, we present a corresponding research agenda with key research directions.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"19 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89016744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}