Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.
{"title":"BIGQA: Declarative Big Data Quality Assessment","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber","doi":"10.1145/3603706","DOIUrl":"https://doi.org/10.1145/3603706","url":null,"abstract":"In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"56 1","pages":"1 - 30"},"PeriodicalIF":2.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78304350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmad Al-qerem, A. Ali, Hani Attar, S. Nashwan, Lianyong Qi, Mohammad Kazem Moghimi, A. Solyman
This article aims to compare Generative Adversarial Network (GAN) models and feature selection methods for generating synthetic data in order to improve the validity of a classification model. The synthetic data generation technique involves generating new data samples from existing data to increase the diversity of the data and help the model generalize better. The multidimensional aspect of the data refers to the fact that it can have multiple features or variables that describe it. The GAN models have proven to be effective in preserving the statistical properties of the original data. However, the order of data augmentation and feature selection is crucial to build robust and accurate predictive models. By comparing the different GAN models with feature selection methods on multidimensional datasets, this article aims to determine the best combination to support the validity of a classification model in multidimensional data.
{"title":"Synthetic Generation of Multidimensional Data to Improve Classification Model Validity","authors":"Ahmad Al-qerem, A. Ali, Hani Attar, S. Nashwan, Lianyong Qi, Mohammad Kazem Moghimi, A. Solyman","doi":"10.1145/3603715","DOIUrl":"https://doi.org/10.1145/3603715","url":null,"abstract":"This article aims to compare Generative Adversarial Network (GAN) models and feature selection methods for generating synthetic data in order to improve the validity of a classification model. The synthetic data generation technique involves generating new data samples from existing data to increase the diversity of the data and help the model generalize better. The multidimensional aspect of the data refers to the fact that it can have multiple features or variables that describe it. The GAN models have proven to be effective in preserving the statistical properties of the original data. However, the order of data augmentation and feature selection is crucial to build robust and accurate predictive models. By comparing the different GAN models with feature selection methods on multidimensional datasets, this article aims to determine the best combination to support the validity of a classification model in multidimensional data.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"16 1","pages":"1 - 20"},"PeriodicalIF":2.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74930745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this study, a novel approach for feature selection has been presented in order to overcome the challenge of classifying positive and negative risk prediction in the cryptocurrency market, which contains high fluctuation. This approach is based on maximizing information gain with simultaneously minimizing the similarity of selected features to achieve a proper feature set for improving classification accuracy. The proposed method was compared with other feature selection techniques, such as sequential and bidirectional feature selection, univariate feature selection, and least absolute shrinkage and selection operator. To evaluate the feature selection techniques, several classifiers were employed: XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression, long short-term memory, and deep neural networks. The features were elicited from the time series of Bitcoin, Binance, and Ethereum cryptocurrencies. The results of applying the selected features to different classifiers indicated that XGBoost and random forest provided better results on the time series datasets. Furthermore, the proposed feature selection method achieved the best results on two (out of three) cryptocurrencies. The accuracy in the best state varied between 55% to 68% for different time series. It is worth mentioning that preprocessed features were used in this research, meaning that raw data (candle data) were used to derive efficient features that can explain the problem and help the classifiers in predicting the labels.
{"title":"A Novel Feature Selection Method for Risk Management in High-Dimensional Time Series of Cryptocurrency Market","authors":"Erfan Varedi, R. Boostani","doi":"10.1145/3597309","DOIUrl":"https://doi.org/10.1145/3597309","url":null,"abstract":"In this study, a novel approach for feature selection has been presented in order to overcome the challenge of classifying positive and negative risk prediction in the cryptocurrency market, which contains high fluctuation. This approach is based on maximizing information gain with simultaneously minimizing the similarity of selected features to achieve a proper feature set for improving classification accuracy. The proposed method was compared with other feature selection techniques, such as sequential and bidirectional feature selection, univariate feature selection, and least absolute shrinkage and selection operator. To evaluate the feature selection techniques, several classifiers were employed: XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression, long short-term memory, and deep neural networks. The features were elicited from the time series of Bitcoin, Binance, and Ethereum cryptocurrencies. The results of applying the selected features to different classifiers indicated that XGBoost and random forest provided better results on the time series datasets. Furthermore, the proposed feature selection method achieved the best results on two (out of three) cryptocurrencies. The accuracy in the best state varied between 55% to 68% for different time series. It is worth mentioning that preprocessed features were used in this research, meaning that raw data (candle data) were used to derive efficient features that can explain the problem and help the classifiers in predicting the labels.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"27 1","pages":"1 - 14"},"PeriodicalIF":2.1,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81231253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali
In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.
{"title":"Pipeline Design for Data Preparation for Social Media Analysis","authors":"Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali","doi":"10.1145/3597305","DOIUrl":"https://doi.org/10.1145/3597305","url":null,"abstract":"In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88878734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ornella Irrera, A. Mannocci, P. Manghi, G. Silvello
In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete. This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product. To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.
{"title":"A Novel Curated Scholarly Graph Connecting Textual and Data Publications","authors":"Ornella Irrera, A. Mannocci, P. Manghi, G. Silvello","doi":"10.1145/3597310","DOIUrl":"https://doi.org/10.1145/3597310","url":null,"abstract":"In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete. This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product. To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"42 1","pages":"1 - 24"},"PeriodicalIF":2.1,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78953755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we introduce and discuss the pervasive issue of bias in the large language models that are currently at the core of mainstream approaches to Natural Language Processing (NLP). We first introduce data selection bias, that is, the bias caused by the choice of texts that make up a training corpus. Then, we survey the different types of social bias evidenced in the text generated by language models trained on such corpora, ranging from gender to age, from sexual orientation to ethnicity, and from religion to culture. We conclude with directions focused on measuring, reducing, and tackling the aforementioned types of bias.
{"title":"Biases in Large Language Models: Origins, Inventory, and Discussion","authors":"Roberto Navigli, Simone Conia, Björn Ross","doi":"10.1145/3597307","DOIUrl":"https://doi.org/10.1145/3597307","url":null,"abstract":"In this article, we introduce and discuss the pervasive issue of bias in the large language models that are currently at the core of mainstream approaches to Natural Language Processing (NLP). We first introduce data selection bias, that is, the bias caused by the choice of texts that make up a training corpus. Then, we survey the different types of social bias evidenced in the text generated by language models trained on such corpora, ranging from gender to age, from sexual orientation to ethnicity, and from religion to culture. We conclude with directions focused on measuring, reducing, and tackling the aforementioned types of bias.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"44 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2023-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88626153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Alzu’bi, Lojin Bani Younis, A. Abuarqoub, M. Hammoudeh
A meme is a visual representation that illustrates a thought or concept. Memes are spreading steadily among people in this era of rapidly expanding social media platforms, and they are becoming increasingly popular forms of expression. In the domain of meme and emotion analysis, the detection of offensives is a crucial task. However, it can be difficult to identify and comprehend the underlying emotion of a meme because its content is multimodal. Additionally, there is a lack of memes datasets that address how offensive a meme is, and the existing ones in this context have a bias towards the dominant labels or categories, leading to an imbalanced training set. In this article, we present a descriptive balanced dataset to help detect the offensive nature of the meme’s content using a proposed multimodal deep learning model. Two deep semantic models, baseline BERT and hateXplain-BERT, are systematically combined with several deep ResNet architectures to estimate the severity of the offensive memes. This process is based on the Meme-Merge collection that we construct from two publicly available datasets. The experimental results demonstrate the model’s effectiveness in classifying offensive memes, achieving F1 scores of 0.7315 and 0.7140 for the baseline datasets and Meme-Merge, respectively. The proposed multimodal deep learning approach also outperformed the baseline model in three meme tasks: metaphor understanding, sentiment understanding, and intention detection.
{"title":"Multimodal Deep Learning with Discriminant Descriptors for Offensive Memes Detection","authors":"A. Alzu’bi, Lojin Bani Younis, A. Abuarqoub, M. Hammoudeh","doi":"10.1145/3597308","DOIUrl":"https://doi.org/10.1145/3597308","url":null,"abstract":"A meme is a visual representation that illustrates a thought or concept. Memes are spreading steadily among people in this era of rapidly expanding social media platforms, and they are becoming increasingly popular forms of expression. In the domain of meme and emotion analysis, the detection of offensives is a crucial task. However, it can be difficult to identify and comprehend the underlying emotion of a meme because its content is multimodal. Additionally, there is a lack of memes datasets that address how offensive a meme is, and the existing ones in this context have a bias towards the dominant labels or categories, leading to an imbalanced training set. In this article, we present a descriptive balanced dataset to help detect the offensive nature of the meme’s content using a proposed multimodal deep learning model. Two deep semantic models, baseline BERT and hateXplain-BERT, are systematically combined with several deep ResNet architectures to estimate the severity of the offensive memes. This process is based on the Meme-Merge collection that we construct from two publicly available datasets. The experimental results demonstrate the model’s effectiveness in classifying offensive memes, achieving F1 scores of 0.7315 and 0.7140 for the baseline datasets and Meme-Merge, respectively. The proposed multimodal deep learning approach also outperformed the baseline model in three meme tasks: metaphor understanding, sentiment understanding, and intention detection.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"63 2 1","pages":"1 - 16"},"PeriodicalIF":2.1,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78623211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cameron Aume, S. Pal, Alireza Jolfaei, S. Mukhopadhyay
The devices that can read Electroencephalography (EEG) signals have been widely used for Brain-Computer Interfaces (BCIs). Popularity in the field of BCIs has increased in recent years with the development of several consumer-grade EEG devices that can detect human cognitive states in real-time and deliver feedback to enhance human performance. Several previous studies have been conducted to understand the fundamentals and essential aspects of EEG in BCIs. However, the significant issue of how consumer-grade EEG devices can be used to control mechatronic systems effectively has been given less attention. In this article, we have designed and implemented an EEG BCI system using the OpenBCI Cyton headset and a user interface running a game to explore the concept of streamlining the interaction between humans and mechatronic systems with a BCI EEG-mechatronic system interface. Big Multimodal Social Data (BMSD) analytics can be applied to the high-frequency and high-volume EEG data, allowing us to explore aspects of data acquisition, data processing, and data validation and evaluate the Quality of Experience (QoE) of our system. We employ real-world participants to play a game to gather training data that was later put into multiple machine learning models, including a linear discriminant analysis (LDA), k-nearest neighbours (KNN), and a convolutional neural network (CNN). After training the machine learning models, a validation phase of the experiment took place where participants tried to play the same game but without direct control, utilising the outputs of the machine learning models to determine how the game moved. We find that a CNN trained to the specific user was able to control the game and performed with the highest activation accuracy from the machine learning models tested, along with the highest user-rated QoE, which gives us significant insight for future implementation with a mechatronic system.
{"title":"Multimodal Social Data Analytics on the Design and Implementation of an EEG-Mechatronic System Interface","authors":"Cameron Aume, S. Pal, Alireza Jolfaei, S. Mukhopadhyay","doi":"10.1145/3597306","DOIUrl":"https://doi.org/10.1145/3597306","url":null,"abstract":"The devices that can read Electroencephalography (EEG) signals have been widely used for Brain-Computer Interfaces (BCIs). Popularity in the field of BCIs has increased in recent years with the development of several consumer-grade EEG devices that can detect human cognitive states in real-time and deliver feedback to enhance human performance. Several previous studies have been conducted to understand the fundamentals and essential aspects of EEG in BCIs. However, the significant issue of how consumer-grade EEG devices can be used to control mechatronic systems effectively has been given less attention. In this article, we have designed and implemented an EEG BCI system using the OpenBCI Cyton headset and a user interface running a game to explore the concept of streamlining the interaction between humans and mechatronic systems with a BCI EEG-mechatronic system interface. Big Multimodal Social Data (BMSD) analytics can be applied to the high-frequency and high-volume EEG data, allowing us to explore aspects of data acquisition, data processing, and data validation and evaluate the Quality of Experience (QoE) of our system. We employ real-world participants to play a game to gather training data that was later put into multiple machine learning models, including a linear discriminant analysis (LDA), k-nearest neighbours (KNN), and a convolutional neural network (CNN). After training the machine learning models, a validation phase of the experiment took place where participants tried to play the same game but without direct control, utilising the outputs of the machine learning models to determine how the game moved. We find that a CNN trained to the specific user was able to control the game and performed with the highest activation accuracy from the machine learning models tested, along with the highest user-rated QoE, which gives us significant insight for future implementation with a mechatronic system.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"52 1","pages":"1 - 25"},"PeriodicalIF":2.1,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84697711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Every decision-making process is subject to a certain degree of uncertainty. In sectors where the outcomes of the operations planned are uncertain and difficult to control such as in forestry, data describing the available resources can have a large impact on productivity. When planning activities, it is often assumed that such data are accurate, which causes a need for more replanning efforts. Data verification is kept to a minimum even though using erroneous information increases the level of uncertainty. In this context, it is relevant to develop a process to evaluate whether the data used for planning decisions are appropriate, so as to ensure the decision validity and provide information for better understanding and actions. However, the level of data quality alone can sometimes be difficult to interpret and needs to be put into perspective. This article proposes an extension to most data quality assessment techniques by comparing data to past quality levels. A classification method is proposed to evaluate the level of data quality in order to support decision making. Such classification provides insights into the level of uncertainty associated with the data. The method developed is then exploited using a theoretical case based on the literature and a practical case based on the forest sector. An example of how classified data quality can improve decisions in a transportation problem is finally shown.
{"title":"A Method to Classify Data Quality for Decision Making Under Uncertainty","authors":"Vanessa Simard, M. Rönnqvist, L. Lebel, N. Lehoux","doi":"10.1145/3592534","DOIUrl":"https://doi.org/10.1145/3592534","url":null,"abstract":"Every decision-making process is subject to a certain degree of uncertainty. In sectors where the outcomes of the operations planned are uncertain and difficult to control such as in forestry, data describing the available resources can have a large impact on productivity. When planning activities, it is often assumed that such data are accurate, which causes a need for more replanning efforts. Data verification is kept to a minimum even though using erroneous information increases the level of uncertainty. In this context, it is relevant to develop a process to evaluate whether the data used for planning decisions are appropriate, so as to ensure the decision validity and provide information for better understanding and actions. However, the level of data quality alone can sometimes be difficult to interpret and needs to be put into perspective. This article proposes an extension to most data quality assessment techniques by comparing data to past quality levels. A classification method is proposed to evaluate the level of data quality in order to support decision making. Such classification provides insights into the level of uncertainty associated with the data. The method developed is then exploited using a theoretical case based on the literature and a practical case based on the forest sector. An example of how classified data quality can improve decisions in a transportation problem is finally shown.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"26 1","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73809409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margaret A. Priestley, Fionntán O'Donnell, E. Simperl
The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users—typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical “fitness-for-use” view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.
{"title":"A Survey of Data Quality Requirements That Matter in ML Development Pipelines","authors":"Margaret A. Priestley, Fionntán O'Donnell, E. Simperl","doi":"10.1145/3592616","DOIUrl":"https://doi.org/10.1145/3592616","url":null,"abstract":"The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users—typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical “fitness-for-use” view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"33 1","pages":"1 - 39"},"PeriodicalIF":2.1,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81012919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}