Chiara Accinelli, B. Catania, G. Guerrini, Simone Minisi
The development of technological solutions satisfying nondiscriminatory requirements is one of the main current challenges for data processing. Back-end operators for preparing, i.e., extracting and transforming, data play a relevant role w.r.t. nondiscrimination, since they can introduce bias with an impact on the entire data life-cycle. In this article, we focus on back-end transformations, defined in terms of Select-Project-Join queries, and on coverage. Coverage aims at guaranteeing that the input, or training, dataset includes enough examples for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. The article proposes an approach to automatically rewrite a transformation with a result that violates coverage constraints, into the “closest” query satisfying the constraints. The approach is approximate and relies on a sample-based cardinality estimation, thus it introduces a trade-off between accuracy and efficiency. The efficiency and the effectiveness of the approach are experimentally validated on synthetic and real data.
{"title":"A Coverage-based Approach to Nondiscrimination-aware Data Transformation","authors":"Chiara Accinelli, B. Catania, G. Guerrini, Simone Minisi","doi":"10.1145/3546913","DOIUrl":"https://doi.org/10.1145/3546913","url":null,"abstract":"The development of technological solutions satisfying nondiscriminatory requirements is one of the main current challenges for data processing. Back-end operators for preparing, i.e., extracting and transforming, data play a relevant role w.r.t. nondiscrimination, since they can introduce bias with an impact on the entire data life-cycle. In this article, we focus on back-end transformations, defined in terms of Select-Project-Join queries, and on coverage. Coverage aims at guaranteeing that the input, or training, dataset includes enough examples for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. The article proposes an approach to automatically rewrite a transformation with a result that violates coverage constraints, into the “closest” query satisfying the constraints. The approach is approximate and relies on a sample-based cardinality estimation, thus it introduces a trade-off between accuracy and efficiency. The efficiency and the effectiveness of the approach are experimentally validated on synthetic and real data.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"44 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73071754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Senaratne, P. Christen, Graham J. Williams, Pouya Ghiasnezhad Omran
Much of today’s data are represented as graphs, ranging from social networks to bibliographic citations. Nodes in such graphs correspond to records that generally represent entities, while edges represent relationships between these entities. Both nodes and edges in a graph can have attributes that characterize the entities and their relationships. Relationships are either explicitly known (like friends in a social network), or they are inferred using link prediction (such as two babies are siblings because they have the same mother). Any graph representing real-world data likely contains nodes and edges that are abnormal, and identifying these can be important for outlier detection in applications ranging from crime and fraud detection to viral marketing. We propose a novel approach to the unsupervised detection of abnormal nodes and edges in graphs. We first characterize nodes and edges using a set of features, and then employ a one-class classifier to identify abnormal nodes and edges. We extract patterns of features from these abnormal nodes and edges, and apply clustering to identify groups of patterns with similar characteristics. We finally visualize these abnormal patterns to show co-occurrences of features and relationships between those features that mostly influence the abnormality of nodes and edges. We evaluate our approach on datasets from diverse domains, including historical birth certificates, COVID patient records, e-mails, books, and movies. This evaluation demonstrates that our approach is well suited to identify both abnormal nodes and edges in graphs in an unsupervised way, and it can outperform several baseline anomaly detection techniques.
{"title":"Unsupervised Identification of Abnormal Nodes and Edges in Graphs","authors":"A. Senaratne, P. Christen, Graham J. Williams, Pouya Ghiasnezhad Omran","doi":"10.1145/3546912","DOIUrl":"https://doi.org/10.1145/3546912","url":null,"abstract":"Much of today’s data are represented as graphs, ranging from social networks to bibliographic citations. Nodes in such graphs correspond to records that generally represent entities, while edges represent relationships between these entities. Both nodes and edges in a graph can have attributes that characterize the entities and their relationships. Relationships are either explicitly known (like friends in a social network), or they are inferred using link prediction (such as two babies are siblings because they have the same mother). Any graph representing real-world data likely contains nodes and edges that are abnormal, and identifying these can be important for outlier detection in applications ranging from crime and fraud detection to viral marketing. We propose a novel approach to the unsupervised detection of abnormal nodes and edges in graphs. We first characterize nodes and edges using a set of features, and then employ a one-class classifier to identify abnormal nodes and edges. We extract patterns of features from these abnormal nodes and edges, and apply clustering to identify groups of patterns with similar characteristics. We finally visualize these abnormal patterns to show co-occurrences of features and relationships between those features that mostly influence the abnormality of nodes and edges. We evaluate our approach on datasets from diverse domains, including historical birth certificates, COVID patient records, e-mails, books, and movies. This evaluation demonstrates that our approach is well suited to identify both abnormal nodes and edges in graphs in an unsupervised way, and it can outperform several baseline anomaly detection techniques.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"42 1","pages":"1 - 37"},"PeriodicalIF":2.1,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73409326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, there is an increasing tendency to upload images to online platforms acting as information carriers for various applications. Unfortunately, the unauthorized utilization of such images is a serious concern that has significantly impacted security and privacy. Although digital images are widely available, the storage of these images requires a large amount of data. This study aims to address these issues by developing an improved encryption–compression-based algorithm for securing digital images that reduces unnecessary hardware storage space, transmission time and bandwidth demand. First, the image is encrypted using chaotic encryption. Then the encrypted image is compressed using wavelet-based compression in order to make efficient use of resources without any information about the encryption key. On the other side, the image is decompressed and decrypted by the receiver. The security assessment of the proposed algorithm is performed in different ways, such as differential and statistical, key sensitivity and execution time analysis. The experimental analysis proves the security of the method against various possible attacks. Furthermore, the extensive evaluations on a real dataset demonstrate that the proposed solution is secure and has a low encryption overhead compared to similar methods.
{"title":"An Improved Encryption–Compression-based Algorithm for Securing Digital Images","authors":"K. Singh, Ashutosh Kumar Singh","doi":"10.1145/3532783","DOIUrl":"https://doi.org/10.1145/3532783","url":null,"abstract":"Nowadays, there is an increasing tendency to upload images to online platforms acting as information carriers for various applications. Unfortunately, the unauthorized utilization of such images is a serious concern that has significantly impacted security and privacy. Although digital images are widely available, the storage of these images requires a large amount of data. This study aims to address these issues by developing an improved encryption–compression-based algorithm for securing digital images that reduces unnecessary hardware storage space, transmission time and bandwidth demand. First, the image is encrypted using chaotic encryption. Then the encrypted image is compressed using wavelet-based compression in order to make efficient use of resources without any information about the encryption key. On the other side, the image is decompressed and decrypted by the receiver. The security assessment of the proposed algorithm is performed in different ways, such as differential and statistical, key sensitivity and execution time analysis. The experimental analysis proves the security of the method against various possible attacks. Furthermore, the extensive evaluations on a real dataset demonstrate that the proposed solution is secure and has a low encryption overhead compared to similar methods.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"30 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88100393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes
Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.
{"title":"Fairness-aware Data Integration","authors":"Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes","doi":"10.1145/3519419","DOIUrl":"https://doi.org/10.1145/3519419","url":null,"abstract":"Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"56 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90982222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
{"title":"A Survey on Classifying Big Data with Label Noise","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1145/3492546","DOIUrl":"https://doi.org/10.1145/3492546","url":null,"abstract":"Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 43"},"PeriodicalIF":2.1,"publicationDate":"2022-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data-driven systems can induce, operationalize, and amplify systemic discrimination in a variety of ways. As data scientists, we tend to prefer to isolate and formalize equity problems to make them amenable to narrow technical solutions. However, this reductionist approach is inadequate in practice. In this article, we attempt to address data equity broadly, identify different ways in which it is manifest in data-driven systems, and propose a research agenda.
{"title":"The Many Facets of Data Equity","authors":"H. Jagadish, Julia Stoyanovich, B. Howe","doi":"10.1145/3533425","DOIUrl":"https://doi.org/10.1145/3533425","url":null,"abstract":"Data-driven systems can induce, operationalize, and amplify systemic discrimination in a variety of ways. As data scientists, we tend to prefer to isolate and formalize equity problems to make them amenable to narrow technical solutions. However, this reductionist approach is inadequate in practice. In this article, we attempt to address data equity broadly, identify different ways in which it is manifest in data-driven systems, and propose a research agenda.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"29 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2022-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82685313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Simon, B. Amann, Rutian Liu, Stéphane Gançarski
We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and Business Intelligence (BI) tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, union, difference, and pivot. We introduce aggregable properties to describe for any attribute of an analytic table, which aggregation functions correctly aggregate the attribute along which sets of dimension attributes. These properties can also be used to formally identify attributes that are summarizable with respect to some aggregation function along a given set of dimension attributes. This is particularly helpful to detect incorrect aggregations of measures obtained through the use of non-distributive aggregation functions like average and count. We extend the notion of summarizability by introducing a new generalized summarizability condition to control the aggregation of attributes after any analytic operation. Finally, we define propagation rules that transform aggregable properties of the query input tables into new aggregable properties for the result tables, preserving summarizability and generalized summarizability.
{"title":"Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic Queries","authors":"E. Simon, B. Amann, Rutian Liu, Stéphane Gançarski","doi":"10.1145/3575812","DOIUrl":"https://doi.org/10.1145/3575812","url":null,"abstract":"We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and Business Intelligence (BI) tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, union, difference, and pivot. We introduce aggregable properties to describe for any attribute of an analytic table, which aggregation functions correctly aggregate the attribute along which sets of dimension attributes. These properties can also be used to formally identify attributes that are summarizable with respect to some aggregation function along a given set of dimension attributes. This is particularly helpful to detect incorrect aggregations of measures obtained through the use of non-distributive aggregation functions like average and count. We extend the notion of summarizability by introducing a new generalized summarizability condition to control the aggregation of attributes after any analytic operation. Finally, we define propagation rules that transform aggregable properties of the query input tables into new aggregable properties for the result tables, preserving summarizability and generalized summarizability.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"33 1","pages":"1 - 41"},"PeriodicalIF":2.1,"publicationDate":"2021-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76294874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding toxicity in user conversations is undoubtedly an important problem. Addressing “covert” or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and find that toxicity labelling by humans is in general influenced by the conversational structure, polarity, and topic of the context. We then propose to bring these findings into computational detection models by introducing and evaluating (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results show the encouraging potential of neural architectures that are aware of the conversation structure. We also demonstrate that such models can benefit from synthetic data, especially in the social media domain.
{"title":"Revisiting Contextual Toxicity Detection in Conversations","authors":"Julia Ive, Atijit Anuchitanukul, Lucia Specia","doi":"10.1145/3561390","DOIUrl":"https://doi.org/10.1145/3561390","url":null,"abstract":"Understanding toxicity in user conversations is undoubtedly an important problem. Addressing “covert” or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and find that toxicity labelling by humans is in general influenced by the conversational structure, polarity, and topic of the context. We then propose to bring these findings into computational detection models by introducing and evaluating (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results show the encouraging potential of neural architectures that are aware of the conversation structure. We also demonstrate that such models can benefit from synthetic data, especially in the social media domain.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"5 4 1","pages":"1 - 22"},"PeriodicalIF":2.1,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78487730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In partic...
{"title":"Experience: Automated Prediction of Experimental Metadata from Scientific Publications","authors":"NayakStuti, ZaveriAmrapali, SerranoPedro Hernandez, DumontierMichel","doi":"10.1145/3451219","DOIUrl":"https://doi.org/10.1145/3451219","url":null,"abstract":"While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In partic...","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"13 1","pages":"1-11"},"PeriodicalIF":2.1,"publicationDate":"2021-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49020739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In dynamic and big data environments the visualization of a segmentation process over time often does not enable the user to simultaneously track entire pieces. The key points are sometimes incomparable, and the user is limited to a static visual presentation of a certain point. The proposed visualization concept, called ExpanDrogram, is designed to support dynamic classifiers that run in a big data environment subject to changes in data characteristics. It offers a wide range of features that seek to maximize the customization of a segmentation problem. The main goal of the ExpanDrogram visualization is to improve comprehensiveness by combining both the individual and segment levels, illustrating the dynamics of the segmentation process over time, providing “version control” that enables the user to observe the history of changes, and more. The method is illustrated using different datasets, with which we demonstrate multiple segmentation parameters, as well as multiple display layers, to highlight points such as new trend detection, outlier detection, tracking changes in original segments, and zoom in/out for more/less detail. The datasets vary in size from a small one to one of more than 12 million records.
{"title":"ExpanDrogram: Dynamic Visualization of Big Data Segmentation over Time","authors":"A. Khalemsky, R. Gelbard","doi":"10.1145/3434778","DOIUrl":"https://doi.org/10.1145/3434778","url":null,"abstract":"In dynamic and big data environments the visualization of a segmentation process over time often does not enable the user to simultaneously track entire pieces. The key points are sometimes incomparable, and the user is limited to a static visual presentation of a certain point. The proposed visualization concept, called ExpanDrogram, is designed to support dynamic classifiers that run in a big data environment subject to changes in data characteristics. It offers a wide range of features that seek to maximize the customization of a segmentation problem. The main goal of the ExpanDrogram visualization is to improve comprehensiveness by combining both the individual and segment levels, illustrating the dynamics of the segmentation process over time, providing “version control” that enables the user to observe the history of changes, and more. The method is illustrated using different datasets, with which we demonstrate multiple segmentation parameters, as well as multiple display layers, to highlight points such as new trend detection, outlier detection, tracking changes in original segments, and zoom in/out for more/less detail. The datasets vary in size from a small one to one of more than 12 million records.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"94 4 1","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2021-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85274426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}