首页 > 最新文献

ACM Journal of Data and Information Quality最新文献

英文 中文
A Coverage-based Approach to Nondiscrimination-aware Data Transformation 基于覆盖的无差别感知数据转换方法
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-08 DOI: 10.1145/3546913
Chiara Accinelli, B. Catania, G. Guerrini, Simone Minisi
The development of technological solutions satisfying nondiscriminatory requirements is one of the main current challenges for data processing. Back-end operators for preparing, i.e., extracting and transforming, data play a relevant role w.r.t. nondiscrimination, since they can introduce bias with an impact on the entire data life-cycle. In this article, we focus on back-end transformations, defined in terms of Select-Project-Join queries, and on coverage. Coverage aims at guaranteeing that the input, or training, dataset includes enough examples for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. The article proposes an approach to automatically rewrite a transformation with a result that violates coverage constraints, into the “closest” query satisfying the constraints. The approach is approximate and relies on a sample-based cardinality estimation, thus it introduces a trade-off between accuracy and efficiency. The efficiency and the effectiveness of the approach are experimentally validated on synthetic and real data.
开发满足非歧视性要求的技术解决方案是当前数据处理的主要挑战之一。用于准备(即提取和转换)数据的后端操作符在非歧视之外发挥着相关作用,因为它们可能引入偏见,并对整个数据生命周期产生影响。在本文中,我们主要关注后端转换(根据Select-Project-Join查询定义)和覆盖率。覆盖旨在保证输入或训练数据集为每个(受保护的)兴趣类别包含足够的示例,从而增加多样性,目的是在接下来的分析步骤中限制引入偏见。这篇文章提出了一种方法,可以自动重写带有违反覆盖约束的结果的转换,将其转换为满足约束的“最接近”查询。该方法是近似的,依赖于基于样本的基数估计,因此它引入了准确性和效率之间的权衡。在合成数据和实际数据上验证了该方法的有效性和有效性。
{"title":"A Coverage-based Approach to Nondiscrimination-aware Data Transformation","authors":"Chiara Accinelli, B. Catania, G. Guerrini, Simone Minisi","doi":"10.1145/3546913","DOIUrl":"https://doi.org/10.1145/3546913","url":null,"abstract":"The development of technological solutions satisfying nondiscriminatory requirements is one of the main current challenges for data processing. Back-end operators for preparing, i.e., extracting and transforming, data play a relevant role w.r.t. nondiscrimination, since they can introduce bias with an impact on the entire data life-cycle. In this article, we focus on back-end transformations, defined in terms of Select-Project-Join queries, and on coverage. Coverage aims at guaranteeing that the input, or training, dataset includes enough examples for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. The article proposes an approach to automatically rewrite a transformation with a result that violates coverage constraints, into the “closest” query satisfying the constraints. The approach is approximate and relies on a sample-based cardinality estimation, thus it introduces a trade-off between accuracy and efficiency. The efficiency and the effectiveness of the approach are experimentally validated on synthetic and real data.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"44 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73071754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Unsupervised Identification of Abnormal Nodes and Edges in Graphs 图中异常节点和异常边的无监督识别
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-06 DOI: 10.1145/3546912
A. Senaratne, P. Christen, Graham J. Williams, Pouya Ghiasnezhad Omran
Much of today’s data are represented as graphs, ranging from social networks to bibliographic citations. Nodes in such graphs correspond to records that generally represent entities, while edges represent relationships between these entities. Both nodes and edges in a graph can have attributes that characterize the entities and their relationships. Relationships are either explicitly known (like friends in a social network), or they are inferred using link prediction (such as two babies are siblings because they have the same mother). Any graph representing real-world data likely contains nodes and edges that are abnormal, and identifying these can be important for outlier detection in applications ranging from crime and fraud detection to viral marketing. We propose a novel approach to the unsupervised detection of abnormal nodes and edges in graphs. We first characterize nodes and edges using a set of features, and then employ a one-class classifier to identify abnormal nodes and edges. We extract patterns of features from these abnormal nodes and edges, and apply clustering to identify groups of patterns with similar characteristics. We finally visualize these abnormal patterns to show co-occurrences of features and relationships between those features that mostly influence the abnormality of nodes and edges. We evaluate our approach on datasets from diverse domains, including historical birth certificates, COVID patient records, e-mails, books, and movies. This evaluation demonstrates that our approach is well suited to identify both abnormal nodes and edges in graphs in an unsupervised way, and it can outperform several baseline anomaly detection techniques.
如今,从社交网络到书目引用,许多数据都以图表的形式呈现。这种图中的节点对应于通常表示实体的记录,而边表示这些实体之间的关系。图中的节点和边都可以具有表征实体及其关系的属性。关系要么是明确已知的(如社交网络中的朋友),要么是通过链接预测推断出来的(如两个婴儿是兄弟姐妹,因为他们有同一个母亲)。任何表示真实世界数据的图形都可能包含异常的节点和边缘,识别这些节点和边缘对于从犯罪和欺诈检测到病毒式营销等应用中的异常值检测非常重要。我们提出了一种图中异常节点和边缘的无监督检测方法。我们首先使用一组特征来描述节点和边缘,然后使用单类分类器来识别异常节点和边缘。我们从这些异常节点和边缘提取特征模式,并应用聚类方法识别具有相似特征的模式组。我们最终将这些异常模式可视化,以显示共同出现的特征以及这些特征之间的关系,这些特征主要影响节点和边缘的异常。我们在不同领域的数据集上评估了我们的方法,包括历史出生证明、COVID患者记录、电子邮件、书籍和电影。该评估表明,我们的方法非常适合以无监督的方式识别图中的异常节点和边缘,并且它可以优于几种基线异常检测技术。
{"title":"Unsupervised Identification of Abnormal Nodes and Edges in Graphs","authors":"A. Senaratne, P. Christen, Graham J. Williams, Pouya Ghiasnezhad Omran","doi":"10.1145/3546912","DOIUrl":"https://doi.org/10.1145/3546912","url":null,"abstract":"Much of today’s data are represented as graphs, ranging from social networks to bibliographic citations. Nodes in such graphs correspond to records that generally represent entities, while edges represent relationships between these entities. Both nodes and edges in a graph can have attributes that characterize the entities and their relationships. Relationships are either explicitly known (like friends in a social network), or they are inferred using link prediction (such as two babies are siblings because they have the same mother). Any graph representing real-world data likely contains nodes and edges that are abnormal, and identifying these can be important for outlier detection in applications ranging from crime and fraud detection to viral marketing. We propose a novel approach to the unsupervised detection of abnormal nodes and edges in graphs. We first characterize nodes and edges using a set of features, and then employ a one-class classifier to identify abnormal nodes and edges. We extract patterns of features from these abnormal nodes and edges, and apply clustering to identify groups of patterns with similar characteristics. We finally visualize these abnormal patterns to show co-occurrences of features and relationships between those features that mostly influence the abnormality of nodes and edges. We evaluate our approach on datasets from diverse domains, including historical birth certificates, COVID patient records, e-mails, books, and movies. This evaluation demonstrates that our approach is well suited to identify both abnormal nodes and edges in graphs in an unsupervised way, and it can outperform several baseline anomaly detection techniques.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"42 1","pages":"1 - 37"},"PeriodicalIF":2.1,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73409326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Improved Encryption–Compression-based Algorithm for Securing Digital Images 一种改进的基于加密压缩的数字图像安全算法
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-06 DOI: 10.1145/3532783
K. Singh, Ashutosh Kumar Singh
Nowadays, there is an increasing tendency to upload images to online platforms acting as information carriers for various applications. Unfortunately, the unauthorized utilization of such images is a serious concern that has significantly impacted security and privacy. Although digital images are widely available, the storage of these images requires a large amount of data. This study aims to address these issues by developing an improved encryption–compression-based algorithm for securing digital images that reduces unnecessary hardware storage space, transmission time and bandwidth demand. First, the image is encrypted using chaotic encryption. Then the encrypted image is compressed using wavelet-based compression in order to make efficient use of resources without any information about the encryption key. On the other side, the image is decompressed and decrypted by the receiver. The security assessment of the proposed algorithm is performed in different ways, such as differential and statistical, key sensitivity and execution time analysis. The experimental analysis proves the security of the method against various possible attacks. Furthermore, the extensive evaluations on a real dataset demonstrate that the proposed solution is secure and has a low encryption overhead compared to similar methods.
如今,越来越多的人将图片上传到网络平台,作为各种应用的信息载体。不幸的是,未经授权使用这些图像是一个严重的问题,严重影响了安全性和隐私性。虽然数字图像广泛存在,但是这些图像的存储需要大量的数据。本研究旨在通过开发一种改进的基于加密压缩的算法来解决这些问题,该算法用于保护数字图像,减少不必要的硬件存储空间,传输时间和带宽需求。首先,对图像进行混沌加密。然后在没有任何加密密钥信息的情况下,利用基于小波的压缩技术对加密后的图像进行压缩,使资源得到有效利用。另一方面,接收方对图像进行解压缩和解密。采用差分和统计、密钥敏感性和执行时间分析等方法对算法进行安全评估。实验分析证明了该方法对各种可能的攻击是安全的。此外,对真实数据集的广泛评估表明,与类似方法相比,所提出的解决方案是安全的,并且具有较低的加密开销。
{"title":"An Improved Encryption–Compression-based Algorithm for Securing Digital Images","authors":"K. Singh, Ashutosh Kumar Singh","doi":"10.1145/3532783","DOIUrl":"https://doi.org/10.1145/3532783","url":null,"abstract":"Nowadays, there is an increasing tendency to upload images to online platforms acting as information carriers for various applications. Unfortunately, the unauthorized utilization of such images is a serious concern that has significantly impacted security and privacy. Although digital images are widely available, the storage of these images requires a large amount of data. This study aims to address these issues by developing an improved encryption–compression-based algorithm for securing digital images that reduces unnecessary hardware storage space, transmission time and bandwidth demand. First, the image is encrypted using chaotic encryption. Then the encrypted image is compressed using wavelet-based compression in order to make efficient use of resources without any information about the encryption key. On the other side, the image is decompressed and decrypted by the receiver. The security assessment of the proposed algorithm is performed in different ways, such as differential and statistical, key sensitivity and execution time analysis. The experimental analysis proves the security of the method against various possible attacks. Furthermore, the extensive evaluations on a real dataset demonstrate that the proposed solution is secure and has a low encryption overhead compared to similar methods.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"30 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88100393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fairness-aware Data Integration 公平感知数据集成
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-05 DOI: 10.1145/3519419
Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes
Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.
机器学习可以应用于影响人们生活的决策。这些技术有可能使决策更加客观,但也存在这样一种风险,即由于基础数据中的偏见,决策可能歧视某些群体。减少偏见,或促进公平,一直是机器学习中重要研究的焦点,例如,基于预处理训练数据,改变学习算法,或后处理学习结果。然而,在这些活动之前,数据集成发现并集成用于培训的数据,数据集成过程有可能产生导致有偏见的结论的数据。在本文中,我们提出了一种生成模式映射的方法,该方法考虑到:(i)映射结果固有的属性,这些属性可能会在分析中产生偏差;(ii)在不同映射集的结果上训练的分类器中观察到的偏差。不同方式的方法探索空间的整合数据,使用禁忌搜索算法,根据bias-aware目标函数代表不同类型的偏见。使用成人普查和德国信用数据集对结果方法进行评估,以探索该方法可以增加数据整合过程结果公平性的程度和情况。
{"title":"Fairness-aware Data Integration","authors":"Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes","doi":"10.1145/3519419","DOIUrl":"https://doi.org/10.1145/3519419","url":null,"abstract":"Machine learning can be applied in applications that take decisions that impact people’s lives. Such techniques have the potential to make decision making more objective, but there also is a risk that the decisions can discriminate against certain groups as a result of bias in the underlying data. Reducing bias, or promoting fairness, has been a focus of significant investigation in machine learning, for example, based on pre-processing the training data, changing the learning algorithm, or post-processing the results of the learning. However, prior to these activities, data integration discovers and integrates the data that is used for training, and data integration processes have the potential to produce data that leads to biased conclusions. In this article, we propose an approach that generates schema mappings in ways that take into account: (i) properties that are intrinsic to mapping results that may give rise to bias in analyses; and (ii) bias observed in classifiers trained on the results of different sets of mappings. The approach explores a space of different ways of integrating the data, using a Tabu search algorithm, guided by bias-aware objective functions that represent different types of bias.The resulting approach is evaluated using Adult Census and German Credit datasets to explore the extent to which and the circumstances in which the approach can increase the fairness of the results of the data integration process.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"56 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90982222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Survey on Classifying Big Data with Label Noise 基于标签噪声的大数据分类研究
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-04-10 DOI: 10.1145/3492546
Justin M. Johnson, T. Khoshgoftaar
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
类标签噪声是数据质量的一个关键组成部分,它直接抑制了机器学习算法的预测性能。虽然存在许多数据级和算法级的方法来处理标签噪声,但与大数据相关的挑战需要新的和改进的方法。本调查通过对处理大数据中的标签噪声进行广泛的文献回顾来解决这些问题。本文首先介绍了类标噪声问题和处理类标噪声的传统方法。接下来,我们提出了30种在大数据环境下处理类标签噪声的方法,即大容量、高种类和高速度问题。所调查的工作包括能够在任意大小的数据集上运行的分布式解决方案,用于具有有限干净标签的大规模数据集的深度学习技术,以及用于检测存在概念漂移的类噪声的流技术。在这些领域中确定了共同的趋势和最佳实践,审查了实施细节,在适用的情况下比较了研究中的经验结果,并提供了17个开源项目和编程包的参考资料。强调与大数据相关的标签噪声挑战、解决方案和实证结果,使这项工作成为一项独特的贡献,将激励未来的研究并指导机器学习从业者。
{"title":"A Survey on Classifying Big Data with Label Noise","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1145/3492546","DOIUrl":"https://doi.org/10.1145/3492546","url":null,"abstract":"Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 43"},"PeriodicalIF":2.1,"publicationDate":"2022-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
The Many Facets of Data Equity 数据公平的许多方面
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-02-07 DOI: 10.1145/3533425
H. Jagadish, Julia Stoyanovich, B. Howe
Data-driven systems can induce, operationalize, and amplify systemic discrimination in a variety of ways. As data scientists, we tend to prefer to isolate and formalize equity problems to make them amenable to narrow technical solutions. However, this reductionist approach is inadequate in practice. In this article, we attempt to address data equity broadly, identify different ways in which it is manifest in data-driven systems, and propose a research agenda.
数据驱动的系统可以以各种方式诱发、实施和放大系统性歧视。作为数据科学家,我们倾向于孤立和形式化公平问题,使其适用于狭隘的技术解决方案。然而,这种简化的方法在实践中是不充分的。在本文中,我们试图从广义上解决数据公平问题,确定数据驱动系统中体现数据公平的不同方式,并提出研究议程。
{"title":"The Many Facets of Data Equity","authors":"H. Jagadish, Julia Stoyanovich, B. Howe","doi":"10.1145/3533425","DOIUrl":"https://doi.org/10.1145/3533425","url":null,"abstract":"Data-driven systems can induce, operationalize, and amplify systemic discrimination in a variety of ways. As data scientists, we tend to prefer to isolate and formalize equity problems to make them amenable to narrow technical solutions. However, this reductionist approach is inadequate in practice. In this article, we attempt to address data equity broadly, identify different ways in which it is manifest in data-driven systems, and propose a research agenda.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"29 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2022-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82685313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic Queries 交互式分析查询会话期间聚合操作的正确性控制
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-11-27 DOI: 10.1145/3575812
E. Simon, B. Amann, Rutian Liu, Stéphane Gançarski
We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and Business Intelligence (BI) tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, union, difference, and pivot. We introduce aggregable properties to describe for any attribute of an analytic table, which aggregation functions correctly aggregate the attribute along which sets of dimension attributes. These properties can also be used to formally identify attributes that are summarizable with respect to some aggregation function along a given set of dimension attributes. This is particularly helpful to detect incorrect aggregations of measures obtained through the use of non-distributive aggregation functions like average and count. We extend the notion of summarizability by introducing a new generalized summarizability condition to control the aggregation of attributes after any analytic operation. Finally, we define propagation rules that transform aggregable properties of the query input tables into new aggregable properties for the result tables, preserving summarizability and generalized summarizability.
我们提供了一组全面的条件和规则来控制交互式数据分析会话中聚合查询的正确性。目标是扩展自助服务数据准备和商业智能(BI)工具,以自动检测通过使用常见的分析操作(包括过滤器、项目、连接、聚合、联合、差异和pivot)构建的分析表和视图上语义不正确的聚合查询。我们引入可聚合属性来描述解析表的任何属性,哪些聚合函数沿着哪些维度属性集正确地聚合属性。这些属性还可用于形式化地标识属性,这些属性可根据给定维度属性集的某些聚合函数进行汇总。这对于检测通过使用非分布聚合函数(如average和count)获得的度量的不正确聚合特别有帮助。我们扩展了可归纳性的概念,引入了一个新的广义可归纳性条件来控制任意分析运算后属性的聚集。最后,我们定义了将查询输入表的可聚合属性转换为结果表的新可聚合属性的传播规则,从而保持了可聚合性和广义可聚合性。
{"title":"Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic Queries","authors":"E. Simon, B. Amann, Rutian Liu, Stéphane Gançarski","doi":"10.1145/3575812","DOIUrl":"https://doi.org/10.1145/3575812","url":null,"abstract":"We present a comprehensive set of conditions and rules to control the correctness of aggregation queries within an interactive data analysis session. The goal is to extend self-service data preparation and Business Intelligence (BI) tools to automatically detect semantically incorrect aggregate queries on analytic tables and views built by using the common analytic operations including filter, project, join, aggregate, union, difference, and pivot. We introduce aggregable properties to describe for any attribute of an analytic table, which aggregation functions correctly aggregate the attribute along which sets of dimension attributes. These properties can also be used to formally identify attributes that are summarizable with respect to some aggregation function along a given set of dimension attributes. This is particularly helpful to detect incorrect aggregations of measures obtained through the use of non-distributive aggregation functions like average and count. We extend the notion of summarizability by introducing a new generalized summarizability condition to control the aggregation of attributes after any analytic operation. Finally, we define propagation rules that transform aggregable properties of the query input tables into new aggregable properties for the result tables, preserving summarizability and generalized summarizability.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"33 1","pages":"1 - 41"},"PeriodicalIF":2.1,"publicationDate":"2021-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76294874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Revisiting Contextual Toxicity Detection in Conversations 回顾对话中的语境毒性检测
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-11-24 DOI: 10.1145/3561390
Julia Ive, Atijit Anuchitanukul, Lucia Specia
Understanding toxicity in user conversations is undoubtedly an important problem. Addressing “covert” or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and find that toxicity labelling by humans is in general influenced by the conversational structure, polarity, and topic of the context. We then propose to bring these findings into computational detection models by introducing and evaluating (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results show the encouraging potential of neural architectures that are aware of the conversation structure. We also demonstrate that such models can benefit from synthetic data, especially in the social media domain.
理解用户对话中的毒性无疑是一个重要的问题。处理“隐蔽的”或隐含的毒性情况尤其困难,并且需要上下文。以前很少有研究分析会话上下文对人类感知或自动检测模型的影响。我们对这两个方向都进行了更深入的研究。我们首先分析现有的上下文数据集,发现人类的毒性标签通常受到上下文的会话结构、极性和主题的影响。然后,我们建议通过引入和评估(a)意识到对话结构的上下文毒性检测的神经架构,以及(b)可以帮助建立上下文毒性检测模型的数据增强策略,将这些发现引入计算检测模型。我们的研究结果显示了意识到对话结构的神经结构的令人鼓舞的潜力。我们还证明了这些模型可以从合成数据中受益,特别是在社交媒体领域。
{"title":"Revisiting Contextual Toxicity Detection in Conversations","authors":"Julia Ive, Atijit Anuchitanukul, Lucia Specia","doi":"10.1145/3561390","DOIUrl":"https://doi.org/10.1145/3561390","url":null,"abstract":"Understanding toxicity in user conversations is undoubtedly an important problem. Addressing “covert” or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and find that toxicity labelling by humans is in general influenced by the conversational structure, polarity, and topic of the context. We then propose to bring these findings into computational detection models by introducing and evaluating (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results show the encouraging potential of neural architectures that are aware of the conversation structure. We also demonstrate that such models can benefit from synthetic data, especially in the social media domain.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"5 4 1","pages":"1 - 22"},"PeriodicalIF":2.1,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78487730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Experience: Automated Prediction of Experimental Metadata from Scientific Publications 经验:科学出版物实验元数据的自动预测
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-08-12 DOI: 10.1145/3451219
NayakStuti, ZaveriAmrapali, SerranoPedro Hernandez, DumontierMichel
While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In partic...
虽然存在大量开放的生物医学数据,但由于缺乏高质量的元数据,其他人很难找到相关数据集并将其用于其他目的。在partic…
{"title":"Experience: Automated Prediction of Experimental Metadata from Scientific Publications","authors":"NayakStuti, ZaveriAmrapali, SerranoPedro Hernandez, DumontierMichel","doi":"10.1145/3451219","DOIUrl":"https://doi.org/10.1145/3451219","url":null,"abstract":"While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In partic...","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"13 1","pages":"1-11"},"PeriodicalIF":2.1,"publicationDate":"2021-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49020739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ExpanDrogram: Dynamic Visualization of Big Data Segmentation over Time ExpanDrogram:大数据分段的动态可视化
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-06-02 DOI: 10.1145/3434778
A. Khalemsky, R. Gelbard
In dynamic and big data environments the visualization of a segmentation process over time often does not enable the user to simultaneously track entire pieces. The key points are sometimes incomparable, and the user is limited to a static visual presentation of a certain point. The proposed visualization concept, called ExpanDrogram, is designed to support dynamic classifiers that run in a big data environment subject to changes in data characteristics. It offers a wide range of features that seek to maximize the customization of a segmentation problem. The main goal of the ExpanDrogram visualization is to improve comprehensiveness by combining both the individual and segment levels, illustrating the dynamics of the segmentation process over time, providing “version control” that enables the user to observe the history of changes, and more. The method is illustrated using different datasets, with which we demonstrate multiple segmentation parameters, as well as multiple display layers, to highlight points such as new trend detection, outlier detection, tracking changes in original segments, and zoom in/out for more/less detail. The datasets vary in size from a small one to one of more than 12 million records.
在动态和大数据环境中,随着时间的推移,分割过程的可视化通常不允许用户同时跟踪整个片段。关键点有时是无法比较的,用户被限制在某一点的静态视觉呈现上。提出的可视化概念称为ExpanDrogram,旨在支持在数据特征变化的大数据环境中运行的动态分类器。它提供了广泛的功能,寻求最大限度地定制分割问题。ExpanDrogram可视化的主要目标是通过结合个人和分段级别来提高全面性,说明分段过程随时间的动态变化,提供“版本控制”,使用户能够观察变化的历史等等。该方法使用不同的数据集进行演示,其中我们展示了多个分割参数以及多个显示层,以突出显示新趋势检测,异常值检测,跟踪原始片段的变化以及放大/缩小更多/更少细节。数据集的大小各不相同,从很小的一个到超过1200万条记录的一个。
{"title":"ExpanDrogram: Dynamic Visualization of Big Data Segmentation over Time","authors":"A. Khalemsky, R. Gelbard","doi":"10.1145/3434778","DOIUrl":"https://doi.org/10.1145/3434778","url":null,"abstract":"In dynamic and big data environments the visualization of a segmentation process over time often does not enable the user to simultaneously track entire pieces. The key points are sometimes incomparable, and the user is limited to a static visual presentation of a certain point. The proposed visualization concept, called ExpanDrogram, is designed to support dynamic classifiers that run in a big data environment subject to changes in data characteristics. It offers a wide range of features that seek to maximize the customization of a segmentation problem. The main goal of the ExpanDrogram visualization is to improve comprehensiveness by combining both the individual and segment levels, illustrating the dynamics of the segmentation process over time, providing “version control” that enables the user to observe the history of changes, and more. The method is illustrated using different datasets, with which we demonstrate multiple segmentation parameters, as well as multiple display layers, to highlight points such as new trend detection, outlier detection, tracking changes in original segments, and zoom in/out for more/less detail. The datasets vary in size from a small one to one of more than 12 million records.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"94 4 1","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2021-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85274426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Journal of Data and Information Quality
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1