首页 > 最新文献

ACM Journal of Data and Information Quality最新文献

英文 中文
Experience: Differentiating Between Isolated and Sequence Missing Data 经验:区分孤立数据和序列缺失数据
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-01-19 DOI: 10.1145/3575809
Amal Tawakuli, Daniel Kaiser, T. Engel
Missing data is one of the most persistent problems found in data that hinders information and value extraction. Handling missing data is a preprocessing task that has been extensively studied by the research community and remains an active research topic due to its impact and pervasiveness. Many surveys have been conducted to evaluate traditional and state-of-the-art techniques, however, the accuracy of missing data imputation techniques is evaluated without differentiating between isolated and sequence missing instances. In this article, we highlight the presence of both of these types of missing data at different percentages in real-world time-series datasets. We demonstrate that existing imputation techniques have different estimation accuracies for isolated and sequence missing instances. We then propose using a hybrid approach that differentiate between the two types of missing data to yield improved overall imputation accuracy.
数据缺失是阻碍信息和价值提取的数据中最持久的问题之一。缺失数据处理是一项预处理任务,已被研究界广泛研究,由于其影响和普遍性,仍然是一个活跃的研究课题。已经进行了许多调查来评估传统和最先进的技术,然而,在没有区分孤立和序列缺失实例的情况下评估缺失数据插入技术的准确性。在本文中,我们重点介绍了这两种类型的缺失数据在实际时间序列数据集中的不同百分比。我们证明了现有的估算技术对孤立的和序列缺失的实例有不同的估计精度。然后,我们建议使用一种混合方法来区分两种类型的缺失数据,以提高总体插补精度。
{"title":"Experience: Differentiating Between Isolated and Sequence Missing Data","authors":"Amal Tawakuli, Daniel Kaiser, T. Engel","doi":"10.1145/3575809","DOIUrl":"https://doi.org/10.1145/3575809","url":null,"abstract":"Missing data is one of the most persistent problems found in data that hinders information and value extraction. Handling missing data is a preprocessing task that has been extensively studied by the research community and remains an active research topic due to its impact and pervasiveness. Many surveys have been conducted to evaluate traditional and state-of-the-art techniques, however, the accuracy of missing data imputation techniques is evaluated without differentiating between isolated and sequence missing instances. In this article, we highlight the presence of both of these types of missing data at different percentages in real-world time-series datasets. We demonstrate that existing imputation techniques have different estimation accuracies for isolated and sequence missing instances. We then propose using a hybrid approach that differentiate between the two types of missing data to yield improved overall imputation accuracy.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 15"},"PeriodicalIF":2.1,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77139665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Deception Detection Within and Across Domains: Identifying and Understanding the Performance Gap 领域内和跨领域的欺骗检测:识别和理解性能差距
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-11-22 DOI: 10.1145/3561413
Subhadarshi Panda, Sarah Ita Levitan
NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.
在过去的几年里,自动欺骗检测的NLP方法越来越受欢迎,尤其是在网上虚假评论和虚假新闻泛滥的情况下。然而,大多数先前的欺骗检测研究都集中在单一领域。我们目前缺乏关于这些单域欺骗模型如何可能或可能不会推广到新领域的信息。在这项工作中,我们对五个领域的跨领域欺骗检测进行了实证研究,以了解当前模型在新的欺骗领域评估时的表现。我们的实验结果表明,域内和跨域分类性能之间存在很大差距。在这些发现的激励下,我们提出了理解跨领域性能差异的方法。我们制定了五个距离度量来量化欺骗域对之间的距离。我们通过实验证明,一对域之间的距离与域的跨域精度呈负相关。我们通过对句子嵌入的可视化,深入分析了基于BERT的模型在领域上的差异和微调的影响。最后,我们利用距离度量为任何给定的目标域推荐最优源域。这项工作强调了开发跨领域欺骗检测的鲁棒学习算法的必要性,这些算法可以推广和适应新的领域,并有助于实现这一目标。
{"title":"Deception Detection Within and Across Domains: Identifying and Understanding the Performance Gap","authors":"Subhadarshi Panda, Sarah Ita Levitan","doi":"10.1145/3561413","DOIUrl":"https://doi.org/10.1145/3561413","url":null,"abstract":"NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"342 5","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72391624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Special Issue on Data Quality and Ethics 社论:数据质量与伦理特刊
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-09-06 DOI: 10.1145/3561202
D. Firmani, L. Tanca, Riccardo Torlone
This editorial summarizes the content of the Special Issue on Data Quality and Ethics of the Journal of Data and Information Quality (JDIQ). The issue accepted submissions from June 1 to July 30, 2021.
这篇社论总结了《数据与信息质量杂志》(JDIQ)数据质量与伦理特刊的内容。本期杂志于2021年6月1日至7月30日接受投稿。
{"title":"Editorial: Special Issue on Data Quality and Ethics","authors":"D. Firmani, L. Tanca, Riccardo Torlone","doi":"10.1145/3561202","DOIUrl":"https://doi.org/10.1145/3561202","url":null,"abstract":"This editorial summarizes the content of the Special Issue on Data Quality and Ethics of the Journal of Data and Information Quality (JDIQ). The issue accepted submissions from June 1 to July 30, 2021.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80991521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter 眼见为实:推特上COVID-19错误信息中欺骗性支持的作用
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-08-18 DOI: 10.1145/3546914
Chaoyuan Zuo, Ritwik Banerjee, H. Shirazi, Fateme Hashemi Chaleshtori, I. Ray
With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a “Tweet”) is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not actually supported by the cited news and are hence misleading.
随着新冠肺炎的扩散,有关新冠肺炎的大量信息通过推特等社交媒体平台传播。社交媒体帖子经常利用读者对知名新闻机构的信任,并引用新闻文章作为获得可信度的一种方式。然而,被引用的文章并不总是支持社交媒体帖子中的说法。我们提出了一个跨类型的特别管道来识别Twitter帖子(即“Tweet”)中的信息是否确实被引用的新闻文章所支持。我们的方法是基于超过4686万条推文的语料库,并分为两个任务:(i)开发模型来检测包含声明和值得事实核查的推文;(ii)验证推文中的声明是否得到其引用的新闻专线文章的支持。不像以前的研究,通过对传播模式的事后分析来检测未经证实的信息,我们试图在错误信息开始传播之前确定可靠的支持(或缺乏支持)。我们发现近一半的推文(43.4%)是不真实的,因此不值得检查——考虑到Twitter等平台上社交媒体帖子的绝对数量,这是一个重要的过滤器。此外,我们发现,在引用新闻文章作为支持证据的同时,包含看似事实的主张的推文中,至少有1%的推文实际上并没有被引用的新闻所支持,因此具有误导性。
{"title":"Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter","authors":"Chaoyuan Zuo, Ritwik Banerjee, H. Shirazi, Fateme Hashemi Chaleshtori, I. Ray","doi":"10.1145/3546914","DOIUrl":"https://doi.org/10.1145/3546914","url":null,"abstract":"With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a “Tweet”) is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not actually supported by the cited news and are hence misleading.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"134 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85372275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Data Completeness and Complex Semantics in Conceptual Modeling: The Need for a Disaggregation Construct 概念建模中的数据完整性和复杂语义:对分解构造的需求
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-08-08 DOI: 10.1145/3532784
Y. Li, Faiz Currim, S. Ram
Conceptual modeling is important for developing databases that maintain the integrity and quality of stored information. However, classical conceptual models have often been assumed to work on well-maintained and high-quality data. With the advancement and expansion of data science, it is no longer the case. The need to model and store data has emerged for settings with lower data quality, which creates the need to update and augment conceptual models to represent lower-quality data. In this paper, we focus on the intersection between data completeness (an important aspect of data quality) and complex class semantics (where a complex class entity represents information that spans more than one simple class entity). We propose a new disaggregation construct to allow the modeling of incomplete information. We demonstrate the use of our disaggregation construct for diverse modeling problems and discuss the anomalies that could occur without this construct. We provide formal definitions and thorough comparisons between various types of complex constructs to guide future application and prove the unique interpretation of our newly proposed disaggregation construct.
概念建模对于开发维护存储信息的完整性和质量的数据库非常重要。然而,经典的概念模型通常被认为可以处理维护良好的高质量数据。随着数据科学的进步和扩展,情况不再是这样了。对于具有较低数据质量的设置,需要对数据进行建模和存储,这就需要更新和增强概念模型以表示较低质量的数据。在本文中,我们关注数据完整性(数据质量的一个重要方面)和复杂类语义(复杂类实体表示跨越多个简单类实体的信息)之间的交集。我们提出了一种新的分解结构,允许对不完全信息进行建模。我们演示了对各种建模问题使用我们的分解构造,并讨论了没有这个构造可能发生的异常情况。我们提供了正式的定义,并对不同类型的复杂构式进行了全面的比较,以指导未来的应用,并证明了我们新提出的分解构式的独特解释。
{"title":"Data Completeness and Complex Semantics in Conceptual Modeling: The Need for a Disaggregation Construct","authors":"Y. Li, Faiz Currim, S. Ram","doi":"10.1145/3532784","DOIUrl":"https://doi.org/10.1145/3532784","url":null,"abstract":"Conceptual modeling is important for developing databases that maintain the integrity and quality of stored information. However, classical conceptual models have often been assumed to work on well-maintained and high-quality data. With the advancement and expansion of data science, it is no longer the case. The need to model and store data has emerged for settings with lower data quality, which creates the need to update and augment conceptual models to represent lower-quality data. In this paper, we focus on the intersection between data completeness (an important aspect of data quality) and complex class semantics (where a complex class entity represents information that spans more than one simple class entity). We propose a new disaggregation construct to allow the modeling of incomplete information. We demonstrate the use of our disaggregation construct for diverse modeling problems and discuss the anomalies that could occur without this construct. We provide formal definitions and thorough comparisons between various types of complex constructs to guide future application and prove the unique interpretation of our newly proposed disaggregation construct.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"13 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2022-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72402632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Detecting Risk of Biased Output with Balance Measures 用平衡测度检测输出偏置风险
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-08-05 DOI: 10.1145/3530787
Mariachiara Mecati, A. Vetrò, Marco Torchiano
Data have become a fundamental element of the management and productive infrastructures of our society, fuelling digitization of organizational and decision-making processes at an impressive speed. This transition shows lights and shadows, and the “bias in-bias out” problem is one of the most relevant issues, which encompasses technical, ethical, and social perspectives. We address this field of research by investigating how the balance of protected attributes in training data can be used to assess the risk of algorithmic unfairness. We identify four balance measures and test their ability to detect the risk of discriminatory classification by applying them to the training set. The results of this proof of concept show that the indexes can properly detect unfairness of software output. However, we found the choice of the balance measure has a relevant impact on the threshold to consider as risky; further work is necessary to deepen knowledge on this aspect.
数据已经成为我们社会管理和生产基础设施的基本要素,以惊人的速度推动组织和决策过程的数字化。这种转变显示了光明和阴影,“偏内偏出”问题是最相关的问题之一,涉及技术,伦理和社会观点。我们通过研究如何使用训练数据中受保护属性的平衡来评估算法不公平的风险来解决这一研究领域。我们确定了四个平衡度量,并通过将它们应用于训练集来测试它们检测歧视性分类风险的能力。概念验证的结果表明,该指标能够很好地检测出软件输出的不公平性。然而,我们发现平衡度量的选择对风险阈值有相关影响;需要进一步的工作来加深对这方面的认识。
{"title":"Detecting Risk of Biased Output with Balance Measures","authors":"Mariachiara Mecati, A. Vetrò, Marco Torchiano","doi":"10.1145/3530787","DOIUrl":"https://doi.org/10.1145/3530787","url":null,"abstract":"Data have become a fundamental element of the management and productive infrastructures of our society, fuelling digitization of organizational and decision-making processes at an impressive speed. This transition shows lights and shadows, and the “bias in-bias out” problem is one of the most relevant issues, which encompasses technical, ethical, and social perspectives. We address this field of research by investigating how the balance of protected attributes in training data can be used to assess the risk of algorithmic unfairness. We identify four balance measures and test their ability to detect the risk of discriminatory classification by applying them to the training set. The results of this proof of concept show that the indexes can properly detect unfairness of software output. However, we found the choice of the balance measure has a relevant impact on the threshold to consider as risky; further work is necessary to deepen knowledge on this aspect.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"47 1","pages":"1 - 7"},"PeriodicalIF":2.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88531974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity E-FAIR-DB:发现数据偏差和增强数据公平性的功能依赖关系
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-08-04 DOI: 10.1145/3552433
Fabio Azzalini, Chiara Criscuolo, L. Tanca
Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; for these advances to be reliable, the results should be accurate but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics and, in general, in Data Science. Although data equity is desirable, reconciling this property with accurate decision-making is a critical tradeoff, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work, we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency—a type of data constraint—aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, first, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g., gender, ethnicity, or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.
基于算法和数据生成的系统的决策已经成为渗透我们日常生活方方面面的重要工具;为了使这些进展可靠,结果应该准确,但也应该尊重数据公平的所有方面[11]。在这种背景下,公平性和多样性的概念已经成为数据科学伦理领域和数据科学领域讨论的相关主题。虽然数据公平是可取的,但是将这一属性与准确的决策相协调是一个关键的权衡,因为应用修复过程来恢复公平可能会以这样一种方式修改原始数据,从而使最终决策与分析的最终目标相比是不准确的。在这项工作中,我们提出了E-FAIR-DB,这是一种新颖的解决方案,利用功能依赖(一种数据约束)的概念,旨在通过发现和解决数据集中的歧视来恢复数据公平。提出的解决方案是作为一个管道实现的,首先,挖掘功能依赖关系以检测和评估输入数据集的公平性和多样性,然后,基于这些理解和数据分析的目标,减轻数据偏差,最大限度地减少修改次数。我们的工具可以通过挖掘的依赖关系来识别包含歧视的数据库属性(例如,性别、种族或宗教);然后,基于这些依赖关系,它确定必须添加和/或删除的最小数据量,以减轻这种偏差。我们通过理论考虑和两个现实世界数据集的实验来评估我们的建议。
{"title":"E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity","authors":"Fabio Azzalini, Chiara Criscuolo, L. Tanca","doi":"10.1145/3552433","DOIUrl":"https://doi.org/10.1145/3552433","url":null,"abstract":"Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; for these advances to be reliable, the results should be accurate but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics and, in general, in Data Science. Although data equity is desirable, reconciling this property with accurate decision-making is a critical tradeoff, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work, we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency—a type of data constraint—aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, first, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g., gender, ethnicity, or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"9 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74431949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining Human and Machine Confidence in Truthfulness Assessment 在真实性评估中结合人与机器的信心
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-11 DOI: 10.1145/3546916
Yunke Qu, Kevin Roitero, David La Barbera, Damiano Spina, Stefano Mizzaro, Gianluca Demartini
Automatically detecting online misinformation at scale is a challenging and interdisciplinary problem. Deciding what is to be considered truthful information is sometimes controversial and also difficult for educated experts. As the scale of the problem increases, human-in-the-loop approaches to truthfulness that combine both the scalability of machine learning (ML) and the accuracy of human contributions have been considered. In this work, we look at the potential to automatically combine machine-based systems with human-based systems. The former exploit superviseds ML approaches; the latter involve either crowd workers (i.e., human non-experts) or human experts. Since both ML and crowdsourcing approaches can produce a score indicating the level of confidence on their truthfulness judgments (either algorithmic or self-reported, respectively), we address the question of whether it is feasible to make use of such confidence scores to effectively and efficiently combine three approaches: (i) machine-based methods, (ii) crowd workers, and (iii) human experts. The three approaches differ significantly, as they range from available, cheap, fast, scalable, but less accurate to scarce, expensive, slow, not scalable, but highly accurate.
大规模自动检测在线错误信息是一个具有挑战性的跨学科问题。决定什么是被认为是真实的信息有时是有争议的,对受过教育的专家来说也很困难。随着问题规模的扩大,人们开始考虑将机器学习(ML)的可扩展性和人类贡献的准确性结合起来的“人在回路”(human-in-the-loop)方法来实现真实性。在这项工作中,我们着眼于将基于机器的系统与基于人类的系统自动结合的潜力。前者负责监督机器学习方法;后者涉及群体工作者(即人类非专家)或人类专家。由于机器学习和众包方法都可以产生一个分数,表明对其真实性判断的信心水平(分别是算法或自我报告),我们解决了是否可以利用这种信心分数来有效和高效地结合三种方法的问题:(i)基于机器的方法,(ii)人群工作者和(iii)人类专家。这三种方法差别很大,因为它们从可用、便宜、快速、可扩展但不太准确到稀缺、昂贵、缓慢、不可扩展但高度准确。
{"title":"Combining Human and Machine Confidence in Truthfulness Assessment","authors":"Yunke Qu, Kevin Roitero, David La Barbera, Damiano Spina, Stefano Mizzaro, Gianluca Demartini","doi":"10.1145/3546916","DOIUrl":"https://doi.org/10.1145/3546916","url":null,"abstract":"Automatically detecting online misinformation at scale is a challenging and interdisciplinary problem. Deciding what is to be considered truthful information is sometimes controversial and also difficult for educated experts. As the scale of the problem increases, human-in-the-loop approaches to truthfulness that combine both the scalability of machine learning (ML) and the accuracy of human contributions have been considered. In this work, we look at the potential to automatically combine machine-based systems with human-based systems. The former exploit superviseds ML approaches; the latter involve either crowd workers (i.e., human non-experts) or human experts. Since both ML and crowdsourcing approaches can produce a score indicating the level of confidence on their truthfulness judgments (either algorithmic or self-reported, respectively), we address the question of whether it is feasible to make use of such confidence scores to effectively and efficiently combine three approaches: (i) machine-based methods, (ii) crowd workers, and (iii) human experts. The three approaches differ significantly, as they range from available, cheap, fast, scalable, but less accurate to scarce, expensive, slow, not scalable, but highly accurate.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"220 1","pages":"1 - 17"},"PeriodicalIF":2.1,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75890630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Neural Model to Jointly Predict and Explain Truthfulness of Statements 一种联合预测和解释陈述真实性的神经模型
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-09 DOI: 10.1145/3546917
Erik Brand, Kevin Roitero, Michael Soprano, A. Rahimi, Gianluca Demartini
Automated fact-checking (AFC) systems exist to combat disinformation, however, their complexity usually makes them opaque to the end-user, making it difficult to foster trust in the system. In this article, we introduce the E-BART model with the hope of making progress on this front. E-BART is able to provide a veracity prediction for a claim and jointly generate a human-readable explanation for this decision. We show that E-BART is competitive with the state-of-the-art on the e-FEVER and e-SNLI tasks. In addition, we validate the joint-prediction architecture by showing (1) that generating explanations does not significantly impede the model from performing well in its main task of veracity prediction, and (2) that predicted veracity and explanations are more internally coherent when generated jointly than separately. We also calibrate the E-BART model, allowing the output of the final model to be correctly interpreted as the confidence of correctness. Finally, we also conduct an extensive human evaluation on the impact of generated explanations and observe that: Explanations increase human ability to spot misinformation and make people more skeptical about claims, and explanations generated by E-BART are competitive with ground truth explanations.
自动事实核查(AFC)系统的存在是为了打击虚假信息,然而,它们的复杂性通常使它们对最终用户不透明,从而难以培养对系统的信任。在本文中,我们介绍了E-BART模型,希望在这方面取得进展。E-BART能够为索赔提供准确性预测,并共同为该决定生成人类可读的解释。我们证明了E-BART在e-FEVER和e-SNLI任务上与最先进的技术具有竞争力。此外,我们通过证明(1)生成解释不会显著阻碍模型在准确性预测的主要任务中表现良好,以及(2)预测的准确性和解释在联合生成时比单独生成时更具有内部一致性来验证联合预测架构。我们还校准了E-BART模型,允许最终模型的输出被正确地解释为正确性的置信度。最后,我们还对生成的解释的影响进行了广泛的人类评估,并观察到:解释提高了人类发现错误信息的能力,使人们对主张更加怀疑,E-BART生成的解释与基础事实解释具有竞争力。
{"title":"A Neural Model to Jointly Predict and Explain Truthfulness of Statements","authors":"Erik Brand, Kevin Roitero, Michael Soprano, A. Rahimi, Gianluca Demartini","doi":"10.1145/3546917","DOIUrl":"https://doi.org/10.1145/3546917","url":null,"abstract":"Automated fact-checking (AFC) systems exist to combat disinformation, however, their complexity usually makes them opaque to the end-user, making it difficult to foster trust in the system. In this article, we introduce the E-BART model with the hope of making progress on this front. E-BART is able to provide a veracity prediction for a claim and jointly generate a human-readable explanation for this decision. We show that E-BART is competitive with the state-of-the-art on the e-FEVER and e-SNLI tasks. In addition, we validate the joint-prediction architecture by showing (1) that generating explanations does not significantly impede the model from performing well in its main task of veracity prediction, and (2) that predicted veracity and explanations are more internally coherent when generated jointly than separately. We also calibrate the E-BART model, allowing the output of the final model to be correctly interpreted as the confidence of correctness. Finally, we also conduct an extensive human evaluation on the impact of generated explanations and observe that: Explanations increase human ability to spot misinformation and make people more skeptical about claims, and explanations generated by E-BART are competitive with ground truth explanations.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"7 1","pages":"1 - 19"},"PeriodicalIF":2.1,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86355222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Using Agent-Based Modelling to Evaluate the Impact of Algorithmic Curation on Social Media 使用基于代理的建模来评估算法策展对社交媒体的影响
IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-07-08 DOI: 10.1145/3546915
A. Gausen, Wayne Luk, Ce Guo
Social media networks have drastically changed how people communicate and seek information. Due to the scale of information on these platforms, newsfeed curation algorithms have been developed to sort through this information and curate what users see. However, these algorithms are opaque and it is difficult to understand their impact on human communication flows. Some papers have criticised newsfeed curation algorithms that, while promoting user engagement, heighten online polarisation, misinformation, and the formation of echo chambers. Agent-based modelling offers the opportunity to simulate the complex interactions between these algorithms, what users see, and the propagation of information on social media. This article uses agent-based modelling to compare the impact of four different newsfeed curation algorithms on the spread of misinformation and polarisation. This research has the following contributions: (1) implementing newsfeed curation algorithm logic on an agent-based model; (2) comparing the impact of different curation algorithm objectives on misinformation and polarisation; and (3) calibration and empirical validation using real Twitter data. This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media. Moreover, we show how agent-based modelling can reveal specific properties of curation algorithms, which can be used in improving such algorithms.
社交媒体网络极大地改变了人们交流和寻求信息的方式。由于这些平台上的信息规模庞大,新闻源管理算法已经被开发出来,可以对这些信息进行分类,并管理用户看到的内容。然而,这些算法是不透明的,很难理解它们对人类交流流程的影响。一些论文批评了信息流管理算法,这些算法在促进用户参与度的同时,加剧了网络上的两极分化、错误信息和回音室的形成。基于代理的建模提供了模拟这些算法、用户看到的内容以及社交媒体上信息传播之间复杂交互的机会。本文使用基于代理的建模来比较四种不同的新闻源管理算法对错误信息和两极分化传播的影响。本研究有以下贡献:(1)在基于agent的模型上实现新闻提要管理算法逻辑;(2)比较不同策展算法目标对错误信息和极化的影响;(3)利用Twitter真实数据进行标定和实证验证。这项研究为管理算法对信息传播方式和社交媒体内容多样性的影响提供了有用的见解。此外,我们展示了基于代理的建模如何揭示策展算法的特定属性,这些属性可用于改进此类算法。
{"title":"Using Agent-Based Modelling to Evaluate the Impact of Algorithmic Curation on Social Media","authors":"A. Gausen, Wayne Luk, Ce Guo","doi":"10.1145/3546915","DOIUrl":"https://doi.org/10.1145/3546915","url":null,"abstract":"Social media networks have drastically changed how people communicate and seek information. Due to the scale of information on these platforms, newsfeed curation algorithms have been developed to sort through this information and curate what users see. However, these algorithms are opaque and it is difficult to understand their impact on human communication flows. Some papers have criticised newsfeed curation algorithms that, while promoting user engagement, heighten online polarisation, misinformation, and the formation of echo chambers. Agent-based modelling offers the opportunity to simulate the complex interactions between these algorithms, what users see, and the propagation of information on social media. This article uses agent-based modelling to compare the impact of four different newsfeed curation algorithms on the spread of misinformation and polarisation. This research has the following contributions: (1) implementing newsfeed curation algorithm logic on an agent-based model; (2) comparing the impact of different curation algorithm objectives on misinformation and polarisation; and (3) calibration and empirical validation using real Twitter data. This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media. Moreover, we show how agent-based modelling can reveal specific properties of curation algorithms, which can be used in improving such algorithms.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"15 1","pages":"1 - 24"},"PeriodicalIF":2.1,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78258605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
ACM Journal of Data and Information Quality
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1