Missing data is one of the most persistent problems found in data that hinders information and value extraction. Handling missing data is a preprocessing task that has been extensively studied by the research community and remains an active research topic due to its impact and pervasiveness. Many surveys have been conducted to evaluate traditional and state-of-the-art techniques, however, the accuracy of missing data imputation techniques is evaluated without differentiating between isolated and sequence missing instances. In this article, we highlight the presence of both of these types of missing data at different percentages in real-world time-series datasets. We demonstrate that existing imputation techniques have different estimation accuracies for isolated and sequence missing instances. We then propose using a hybrid approach that differentiate between the two types of missing data to yield improved overall imputation accuracy.
{"title":"Experience: Differentiating Between Isolated and Sequence Missing Data","authors":"Amal Tawakuli, Daniel Kaiser, T. Engel","doi":"10.1145/3575809","DOIUrl":"https://doi.org/10.1145/3575809","url":null,"abstract":"Missing data is one of the most persistent problems found in data that hinders information and value extraction. Handling missing data is a preprocessing task that has been extensively studied by the research community and remains an active research topic due to its impact and pervasiveness. Many surveys have been conducted to evaluate traditional and state-of-the-art techniques, however, the accuracy of missing data imputation techniques is evaluated without differentiating between isolated and sequence missing instances. In this article, we highlight the presence of both of these types of missing data at different percentages in real-world time-series datasets. We demonstrate that existing imputation techniques have different estimation accuracies for isolated and sequence missing instances. We then propose using a hybrid approach that differentiate between the two types of missing data to yield improved overall imputation accuracy.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 15"},"PeriodicalIF":2.1,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77139665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.
{"title":"Deception Detection Within and Across Domains: Identifying and Understanding the Performance Gap","authors":"Subhadarshi Panda, Sarah Ita Levitan","doi":"10.1145/3561413","DOIUrl":"https://doi.org/10.1145/3561413","url":null,"abstract":"NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"342 5","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72391624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This editorial summarizes the content of the Special Issue on Data Quality and Ethics of the Journal of Data and Information Quality (JDIQ). The issue accepted submissions from June 1 to July 30, 2021.
{"title":"Editorial: Special Issue on Data Quality and Ethics","authors":"D. Firmani, L. Tanca, Riccardo Torlone","doi":"10.1145/3561202","DOIUrl":"https://doi.org/10.1145/3561202","url":null,"abstract":"This editorial summarizes the content of the Special Issue on Data Quality and Ethics of the Journal of Data and Information Quality (JDIQ). The issue accepted submissions from June 1 to July 30, 2021.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80991521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaoyuan Zuo, Ritwik Banerjee, H. Shirazi, Fateme Hashemi Chaleshtori, I. Ray
With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a “Tweet”) is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not actually supported by the cited news and are hence misleading.
{"title":"Seeing Should Probably Not Be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter","authors":"Chaoyuan Zuo, Ritwik Banerjee, H. Shirazi, Fateme Hashemi Chaleshtori, I. Ray","doi":"10.1145/3546914","DOIUrl":"https://doi.org/10.1145/3546914","url":null,"abstract":"With the spread of the SARS-CoV-2, enormous amounts of information about the pandemic are disseminated through social media platforms such as Twitter. Social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. Nevertheless, it is not always the case that the cited article supports the claim made in the social media post. We present a cross-genre ad hoc pipeline to identify whether the information in a Twitter post (i.e., a “Tweet”) is indeed supported by the cited news article. Our approach is empirically based on a corpus of over 46.86 million Tweets and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, we seek to identify reliable support (or the lack of it) before the misinformation begins to spread. We discover that nearly half of the Tweets (43.4%) are not factual and hence not worth checking—a significant filter, given the sheer volume of social media posts on a platform such as Twitter. Moreover, we find that among the Tweets that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% are not actually supported by the cited news and are hence misleading.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"134 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85372275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conceptual modeling is important for developing databases that maintain the integrity and quality of stored information. However, classical conceptual models have often been assumed to work on well-maintained and high-quality data. With the advancement and expansion of data science, it is no longer the case. The need to model and store data has emerged for settings with lower data quality, which creates the need to update and augment conceptual models to represent lower-quality data. In this paper, we focus on the intersection between data completeness (an important aspect of data quality) and complex class semantics (where a complex class entity represents information that spans more than one simple class entity). We propose a new disaggregation construct to allow the modeling of incomplete information. We demonstrate the use of our disaggregation construct for diverse modeling problems and discuss the anomalies that could occur without this construct. We provide formal definitions and thorough comparisons between various types of complex constructs to guide future application and prove the unique interpretation of our newly proposed disaggregation construct.
{"title":"Data Completeness and Complex Semantics in Conceptual Modeling: The Need for a Disaggregation Construct","authors":"Y. Li, Faiz Currim, S. Ram","doi":"10.1145/3532784","DOIUrl":"https://doi.org/10.1145/3532784","url":null,"abstract":"Conceptual modeling is important for developing databases that maintain the integrity and quality of stored information. However, classical conceptual models have often been assumed to work on well-maintained and high-quality data. With the advancement and expansion of data science, it is no longer the case. The need to model and store data has emerged for settings with lower data quality, which creates the need to update and augment conceptual models to represent lower-quality data. In this paper, we focus on the intersection between data completeness (an important aspect of data quality) and complex class semantics (where a complex class entity represents information that spans more than one simple class entity). We propose a new disaggregation construct to allow the modeling of incomplete information. We demonstrate the use of our disaggregation construct for diverse modeling problems and discuss the anomalies that could occur without this construct. We provide formal definitions and thorough comparisons between various types of complex constructs to guide future application and prove the unique interpretation of our newly proposed disaggregation construct.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"13 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2022-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72402632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data have become a fundamental element of the management and productive infrastructures of our society, fuelling digitization of organizational and decision-making processes at an impressive speed. This transition shows lights and shadows, and the “bias in-bias out” problem is one of the most relevant issues, which encompasses technical, ethical, and social perspectives. We address this field of research by investigating how the balance of protected attributes in training data can be used to assess the risk of algorithmic unfairness. We identify four balance measures and test their ability to detect the risk of discriminatory classification by applying them to the training set. The results of this proof of concept show that the indexes can properly detect unfairness of software output. However, we found the choice of the balance measure has a relevant impact on the threshold to consider as risky; further work is necessary to deepen knowledge on this aspect.
{"title":"Detecting Risk of Biased Output with Balance Measures","authors":"Mariachiara Mecati, A. Vetrò, Marco Torchiano","doi":"10.1145/3530787","DOIUrl":"https://doi.org/10.1145/3530787","url":null,"abstract":"Data have become a fundamental element of the management and productive infrastructures of our society, fuelling digitization of organizational and decision-making processes at an impressive speed. This transition shows lights and shadows, and the “bias in-bias out” problem is one of the most relevant issues, which encompasses technical, ethical, and social perspectives. We address this field of research by investigating how the balance of protected attributes in training data can be used to assess the risk of algorithmic unfairness. We identify four balance measures and test their ability to detect the risk of discriminatory classification by applying them to the training set. The results of this proof of concept show that the indexes can properly detect unfairness of software output. However, we found the choice of the balance measure has a relevant impact on the threshold to consider as risky; further work is necessary to deepen knowledge on this aspect.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"47 1","pages":"1 - 7"},"PeriodicalIF":2.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88531974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; for these advances to be reliable, the results should be accurate but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics and, in general, in Data Science. Although data equity is desirable, reconciling this property with accurate decision-making is a critical tradeoff, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work, we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency—a type of data constraint—aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, first, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g., gender, ethnicity, or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.
{"title":"E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity","authors":"Fabio Azzalini, Chiara Criscuolo, L. Tanca","doi":"10.1145/3552433","DOIUrl":"https://doi.org/10.1145/3552433","url":null,"abstract":"Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; for these advances to be reliable, the results should be accurate but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics and, in general, in Data Science. Although data equity is desirable, reconciling this property with accurate decision-making is a critical tradeoff, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work, we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency—a type of data constraint—aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, first, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g., gender, ethnicity, or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"9 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2022-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74431949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunke Qu, Kevin Roitero, David La Barbera, Damiano Spina, Stefano Mizzaro, Gianluca Demartini
Automatically detecting online misinformation at scale is a challenging and interdisciplinary problem. Deciding what is to be considered truthful information is sometimes controversial and also difficult for educated experts. As the scale of the problem increases, human-in-the-loop approaches to truthfulness that combine both the scalability of machine learning (ML) and the accuracy of human contributions have been considered. In this work, we look at the potential to automatically combine machine-based systems with human-based systems. The former exploit superviseds ML approaches; the latter involve either crowd workers (i.e., human non-experts) or human experts. Since both ML and crowdsourcing approaches can produce a score indicating the level of confidence on their truthfulness judgments (either algorithmic or self-reported, respectively), we address the question of whether it is feasible to make use of such confidence scores to effectively and efficiently combine three approaches: (i) machine-based methods, (ii) crowd workers, and (iii) human experts. The three approaches differ significantly, as they range from available, cheap, fast, scalable, but less accurate to scarce, expensive, slow, not scalable, but highly accurate.
{"title":"Combining Human and Machine Confidence in Truthfulness Assessment","authors":"Yunke Qu, Kevin Roitero, David La Barbera, Damiano Spina, Stefano Mizzaro, Gianluca Demartini","doi":"10.1145/3546916","DOIUrl":"https://doi.org/10.1145/3546916","url":null,"abstract":"Automatically detecting online misinformation at scale is a challenging and interdisciplinary problem. Deciding what is to be considered truthful information is sometimes controversial and also difficult for educated experts. As the scale of the problem increases, human-in-the-loop approaches to truthfulness that combine both the scalability of machine learning (ML) and the accuracy of human contributions have been considered. In this work, we look at the potential to automatically combine machine-based systems with human-based systems. The former exploit superviseds ML approaches; the latter involve either crowd workers (i.e., human non-experts) or human experts. Since both ML and crowdsourcing approaches can produce a score indicating the level of confidence on their truthfulness judgments (either algorithmic or self-reported, respectively), we address the question of whether it is feasible to make use of such confidence scores to effectively and efficiently combine three approaches: (i) machine-based methods, (ii) crowd workers, and (iii) human experts. The three approaches differ significantly, as they range from available, cheap, fast, scalable, but less accurate to scarce, expensive, slow, not scalable, but highly accurate.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"220 1","pages":"1 - 17"},"PeriodicalIF":2.1,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75890630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erik Brand, Kevin Roitero, Michael Soprano, A. Rahimi, Gianluca Demartini
Automated fact-checking (AFC) systems exist to combat disinformation, however, their complexity usually makes them opaque to the end-user, making it difficult to foster trust in the system. In this article, we introduce the E-BART model with the hope of making progress on this front. E-BART is able to provide a veracity prediction for a claim and jointly generate a human-readable explanation for this decision. We show that E-BART is competitive with the state-of-the-art on the e-FEVER and e-SNLI tasks. In addition, we validate the joint-prediction architecture by showing (1) that generating explanations does not significantly impede the model from performing well in its main task of veracity prediction, and (2) that predicted veracity and explanations are more internally coherent when generated jointly than separately. We also calibrate the E-BART model, allowing the output of the final model to be correctly interpreted as the confidence of correctness. Finally, we also conduct an extensive human evaluation on the impact of generated explanations and observe that: Explanations increase human ability to spot misinformation and make people more skeptical about claims, and explanations generated by E-BART are competitive with ground truth explanations.
{"title":"A Neural Model to Jointly Predict and Explain Truthfulness of Statements","authors":"Erik Brand, Kevin Roitero, Michael Soprano, A. Rahimi, Gianluca Demartini","doi":"10.1145/3546917","DOIUrl":"https://doi.org/10.1145/3546917","url":null,"abstract":"Automated fact-checking (AFC) systems exist to combat disinformation, however, their complexity usually makes them opaque to the end-user, making it difficult to foster trust in the system. In this article, we introduce the E-BART model with the hope of making progress on this front. E-BART is able to provide a veracity prediction for a claim and jointly generate a human-readable explanation for this decision. We show that E-BART is competitive with the state-of-the-art on the e-FEVER and e-SNLI tasks. In addition, we validate the joint-prediction architecture by showing (1) that generating explanations does not significantly impede the model from performing well in its main task of veracity prediction, and (2) that predicted veracity and explanations are more internally coherent when generated jointly than separately. We also calibrate the E-BART model, allowing the output of the final model to be correctly interpreted as the confidence of correctness. Finally, we also conduct an extensive human evaluation on the impact of generated explanations and observe that: Explanations increase human ability to spot misinformation and make people more skeptical about claims, and explanations generated by E-BART are competitive with ground truth explanations.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"7 1","pages":"1 - 19"},"PeriodicalIF":2.1,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86355222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social media networks have drastically changed how people communicate and seek information. Due to the scale of information on these platforms, newsfeed curation algorithms have been developed to sort through this information and curate what users see. However, these algorithms are opaque and it is difficult to understand their impact on human communication flows. Some papers have criticised newsfeed curation algorithms that, while promoting user engagement, heighten online polarisation, misinformation, and the formation of echo chambers. Agent-based modelling offers the opportunity to simulate the complex interactions between these algorithms, what users see, and the propagation of information on social media. This article uses agent-based modelling to compare the impact of four different newsfeed curation algorithms on the spread of misinformation and polarisation. This research has the following contributions: (1) implementing newsfeed curation algorithm logic on an agent-based model; (2) comparing the impact of different curation algorithm objectives on misinformation and polarisation; and (3) calibration and empirical validation using real Twitter data. This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media. Moreover, we show how agent-based modelling can reveal specific properties of curation algorithms, which can be used in improving such algorithms.
{"title":"Using Agent-Based Modelling to Evaluate the Impact of Algorithmic Curation on Social Media","authors":"A. Gausen, Wayne Luk, Ce Guo","doi":"10.1145/3546915","DOIUrl":"https://doi.org/10.1145/3546915","url":null,"abstract":"Social media networks have drastically changed how people communicate and seek information. Due to the scale of information on these platforms, newsfeed curation algorithms have been developed to sort through this information and curate what users see. However, these algorithms are opaque and it is difficult to understand their impact on human communication flows. Some papers have criticised newsfeed curation algorithms that, while promoting user engagement, heighten online polarisation, misinformation, and the formation of echo chambers. Agent-based modelling offers the opportunity to simulate the complex interactions between these algorithms, what users see, and the propagation of information on social media. This article uses agent-based modelling to compare the impact of four different newsfeed curation algorithms on the spread of misinformation and polarisation. This research has the following contributions: (1) implementing newsfeed curation algorithm logic on an agent-based model; (2) comparing the impact of different curation algorithm objectives on misinformation and polarisation; and (3) calibration and empirical validation using real Twitter data. This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media. Moreover, we show how agent-based modelling can reveal specific properties of curation algorithms, which can be used in improving such algorithms.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"15 1","pages":"1 - 24"},"PeriodicalIF":2.1,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78258605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}