Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.
{"title":"Preserving Missing Data Distribution in Synthetic Data","authors":"Xinyu Wang, H. Asif, Jaideep Vaidya","doi":"10.1145/3543507.3583297","DOIUrl":"https://doi.org/10.1145/3543507.3583297","url":null,"abstract":"Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130513396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang
Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.
{"title":"CL-WSTC: Continual Learning for Weakly Supervised Text Classification on the Internet","authors":"Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang","doi":"10.1145/3543507.3583249","DOIUrl":"https://doi.org/10.1145/3543507.3583249","url":null,"abstract":"Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121122858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) have demonstrated a great representation learning capability on graph data and have been utilized in various downstream applications. However, real-world data in web-based applications (e.g., recommendation and advertising) always contains bias, preventing GNNs from learning fair representations. Although many works were proposed to address the fairness issue, they suffer from the significant problem of insufficient learnable knowledge with limited attributes after debiasing. To address this problem, we develop Graph-Fairness Mixture of Experts (G-Fame), a novel plug-and-play method to assist any GNNs to learn distinguishable representations with unbiased attributes. Furthermore, based on G-Fame, we propose G-Fame++, which introduces three novel strategies to improve the representation fairness from node representations, model layer, and parameter redundancy perspectives. In particular, we first present the embedding diversified method to learn distinguishable node representations. Second, we design the layer diversified strategy to maximize the output difference of distinct model layers. Third, we introduce the expert diversified method to minimize expert parameter similarities to learn diverse and complementary representations. Extensive experiments demonstrate the superiority of G-Fame and G-Fame++ in both accuracy and fairness, compared to state-of-the-art methods across multiple graph datasets.
{"title":"Fair Graph Representation Learning via Diverse Mixture-of-Experts","authors":"Zheyuan Liu, Chunhui Zhang, Yijun Tian, Erchi Zhang, Chao Huang, Yanfang Ye, Chuxu Zhang","doi":"10.1145/3543507.3583207","DOIUrl":"https://doi.org/10.1145/3543507.3583207","url":null,"abstract":"Graph Neural Networks (GNNs) have demonstrated a great representation learning capability on graph data and have been utilized in various downstream applications. However, real-world data in web-based applications (e.g., recommendation and advertising) always contains bias, preventing GNNs from learning fair representations. Although many works were proposed to address the fairness issue, they suffer from the significant problem of insufficient learnable knowledge with limited attributes after debiasing. To address this problem, we develop Graph-Fairness Mixture of Experts (G-Fame), a novel plug-and-play method to assist any GNNs to learn distinguishable representations with unbiased attributes. Furthermore, based on G-Fame, we propose G-Fame++, which introduces three novel strategies to improve the representation fairness from node representations, model layer, and parameter redundancy perspectives. In particular, we first present the embedding diversified method to learn distinguishable node representations. Second, we design the layer diversified strategy to maximize the output difference of distinct model layers. Third, we introduce the expert diversified method to minimize expert parameter similarities to learn diverse and complementary representations. Extensive experiments demonstrate the superiority of G-Fame and G-Fame++ in both accuracy and fairness, compared to state-of-the-art methods across multiple graph datasets.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123150917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The internet is often thought of as a democratizer, enabling equality in aspects such as pay, as well as a tool introducing novel communication and monetization opportunities. In this study we examine athletes on Cameo, a website that enables bi-directional fan-celebrity interactions, questioning whether the well-documented gender pay gaps in sports persist in this digital setting. Traditional studies into gender pay gaps in sports are mostly in a centralized setting where an organization decides the pay for the players, while Cameo facilitates grass-roots fan engagement where fans pay for video messages from their preferred athletes. The results showed that even on such a platform gender pay gaps persist, both in terms of cost-per-message, and in the number of requests, proxied by number of ratings. For instance, we find that female athletes have a median pay of 30$ per-video, while the same statistic is 40$ for men. The results also contribute to the study of parasocial relationships and personalized fan engagements over a distance. Something that has become more relevant during the ongoing COVID-19 pandemic, where in-person fan engagement has often been limited.
{"title":"Gender Pay Gap in Sports on a Fan-Request Celebrity Video Site","authors":"Nazanin Sabri, Stephen Reysen, Ingmar Weber","doi":"10.1145/3543507.3583884","DOIUrl":"https://doi.org/10.1145/3543507.3583884","url":null,"abstract":"The internet is often thought of as a democratizer, enabling equality in aspects such as pay, as well as a tool introducing novel communication and monetization opportunities. In this study we examine athletes on Cameo, a website that enables bi-directional fan-celebrity interactions, questioning whether the well-documented gender pay gaps in sports persist in this digital setting. Traditional studies into gender pay gaps in sports are mostly in a centralized setting where an organization decides the pay for the players, while Cameo facilitates grass-roots fan engagement where fans pay for video messages from their preferred athletes. The results showed that even on such a platform gender pay gaps persist, both in terms of cost-per-message, and in the number of requests, proxied by number of ratings. For instance, we find that female athletes have a median pay of 30$ per-video, while the same statistic is 40$ for men. The results also contribute to the study of parasocial relationships and personalized fan engagements over a distance. Something that has become more relevant during the ongoing COVID-19 pandemic, where in-person fan engagement has often been limited.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134196397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daisuke Kawai, A. Cuevas, Bryan R. Routledge, K. Soska, Ariel Zetlin-Jones, Nicolas Christin
The web and social media platforms have drastically changed how investors produce and consume financial advice. Historically, individual investors were often relying on newsletters and related prospectus backed by the reputation and track record of their issuers. Nowadays, financial advice is frequently offered online, by anonymous or pseudonymous parties with little at stake. As such, a natural question is to investigate whether these modern financial “influencers” operate in good faith, or whether they might be misleading their followers intentionally. To start answering this question, we obtained data from a very large cryptocurrency derivatives exchange, from which we derived individual trading positions. Some of the investors on that platform elect to link to their Twitter profiles. We were thus able to compare the positions publicly espoused on Twitter with those actually taken in the market. We discovered that 1) staunchly “bullish” investors on Twitter often took much more moderate, if not outright opposite, positions in their own trades when the market was down, 2) their followers tended to align their positions with bullish Twitter outlooks, and 3) moderate voices on Twitter (and their own followers) were on the other hand far more consistent with their actual investment strategies. In other words, while social media advice may attempt to foster a sense of camaraderie among people of like-minded beliefs, the reality is that this is merely an illusion, which may result in financial losses for people blindly following advice.
{"title":"Is your digital neighbor a reliable investment advisor?","authors":"Daisuke Kawai, A. Cuevas, Bryan R. Routledge, K. Soska, Ariel Zetlin-Jones, Nicolas Christin","doi":"10.1145/3543507.3583502","DOIUrl":"https://doi.org/10.1145/3543507.3583502","url":null,"abstract":"The web and social media platforms have drastically changed how investors produce and consume financial advice. Historically, individual investors were often relying on newsletters and related prospectus backed by the reputation and track record of their issuers. Nowadays, financial advice is frequently offered online, by anonymous or pseudonymous parties with little at stake. As such, a natural question is to investigate whether these modern financial “influencers” operate in good faith, or whether they might be misleading their followers intentionally. To start answering this question, we obtained data from a very large cryptocurrency derivatives exchange, from which we derived individual trading positions. Some of the investors on that platform elect to link to their Twitter profiles. We were thus able to compare the positions publicly espoused on Twitter with those actually taken in the market. We discovered that 1) staunchly “bullish” investors on Twitter often took much more moderate, if not outright opposite, positions in their own trades when the market was down, 2) their followers tended to align their positions with bullish Twitter outlooks, and 3) moderate voices on Twitter (and their own followers) were on the other hand far more consistent with their actual investment strategies. In other words, while social media advice may attempt to foster a sense of camaraderie among people of like-minded beliefs, the reality is that this is merely an illusion, which may result in financial losses for people blindly following advice.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134129610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaochen Yu, Lei Han, M. Indulska, S. Sadiq, Gianluca Demartini
Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.
{"title":"Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency","authors":"Shaochen Yu, Lei Han, M. Indulska, S. Sadiq, Gianluca Demartini","doi":"10.1145/3543507.3583515","DOIUrl":"https://doi.org/10.1145/3543507.3583515","url":null,"abstract":"Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To learn influence relationships between nodes in a diffusion network, most existing approaches resort to precise timestamps of historical node infections. The target network is customarily assumed as an one-aspect diffusion network, with homogeneous influence relationships. Nonetheless, tracing node infection timestamps is often infeasible due to high cost, and the type of influence relationships may be heterogeneous because of the diversity of propagation media. In this work, we study how to infer a multi-aspect diffusion network with heterogeneous influence relationships, using only node infection statuses that are more readily accessible in practice. Equipped with a probabilistic generative model, we iteratively conduct a posteriori, quantitative analysis on historical diffusion results of the network, and infer the structure and strengths of homogeneous influence relationships in each aspect. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.
{"title":"Multi-aspect Diffusion Network Inference","authors":"Hao Huang, Ke‐qi Han, Beicheng Xu, Ting Gan","doi":"10.1145/3543507.3583228","DOIUrl":"https://doi.org/10.1145/3543507.3583228","url":null,"abstract":"To learn influence relationships between nodes in a diffusion network, most existing approaches resort to precise timestamps of historical node infections. The target network is customarily assumed as an one-aspect diffusion network, with homogeneous influence relationships. Nonetheless, tracing node infection timestamps is often infeasible due to high cost, and the type of influence relationships may be heterogeneous because of the diversity of propagation media. In this work, we study how to infer a multi-aspect diffusion network with heterogeneous influence relationships, using only node infection statuses that are more readily accessible in practice. Equipped with a probabilistic generative model, we iteratively conduct a posteriori, quantitative analysis on historical diffusion results of the network, and infer the structure and strengths of homogeneous influence relationships in each aspect. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129749753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengyu Wu, ChengHao Ryan Yang, Santiago Vargas, A. Balasubramanian
InterPlanetary File System (IPFS) is a peer-to-peer protocol for decentralized content storage and retrieval. The IPFS platform has the potential to help users evade censorship and avoid a central point of failure. IPFS is seeing increasing adoption for distributing various kinds of files, including video. However, the performance of video streaming on IPFS has not been well-studied. We conduct a measurement study with over 28,000 videos hosted on the IPFS network and find that video streaming experiences high stall rates due to relatively high Round Trip Times (RTT). Further, videos are encoded using a single static quality, because of which streaming cannot adapt to different network conditions. A natural approach is to use adaptive bitrate (ABR) algorithms for streaming, which encode videos in multiple qualities and streams according to the throughput available. However, traditional ABR algorithms perform poorly on IPFS because the throughput cannot be estimated correctly. The main problem is that video segments can be retrieved from multiple sources, making it difficult to estimate the throughput. To overcome this issue, we have designed Telescope, an IPFS-aware ABR system. We conduct experiments on the IPFS network, where IPFS video providers are geographically distributed across the globe. Our results show that Telescope significantly improves the Quality of Experience (QoE) of videos, for a diverse set of network and cache conditions, compared to traditional ABR.
{"title":"Is IPFS Ready for Decentralized Video Streaming?","authors":"Zhengyu Wu, ChengHao Ryan Yang, Santiago Vargas, A. Balasubramanian","doi":"10.1145/3543507.3583404","DOIUrl":"https://doi.org/10.1145/3543507.3583404","url":null,"abstract":"InterPlanetary File System (IPFS) is a peer-to-peer protocol for decentralized content storage and retrieval. The IPFS platform has the potential to help users evade censorship and avoid a central point of failure. IPFS is seeing increasing adoption for distributing various kinds of files, including video. However, the performance of video streaming on IPFS has not been well-studied. We conduct a measurement study with over 28,000 videos hosted on the IPFS network and find that video streaming experiences high stall rates due to relatively high Round Trip Times (RTT). Further, videos are encoded using a single static quality, because of which streaming cannot adapt to different network conditions. A natural approach is to use adaptive bitrate (ABR) algorithms for streaming, which encode videos in multiple qualities and streams according to the throughput available. However, traditional ABR algorithms perform poorly on IPFS because the throughput cannot be estimated correctly. The main problem is that video segments can be retrieved from multiple sources, making it difficult to estimate the throughput. To overcome this issue, we have designed Telescope, an IPFS-aware ABR system. We conduct experiments on the IPFS network, where IPFS video providers are geographically distributed across the globe. Our results show that Telescope significantly improves the Quality of Experience (QoE) of videos, for a diverse set of network and cache conditions, compared to traditional ABR.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132264502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Hambley, Y. Yeşilada, Markel Vigo, S. Harper
Web accessibility evaluation is a costly and complex process due to limited time, resources and ambiguity. To optimise the accessibility evaluation process, we aim to reduce the number of pages auditors must review by employing statistically representative pages, reducing a site of thousands of pages to a manageable review of archetypal pages. Our paper focuses on representativeness, one of six proposed metrics that form our methodology, to address the limitations we have identified with the W3C Website Accessibility Conformance Evaluation Methodology (WCAG-EM). These include the evaluative scope, the non-probabilistic sampling approach, and the potential for bias within the selected sample. Representativeness, in particular, is a metric to assess the quality and coverage of sampling. To measure this, we systematically evaluate five web page representations with a website of 388 pages, including tags, structure, the DOM tree, content, and a mixture of structure and content. Our findings highlight the importance of including structural components in representations. We validate our conclusions using the same methodology for three additional random sites of 500 pages. As an exclusive attribute, we find that features derived from web content are suboptimal and can lead to lower quality and more disparate clustering for optimised accessibility evaluation.
{"title":"Web Structure Derived Clustering for Optimised Web Accessibility Evaluation","authors":"Alexander Hambley, Y. Yeşilada, Markel Vigo, S. Harper","doi":"10.1145/3543507.3583508","DOIUrl":"https://doi.org/10.1145/3543507.3583508","url":null,"abstract":"Web accessibility evaluation is a costly and complex process due to limited time, resources and ambiguity. To optimise the accessibility evaluation process, we aim to reduce the number of pages auditors must review by employing statistically representative pages, reducing a site of thousands of pages to a manageable review of archetypal pages. Our paper focuses on representativeness, one of six proposed metrics that form our methodology, to address the limitations we have identified with the W3C Website Accessibility Conformance Evaluation Methodology (WCAG-EM). These include the evaluative scope, the non-probabilistic sampling approach, and the potential for bias within the selected sample. Representativeness, in particular, is a metric to assess the quality and coverage of sampling. To measure this, we systematically evaluate five web page representations with a website of 388 pages, including tags, structure, the DOM tree, content, and a mixture of structure and content. Our findings highlight the importance of including structural components in representations. We validate our conclusions using the same methodology for three additional random sites of 500 pages. As an exclusive attribute, we find that features derived from web content are suboptimal and can lead to lower quality and more disparate clustering for optimised accessibility evaluation.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134561834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. You, Chi-Pan Li, Daizong Ding, Mi Zhang, Fuli Feng, Xudong Pan, Min Yang
Graph neural network (GNN) based recommendation models are observed to be more vulnerable against carefully-designed malicious records injected into the system, i.e., shilling attacks, which manipulate the recommendation to common users and therefore impair user trust. In this paper, we for the first time conduct a systematic study on the vulnerability of GNN based recommendation model against the shilling attack. With the aid of theoretical analysis, we attribute the root cause of the vulnerability to its neighborhood aggregation mechanism, which could make the negative impact of attacks propagate rapidly in the system. To restore the robustness of GNN based recommendation model, the key factor lies in detecting malicious records in the system and preventing the propagation of misinformation. To this end, we construct a user-user graph to capture the patterns of malicious behaviors and design a novel GNN based detector to identify fake users. Furthermore, we develop a data augmentation strategy and a joint learning paradigm to train the recommender model and the proposed detector. Extensive experiments on benchmark datasets validate the enhanced robustness of the proposed method in resisting various types of shilling attacks and identifying fake users, e.g., our proposed method fully mitigating the impact of popularity attacks on target items up to , and improving the accuracy of detecting fake users on the Gowalla dataset by .
{"title":"Anti-FakeU: Defending Shilling Attacks on Graph Neural Network based Recommender Model","authors":"X. You, Chi-Pan Li, Daizong Ding, Mi Zhang, Fuli Feng, Xudong Pan, Min Yang","doi":"10.1145/3543507.3583289","DOIUrl":"https://doi.org/10.1145/3543507.3583289","url":null,"abstract":"Graph neural network (GNN) based recommendation models are observed to be more vulnerable against carefully-designed malicious records injected into the system, i.e., shilling attacks, which manipulate the recommendation to common users and therefore impair user trust. In this paper, we for the first time conduct a systematic study on the vulnerability of GNN based recommendation model against the shilling attack. With the aid of theoretical analysis, we attribute the root cause of the vulnerability to its neighborhood aggregation mechanism, which could make the negative impact of attacks propagate rapidly in the system. To restore the robustness of GNN based recommendation model, the key factor lies in detecting malicious records in the system and preventing the propagation of misinformation. To this end, we construct a user-user graph to capture the patterns of malicious behaviors and design a novel GNN based detector to identify fake users. Furthermore, we develop a data augmentation strategy and a joint learning paradigm to train the recommender model and the proposed detector. Extensive experiments on benchmark datasets validate the enhanced robustness of the proposed method in resisting various types of shilling attacks and identifying fake users, e.g., our proposed method fully mitigating the impact of popularity attacks on target items up to , and improving the accuracy of detecting fake users on the Gowalla dataset by .","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129385378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}