In Internet applications, network conversation is the primary communication between the user and server. The server needs to efficiently and quickly return the corresponding service according to the conversation sent by the user to improve the users’ Quality of Service. Thus, Conversation Information Seeking (CIS) research has become a hot topic today. In Cloud Computing (CC), a central service mode, the conversation is transmitted between the user and the remote cloud over a long distance. With the explosive growth of Internet applications, network congestion, long-distance communication, and single point of failure have brought new challenges to the centralized service mode. People put forward Edge Cloud Computing (ECC) to meet the new challenges of the centralized service mode of CC. As a distributed service mode, ECC is an extension of CC. By migrating services from the remote cloud to the network edge closer to users, ECC can solve the above challenges in CC well. In ECC, people solve the problem of CIS through edge caching. The current research focuses on designing the edge cache strategy to achieve more predictable caching. In this paper, we propose an edge cache placement method Evolutionary Game based Caching Placement Strategy (EG-CPS). This method consists of three modules: the user preference prediction module, the content popularity calculation module, and the cache placement decision module. To maximize the predictability of the cache strategy, we are committed to optimizing the cache hit rate and service latency. The simulation experiment compares the proposed strategy with several other cache strategies. The experimental results illustrate that EG-CPS can reduce up to 2.4% of the original average content request latency, increase the average direct cache hit rate by 1.7%, and increase the average edge cache hit rate by 3.3%.
在Internet应用程序中,网络会话是用户和服务器之间的主要通信。服务器需要根据用户发送的会话,高效、快速地返回相应的服务,以提高用户的服务质量。因此,会话信息搜索(CIS)的研究成为当今的热门话题。在中心服务模式云计算(CC)中,会话在用户和远程云之间进行长距离传输。随着互联网应用的爆炸式增长,网络拥塞、远程通信、单点故障等问题给集中式服务模式带来了新的挑战。边缘云计算(Edge Cloud Computing, ECC)是为了应对CC集中服务模式带来的新挑战而提出的,ECC作为一种分布式服务模式,是CC的延伸,通过将服务从远程云迁移到离用户更近的网络边缘,可以很好地解决CC中的上述挑战。在ECC中,人们通过边缘缓存来解决CIS问题。当前的研究重点是设计边缘缓存策略以实现更可预测的缓存。本文提出一种基于进化博弈的边缘缓存放置策略(evolution Game based Caching placement Strategy, egg - cps)。该方法包括三个模块:用户偏好预测模块、内容流行度计算模块和缓存放置决策模块。为了最大限度地提高缓存策略的可预测性,我们致力于优化缓存命中率和服务延迟。仿真实验将该策略与其他几种缓存策略进行了比较。实验结果表明,egg - cps可以将原始平均内容请求延迟减少2.4%,将平均直接缓存命中率提高1.7%,将平均边缘缓存命中率提高3.3%。
{"title":"Edge Caching Placement Strategy based on Evolutionary Game for Conversational Information Seeking in Edge Cloud Computing","authors":"Hongjian Shi, Meng Zhang, RuHui Ma, Liwei Lin, Rui Zhang, Haibing Guan","doi":"10.1145/3624985","DOIUrl":"https://doi.org/10.1145/3624985","url":null,"abstract":"In Internet applications, network conversation is the primary communication between the user and server. The server needs to efficiently and quickly return the corresponding service according to the conversation sent by the user to improve the users’ Quality of Service. Thus, Conversation Information Seeking (CIS) research has become a hot topic today. In Cloud Computing (CC), a central service mode, the conversation is transmitted between the user and the remote cloud over a long distance. With the explosive growth of Internet applications, network congestion, long-distance communication, and single point of failure have brought new challenges to the centralized service mode. People put forward Edge Cloud Computing (ECC) to meet the new challenges of the centralized service mode of CC. As a distributed service mode, ECC is an extension of CC. By migrating services from the remote cloud to the network edge closer to users, ECC can solve the above challenges in CC well. In ECC, people solve the problem of CIS through edge caching. The current research focuses on designing the edge cache strategy to achieve more predictable caching. In this paper, we propose an edge cache placement method Evolutionary Game based Caching Placement Strategy (EG-CPS). This method consists of three modules: the user preference prediction module, the content popularity calculation module, and the cache placement decision module. To maximize the predictability of the cache strategy, we are committed to optimizing the cache hit rate and service latency. The simulation experiment compares the proposed strategy with several other cache strategies. The experimental results illustrate that EG-CPS can reduce up to 2.4% of the original average content request latency, increase the average direct cache hit rate by 1.7%, and increase the average edge cache hit rate by 3.3%.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136313803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing studies in conversational AI mostly treat task-oriented dialog (TOD) and question answering (QA) as separate tasks. Towards the goal of constructing a conversational agent that can complete user tasks and support information seeking, it is important to develop a system that can handle both TOD and QA with access to various external knowledge sources. In this work, we propose a new task, Open-Book TOD (OB-TOD), which combines TOD with QA and expands the external knowledge sources to include both explicit sources (e.g., the web) and implicit sources (e.g., pre-trained language models). We create a new dataset OB-MultiWOZ, where we enrich TOD sessions with QA-like information-seeking experience grounded on external knowledge. We propose a unified model OPERA ( Op en-book E nd-to-end Task-o r iented Di a log) which can appropriately access explicit and implicit external knowledge to tackle the OB-TOD task. Experimental results show that OPERA outperforms closed-book baselines, highlighting the value of both types of knowledge.
现有的会话式人工智能研究大多将面向任务的对话(TOD)和问答(QA)作为独立的任务。为了构建一个能够完成用户任务并支持信息搜索的会话代理,开发一个能够同时处理TOD和QA并访问各种外部知识来源的系统是很重要的。在这项工作中,我们提出了一个新的任务,开卷TOD (OB-TOD),它将TOD与QA相结合,并扩展了外部知识来源,包括显式来源(例如,网络)和隐式来源(例如,预训练的语言模型)。我们创建了一个新的数据集OB-MultiWOZ,在那里我们用基于外部知识的类似qa的信息搜索经验丰富TOD会议。我们提出了一个统一的模型OPERA (Op -book E -end -to-end task -o - oriented Di - log),它可以适当地访问显式和隐式外部知识来解决OB-TOD任务。实验结果表明,OPERA优于闭卷基线,突出了两种知识的价值。
{"title":"OPERA: Harmonizing Task-Oriented Dialogs and Information Seeking Experience","authors":"Miaoran Li, Baolin Peng, Jianfeng Gao, Zhu Zhang","doi":"10.1145/3623381","DOIUrl":"https://doi.org/10.1145/3623381","url":null,"abstract":"Existing studies in conversational AI mostly treat task-oriented dialog (TOD) and question answering (QA) as separate tasks. Towards the goal of constructing a conversational agent that can complete user tasks and support information seeking, it is important to develop a system that can handle both TOD and QA with access to various external knowledge sources. In this work, we propose a new task, Open-Book TOD (OB-TOD), which combines TOD with QA and expands the external knowledge sources to include both explicit sources (e.g., the web) and implicit sources (e.g., pre-trained language models). We create a new dataset OB-MultiWOZ, where we enrich TOD sessions with QA-like information-seeking experience grounded on external knowledge. We propose a unified model OPERA ( Op en-book E nd-to-end Task-o r iented Di a log) which can appropriately access explicit and implicit external knowledge to tackle the OB-TOD task. Experimental results show that OPERA outperforms closed-book baselines, highlighting the value of both types of knowledge.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article we present a large-scale empirical analysis of the use of web storage in the wild. By using dynamic taint tracking at the level of JavaScript and by performing an automated classification of the detected information flows, we shed light on the key characteristics of web storage uses in the Tranco Top 10k. Our analysis shows that web storage is routinely accessed by third parties, including known web trackers, who are particularly eager to have both read and write access to persistent web storage information. We then deep dive in web tracking as a prominent case study: our analysis shows that web storage is not yet as popular as cookies for tracking purposes, however taint tracking is useful to detect potential new trackers not included in standard filter lists. Moreover, we observe that many websites do not comply with the General Data Protection Regulation (GDPR) directives when it comes to their use of web storage.
在这篇文章中,我们对网络存储的使用进行了大规模的实证分析。通过在JavaScript级别使用动态污染跟踪,并对检测到的信息流执行自动分类,我们揭示了Tranco Top 10k中web存储使用的关键特征。我们的分析表明,网络存储经常被第三方访问,包括已知的网络跟踪者,他们特别渴望对持久的网络存储信息进行读写访问。然后,我们深入研究了网络跟踪作为一个突出的案例研究:我们的分析表明,网络存储在跟踪目的方面还没有cookie那么流行,然而,污染跟踪对于检测未包含在标准过滤列表中的潜在新跟踪器是有用的。此外,我们观察到许多网站在使用网络存储时不遵守通用数据保护条例(GDPR)指令。
{"title":"An Empirical Analysis of Web Storage and its Applications to Web Tracking","authors":"Zubair Ahmad, Samuele Casarin, Stefano Calzavara","doi":"10.1145/3623382","DOIUrl":"https://doi.org/10.1145/3623382","url":null,"abstract":"In this article we present a large-scale empirical analysis of the use of web storage in the wild. By using dynamic taint tracking at the level of JavaScript and by performing an automated classification of the detected information flows, we shed light on the key characteristics of web storage uses in the Tranco Top 10k. Our analysis shows that web storage is routinely accessed by third parties, including known web trackers, who are particularly eager to have both read and write access to persistent web storage information. We then deep dive in web tracking as a prominent case study: our analysis shows that web storage is not yet as popular as cookies for tracking purposes, however taint tracking is useful to detect potential new trackers not included in standard filter lists. Moreover, we observe that many websites do not comply with the General Data Protection Regulation (GDPR) directives when it comes to their use of web storage.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43935855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anjin Liu, Zimu Lu, Ning Xu, Min Liu, Chenggang Yan, Bolun Zheng, Bo Lv, Yulong Duan, Zhuang Shao, Xuanya Li
Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer predication, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning, and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental results show that our network achieves significant performance against the previous state-of-the-art methods.
{"title":"Multi-stage reasoning on introspecting and revising bias for visual question answering","authors":"Anjin Liu, Zimu Lu, Ning Xu, Min Liu, Chenggang Yan, Bolun Zheng, Bo Lv, Yulong Duan, Zhuang Shao, Xuanya Li","doi":"10.1145/3616399","DOIUrl":"https://doi.org/10.1145/3616399","url":null,"abstract":"Visual Question Answering (VQA) is a task that involves predicting an answer to a question depending on the content of an image. However, recent VQA methods have relied more on language priors between the question and answer rather than the image content. To address this issue, many debiasing methods have been proposed to reduce language bias in model reasoning. However, the bias can be divided into two categories: good bias and bad bias. Good bias can benefit to the answer predication, while the bad bias may associate the models with the unrelated information. Therefore, instead of excluding good and bad bias indiscriminately in existing debiasing methods, we proposed a bias discrimination module to distinguish them. Additionally, bad bias may reduce the model’s reliance on image content during answer reasoning, and thus attend little on image features updating. To tackle this, we leverage Markov theory to construct a Markov field with image regions and question words as nodes. This helps with feature updating for both image regions and question words, thereby facilitating more accurate and comprehensive reasoning about both the image content and question. To verify the effectiveness of our network, we evaluate our network on VQA v2 and VQA cp v2 datasets and conduct extensive quantity and quality studies to verify the effectiveness of our proposed network. Experimental results show that our network achieves significant performance against the previous state-of-the-art methods.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"24 45","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41312288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Massively multiplayer online games (MMOGs) played on the Web provide a new form of social, computer-mediated interactions that allow the connection of millions of players worldwide. The rules governing team-based MMOGs are typically complex and non-deterministic giving rise to an intricate dynamical behavior. However, due to the novelty and complexity of MMOGs their behavior is understudied. In this paper, we investigate the MMOG World of Tanks (WOT) Blitz by using a combined approach based on data science and complex adaptive systems. We analyze data on the population level to get insight into organizational principles of the game and its game mechanics. For this reason, we study the scaling behavior and the predictability of system variables. As a result, we find a power-law behavior on the population level revealing long-range interactions between system variables. Furthermore, we identify and quantify the predictability of summary statistics of the game and its decomposition into explanatory variables. This reveals a heterogeneous progression through the tiers and identifies only a single system variable as key driver for the win rate.
{"title":"Human team behavior and predictability in the massively multiplayer online game WOT Blitz","authors":"F. Emmert-Streib, S. Tripathi, M. Dehmer","doi":"10.1145/3617509","DOIUrl":"https://doi.org/10.1145/3617509","url":null,"abstract":"Massively multiplayer online games (MMOGs) played on the Web provide a new form of social, computer-mediated interactions that allow the connection of millions of players worldwide. The rules governing team-based MMOGs are typically complex and non-deterministic giving rise to an intricate dynamical behavior. However, due to the novelty and complexity of MMOGs their behavior is understudied. In this paper, we investigate the MMOG World of Tanks (WOT) Blitz by using a combined approach based on data science and complex adaptive systems. We analyze data on the population level to get insight into organizational principles of the game and its game mechanics. For this reason, we study the scaling behavior and the predictability of system variables. As a result, we find a power-law behavior on the population level revealing long-range interactions between system variables. Furthermore, we identify and quantify the predictability of summary statistics of the game and its decomposition into explanatory variables. This reveals a heterogeneous progression through the tiers and identifies only a single system variable as key driver for the win rate.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47255856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhang, Wuji Zhang, Likang Wu, Ming He, Hongke Zhao
In recent years, multi-behavior information has been utilized to address data sparsity and cold-start issues. The general multi-behavior models capture multiple behaviors of users to make the representation of relevant features more fine-grained and informative. However, most current multi-behavior recommendation methods neglect the exploration of social relations between users. Actually, users’ potential social connections are critical to assist them in filtering multifarious messages, which may be one key for models to tap deeper into users’ interests. Additionally, existing models usually focus on the positive behaviors (e.g. click, follow and purchase) of users and tend to ignore the value of negative behaviors (e.g. unfollow and badpost). In this work, we present a Multi-Behavior Graph (MBG) construction method based on user behaviors and social relationships, and then introduce a novel socially enhanced and behavior-aware graph neural network for behavior prediction. Specifically, we propose a Socially Enhanced Heterogeneous Graph Convolutional Network (SHGCN) model, which utilizes behavior heterogeneous graph convolution module and social graph convolution module to effectively incorporate behavior features and social information to achieve precise multi-behavior prediction. In addition, the aggregation pooling mechanism is suggested to integrate the outputs of different graph convolution layers, and a dynamic adaptive loss (DAL) method is presented to explore the weight of each behavior. The experimental results on the datasets of the e-commerce platforms (i.e., Epinions and Ciao) indicate the promising performance of SHGCN. Compared with the most powerful baseline, SHGCN achieves 3.3% and 1.4% uplift in terms of AUC on the Epinions and Ciao datasets. Further experiments, including model efficiency analysis, DAL mechanism and ablation experiments, confirm the validity of the multi-behavior information and social enhancement.
{"title":"SHGCN: Socially Enhanced Heterogeneous Graph Convolutional Network for Multi-Behavior Prediction","authors":"Lei Zhang, Wuji Zhang, Likang Wu, Ming He, Hongke Zhao","doi":"10.1145/3617510","DOIUrl":"https://doi.org/10.1145/3617510","url":null,"abstract":"In recent years, multi-behavior information has been utilized to address data sparsity and cold-start issues. The general multi-behavior models capture multiple behaviors of users to make the representation of relevant features more fine-grained and informative. However, most current multi-behavior recommendation methods neglect the exploration of social relations between users. Actually, users’ potential social connections are critical to assist them in filtering multifarious messages, which may be one key for models to tap deeper into users’ interests. Additionally, existing models usually focus on the positive behaviors (e.g. click, follow and purchase) of users and tend to ignore the value of negative behaviors (e.g. unfollow and badpost). In this work, we present a Multi-Behavior Graph (MBG) construction method based on user behaviors and social relationships, and then introduce a novel socially enhanced and behavior-aware graph neural network for behavior prediction. Specifically, we propose a Socially Enhanced Heterogeneous Graph Convolutional Network (SHGCN) model, which utilizes behavior heterogeneous graph convolution module and social graph convolution module to effectively incorporate behavior features and social information to achieve precise multi-behavior prediction. In addition, the aggregation pooling mechanism is suggested to integrate the outputs of different graph convolution layers, and a dynamic adaptive loss (DAL) method is presented to explore the weight of each behavior. The experimental results on the datasets of the e-commerce platforms (i.e., Epinions and Ciao) indicate the promising performance of SHGCN. Compared with the most powerful baseline, SHGCN achieves 3.3% and 1.4% uplift in terms of AUC on the Epinions and Ciao datasets. Further experiments, including model efficiency analysis, DAL mechanism and ablation experiments, confirm the validity of the multi-behavior information and social enhancement.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43629543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anu Shrestha, Jason Duran, Francesca Spezzano, Edoardo Serra
The presence of fake news on online social media is overwhelming and is responsible for having impacted several aspects of people’s lives, from health to politics, the economy, and response to natural disasters. Although significant effort has been made to mitigate fake news spread, current research focuses on single aspects of the problem, such as detecting fake news spreaders and classifying stories as either factual or fake. In this paper, we propose a new method to exploit inter-relationships between stories, sources, and final users and integrate prior knowledge of these three entities to jointly estimate the credibility degree of each entity involved in the news ecosystem. Specifically, we develop a new graph convolutional network, namely Role-Relational Graph Convolutional Networks (Role-RGCN), to learn, for each node type (or role), a unique node representation space and jointly connect the different representation spaces with edge relations. To test our proposed approach, we conducted an experimental evaluation on the state-of-the-art FakeNewsNet-Politifact dataset and a new dataset with ground truth on news credibility degrees we collected. Experimental results show a superior performance of our Role-RGCN proposed method at predicting the credibility degree of stories, sources, and users compared to state-of-the-art approaches and other baselines.
{"title":"Joint Credibility Estimation of News, User, and Publisher via Role-Relational Graph Convolutional Networks","authors":"Anu Shrestha, Jason Duran, Francesca Spezzano, Edoardo Serra","doi":"10.1145/3617418","DOIUrl":"https://doi.org/10.1145/3617418","url":null,"abstract":"The presence of fake news on online social media is overwhelming and is responsible for having impacted several aspects of people’s lives, from health to politics, the economy, and response to natural disasters. Although significant effort has been made to mitigate fake news spread, current research focuses on single aspects of the problem, such as detecting fake news spreaders and classifying stories as either factual or fake. In this paper, we propose a new method to exploit inter-relationships between stories, sources, and final users and integrate prior knowledge of these three entities to jointly estimate the credibility degree of each entity involved in the news ecosystem. Specifically, we develop a new graph convolutional network, namely Role-Relational Graph Convolutional Networks (Role-RGCN), to learn, for each node type (or role), a unique node representation space and jointly connect the different representation spaces with edge relations. To test our proposed approach, we conducted an experimental evaluation on the state-of-the-art FakeNewsNet-Politifact dataset and a new dataset with ground truth on news credibility degrees we collected. Experimental results show a superior performance of our Role-RGCN proposed method at predicting the credibility degree of stories, sources, and users compared to state-of-the-art approaches and other baselines.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45331298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools because it is based on textual data.
{"title":"Scraping Relevant Images from Web Pages Without Download","authors":"Erdinç Uzun","doi":"10.1145/3616849","DOIUrl":"https://doi.org/10.1145/3616849","url":null,"abstract":"Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools because it is based on textual data.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43978242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social networks are a platform for individuals and organizations to connect with each other and inform, advertise, spread ideas, and ultimately influence opinions. These platforms have been known to propel misinformation. We argue that this could be compounded by the recommender algorithms that these platforms use to suggest items potentially of interest to their users, given the known biases and filter bubbles issues affecting recommender systems. While much has been studied about misinformation on social networks, the potential exacerbation that could result from recommender algorithms in this environment is in its infancy. In this manuscript, we present the result of an in-depth analysis conducted on two datasets (Politifact FakeNewsNet dataset and HealthStory FakeHealth dataset) in order to deepen our understanding of the interconnection between recommender algorithms and misinformation spread on Twitter. In particular, we explore the degree to which well-known recommendation algorithms are prone to be impacted by misinformation. Via simulation, we also study misinformation diffusion on social networks, as triggered by suggestions produced by these recommendation algorithms. Outcomes from this work evidence that misinformation does not equally affect all recommendation algorithms. Popularity-based and network-based recommender algorithms contribute the most to misinformation diffusion. Users who are known to be superspreaders are known to directly impact algorithmic performance and misinformation spread in specific scenarios. Findings emerging from our exploration result in a number of implications for researchers and practitioners to consider when designing and deploying recommender algorithms in social networks.
{"title":"Understanding the Contribution of Recommendation Algorithms on Misinformation Recommendation and Misinformation Dissemination on Social Networks","authors":"Royal Pathak, Francesca Spezzano, M. S. Pera","doi":"10.1145/3616088","DOIUrl":"https://doi.org/10.1145/3616088","url":null,"abstract":"Social networks are a platform for individuals and organizations to connect with each other and inform, advertise, spread ideas, and ultimately influence opinions. These platforms have been known to propel misinformation. We argue that this could be compounded by the recommender algorithms that these platforms use to suggest items potentially of interest to their users, given the known biases and filter bubbles issues affecting recommender systems. While much has been studied about misinformation on social networks, the potential exacerbation that could result from recommender algorithms in this environment is in its infancy. In this manuscript, we present the result of an in-depth analysis conducted on two datasets (Politifact FakeNewsNet dataset and HealthStory FakeHealth dataset) in order to deepen our understanding of the interconnection between recommender algorithms and misinformation spread on Twitter. In particular, we explore the degree to which well-known recommendation algorithms are prone to be impacted by misinformation. Via simulation, we also study misinformation diffusion on social networks, as triggered by suggestions produced by these recommendation algorithms. Outcomes from this work evidence that misinformation does not equally affect all recommendation algorithms. Popularity-based and network-based recommender algorithms contribute the most to misinformation diffusion. Users who are known to be superspreaders are known to directly impact algorithmic performance and misinformation spread in specific scenarios. Findings emerging from our exploration result in a number of implications for researchers and practitioners to consider when designing and deploying recommender algorithms in social networks.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44082288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Graph Query Language (GraphQL) is a powerful language for APIs manipulation in web services. It has been recently introduced as an alternative solution for addressing the limitations of RESTful APIs. This paper introduces an automated solution for GraphQL APIs testing. We present a full framework for automated APIs testing, from the schema extraction to test case generation. In addition, we consider two kinds of testing: white-box and black-box testing. The white-box testing is performed when the source code of the GraphQL API is available. Our approach is based on evolutionary search. Test cases are evolved to intelligently explore the solution space while maximizing code coverage and fault-finding criteria. The black-box testing does not require access to the source code of the GraphQL API. It is therefore of more general applicability, albeit it has worse performance. In this context, we use a random search to generate GraphQL data. The proposed framework is implemented and integrated into the open-source EvoMaster tool. With enabled white-box heuristics, i.e., white-box mode, experiments on 7 open-source GraphQL APIs and 3 search algorithms show statistically significant improvement of the evolutionary approach compared to the baseline random search. In addition, experiments on 31 online GraphQL APIs reveal the ability of the black-box mode to detect real faults.
{"title":"Random Testing and Evolutionary Testing for Fuzzing GraphQL APIs","authors":"Asma Belhadi, Man Zhang, Andrea Arcuri","doi":"10.1145/3609427","DOIUrl":"https://doi.org/10.1145/3609427","url":null,"abstract":"The Graph Query Language (GraphQL) is a powerful language for APIs manipulation in web services. It has been recently introduced as an alternative solution for addressing the limitations of RESTful APIs. This paper introduces an automated solution for GraphQL APIs testing. We present a full framework for automated APIs testing, from the schema extraction to test case generation. In addition, we consider two kinds of testing: white-box and black-box testing. The white-box testing is performed when the source code of the GraphQL API is available. Our approach is based on evolutionary search. Test cases are evolved to intelligently explore the solution space while maximizing code coverage and fault-finding criteria. The black-box testing does not require access to the source code of the GraphQL API. It is therefore of more general applicability, albeit it has worse performance. In this context, we use a random search to generate GraphQL data. The proposed framework is implemented and integrated into the open-source EvoMaster tool. With enabled white-box heuristics, i.e., white-box mode, experiments on 7 open-source GraphQL APIs and 3 search algorithms show statistically significant improvement of the evolutionary approach compared to the baseline random search. In addition, experiments on 31 online GraphQL APIs reveal the ability of the black-box mode to detect real faults.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47535933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}