首页 > 最新文献

Proceedings of The Web Conference 2020最新文献

英文 中文
The Structure of Social Influence in Recommender Networks 推荐网络中的社会影响结构
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380020
P. Analytis, D. Barkoczi, Philipp Lorenz-Spreen, Stefan M. Herzog
People’s ability to influence others’ opinion on matters of taste varies greatly—both offline and in recommender systems. What are the mechanisms underlying these striking differences? Using the weighted k-nearest neighbors algorithm (k-nn) to represent an array of social learning strategies, we show—leveraging methods from network science—how the k-nn algorithm gives rise to networks of social influence in six real-world domains of taste. We show three novel results that apply both to offline advice taking and online recommender settings. First, influential individuals have mainstream tastes and high dispersion in their taste similarity with others. Second, the fewer people an individual or algorithm consults (i.e., the lower k is) or the larger the weight placed on the opinions of more similar others, the smaller the group of people with substantial influence. Third, the influence networks emerging from deploying the k-nn algorithm are hierarchically organized. Our results shed new light on classic empirical findings in communication and network science and can help improve the understanding of social influence offline and online.
无论是在线下还是在推荐系统中,人们在品味问题上影响他人意见的能力差别很大。这些显著差异背后的机制是什么?使用加权k近邻算法(k-nn)来表示一系列社会学习策略,我们展示了-利用网络科学的方法- k-nn算法如何在六个现实世界的品味领域中产生社会影响网络。我们展示了三个新结果,它们既适用于离线建议获取,也适用于在线推荐设置。首先,有影响力的个人具有主流品味,与他人的品味相似度高度分散。其次,个人或算法咨询的人越少(即k越低),或者对更相似的其他人的意见给予的权重越大,具有重大影响力的群体就越小。第三,部署k-nn算法产生的影响网络是分层组织的。我们的研究结果为传播和网络科学的经典实证发现提供了新的视角,有助于提高对线下和线上社会影响的理解。
{"title":"The Structure of Social Influence in Recommender Networks","authors":"P. Analytis, D. Barkoczi, Philipp Lorenz-Spreen, Stefan M. Herzog","doi":"10.1145/3366423.3380020","DOIUrl":"https://doi.org/10.1145/3366423.3380020","url":null,"abstract":"People’s ability to influence others’ opinion on matters of taste varies greatly—both offline and in recommender systems. What are the mechanisms underlying these striking differences? Using the weighted k-nearest neighbors algorithm (k-nn) to represent an array of social learning strategies, we show—leveraging methods from network science—how the k-nn algorithm gives rise to networks of social influence in six real-world domains of taste. We show three novel results that apply both to offline advice taking and online recommender settings. First, influential individuals have mainstream tastes and high dispersion in their taste similarity with others. Second, the fewer people an individual or algorithm consults (i.e., the lower k is) or the larger the weight placed on the opinions of more similar others, the smaller the group of people with substantial influence. Third, the influence networks emerging from deploying the k-nn algorithm are hierarchically organized. Our results shed new light on classic empirical findings in communication and network science and can help improve the understanding of social influence offline and online.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89197786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
LOVBench: Ontology Ranking Benchmark LOVBench:本体排名基准
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380245
Niklas Kolbe, P. Vandenbussche, S. Kubler, Yves Le Traon
Ontology search and ranking are key building blocks to establish and reuse shared conceptualizations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies’ relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behavior. Our experimental results show that feature configurations which are (i) well-suited to the user behavior, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.
本体搜索和排序是在Web上建立和重用共享领域知识概念的关键组成部分。然而,所提出的本体排序模型的有效性很难进行比较,因为这些模型通常是在不同的数据集上进行评估的,这些数据集受其静态性质和规模的限制。在本文中,我们首先引入LOVBench数据集作为本体术语排序的基准。通过对7000多个查询的推断相关性判断,LOVBench足够大,可以使用学习排序(LTR)和复杂的本体排序模型进行比较研究。我们没有依赖于少数专家的相关性判断,而是考虑了从链接开放词汇表(LOV)平台收集的许多实际用户的隐式反馈。我们的方法进一步支持基准的持续更新,在不断变化的数据社区中捕捉本体相关性的演变。其次,我们在LTR设置中使用LOVBench比较了文献中几种特征配置的性能,并在观察到的真实用户行为的背景下讨论了结果。我们的实验结果表明,(i)非常适合用户行为的特征配置,(ii)覆盖所有特征类型,以及(iii)考虑特征分解可以显着提高排名性能。
{"title":"LOVBench: Ontology Ranking Benchmark","authors":"Niklas Kolbe, P. Vandenbussche, S. Kubler, Yves Le Traon","doi":"10.1145/3366423.3380245","DOIUrl":"https://doi.org/10.1145/3366423.3380245","url":null,"abstract":"Ontology search and ranking are key building blocks to establish and reuse shared conceptualizations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies’ relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behavior. Our experimental results show that feature configurations which are (i) well-suited to the user behavior, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89604931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Twitter User Location Inference Based on Representation Learning and Label Propagation 基于表示学习和标签传播的Twitter用户位置推断
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380019
Hechan Tian, Meng Zhang, Xiangyang Luo, Fenlin Liu, Yaqiong Qiao
Social network user location inference technology has been widely used in various geospatial applications like public health monitoring and local advertising recommendation. Due to insufficient consideration of relationships between users and location indicative words, most of existing inference methods estimate label propagation probabilities solely based on statistical features, resulting in large location inference error. In this paper, a Twitter user location inference method based on representation learning and label propagation is proposed. Firstly, the heterogeneous connection relation graph is constructed based on relationships between Twitter users and relationships between users and location indicative words, and relationships unrelated to geographic attributes are filtered. Then, vector representations of users are learnt from the connection relation graph. Finally, label propagation probabilities between adjacent users are calculated based on vector representations, and the locations of unknown users are predicted through iterative label propagation. Experiments on two representative Twitter datasets - GeoText and TwUs, show that the proposed method can accurately calculate label propagation probabilities based on vector representations and improve the accuracy of location inference. Compared with existing typical Twitter user location inference methods - GCN and MLP-TXT+NET, the median error distance of the proposed method is reduced by 18% and 16%, respectively.
社交网络用户位置推断技术已广泛应用于公共卫生监测、本地广告推荐等各种地理空间应用。由于没有充分考虑用户与位置指示词之间的关系,现有的推理方法大多仅基于统计特征来估计标签传播概率,导致位置推断误差较大。提出了一种基于表示学习和标签传播的Twitter用户位置推理方法。首先,基于Twitter用户之间的关系和用户与位置指示词之间的关系构建异构连接关系图,并过滤与地理属性无关的关系;然后,从连接关系图中学习用户的向量表示。最后,基于向量表示计算相邻用户之间的标签传播概率,并通过迭代标签传播预测未知用户的位置。在两个具有代表性的Twitter数据集GeoText和TwUs上的实验表明,该方法可以准确地计算出基于向量表示的标签传播概率,提高了位置推理的准确性。与现有的典型Twitter用户位置推断方法GCN和MLP-TXT+NET相比,本文方法的中位误差距离分别减小了18%和16%。
{"title":"Twitter User Location Inference Based on Representation Learning and Label Propagation","authors":"Hechan Tian, Meng Zhang, Xiangyang Luo, Fenlin Liu, Yaqiong Qiao","doi":"10.1145/3366423.3380019","DOIUrl":"https://doi.org/10.1145/3366423.3380019","url":null,"abstract":"Social network user location inference technology has been widely used in various geospatial applications like public health monitoring and local advertising recommendation. Due to insufficient consideration of relationships between users and location indicative words, most of existing inference methods estimate label propagation probabilities solely based on statistical features, resulting in large location inference error. In this paper, a Twitter user location inference method based on representation learning and label propagation is proposed. Firstly, the heterogeneous connection relation graph is constructed based on relationships between Twitter users and relationships between users and location indicative words, and relationships unrelated to geographic attributes are filtered. Then, vector representations of users are learnt from the connection relation graph. Finally, label propagation probabilities between adjacent users are calculated based on vector representations, and the locations of unknown users are predicted through iterative label propagation. Experiments on two representative Twitter datasets - GeoText and TwUs, show that the proposed method can accurately calculate label propagation probabilities based on vector representations and improve the accuracy of location inference. Compared with existing typical Twitter user location inference methods - GCN and MLP-TXT+NET, the median error distance of the proposed method is reduced by 18% and 16%, respectively.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85412856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Leveraging Passage-level Cumulative Gain for Document Ranking 利用段落级累积增益进行文档排序
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380305
Zhijing Wu, Jiaxin Mao, Yiqun Liu, Jingtao Zhan, Yukun Zheng, Min Zhang, Shaoping Ma
Document ranking is one of the most studied but challenging problems in information retrieval (IR) research. A number of existing document ranking models capture relevance signals at the whole document level. Recently, more and more research has begun to address this problem from fine-grained document modeling. Several works leveraged fine-grained passage-level relevance signals in ranking models. However, most of these works focus on context-independent passage-level relevance signals and ignore the context information, which may lead to inaccurate estimation of passage-level relevance. In this paper, we investigate how information gain accumulates with passages when users sequentially read a document. We propose the context-aware Passage-level Cumulative Gain (PCG), which aggregates relevance scores of passages and avoids the need to formally split a document into independent passages. Next, we incorporate the patterns of PCG into a BERT-based sequential model called Passage-level Cumulative Gain Model (PCGM) to predict the PCG sequence. Finally, we apply PCGM to the document ranking task. Experimental results on two public ad hoc retrieval benchmark datasets show that PCGM outperforms most existing ranking models and also indicates the effectiveness of PCG signals. We believe that this work contributes to improving ranking performance and providing more explainability for document ranking.
文献排序是信息检索研究中研究最多但也是最具挑战性的问题之一。许多现有的文档排序模型在整个文档级别捕获相关信号。最近,越来越多的研究开始从细粒度文档建模的角度来解决这个问题。一些研究在排序模型中利用了细粒度的通道级相关信号。然而,这些研究大多关注与语境无关的篇章级关联信号,而忽略了语境信息,这可能导致篇章级关联的估计不准确。在本文中,我们研究了当用户顺序阅读文档时,信息增益是如何随着段落积累的。我们提出了上下文感知的段落级累积增益(PCG),它汇总了段落的相关性分数,避免了将文档正式拆分为独立段落的需要。接下来,我们将PCG的模式整合到基于bert的序列模型中,称为通道级累积增益模型(PCGM),以预测PCG序列。最后,我们将PCGM应用于文档排序任务。在两个公共自组织检索基准数据集上的实验结果表明,PCGM优于大多数现有的排序模型,也表明了PCG信号的有效性。我们相信这项工作有助于提高排名性能,并为文档排名提供更多的可解释性。
{"title":"Leveraging Passage-level Cumulative Gain for Document Ranking","authors":"Zhijing Wu, Jiaxin Mao, Yiqun Liu, Jingtao Zhan, Yukun Zheng, Min Zhang, Shaoping Ma","doi":"10.1145/3366423.3380305","DOIUrl":"https://doi.org/10.1145/3366423.3380305","url":null,"abstract":"Document ranking is one of the most studied but challenging problems in information retrieval (IR) research. A number of existing document ranking models capture relevance signals at the whole document level. Recently, more and more research has begun to address this problem from fine-grained document modeling. Several works leveraged fine-grained passage-level relevance signals in ranking models. However, most of these works focus on context-independent passage-level relevance signals and ignore the context information, which may lead to inaccurate estimation of passage-level relevance. In this paper, we investigate how information gain accumulates with passages when users sequentially read a document. We propose the context-aware Passage-level Cumulative Gain (PCG), which aggregates relevance scores of passages and avoids the need to formally split a document into independent passages. Next, we incorporate the patterns of PCG into a BERT-based sequential model called Passage-level Cumulative Gain Model (PCGM) to predict the PCG sequence. Finally, we apply PCGM to the document ranking task. Experimental results on two public ad hoc retrieval benchmark datasets show that PCGM outperforms most existing ranking models and also indicates the effectiveness of PCG signals. We believe that this work contributes to improving ranking performance and providing more explainability for document ranking.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87705009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Scaling PageRank to 100 Billion Pages 将PageRank扩展到1000亿页
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380035
S. Stergiou
Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.
分布式图处理框架将任务制定为超步骤序列,其中通过在图边缘上发送消息来异步执行通信。PageRank的通信模式在其所有超步中都是相同的,因为每个顶点都向其所有边发送消息。我们利用这种模式开发了一种新的通信范例,它允许我们交换只包含边缘有效负载的消息,从而大大降低了带宽需求。在一个有380亿个顶点和3.1万亿个边的网络图上进行实验,每次迭代的执行时间为34.4秒,表明比最先进的技术提高了一个数量级以上。
{"title":"Scaling PageRank to 100 Billion Pages","authors":"S. Stergiou","doi":"10.1145/3366423.3380035","DOIUrl":"https://doi.org/10.1145/3366423.3380035","url":null,"abstract":"Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82794592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search 基于等距等分布三元损失的度量学习在产品图像搜索中的应用
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380094
Furong Xu, Wei Zhang, Yuan Cheng, Wei Chu
Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.
电子商务系统中的产品图像搜索是一项具有挑战性的任务,因为产品类别数量庞大,类内相似度低,类间相似度高。深度度量学习基于独立于类数的成对距离,旨在最小化特征嵌入空间中的类内方差和类间相似性。现有的方法大多严格限制固定值样本之间的距离,以区分不同类别的样本。然而,在不同的训练阶段,配对样本的距离有不同的大小。因此,很难用固定值直接限制绝对距离。在本文中,我们提出了一种新的等距和等分布三元组(EET)损失函数来调整具有相对距离约束的样本之间的距离。该算法通过优化损失函数,逐步实现类内相似性和类间方差的最大化。具体来说,1)等距损失通过自适应约束每个三元组中同一类的两个样本与另一个不同类的样本之间的距离相等,从而将匹配样本拉近;2)等分布损失通过引导不同类的均匀分布,同时在嵌入空间中保持类内结构紧凑,从而将不匹配样本推得更远。在产品搜索基准上的大量实验结果验证了我们方法的改进性能。我们在其他检索数据集上也取得了改进,这表明我们的方法在图像搜索方面具有优越的泛化能力。
{"title":"Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search","authors":"Furong Xu, Wei Zhang, Yuan Cheng, Wei Chu","doi":"10.1145/3366423.3380094","DOIUrl":"https://doi.org/10.1145/3366423.3380094","url":null,"abstract":"Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79952749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Adversarial Bandits Policy for Crawling Commercial Web Content 抓取商业Web内容的对抗性盗匪策略
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380125
Shuguang Han, Michael Bendersky, Przemek Gajda, Sergey Novikov, Marc Najork, Bernhard Brodowsky, Alexandrin Popescul
The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.
商业网站内容的快速增长推动了购物搜索服务的发展,以帮助用户找到产品优惠。由于商业内容的动态性,有效的抓取策略是购物搜索服务的关键组成部分;它确保用户能够访问最新的产品详细信息。大多数现有的策略要么依赖于简单的启发式,要么忽视了资源预算。为了解决这个问题,Azar等人最近提出了一种优化策略LambdaCrawl,旨在在给定的资源预算内最大化内容的新鲜度。在本文中,我们证明了LambdaCrawl的有效性在很大程度上取决于未来内容变化率的估计程度。通过采用最先进的深度学习模型进行变化率预测,我们获得了比普通的LambdaCrawl实现(根据过去历史估计的变化率)大幅增加的内容新鲜度。此外,我们证明,虽然LambdaCrawl是现有抓取策略的重大进步,但它可以通过统一的多策略抓取策略进一步改进。为此,我们采用k臂对抗盗匪算法,该算法可以通过组合多种策略来优化整体新鲜度。在大规模生产数据集上的实证结果证实了它优于LambdaCrawl,尤其是在资源预算紧张的情况下。
{"title":"Adversarial Bandits Policy for Crawling Commercial Web Content","authors":"Shuguang Han, Michael Bendersky, Przemek Gajda, Sergey Novikov, Marc Najork, Bernhard Brodowsky, Alexandrin Popescul","doi":"10.1145/3366423.3380125","DOIUrl":"https://doi.org/10.1145/3366423.3380125","url":null,"abstract":"The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"148 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89072482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Deconstruct Densest Subgraphs 解构最密集子图
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380033
Lijun Chang, Miao Qiao
In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Indexcan report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Indexcosts no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with one billion edges, the ds-Indexcan be constructed in several minutes on an ordinary commercial machine.
在本文中,我们的目的是在平均度的密度概念下理解给定图的最密集子图的分布。我们证明了图G的所有密集子图的结构、关系和分布都可以在O(L)空间中编码成一个叫做ds-Index的索引。其中L表示G的最密集子图的最大输出大小。更重要的是,ds- index可以在O(L)时间内集合报告G的所有最小最密集子图,并且可以以O(L)的延迟枚举G的所有最密集子图。此外,构建ds- index的成本不超过使用最先进的方法找到一个最密集的子图。我们的实证研究表明,对于具有10亿条边的网络规模图,ds- index可以在普通商用机器上几分钟内构建完成。
{"title":"Deconstruct Densest Subgraphs","authors":"Lijun Chang, Miao Qiao","doi":"10.1145/3366423.3380033","DOIUrl":"https://doi.org/10.1145/3366423.3380033","url":null,"abstract":"In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Indexcan report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Indexcosts no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with one billion edges, the ds-Indexcan be constructed in several minutes on an ordinary commercial machine.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79091010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dark Matter: Uncovering the DarkComet RAT Ecosystem 暗物质:揭示暗彗星鼠生态系统
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380277
Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, Kirill Levchenko
Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day; many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potentially be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.
远程访问木马(rat)是一类持续存在的恶意软件,它使攻击者能够直接、交互式地访问受害者的个人电脑,允许攻击者窃取私人数据,使用摄像头和麦克风实时监视受害者,并通过扬声器对受害者进行口头骚扰。迄今为止,由于感染的不显眼性,这种有害形式的恶意软件的用户和受害者很难在野外观察到。在这项工作中,我们报告了DarkComet RAT生态系统纵向研究的结果。使用从DarkComet控制器收集受害者日志数据库的已知方法,我们提出了跨主机名更改跟踪RAT控制器的新技术,并改进了过滤由扫描仪和沙箱恶意软件执行引起的虚假受害者记录的现有技术。我们从1029个独立控制器中下载了6620个DarkComet数据库,时间跨度超过5年。我们的分析表明,在此期间,至少有57,805名DarkComet受害者,每天有69名新受害者被感染;在此期间,他们的许多按键都被捕获,动作被记录下来,网络摄像头也被监控。我们更精确地确定运动和受害者的方法可能有助于提高清理受害者工作的效率和效力,并有助于确定执法调查的优先次序。
{"title":"Dark Matter: Uncovering the DarkComet RAT Ecosystem","authors":"Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, Kirill Levchenko","doi":"10.1145/3366423.3380277","DOIUrl":"https://doi.org/10.1145/3366423.3380277","url":null,"abstract":"Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day; many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potentially be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73131862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Extracting Knowledge from Web Text with Monte Carlo Tree Search 用蒙特卡罗树搜索从网络文本中提取知识
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380010
Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, P. Li
To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete reinforcement learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.
为了从一般的web文本中提取知识,需要构建一个独立于领域的提取器,该提取器可以扩展到整个web语料库。这项任务被称为开放信息提取(OIE)。本文提出应用蒙特卡罗树搜索(MCTS)来实现OIE。为了实现这一目标,我们为OIE定义了一个马尔可夫决策过程,并构建了一个模拟器来学习奖励信号,为MCTS提供了一个完整的强化学习框架。使用该框架,MCTS在预训练的序列到序列(Seq2Seq)预测器的指导下探索候选单词(和符号),并在训练过程中生成丰富的探索样本。我们使用探索样本来更新奖励模拟器和预测器,并在此基础上实现另一个MCTS来搜索推理过程中的最优预测。经验评估表明,MCTS推理大大提高了预测的准确性(超过10%),并且比其他最先进的比较模型实现了领先的性能。
{"title":"Extracting Knowledge from Web Text with Monte Carlo Tree Search","authors":"Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, P. Li","doi":"10.1145/3366423.3380010","DOIUrl":"https://doi.org/10.1145/3366423.3380010","url":null,"abstract":"To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete reinforcement learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74328538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
Proceedings of The Web Conference 2020
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1