首页 > 最新文献

Proceedings of The Web Conference 2020最新文献

英文 中文
LOVBench: Ontology Ranking Benchmark LOVBench:本体排名基准
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380245
Niklas Kolbe, P. Vandenbussche, S. Kubler, Yves Le Traon
Ontology search and ranking are key building blocks to establish and reuse shared conceptualizations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies’ relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behavior. Our experimental results show that feature configurations which are (i) well-suited to the user behavior, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.
本体搜索和排序是在Web上建立和重用共享领域知识概念的关键组成部分。然而,所提出的本体排序模型的有效性很难进行比较,因为这些模型通常是在不同的数据集上进行评估的,这些数据集受其静态性质和规模的限制。在本文中,我们首先引入LOVBench数据集作为本体术语排序的基准。通过对7000多个查询的推断相关性判断,LOVBench足够大,可以使用学习排序(LTR)和复杂的本体排序模型进行比较研究。我们没有依赖于少数专家的相关性判断,而是考虑了从链接开放词汇表(LOV)平台收集的许多实际用户的隐式反馈。我们的方法进一步支持基准的持续更新,在不断变化的数据社区中捕捉本体相关性的演变。其次,我们在LTR设置中使用LOVBench比较了文献中几种特征配置的性能,并在观察到的真实用户行为的背景下讨论了结果。我们的实验结果表明,(i)非常适合用户行为的特征配置,(ii)覆盖所有特征类型,以及(iii)考虑特征分解可以显着提高排名性能。
{"title":"LOVBench: Ontology Ranking Benchmark","authors":"Niklas Kolbe, P. Vandenbussche, S. Kubler, Yves Le Traon","doi":"10.1145/3366423.3380245","DOIUrl":"https://doi.org/10.1145/3366423.3380245","url":null,"abstract":"Ontology search and ranking are key building blocks to establish and reuse shared conceptualizations of domain knowledge on the Web. However, the effectiveness of proposed ontology ranking models is difficult to compare since these are often evaluated on diverse datasets that are limited by their static nature and scale. In this paper, we first introduce the LOVBench dataset as a benchmark for ontology term ranking. With inferred relevance judgments for more than 7000 queries, LOVBench is large enough to perform a comparison study using learning to rank (LTR) with complex ontology ranking models. Instead of relying on relevance judgments from a few experts, we consider implicit feedback from many actual users collected from the Linked Open Vocabularies (LOV) platform. Our approach further enables continuous updates of the benchmark, capturing the evolution of ontologies’ relevance in an ever-changing data community. Second, we compare the performance of several feature configurations from the literature using LOVBench in LTR settings and discuss the results in the context of the observed real-world user behavior. Our experimental results show that feature configurations which are (i) well-suited to the user behavior, (ii) cover all features types, and (iii) consider decomposition of features can significantly improve the ranking performance.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89604931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Next Point-of-Interest Recommendation on Resource-Constrained Mobile Devices 资源受限移动设备的下一个兴趣点推荐
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380170
Qinyong Wang, Hongzhi Yin, Tong Chen, Zi Huang, Hao Wang, Yanchang Zhao, Nguyen Quoc Viet Hung
In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobile service as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public. To bypass these defects, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users’ mobile devices to generate accurate recommendations solely utilizing users’ local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.
在现代旅游业中,下一个兴趣点(POI)推荐是一项重要的移动服务,因为它有效地帮助犹豫不决的旅行者决定下一个访问的POI。目前,大多数next POI推荐系统都是建立在基于云的范例之上的,其中推荐模型是在强大的云服务器上训练和部署的。当用户通过移动设备提出推荐请求时,当前的上下文信息将被上传到云服务器,以帮助训练有素的模型生成个性化的推荐结果。然而,在现实中,这种模式严重依赖于高质量的网络连接,并且在运行中受到高能源足迹和公众日益增加的隐私担忧的影响。为了绕过这些缺陷,我们提出了一种新的光位置推荐系统(LLRec),在资源受限的移动设备上本地执行下一个POI推荐。为了使LLRec完全兼容有限的计算资源和内存空间,我们利用FastGRNN(一种轻量级但有效的门控递归神经网络(RNN))作为其主要构建块,并通过在嵌入层中采用张量-训练组合来显著压缩模型大小。作为一个紧凑的模型,LLRec通过创新的师生培训框架来保持其鲁棒性,其中一个强大的教师模型在云上进行培训,从可用的上下文数据中学习必要的知识,而简化的学生模型LLRec在教师模型的指导下进行培训。最终的LLRec被下载并部署到用户的移动设备上,仅利用用户的本地数据生成准确的推荐。因此,LLRec显著减少了对云服务器的依赖,从而允许以稳定、经济、安全的方式推荐下一个POI。在两个大规模推荐数据集上的大量实验进一步证明了我们提出的解决方案的优越性。
{"title":"Next Point-of-Interest Recommendation on Resource-Constrained Mobile Devices","authors":"Qinyong Wang, Hongzhi Yin, Tong Chen, Zi Huang, Hao Wang, Yanchang Zhao, Nguyen Quoc Viet Hung","doi":"10.1145/3366423.3380170","DOIUrl":"https://doi.org/10.1145/3366423.3380170","url":null,"abstract":"In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobile service as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public. To bypass these defects, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users’ mobile devices to generate accurate recommendations solely utilizing users’ local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"83 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82371558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing 自动网络爬虫作为人类浏览代理的代表性
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380104
David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, I. Segall, Fredrik Wollsén, M. Lopatka
Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don’t require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.
大规模网络爬虫已经成为研究网络特征的最新技术。特别是,它们是在线跟踪研究的核心工具。Web爬行是一种有吸引力的数据收集方法,因为爬行可以以相对较低的基础设施成本运行,并且不需要处理浏览历史等敏感用户数据。然而,使用爬虫作为人类浏览数据的代理所带来的偏见还没有得到很好的研究。爬虫可能无法捕捉到用户环境的多样性,并且一次性爬虫所呈现的Web快照视图不能反映其不断发展的本质,这阻碍了基于爬虫的研究的可重复性。在本文中,我们根据常见的跟踪和指纹指标量化了网络爬虫的可重复性和代表性,同时考虑了爬虫之间的差异以及与人类浏览器使用的差异。我们量化了同时抓取的基线变化,然后隔离了时间、云IP地址与住宅和操作系统的影响。这为评估爬行程序访问高流量网站的标准列表和实际浏览行为之间的一致性提供了基础,这些行为是从超过50,000名Firefox Web浏览器用户的选择样本中测量出来的。我们的分析揭示了处理无状态爬行基础设施和一般有状态人类浏览之间的差异,例如,在从相同域加载页面时,爬行程序往往比人类浏览器用户体验到更高的第三方活动率。
{"title":"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing","authors":"David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, I. Segall, Fredrik Wollsén, M. Lopatka","doi":"10.1145/3366423.3380104","DOIUrl":"https://doi.org/10.1145/3366423.3380104","url":null,"abstract":"Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don’t require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80370832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
How Do We Create a Fantabulous Password? 我们如何创建一个奇妙的密码?
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380222
Simon S. Woo
Although pronounceability can improve password memorability, most existing password generation approaches have not properly integrated the pronounceability of passwords in their designs. In this work, we demonstrate several shortfalls of current pronounceable password generation approaches, and then propose, ProSemPass, a new method of generating passwords that are pronounceable and semantically meaningful. In our approach, users supply initial input words and our system improves the pronounceability and meaning of the user-provided words by automatically creating a portmanteau. To measure the strength of our approach, we use attacker models, where attackers have complete knowledge of our password generation algorithms. We measure strength in guess numbers and compare those with other existing password generation approaches. Using a large-scale IRB-approved user study with 1,563 Amazon MTurkers over 9 different conditions, our approach achieves a 30% higher recall than those from current pronounceable password approaches, and is stronger than the offline guessing attack limit.
虽然可发音性可以提高密码的可记忆性,但现有的大多数密码生成方法都没有在设计中适当地考虑密码的可发音性。在这项工作中,我们展示了当前可发音密码生成方法的几个不足,然后提出了ProSemPass,一种生成可发音和语义有意义的密码的新方法。在我们的方法中,用户提供初始输入单词,我们的系统通过自动创建一个组合来提高用户提供的单词的发音和含义。为了衡量我们方法的强度,我们使用攻击者模型,攻击者完全了解我们的密码生成算法。我们测量猜测数字的强度,并将其与其他现有的密码生成方法进行比较。通过对1563名亚马逊MTurkers在9种不同条件下进行的大规模irb批准的用户研究,我们的方法比当前可发音密码方法的召回率高出30%,并且比离线猜测攻击限制更强。
{"title":"How Do We Create a Fantabulous Password?","authors":"Simon S. Woo","doi":"10.1145/3366423.3380222","DOIUrl":"https://doi.org/10.1145/3366423.3380222","url":null,"abstract":"Although pronounceability can improve password memorability, most existing password generation approaches have not properly integrated the pronounceability of passwords in their designs. In this work, we demonstrate several shortfalls of current pronounceable password generation approaches, and then propose, ProSemPass, a new method of generating passwords that are pronounceable and semantically meaningful. In our approach, users supply initial input words and our system improves the pronounceability and meaning of the user-provided words by automatically creating a portmanteau. To measure the strength of our approach, we use attacker models, where attackers have complete knowledge of our password generation algorithms. We measure strength in guess numbers and compare those with other existing password generation approaches. Using a large-scale IRB-approved user study with 1,563 Amazon MTurkers over 9 different conditions, our approach achieves a 30% higher recall than those from current pronounceable password approaches, and is stronger than the offline guessing attack limit.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74272261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scaling PageRank to 100 Billion Pages 将PageRank扩展到1000亿页
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380035
S. Stergiou
Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.
分布式图处理框架将任务制定为超步骤序列,其中通过在图边缘上发送消息来异步执行通信。PageRank的通信模式在其所有超步中都是相同的,因为每个顶点都向其所有边发送消息。我们利用这种模式开发了一种新的通信范例,它允许我们交换只包含边缘有效负载的消息,从而大大降低了带宽需求。在一个有380亿个顶点和3.1万亿个边的网络图上进行实验,每次迭代的执行时间为34.4秒,表明比最先进的技术提高了一个数量级以上。
{"title":"Scaling PageRank to 100 Billion Pages","authors":"S. Stergiou","doi":"10.1145/3366423.3380035","DOIUrl":"https://doi.org/10.1145/3366423.3380035","url":null,"abstract":"Distributed graph processing frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank’s communication pattern is identical across all its supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82794592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search 基于等距等分布三元损失的度量学习在产品图像搜索中的应用
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380094
Furong Xu, Wei Zhang, Yuan Cheng, Wei Chu
Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.
电子商务系统中的产品图像搜索是一项具有挑战性的任务,因为产品类别数量庞大,类内相似度低,类间相似度高。深度度量学习基于独立于类数的成对距离,旨在最小化特征嵌入空间中的类内方差和类间相似性。现有的方法大多严格限制固定值样本之间的距离,以区分不同类别的样本。然而,在不同的训练阶段,配对样本的距离有不同的大小。因此,很难用固定值直接限制绝对距离。在本文中,我们提出了一种新的等距和等分布三元组(EET)损失函数来调整具有相对距离约束的样本之间的距离。该算法通过优化损失函数,逐步实现类内相似性和类间方差的最大化。具体来说,1)等距损失通过自适应约束每个三元组中同一类的两个样本与另一个不同类的样本之间的距离相等,从而将匹配样本拉近;2)等分布损失通过引导不同类的均匀分布,同时在嵌入空间中保持类内结构紧凑,从而将不匹配样本推得更远。在产品搜索基准上的大量实验结果验证了我们方法的改进性能。我们在其他检索数据集上也取得了改进,这表明我们的方法在图像搜索方面具有优越的泛化能力。
{"title":"Metric Learning with Equidistant and Equidistributed Triplet-based Loss for Product Image Search","authors":"Furong Xu, Wei Zhang, Yuan Cheng, Wei Chu","doi":"10.1145/3366423.3380094","DOIUrl":"https://doi.org/10.1145/3366423.3380094","url":null,"abstract":"Product image search in E-commerce systems is a challenging task, because of a huge number of product classes, low intra-class similarity and high inter-class similarity. Deep metric learning, based on paired distances independent of the number of classes, aims to minimize intra-class variances and inter-class similarity in feature embedding space. Most existing approaches strictly restrict the distance between samples with fixed values to distinguish different classes of samples. However, the distance of paired samples has various magnitudes during different training stages. Therefore, it is difficult to directly restrict absolute distances with fixed values. In this paper, we propose a novel Equidistant and Equidistributed Triplet-based (EET) loss function to adjust the distance between samples with relative distance constraints. By optimizing the loss function, the algorithm progressively maximizes intra-class similarity and inter-class variances. Specifically, 1) the equidistant loss pulls the matched samples closer by adaptively constraining two samples of the same class to be equally distant from another one of a different class in each triplet, 2) the equidistributed loss pushes the mismatched samples farther away by guiding different classes to be uniformly distributed while keeping intra-class structure compact in embedding space. Extensive experimental results on product search benchmarks verify the improved performance of our method. We also achieve improvements on other retrieval datasets, which show superior generalization capacity of our method in image search.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79952749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Adversarial Bandits Policy for Crawling Commercial Web Content 抓取商业Web内容的对抗性盗匪策略
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380125
Shuguang Han, Michael Bendersky, Przemek Gajda, Sergey Novikov, Marc Najork, Bernhard Brodowsky, Alexandrin Popescul
The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.
商业网站内容的快速增长推动了购物搜索服务的发展,以帮助用户找到产品优惠。由于商业内容的动态性,有效的抓取策略是购物搜索服务的关键组成部分;它确保用户能够访问最新的产品详细信息。大多数现有的策略要么依赖于简单的启发式,要么忽视了资源预算。为了解决这个问题,Azar等人最近提出了一种优化策略LambdaCrawl,旨在在给定的资源预算内最大化内容的新鲜度。在本文中,我们证明了LambdaCrawl的有效性在很大程度上取决于未来内容变化率的估计程度。通过采用最先进的深度学习模型进行变化率预测,我们获得了比普通的LambdaCrawl实现(根据过去历史估计的变化率)大幅增加的内容新鲜度。此外,我们证明,虽然LambdaCrawl是现有抓取策略的重大进步,但它可以通过统一的多策略抓取策略进一步改进。为此,我们采用k臂对抗盗匪算法,该算法可以通过组合多种策略来优化整体新鲜度。在大规模生产数据集上的实证结果证实了它优于LambdaCrawl,尤其是在资源预算紧张的情况下。
{"title":"Adversarial Bandits Policy for Crawling Commercial Web Content","authors":"Shuguang Han, Michael Bendersky, Przemek Gajda, Sergey Novikov, Marc Najork, Bernhard Brodowsky, Alexandrin Popescul","doi":"10.1145/3366423.3380125","DOIUrl":"https://doi.org/10.1145/3366423.3380125","url":null,"abstract":"The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"148 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89072482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Deconstruct Densest Subgraphs 解构最密集子图
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380033
Lijun Chang, Miao Qiao
In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Indexcan report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Indexcosts no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with one billion edges, the ds-Indexcan be constructed in several minutes on an ordinary commercial machine.
在本文中,我们的目的是在平均度的密度概念下理解给定图的最密集子图的分布。我们证明了图G的所有密集子图的结构、关系和分布都可以在O(L)空间中编码成一个叫做ds-Index的索引。其中L表示G的最密集子图的最大输出大小。更重要的是,ds- index可以在O(L)时间内集合报告G的所有最小最密集子图,并且可以以O(L)的延迟枚举G的所有最密集子图。此外,构建ds- index的成本不超过使用最先进的方法找到一个最密集的子图。我们的实证研究表明,对于具有10亿条边的网络规模图,ds- index可以在普通商用机器上几分钟内构建完成。
{"title":"Deconstruct Densest Subgraphs","authors":"Lijun Chang, Miao Qiao","doi":"10.1145/3366423.3380033","DOIUrl":"https://doi.org/10.1145/3366423.3380033","url":null,"abstract":"In this paper, we aim to understand the distribution of the densest subgraphs of a given graph under the density notion of average-degree. We show that the structures, the relationships and the distributions of all the densest subgraphs of a graph G can be encoded in O(L) space in an index called the ds-Index. Here L denotes the maximum output size of a densest subgraph of G. More importantly, ds-Indexcan report all the minimal densest subgraphs of G collectively in O(L) time and can enumerate all the densest subgraphs of G with an O(L) delay. Besides, the construction of ds-Indexcosts no more than finding a single densest subgraph using the state-of-the-art approach. Our empirical study shows that for web-scale graphs with one billion edges, the ds-Indexcan be constructed in several minutes on an ordinary commercial machine.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79091010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dark Matter: Uncovering the DarkComet RAT Ecosystem 暗物质:揭示暗彗星鼠生态系统
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380277
Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, Kirill Levchenko
Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day; many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potentially be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.
远程访问木马(rat)是一类持续存在的恶意软件,它使攻击者能够直接、交互式地访问受害者的个人电脑,允许攻击者窃取私人数据,使用摄像头和麦克风实时监视受害者,并通过扬声器对受害者进行口头骚扰。迄今为止,由于感染的不显眼性,这种有害形式的恶意软件的用户和受害者很难在野外观察到。在这项工作中,我们报告了DarkComet RAT生态系统纵向研究的结果。使用从DarkComet控制器收集受害者日志数据库的已知方法,我们提出了跨主机名更改跟踪RAT控制器的新技术,并改进了过滤由扫描仪和沙箱恶意软件执行引起的虚假受害者记录的现有技术。我们从1029个独立控制器中下载了6620个DarkComet数据库,时间跨度超过5年。我们的分析表明,在此期间,至少有57,805名DarkComet受害者,每天有69名新受害者被感染;在此期间,他们的许多按键都被捕获,动作被记录下来,网络摄像头也被监控。我们更精确地确定运动和受害者的方法可能有助于提高清理受害者工作的效率和效力,并有助于确定执法调查的优先次序。
{"title":"Dark Matter: Uncovering the DarkComet RAT Ecosystem","authors":"Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, Kirill Levchenko","doi":"10.1145/3366423.3380277","DOIUrl":"https://doi.org/10.1145/3366423.3380277","url":null,"abstract":"Remote Access Trojans (RATs) are a persistent class of malware that give an attacker direct, interactive access to a victim’s personal computer, allowing the attacker to steal private data, spy on the victim in real-time using the camera and microphone, and verbally harass the victim through the speaker. To date, the users and victims of this pernicious form of malware have been challenging to observe in the wild due to the unobtrusive nature of infections. In this work, we report the results of a longitudinal study of the DarkComet RAT ecosystem. Using a known method for collecting victim log databases from DarkComet controllers, we present novel techniques for tracking RAT controllers across hostname changes and improve on established techniques for filtering spurious victim records caused by scanners and sandboxed malware executions. We downloaded 6,620 DarkComet databases from 1,029 unique controllers spanning over 5 years of operation. Our analysis shows that there have been at least 57,805 victims of DarkComet over this period, with 69 new victims infected every day; many of whose keystrokes have been captured, actions recorded, and webcams monitored during this time. Our methodologies for more precisely identifying campaigns and victims could potentially be useful for improving the efficiency and efficacy of victim cleanup efforts and prioritization of law enforcement investigations.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73131862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Extracting Knowledge from Web Text with Monte Carlo Tree Search 用蒙特卡罗树搜索从网络文本中提取知识
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380010
Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, P. Li
To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete reinforcement learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.
为了从一般的web文本中提取知识,需要构建一个独立于领域的提取器,该提取器可以扩展到整个web语料库。这项任务被称为开放信息提取(OIE)。本文提出应用蒙特卡罗树搜索(MCTS)来实现OIE。为了实现这一目标,我们为OIE定义了一个马尔可夫决策过程,并构建了一个模拟器来学习奖励信号,为MCTS提供了一个完整的强化学习框架。使用该框架,MCTS在预训练的序列到序列(Seq2Seq)预测器的指导下探索候选单词(和符号),并在训练过程中生成丰富的探索样本。我们使用探索样本来更新奖励模拟器和预测器,并在此基础上实现另一个MCTS来搜索推理过程中的最优预测。经验评估表明,MCTS推理大大提高了预测的准确性(超过10%),并且比其他最先进的比较模型实现了领先的性能。
{"title":"Extracting Knowledge from Web Text with Monte Carlo Tree Search","authors":"Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, P. Li","doi":"10.1145/3366423.3380010","DOIUrl":"https://doi.org/10.1145/3366423.3380010","url":null,"abstract":"To extract knowledge from general web text, it requires to build a domain-independent extractor that scales to the entire web corpus. This task is known as Open Information Extraction (OIE). This paper proposes to apply Monte-Carlo Tree Search (MCTS) to accomplish OIE. To achieve this goal, we define a Markov Decision Process for OIE and build a simulator to learn the reward signals, which provides a complete reinforcement learning framework for MCTS. Using this framework, MCTS explores candidate words (and symbols) under the guidance of a pre-trained Sequence-to-Sequence (Seq2Seq) predictor and generates abundant exploration samples during training. We apply the exploration samples to update the reward simulator and the predictor, based on which we implement another MCTS to search the optimal predictions during inference. Empirical evaluation demonstrates that the MCTS inference substantially improves the accuracy of prediction (more than 10%) and achieves a leading performance over other state-of-the-art comparison models.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74328538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
Proceedings of The Web Conference 2020
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1