Proceedings of The Web Conference 2020最新文献_第6页

Anchored Model Transfer and Soft Instance Transfer for Cross-Task Cross-Domain Learning: A Study Through Aspect-Level Sentiment Classification 面向跨任务跨领域学习的锚定模型迁移和软实例迁移:基于层面情感分类的研究

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380034

Yaowei Zheng, Richong Zhang, Suyuchen Wang, Samuel Mensah, Yongyi Mao

Supervised learning relies heavily on readily available labelled data to infer an effective classification function. However, proposed methods under the supervised learning paradigm are faced with the scarcity of labelled data within domains, and are not generalized enough to adapt to other tasks. Transfer learning has proved to be a worthy choice to address these issues, by allowing knowledge to be shared across domains and tasks. In this paper, we propose two transfer learning methods Anchored Model Transfer (AMT) and Soft Instance Transfer (SIT), which are both based on multi-task learning, and account for model transfer and instance transfer, and can be combined into a common framework. We demonstrate the effectiveness of AMT and SIT for aspect-level sentiment classification showing the competitive performance against baseline models on benchmark datasets. Interestingly, we show that the integration of both methods AMT+SIT achieves state-of-the-art performance on the same task.

监督学习很大程度上依赖于现成的标记数据来推断有效的分类函数。然而，在监督学习范式下提出的方法面临着领域内标记数据的稀缺性，并且不够泛化以适应其他任务。迁移学习已被证明是解决这些问题的一个有价值的选择，它允许知识在不同领域和任务之间共享。本文提出了锚定模型迁移(AMT)和软实例迁移(SIT)两种迁移学习方法，这两种方法都基于多任务学习，考虑了模型迁移和实例迁移，并且可以组合成一个共同的框架。我们证明了AMT和SIT在方面级情感分类方面的有效性，显示了在基准数据集上与基线模型的竞争性能。有趣的是，我们表明两种方法的集成AMT+SIT在同一任务上实现了最先进的性能。

引用次数: 6

The Fast and The Frugal: Tail Latency Aware Provisioning for Coping with Load Variations 快速和节俭:尾部延迟感知供应应对负载变化

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380117

Adithya Kumar, Iyswarya Narayanan, T. Zhu, A. Sivasubramaniam

Small and medium sized enterprises use the cloud for running online, user-facing, tail latency sensitive applications with well-defined fixed monthly budgets. For these applications, adequate system capacity must be provisioned to extract maximal performance despite the challenges of uncertainties in load and request-sizes. In this paper, we address the problem of capacity provisioning under fixed budget constraints with the goal of minimizing tail latency. To tackle this problem, we propose building systems using a heterogeneous mix of low latency expensive resources and cheap resources that provide high throughput per dollar. As load changes through the day, we use more faster resources to reduce tail latency during low load periods and more cheaper resources to handle the high load periods. To achieve these tail latency benefits, we introduce novel heterogeneity-aware scheduling and autoscaling algorithms that are designed for minimizing tail latency. Using software prototypes and by running experiments on the public cloud, we show that our approach can outperform existing capacity provisioning systems by reducing the tail latency by as much as 45% under fixed-budget settings.

中小型企业使用云来运行在线的、面向用户的、对尾部延迟敏感的应用程序，这些应用程序具有明确定义的固定月度预算。对于这些应用程序，必须提供足够的系统容量，以便在负载和请求大小不确定的情况下获得最大的性能。本文以最小化尾延迟为目标，研究了固定预算约束下的容量分配问题。为了解决这个问题，我们建议使用低延迟昂贵资源和廉价资源的异构组合来构建系统，这些资源提供高吞吐量。随着一天中负载的变化，我们在低负载期间使用更快的资源来减少尾部延迟，并使用更便宜的资源来处理高负载期间。为了实现这些尾部延迟的好处，我们引入了新的异构感知调度和自动缩放算法，旨在最大限度地减少尾部延迟。通过使用软件原型并在公共云上运行实验，我们表明，在固定预算设置下，我们的方法可以通过将尾部延迟减少多达45%，从而优于现有的容量配置系统。

{"title":"The Fast and The Frugal: Tail Latency Aware Provisioning for Coping with Load Variations","authors":"Adithya Kumar, Iyswarya Narayanan, T. Zhu, A. Sivasubramaniam","doi":"10.1145/3366423.3380117","DOIUrl":"https://doi.org/10.1145/3366423.3380117","url":null,"abstract":"Small and medium sized enterprises use the cloud for running online, user-facing, tail latency sensitive applications with well-defined fixed monthly budgets. For these applications, adequate system capacity must be provisioned to extract maximal performance despite the challenges of uncertainties in load and request-sizes. In this paper, we address the problem of capacity provisioning under fixed budget constraints with the goal of minimizing tail latency. To tackle this problem, we propose building systems using a heterogeneous mix of low latency expensive resources and cheap resources that provide high throughput per dollar. As load changes through the day, we use more faster resources to reduce tail latency during low load periods and more cheaper resources to handle the high load periods. To achieve these tail latency benefits, we introduce novel heterogeneity-aware scheduling and autoscaling algorithms that are designed for minimizing tail latency. Using software prototypes and by running experiments on the public cloud, we show that our approach can outperform existing capacity provisioning systems by reducing the tail latency by as much as 45% under fixed-budget settings.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77997205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Keyword Search over Knowledge Graphs via Static and Dynamic Hub Labelings 通过静态和动态中心标签对知识图进行关键字搜索

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380110

Yuxuan Shi, Gong Cheng, E. Kharlamov

Keyword search is a prominent approach to querying Web data. For graph-structured data, a widely accepted semantics for keywords is based on group Steiner trees. For this NP-hard problem, existing algorithms with provable quality guarantees have prohibitive run time on large graphs. In this paper, we propose practical approximation algorithms with a guaranteed quality of computed answers and very low run time. Our algorithms rely on Hub Labeling (HL), a structure that labels each vertex in a graph with a list of vertices reachable from it, which we use to compute distances and shortest paths. We devise two HLs: a conventional static HL that uses a new heuristic to improve pruned landmark labeling, and a novel dynamic HL that inverts and aggregates query-relevant static labels to more efficiently process vertex sets. Our approach allows to compute a reasonably good approximation of answers to keyword queries in milliseconds on million-scale knowledge graphs.

关键字搜索是查询Web数据的一种重要方法。对于图结构数据，广泛接受的关键字语义是基于组斯坦纳树。对于这个np困难问题，现有的具有可证明质量保证的算法在大型图上的运行时间令人望而却步。在本文中，我们提出了实用的近似算法，保证了计算结果的质量和非常低的运行时间。我们的算法依赖于Hub Labeling (HL)，这是一种用可到达的顶点列表标记图中的每个顶点的结构，我们用它来计算距离和最短路径。我们设计了两种HLs:一种是传统的静态HL，它使用一种新的启发式方法来改进已修剪的地标标记;另一种是新的动态HL，它反转和聚合查询相关的静态标签，以更有效地处理顶点集。我们的方法允许在百万级知识图上以毫秒为单位计算关键字查询的相当好的近似答案。

引用次数: 27

Using Cliques with Higher-order Spectral Embeddings Improves Graph Visualizations 使用带有高阶谱嵌入的团块改进图形可视化

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380059

Huda Nassar, Caitlin Kennedy, Shweta Jain, Austin R. Benson, D. Gleich

In the simplest setting, graph visualization is the problem of producing a set of two-dimensional coordinates for each node that meaningfully shows connections and latent structure in a graph. Among other uses, having a meaningful layout is often useful to help interpret the results from network science tasks such as community detection and link prediction. There are several existing graph visualization techniques in the literature that are based on spectral methods, graph embeddings, or optimizing graph distances. Despite the large number of methods, it is still often challenging or extremely time consuming to produce meaningful layouts of graphs with hundreds of thousands of vertices. Existing methods often either fail to produce a visualization in a meaningful time window, or produce a layout colorfully called a “hairball”, which does not illustrate any internal structure in the graph. Here, we show that adding higher-order information based on cliques to a classic eigenvector based graph visualization technique enables it to produce meaningful plots of large graphs. We further evaluate these visualizations along a number of graph visualization metrics and we find that it outperforms existing techniques on a metric that uses random walks to measure the local structure. Finally, we show many examples of how our algorithm successfully produces layouts of large networks. Code to reproduce our results is available.

在最简单的设置中，图形可视化是为每个节点生成一组二维坐标的问题，这些坐标有意义地显示了图中的连接和潜在结构。在其他用途中，有一个有意义的布局通常有助于解释来自网络科学任务(如社区检测和链接预测)的结果。文献中已有几种基于谱方法、图嵌入或优化图距离的图形可视化技术。尽管有大量的方法，但要生成具有数十万个顶点的图形的有意义的布局仍然具有挑战性或非常耗时。现有的方法通常要么不能在有意义的时间窗口内产生可视化，要么产生一种被称为“毛球”的彩色布局，它不能在图中说明任何内部结构。在这里，我们展示了将基于团的高阶信息添加到经典的基于特征向量的图形可视化技术中，使其能够生成有意义的大图形图。我们进一步评估这些可视化沿着一些图形可视化指标，我们发现它优于现有的技术在一个指标，使用随机游走来测量局部结构。最后，我们展示了许多例子，说明我们的算法如何成功地生成大型网络的布局。可以使用代码来重现我们的结果。

{"title":"Using Cliques with Higher-order Spectral Embeddings Improves Graph Visualizations","authors":"Huda Nassar, Caitlin Kennedy, Shweta Jain, Austin R. Benson, D. Gleich","doi":"10.1145/3366423.3380059","DOIUrl":"https://doi.org/10.1145/3366423.3380059","url":null,"abstract":"In the simplest setting, graph visualization is the problem of producing a set of two-dimensional coordinates for each node that meaningfully shows connections and latent structure in a graph. Among other uses, having a meaningful layout is often useful to help interpret the results from network science tasks such as community detection and link prediction. There are several existing graph visualization techniques in the literature that are based on spectral methods, graph embeddings, or optimizing graph distances. Despite the large number of methods, it is still often challenging or extremely time consuming to produce meaningful layouts of graphs with hundreds of thousands of vertices. Existing methods often either fail to produce a visualization in a meaningful time window, or produce a layout colorfully called a “hairball”, which does not illustrate any internal structure in the graph. Here, we show that adding higher-order information based on cliques to a classic eigenvector based graph visualization technique enables it to produce meaningful plots of large graphs. We further evaluate these visualizations along a number of graph visualization metrics and we find that it outperforms existing techniques on a metric that uses random walks to measure the local structure. Finally, we show many examples of how our algorithm successfully produces layouts of large networks. Code to reproduce our results is available.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90276559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ResQueue: A Smarter Datacenter Flow Scheduler ResQueue:一个更智能的数据中心流调度程序

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380012

Hamed Rezaei, Balajee Vamanan

Datacenters host a mix of applications: foreground applications perform distributed lookups in order to service user queries and background applications perform batch processing tasks such as data reorganization, backup, and replication. While background flows produce the most load, foreground applications produce the most number of flows. Because packets from both types of applications compete at switches for network bandwidth, the performance of applications is sensitive to scheduling mechanisms. Existing schedulers use flow size to distinguish critical flows from non-critical flows. However, recent studies on datacenter workloads reveal that most flows are small (e.g., most flows consist of only a handful number of packets). In light of recent findings, we make the key observation that because most flows are small, flow size is not sufficient to distinguish critical flows from non-critical flows and therefore existing flow schedulers do not achieve the desired prioritization. In this paper, we introduce ResQueue, which uses a combination of flow size and packet history to calculate the priority of each flow. Our evaluation shows that ResQueue improves tail flow completion times of short flows by up to 60% over the state-of-the-art flow scheduling mechanisms.

数据中心托管多种应用程序:前台应用程序执行分布式查找，以便为用户查询提供服务;后台应用程序执行批处理任务，如数据重组、备份和复制。虽然后台流产生的负载最多，但前台应用程序产生的流数量最多。由于来自这两种类型的应用程序的数据包在交换机上竞争网络带宽，因此应用程序的性能对调度机制很敏感。现有调度器使用流大小来区分关键流和非关键流。然而，最近对数据中心工作负载的研究表明，大多数流都很小(例如，大多数流仅由少量数据包组成)。根据最近的研究结果，我们做出了关键的观察，因为大多数流量很小，流量大小不足以区分关键流量和非关键流量，因此现有的流量调度器不能实现预期的优先级。在本文中，我们介绍了ResQueue，它结合流大小和数据包历史来计算每个流的优先级。我们的评估表明，与最先进的流调度机制相比，ResQueue将短流的尾流完成时间提高了60%。

{"title":"ResQueue: A Smarter Datacenter Flow Scheduler","authors":"Hamed Rezaei, Balajee Vamanan","doi":"10.1145/3366423.3380012","DOIUrl":"https://doi.org/10.1145/3366423.3380012","url":null,"abstract":"Datacenters host a mix of applications: foreground applications perform distributed lookups in order to service user queries and background applications perform batch processing tasks such as data reorganization, backup, and replication. While background flows produce the most load, foreground applications produce the most number of flows. Because packets from both types of applications compete at switches for network bandwidth, the performance of applications is sensitive to scheduling mechanisms. Existing schedulers use flow size to distinguish critical flows from non-critical flows. However, recent studies on datacenter workloads reveal that most flows are small (e.g., most flows consist of only a handful number of packets). In light of recent findings, we make the key observation that because most flows are small, flow size is not sufficient to distinguish critical flows from non-critical flows and therefore existing flow schedulers do not achieve the desired prioritization. In this paper, we introduce ResQueue, which uses a combination of flow size and packet history to calculate the priority of each flow. Our evaluation shows that ResQueue improves tail flow completion times of short flows by up to 60% over the state-of-the-art flow scheduling mechanisms.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86493573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Embedding the Scientific Record on the Web: Towards Automating Scientific Discoveries 在网络上嵌入科学记录:走向自动化科学发现

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3382667

Y. Gil

Future AI systems will be key contributors to science, but this is unlikely to happen unless we reinvent our current publications and embed our scientific records in the Web as structured Web objects. This implies that our scientific papers of the future will be complemented with explicit, structured descriptions of the experiments, software, data, and workflows used to reach new findings. These scientific papers of the future will not only culminate the promise of open science and reproducible research, but also enable the creation of AI systems that can ingest and organize scientific methods and processes, re-run experiments and re-analyze results, and explore their own hypothesis in systematic and unbiased ways. In this talk, I will describe guidelines for writing scientific papers of the future that embed the scientific record on the Web, and our progress on AI systems capable of using them to systematically explore experiments. I will also outline a research agenda with seven key characteristics for creating AI scientists that will exploit the Web to independently make new discoveries [1]. AI scientists have the potential to transform science and the processes of scientific discovery [2, 3].

未来的人工智能系统将成为科学的关键贡献者，但除非我们重新发明我们现有的出版物，并将我们的科学记录作为结构化的网络对象嵌入到网络中，否则这种情况不太可能发生。这意味着我们未来的科学论文将辅以对实验、软件、数据和用于获得新发现的工作流程的明确、结构化的描述。这些未来的科学论文不仅将使开放科学和可重复研究的前景达到顶峰，而且还将使人工智能系统能够吸收和组织科学方法和过程，重新运行实验和重新分析结果，并以系统和公正的方式探索自己的假设。在这次演讲中，我将描述在网络上嵌入科学记录的未来科学论文的写作指南，以及我们在人工智能系统方面的进展，这些系统能够使用它们来系统地探索实验。我还将概述一个研究议程，其中包含培养人工智能科学家的七个关键特征，这些科学家将利用网络独立做出新的发现[1]。人工智能科学家具有改变科学和科学发现过程的潜力[2,3]。

{"title":"Embedding the Scientific Record on the Web: Towards Automating Scientific Discoveries","authors":"Y. Gil","doi":"10.1145/3366423.3382667","DOIUrl":"https://doi.org/10.1145/3366423.3382667","url":null,"abstract":"Future AI systems will be key contributors to science, but this is unlikely to happen unless we reinvent our current publications and embed our scientific records in the Web as structured Web objects. This implies that our scientific papers of the future will be complemented with explicit, structured descriptions of the experiments, software, data, and workflows used to reach new findings. These scientific papers of the future will not only culminate the promise of open science and reproducible research, but also enable the creation of AI systems that can ingest and organize scientific methods and processes, re-run experiments and re-analyze results, and explore their own hypothesis in systematic and unbiased ways. In this talk, I will describe guidelines for writing scientific papers of the future that embed the scientific record on the Web, and our progress on AI systems capable of using them to systematically explore experiments. I will also outline a research agenda with seven key characteristics for creating AI scientists that will exploit the Web to independently make new discoveries [1]. AI scientists have the potential to transform science and the processes of scientific discovery [2, 3].","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82686326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Edge formation in Social Networks to Nurture Content Creators 社交网络的优势形成以培养内容创作者

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380267

Chun Lo, Emilie de Longueau, Ankan Saha, S. Chatterjee

Social networks act as major content marketplaces where creators and consumers come together to share and consume various kinds of content. Content ranking applications (e.g., newsfeed, moments, notifications) and edge recommendation products (e.g., connect to members, follow celebrities or groups or hashtags) on such platforms aim at improving the consumer experience. In this work, we focus on the creator experience and specifically on improving edge recommendations to better serve creators in such ecosystems. The audience and reach of creators – individuals, celebrities, publishers and companies – are critically shaped by these edge recommendation products. Hence, incorporating creator utility in such recommendations can have a material impact on their success, and in turn, on the marketplace. In this paper, we (i) propose a general framework to incorporate creator utility in edge recommendations, (ii) devise a specific method to estimate edge-level creator utilities for currently unformed edges, (iii) outline the challenges of measurement and propose a practical experiment design, and finally (iv) discuss the implementation of our proposal at scale on LinkedIn, a professional network with 645M+ members, and report our findings.

社交网络是主要的内容市场，创作者和消费者聚集在一起分享和消费各种内容。这些平台上的内容排名应用程序(例如，新闻推送，朋友圈，通知)和边缘推荐产品(例如，连接到成员，关注名人或组或标签)旨在改善消费者体验。在这项工作中，我们专注于创作者的体验，特别是改进边缘推荐，以更好地为这样的生态系统中的创作者服务。这些边缘推荐产品极大地塑造了创作者(个人、名人、出版商和公司)的受众和影响力。因此，在这些推荐中加入创建者实用程序可以对它们的成功产生重大影响，进而对市场产生影响。在本文中，我们(i)提出了一个将创建者效用纳入边缘建议的一般框架，(ii)设计了一种特定的方法来估计当前未形成边缘的边缘级创建者效用，(iii)概述了测量的挑战并提出了一个实用的实验设计，最后(iv)讨论了我们的建议在LinkedIn(一个拥有6.45亿多成员的专业网络)上的大规模实施，并报告了我们的发现。

{"title":"Edge formation in Social Networks to Nurture Content Creators","authors":"Chun Lo, Emilie de Longueau, Ankan Saha, S. Chatterjee","doi":"10.1145/3366423.3380267","DOIUrl":"https://doi.org/10.1145/3366423.3380267","url":null,"abstract":"Social networks act as major content marketplaces where creators and consumers come together to share and consume various kinds of content. Content ranking applications (e.g., newsfeed, moments, notifications) and edge recommendation products (e.g., connect to members, follow celebrities or groups or hashtags) on such platforms aim at improving the consumer experience. In this work, we focus on the creator experience and specifically on improving edge recommendations to better serve creators in such ecosystems. The audience and reach of creators – individuals, celebrities, publishers and companies – are critically shaped by these edge recommendation products. Hence, incorporating creator utility in such recommendations can have a material impact on their success, and in turn, on the marketplace. In this paper, we (i) propose a general framework to incorporate creator utility in edge recommendations, (ii) devise a specific method to estimate edge-level creator utilities for currently unformed edges, (iii) outline the challenges of measurement and propose a practical experiment design, and finally (iv) discuss the implementation of our proposal at scale on LinkedIn, a professional network with 645M+ members, and report our findings.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82244935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Mining Points-of-Interest for Explaining Urban Phenomena: A Scalable Variational Inference Approach 挖掘兴趣点来解释城市现象:一种可扩展的变分推理方法

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380298

Christof Naumzik, Patrick Zoechbauer, S. Feuerriegel

Points-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate the spatial distributions of both the outcome and POIs. In our case, this modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases.

的兴趣点(POIs;例如，餐馆、酒吧、地标和其他实体)在网络挖掘数据中很常见:它们极大地解释了城市现象的空间分布。传统的建模方法依赖于特征工程，但忽略了poi之间的空间结构。为了克服这一缺点，本文提出了一种新的空间模型来解释基于web挖掘的poi的空间分布。我们的主要贡献是:(1)为了模拟poi对给定结果变量的影响，我们提出了一个严格但高度可解释的形式化方法。具体来说，我们适应了结果和poi的空间分布。在我们的例子中，这是由潜在高斯过程的和来建模的。(2)与之前的文献相比，我们的模型没有使用特征工程来推断poi的影响，而是通过具有完全可学习参数化的距离加权核函数来建模poi的影响。(3)提出了一种基于稀疏变分逼近的可扩展学习算法。为此，我们推导了一个定制的证据下限(ELBO)，并且，对于适当的可能性，我们甚至表明可以获得解析表达式。这允许快速和准确地计算ELBO。最后，我们的web挖掘方法的价值在两个现实世界的案例研究中得到了证明。我们的研究结果在预测和特别是解释性能方面都比最先进的基线有了实质性的改进。总之，这为利用网络挖掘的poi提供了一个新的空间模型。在基于位置的社交网络的背景下，它承诺提供广泛的新见解和用例。

{"title":"Mining Points-of-Interest for Explaining Urban Phenomena: A Scalable Variational Inference Approach","authors":"Christof Naumzik, Patrick Zoechbauer, S. Feuerriegel","doi":"10.1145/3366423.3380298","DOIUrl":"https://doi.org/10.1145/3366423.3380298","url":null,"abstract":"Points-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate the spatial distributions of both the outcome and POIs. In our case, this modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"505 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75214721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multi-Objective Ranking Optimization for Product Search Using Stochastic Label Aggregation 基于随机标签聚合的产品搜索多目标排序优化

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380122

David Carmel, Elad Haramaty, Arnon Lazerson, L. Lewin-Eytan

Learning a ranking model in product search involves satisfying many requirements such as maximizing the relevance of retrieved products with respect to the user query, as well as maximizing the purchase likelihood of these products. Multi-Objective Ranking Optimization (MORO) is the task of learning a ranking model from training examples while optimizing multiple objectives simultaneously. Label aggregation is a popular solution approach for multi-objective optimization, which reduces the problem into a single objective optimization problem, by aggregating the multiple labels of the training examples, related to the different objectives, to a single label. In this work we explore several label aggregation methods for MORO in product search. We propose a novel stochastic label aggregation method which randomly selects a label per training example according to a given distribution over the labels. We provide a theoretical proof showing that stochastic label aggregation is superior to alternative aggregation approaches, in the sense that any optimal solution of the MORO problem can be generated by a proper parameter setting of the stochastic aggregation process. We experiment on three different datasets: two from the voice product search domain, and one publicly available dataset from the Web product search domain. We demonstrate empirically over these three datasets that MORO with stochastic label aggregation provides a family of ranking models that fully dominates the set of MORO models built using deterministic label aggregation.

学习产品搜索中的排名模型涉及满足许多需求，例如最大化检索到的产品相对于用户查询的相关性，以及最大化这些产品的购买可能性。多目标排序优化(MORO)是从训练样例中学习排序模型，同时对多个目标进行优化的任务。标签聚合是一种流行的多目标优化解决方法，它通过将与不同目标相关的训练样例的多个标签聚合到一个标签上，将问题简化为一个单目标优化问题。在这项工作中，我们探索了产品搜索中MORO的几种标签聚合方法。我们提出了一种新的随机标签聚合方法，该方法根据标签的给定分布在每个训练样本上随机选择一个标签。我们提供了一个理论证明，表明随机标签聚合优于其他聚合方法，在某种意义上，任何最优解的MORO问题都可以通过随机聚合过程的适当参数设置产生。我们在三个不同的数据集上进行实验:两个来自语音产品搜索领域，一个来自Web产品搜索领域的公开可用数据集。我们通过这三个数据集的经验证明，随机标签聚合的MORO提供了一系列排序模型，这些模型完全优于使用确定性标签聚合构建的MORO模型集。

{"title":"Multi-Objective Ranking Optimization for Product Search Using Stochastic Label Aggregation","authors":"David Carmel, Elad Haramaty, Arnon Lazerson, L. Lewin-Eytan","doi":"10.1145/3366423.3380122","DOIUrl":"https://doi.org/10.1145/3366423.3380122","url":null,"abstract":"Learning a ranking model in product search involves satisfying many requirements such as maximizing the relevance of retrieved products with respect to the user query, as well as maximizing the purchase likelihood of these products. Multi-Objective Ranking Optimization (MORO) is the task of learning a ranking model from training examples while optimizing multiple objectives simultaneously. Label aggregation is a popular solution approach for multi-objective optimization, which reduces the problem into a single objective optimization problem, by aggregating the multiple labels of the training examples, related to the different objectives, to a single label. In this work we explore several label aggregation methods for MORO in product search. We propose a novel stochastic label aggregation method which randomly selects a label per training example according to a given distribution over the labels. We provide a theoretical proof showing that stochastic label aggregation is superior to alternative aggregation approaches, in the sense that any optimal solution of the MORO problem can be generated by a proper parameter setting of the stochastic aggregation process. We experiment on three different datasets: two from the voice product search domain, and one publicly available dataset from the Web product search domain. We demonstrate empirically over these three datasets that MORO with stochastic label aggregation provides a family of ranking models that fully dominates the set of MORO models built using deterministic label aggregation.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73630719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Natural Language Annotations for Search Engine Optimization 搜索引擎优化的自然语言注释

Proceedings of The Web Conference 2020

Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380049

P. Jenkins, Jennifer Zhao, Heath Vinicombe, Anant Subramanian, Arun Prasad, Atillia Dobi, E. Li, Yunsong Guo

Understanding content at scale is a difficult but important problem for many platforms. Many previous studies focus on content understanding to optimize engagement with existing users. However, little work studies how to leverage better content understanding to attract new users. In this work, we build a framework for generating natural language content annotations and show how they can be used for search engine optimization. The proposed framework relies on an XGBoost model that labels “pins” with high probability phrases, and a logistic regression layer that learns to rank aggregated annotations for groups of content. The pipeline identifies keywords that are descriptive and contextually meaningful. We perform a large-scale production experiment deployed on the Pinterest platform and show that natural language annotations cause a 1-2% increase in traffic from leading search engines. This increase is statistically significant. Finally, we explore and interpret the characteristics of our annotations framework.

对许多平台来说，大规模理解内容是一个困难但重要的问题。许多先前的研究都关注于内容理解，以优化与现有用户的互动。然而，很少有人研究如何利用更好的内容理解来吸引新用户。在这项工作中，我们构建了一个用于生成自然语言内容注释的框架，并展示了如何将它们用于搜索引擎优化。提出的框架依赖于一个XGBoost模型，该模型用高概率短语标记“pin”，以及一个逻辑回归层，该层学习对内容组的聚合注释进行排序。管道标识具有描述性和上下文意义的关键字。我们在Pinterest平台上进行了大规模的生产实验，并表明自然语言注释导致领先搜索引擎的流量增加1-2%。这一增长在统计上是显著的。最后，我们探索和解释了我们的注释框架的特点。

引用次数: 3