首页 > 最新文献

Proceedings of The Web Conference 2020最新文献

英文 中文
TRAP: Two-level Regularized Autoencoder-based Embedding for Power-law Distributed Data 陷阱:基于两级正则化自编码器的幂律分布式数据嵌入
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380233
Dongmin Park, Hwanjun Song, Minseok Kim, Jae-Gil Lee
Recently, autoencoder (AE)-based embedding approaches have achieved state-of-the-art performance in many tasks, especially in top-k recommendation with user embedding or node classification with node embedding. However, we find that many real-world data follow the power-law distribution with respect to the data object sparsity. When learning AE-based embeddings of these data, dense inputs move away from sparse inputs in an embedding space even when they are highly correlated. This phenomenon, which we call polarization, obviously distorts the embedding. In this paper, we propose TRAP that leverages two-level regularizers to effectively alleviate the polarization problem. The macroscopic regularizer generally prevents dense input objects from being distant from other sparse input objects, and the microscopic regularizer individually attracts each object to correlated neighbor objects rather than uncorrelated ones. Importantly, TRAP is a meta-algorithm that can be easily coupled with existing AE-based embedding methods with a simple modification. In extensive experiments on two representative embedding tasks using six-real world datasets, TRAP boosted the performance of the state-of-the-art algorithms by up to 31.53% and 94.99% respectively.
近年来,基于自编码器(AE)的嵌入方法在许多任务中取得了最先进的性能,特别是在top-k推荐与用户嵌入或节点分类与节点嵌入方面。然而,我们发现许多现实世界的数据在数据对象稀疏性方面遵循幂律分布。当学习这些数据的基于ae的嵌入时,密集输入会远离嵌入空间中的稀疏输入,即使它们是高度相关的。这种现象,我们称之为极化,显然扭曲了嵌入。在本文中,我们提出了利用两级正则器来有效缓解极化问题的TRAP。宏观正则化器通常防止密集输入对象远离其他稀疏输入对象,微观正则化器单独吸引每个对象到相关的相邻对象而不是不相关的相邻对象。重要的是,TRAP是一种元算法,通过简单的修改可以很容易地与现有的基于ae的嵌入方法相结合。在使用六个真实世界数据集的两个代表性嵌入任务的广泛实验中,TRAP将最先进的算法的性能分别提高了31.53%和94.99%。
{"title":"TRAP: Two-level Regularized Autoencoder-based Embedding for Power-law Distributed Data","authors":"Dongmin Park, Hwanjun Song, Minseok Kim, Jae-Gil Lee","doi":"10.1145/3366423.3380233","DOIUrl":"https://doi.org/10.1145/3366423.3380233","url":null,"abstract":"Recently, autoencoder (AE)-based embedding approaches have achieved state-of-the-art performance in many tasks, especially in top-k recommendation with user embedding or node classification with node embedding. However, we find that many real-world data follow the power-law distribution with respect to the data object sparsity. When learning AE-based embeddings of these data, dense inputs move away from sparse inputs in an embedding space even when they are highly correlated. This phenomenon, which we call polarization, obviously distorts the embedding. In this paper, we propose TRAP that leverages two-level regularizers to effectively alleviate the polarization problem. The macroscopic regularizer generally prevents dense input objects from being distant from other sparse input objects, and the microscopic regularizer individually attracts each object to correlated neighbor objects rather than uncorrelated ones. Importantly, TRAP is a meta-algorithm that can be easily coupled with existing AE-based embedding methods with a simple modification. In extensive experiments on two representative embedding tasks using six-real world datasets, TRAP boosted the performance of the state-of-the-art algorithms by up to 31.53% and 94.99% respectively.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80507857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Beyond Rank-1: Discovering Rich Community Structure in Multi-Aspect Graphs 超越排名1:在多面向图中发现丰富的社区结构
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380129
Ekta Gujral, Ravdeep Pasricha, E. Papalexakis
How are communities in real multi-aspect or multi-view graphs structured? How we can effectively and concisely summarize and explore those communities in a high-dimensional, multi-aspect graph without losing important information? State-of-the-art studies focused on patterns in single graphs, identifying structures in a single snapshot of a large network or in time evolving graphs and stitch them over time. However, to the best of our knowledge, there is no method that discovers and summarizes community structure from a multi-aspect graph, by jointly leveraging information from all aspects. State-of-the-art in multi-aspect/tensor community extraction is limited to discovering clique structure in the extracted communities, or even worse, imposing clique structure where it does not exist. In this paper we bridge that gap by empowering tensor-based methods to extract rich community structure from multi-aspect graphs. In particular, we introduce cLL1, a novel constrained Block Term Tensor Decomposition, that is generally capable of extracting higher than rank-1 but still interpretable structure from a multi-aspect dataset. Subsequently, we propose RichCom, a community structure extraction and summarization algorithm that leverages cLL1to identify rich community structure (e.g., cliques, stars, chains, etc) while leveraging higher-order correlations between the different aspects of the graph. Our contributions are four-fold: (a) Novel algorithm: we develop cLL1, an efficient framework to extract rich and interpretable structure from general multi-aspect data; (b) Graph summarization and exploration: we provide RichCom, a summarization and encoding scheme to discover and explore structures of communities identified by cLL1; (c) Multi-aspect graph generator: we provide a simple and effective synthetic multi-aspect graph generator, and (d) Real-world utility: we present empirical results on small and large real datasets that demonstrate performance on par or superior to existing state-of-the-art.
真正的多角度或多视图图中的社区是如何构建的?我们如何在不丢失重要信息的情况下,以高维、多角度的图表有效、简洁地总结和探索这些社区?最先进的研究集中在单个图中的模式,在一个大网络的单个快照中识别结构或在时间进化的图中,并随着时间的推移将它们缝合。然而,据我们所知,目前还没有一种方法可以通过综合利用各个方面的信息,从一个多面图中发现和总结社区结构。目前的多面向/张量群落提取仅限于在提取的群落中发现团簇结构,甚至在不存在团簇结构的地方强加团簇结构。在本文中,我们通过赋予基于张量的方法从多向图中提取丰富的社区结构来弥补这一差距。特别是,我们引入了cLL1,一种新的约束块项张量分解,它通常能够从多方面数据集中提取高于1级但仍然可解释的结构。随后,我们提出了RichCom,这是一种社区结构提取和汇总算法,它利用cll1来识别丰富的社区结构(例如,派系、星星、链等),同时利用图的不同方面之间的高阶相关性。我们的贡献有四个方面:(a)新颖的算法:我们开发了cLL1,一个从一般的多面向数据中提取丰富和可解释结构的有效框架;(b)图的汇总和挖掘:我们提供了一个汇总和编码方案RichCom,用于发现和挖掘cLL1识别的群落结构;(c)多向图生成器:我们提供了一个简单而有效的合成多向图生成器,以及(d)现实世界的效用:我们在小型和大型真实数据集上展示了经验结果,这些数据集展示了与现有最先进的性能相当或更好的性能。
{"title":"Beyond Rank-1: Discovering Rich Community Structure in Multi-Aspect Graphs","authors":"Ekta Gujral, Ravdeep Pasricha, E. Papalexakis","doi":"10.1145/3366423.3380129","DOIUrl":"https://doi.org/10.1145/3366423.3380129","url":null,"abstract":"How are communities in real multi-aspect or multi-view graphs structured? How we can effectively and concisely summarize and explore those communities in a high-dimensional, multi-aspect graph without losing important information? State-of-the-art studies focused on patterns in single graphs, identifying structures in a single snapshot of a large network or in time evolving graphs and stitch them over time. However, to the best of our knowledge, there is no method that discovers and summarizes community structure from a multi-aspect graph, by jointly leveraging information from all aspects. State-of-the-art in multi-aspect/tensor community extraction is limited to discovering clique structure in the extracted communities, or even worse, imposing clique structure where it does not exist. In this paper we bridge that gap by empowering tensor-based methods to extract rich community structure from multi-aspect graphs. In particular, we introduce cLL1, a novel constrained Block Term Tensor Decomposition, that is generally capable of extracting higher than rank-1 but still interpretable structure from a multi-aspect dataset. Subsequently, we propose RichCom, a community structure extraction and summarization algorithm that leverages cLL1to identify rich community structure (e.g., cliques, stars, chains, etc) while leveraging higher-order correlations between the different aspects of the graph. Our contributions are four-fold: (a) Novel algorithm: we develop cLL1, an efficient framework to extract rich and interpretable structure from general multi-aspect data; (b) Graph summarization and exploration: we provide RichCom, a summarization and encoding scheme to discover and explore structures of communities identified by cLL1; (c) Multi-aspect graph generator: we provide a simple and effective synthetic multi-aspect graph generator, and (d) Real-world utility: we present empirical results on small and large real datasets that demonstrate performance on par or superior to existing state-of-the-art.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87473033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Characterizing Search-Engine Traffic to Internet Research Agency Web Properties 表征搜索引擎流量的互联网研究机构网络属性
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380290
Alexander Spangher, G. Ranade, Besmira Nushi, Adam Fourney, E. Horvitz
The Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. In this paper, we focus on IRA activities that received exposure through search engines, by joining data from Facebook and Twitter with logs from the Internet Explorer 11 and Edge browsers and the Bing.com search engine. We find that a substantial volume of Russian content was apolitical and emotionally-neutral in nature. Our observations demonstrate that such content gave IRA web-properties considerable exposure through search-engines and brought readers to websites hosting inflammatory content and engagement hooks. Our findings show that, like social media, web search also directed traffic to IRA generated web content, and the resultant traffic patterns are distinct from those of social media.
总部位于俄罗斯的互联网研究机构(IRA)在2016年总统大选前后在美国开展了广泛的信息宣传活动。该组织创建了一套庞大的互联网资产:网络域名、Facebook页面和Twitter机器人,它们通过购买Facebook广告、推文和索引其域名的搜索引擎获得流量。在本文中,我们通过将来自Facebook和Twitter的数据与Internet Explorer 11和Edge浏览器以及Bing.com搜索引擎的日志相结合,重点关注通过搜索引擎曝光的IRA活动。我们发现大量的俄罗斯内容在本质上是非政治和情感中立的。我们的观察表明,这些内容通过搜索引擎给IRA网站带来了相当大的曝光率,并将读者带到了承载煽动性内容和吸引人的网站上。我们的研究结果表明,与社交媒体一样,网络搜索也会将流量导向IRA生成的网络内容,并且由此产生的流量模式与社交媒体截然不同。
{"title":"Characterizing Search-Engine Traffic to Internet Research Agency Web Properties","authors":"Alexander Spangher, G. Ranade, Besmira Nushi, Adam Fourney, E. Horvitz","doi":"10.1145/3366423.3380290","DOIUrl":"https://doi.org/10.1145/3366423.3380290","url":null,"abstract":"The Russia-based Internet Research Agency (IRA) carried out a broad information campaign in the U.S. before and after the 2016 presidential election. The organization created an expansive set of internet properties: web domains, Facebook pages, and Twitter bots, which received traffic via purchased Facebook ads, tweets, and search engines indexing their domains. In this paper, we focus on IRA activities that received exposure through search engines, by joining data from Facebook and Twitter with logs from the Internet Explorer 11 and Edge browsers and the Bing.com search engine. We find that a substantial volume of Russian content was apolitical and emotionally-neutral in nature. Our observations demonstrate that such content gave IRA web-properties considerable exposure through search-engines and brought readers to websites hosting inflammatory content and engagement hooks. Our findings show that, like social media, web search also directed traffic to IRA generated web content, and the resultant traffic patterns are distinct from those of social media.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"79 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91449412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
PARS: Peers-aware Recommender System PARS:同伴感知推荐系统
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380013
Huiqiang Mao, Yanzhi Li, Chenliang Li, Di Chen, Xiaoqing Wang, Yuming Deng
The presence or absence of one item in a recommendation list will affect the demand for other items because customers are often willing to switch to other items if their most preferred items are not available. The cross-item influence, called “peers effect”, has been largely ignored in the literature. In this paper, we develop a peers-aware recommender system, named PARS. We apply a ranking-based choice model to capture the cross-item influence and solve the resultant MaxMin problem with a decomposition algorithm. The MaxMin model solves for the recommendation decision in the meanwhile of estimating users’ preferences towards the items, which yields high-quality recommendations robust to input data variation. Experimental results illustrate that PARS outperforms a few frequently used methods in practice. An online evaluation with a flash sales scenario at Taobao also shows that PARS delivers significant improvements in terms of both conversion rates and user value.
推荐列表中某一项商品的存在与否会影响对其他商品的需求,因为如果没有他们最喜欢的商品,顾客通常愿意转向其他商品。被称为“同伴效应”的跨项目影响在文献中基本上被忽略了。在本文中,我们开发了一个同伴感知推荐系统,命名为PARS。我们应用基于排名的选择模型来捕捉跨项目影响,并使用分解算法解决由此产生的MaxMin问题。MaxMin模型在估计用户对商品的偏好的同时解决了推荐决策问题,产生了对输入数据变化具有鲁棒性的高质量推荐。实验结果表明,在实际应用中,PARS算法优于一些常用的算法。一项针对淘宝闪购场景的在线评估也显示,PARS在转化率和用户价值方面都有显著提高。
{"title":"PARS: Peers-aware Recommender System","authors":"Huiqiang Mao, Yanzhi Li, Chenliang Li, Di Chen, Xiaoqing Wang, Yuming Deng","doi":"10.1145/3366423.3380013","DOIUrl":"https://doi.org/10.1145/3366423.3380013","url":null,"abstract":"The presence or absence of one item in a recommendation list will affect the demand for other items because customers are often willing to switch to other items if their most preferred items are not available. The cross-item influence, called “peers effect”, has been largely ignored in the literature. In this paper, we develop a peers-aware recommender system, named PARS. We apply a ranking-based choice model to capture the cross-item influence and solve the resultant MaxMin problem with a decomposition algorithm. The MaxMin model solves for the recommendation decision in the meanwhile of estimating users’ preferences towards the items, which yields high-quality recommendations robust to input data variation. Experimental results illustrate that PARS outperforms a few frequently used methods in practice. An online evaluation with a flash sales scenario at Taobao also shows that PARS delivers significant improvements in terms of both conversion rates and user value.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87391534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Active Domain Transfer on Network Embedding 网络嵌入中的主动域转移
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380024
Lichen Jin, Yizhou Zhang, Guojie Song, Yilun Jin
Recent works show that end-to-end, (semi-) supervised network embedding models can generate satisfactory vectors to represent network topology, and are even applicable to unseen graphs by inductive learning. However, domain mismatch between training and testing network for inductive learning, as well as lack of labeled data often compromises the outcome of such methods. To make matters worse, while transfer learning and active learning techniques, being able to solve such problems correspondingly, have been well studied on regular i.i.d data, relatively few attention has been paid on networks. Consequently, we propose in this paper a method for active transfer learning on networks named active-transfer network embedding, abbreviated ATNE. In ATNE we jointly consider the influence of each node on the network from the perspectives of transfer and active learning, and hence design novel and effective influence scores combining both aspects in the training process to facilitate node selection. We demonstrate that ATNE is efficient and decoupled from the actual model used. Further extensive experiments show that ATNE outperforms state-of-the-art active node selection methods and shows versatility in different situations.
最近的研究表明,端到端(半)监督网络嵌入模型可以生成满意的向量来表示网络拓扑,甚至可以通过归纳学习应用于看不见的图。然而,归纳学习的训练和测试网络之间的领域不匹配以及缺乏标记数据往往会影响这些方法的结果。更糟糕的是,虽然迁移学习和主动学习技术能够相应解决这类问题,已经在常规的i.i.d数据上得到了很好的研究,但对网络的关注相对较少。因此,我们在本文中提出了一种网络上的主动迁移学习方法,称为主动迁移网络嵌入,简称ATNE。在ATNE中,我们从迁移和主动学习的角度共同考虑每个节点对网络的影响,从而在训练过程中结合这两个方面设计新颖有效的影响分数,以方便节点的选择。我们证明了ATNE是有效的,并且与实际使用的模型解耦。进一步的广泛实验表明,ATNE优于最先进的活动节点选择方法,并在不同情况下显示出通用性。
{"title":"Active Domain Transfer on Network Embedding","authors":"Lichen Jin, Yizhou Zhang, Guojie Song, Yilun Jin","doi":"10.1145/3366423.3380024","DOIUrl":"https://doi.org/10.1145/3366423.3380024","url":null,"abstract":"Recent works show that end-to-end, (semi-) supervised network embedding models can generate satisfactory vectors to represent network topology, and are even applicable to unseen graphs by inductive learning. However, domain mismatch between training and testing network for inductive learning, as well as lack of labeled data often compromises the outcome of such methods. To make matters worse, while transfer learning and active learning techniques, being able to solve such problems correspondingly, have been well studied on regular i.i.d data, relatively few attention has been paid on networks. Consequently, we propose in this paper a method for active transfer learning on networks named active-transfer network embedding, abbreviated ATNE. In ATNE we jointly consider the influence of each node on the network from the perspectives of transfer and active learning, and hence design novel and effective influence scores combining both aspects in the training process to facilitate node selection. We demonstrate that ATNE is efficient and decoupled from the actual model used. Further extensive experiments show that ATNE outperforms state-of-the-art active node selection methods and shows versatility in different situations.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89768377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Real-Time Clustering for Large Sparse Online Visitor Data 大型稀疏在线访问者数据的实时聚类
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380183
G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire
Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.
在线访问者行为通常被建模为一个大的稀疏矩阵,其中行表示访问者,列表示行为。为了发现具有不同层次结构的客户群,营销人员通常需要将数据聚类在不同的细分中。这种分析需要聚类算法对用户参数变化提供实时响应,这是当前技术无法支持的。本文针对大规模稀疏数据,提出了一种实时聚类算法——稀疏密度峰算法。它对输入点进行预处理以计算注释和集群分配的层次结构。虽然赋值只是对点进行一次扫描,但简单的预处理需要测量所有的成对距离,这会产生二次计算开销,并且对于任何中等大小的数据都是不可行的。因此,我们提出了一种基于MinHash和LSH的新方法,可以提供快速准确的估计。我们还描述了一个在Spark上解决数据倾斜和内存使用的高效实现。我们的实验表明,与直接的MinHash和LSH实现相比,我们的方法(1)在真实数据集的准确性方面提供了更好的近似,(2)在端到端聚类管道中实现了20倍的加速,(3)可以使用较小的内存维持计算。最后,我们提供了一个界面,从数百万在线访客记录中实时探索客户细分。
{"title":"Real-Time Clustering for Large Sparse Online Visitor Data","authors":"G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire","doi":"10.1145/3366423.3380183","DOIUrl":"https://doi.org/10.1145/3366423.3380183","url":null,"abstract":"Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75558719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Crowd Teaching with Imperfect Labels 用不完美的标签进行教学
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380099
Yao Zhou, A. R. Nelakurthi, Ross Maciejewski, Wei Fan, Jingrui He
The need for annotated labels to train machine learning models led to a surge in crowdsourcing - collecting labels from non-experts. Instead of annotating from scratch, given an imperfect labeled set, how can we leverage the label information obtained from amateur crowd workers to improve the data quality? Furthermore, is there a way to teach the amateur crowd workers using this imperfect labeled set in order to improve their labeling performance? In this paper, we aim to answer both questions via a novel interactive teaching framework, which uses visual explanations to simultaneously teach and gauge the confidence level of the crowd workers. Motivated by the huge demand for fine-grained label information in real-world applications, we start from the realistic and yet challenging assumption that neither the teacher nor the crowd workers are perfect. Then, we propose an adaptive scheme that could improve both of them through a sequence of interactions: the teacher teaches the workers using labeled data, and in return, the workers provide labels and the associated confidence level based on their own expertise. In particular, the teacher performs teaching using an empirical risk minimizer learned from an imperfect labeled set; the workers are assumed to have a forgetting behavior during learning and their learning rate depends on the interpretation difficulty of the teaching item. Furthermore, depending on the level of confidence when the workers perform labeling, we also show that the empirical risk minimizer used by the teacher is a reliable and realistic substitute of the unknown target concept by utilizing the unbiased surrogate loss. Finally, the performance of the proposed framework is demonstrated through experiments on multiple real-world image and text data sets.
对标注标签来训练机器学习模型的需求导致了众包的激增——从非专家那里收集标签。给定一个不完美的标记集,我们如何利用从业余人群工作者那里获得的标签信息来提高数据质量,而不是从头开始注释?此外,是否有一种方法可以教业余人群工作者使用这个不完美的标签集来提高他们的标签性能?在本文中,我们的目标是通过一种新的交互式教学框架来回答这两个问题,该框架使用视觉解释来同时教授和衡量群体工作者的信心水平。由于现实应用中对细粒度标签信息的巨大需求,我们从一个现实但具有挑战性的假设开始,即教师和人群工作者都不是完美的。然后,我们提出了一个自适应方案,可以通过一系列互动来改善两者:教师使用标记的数据来教授工人,作为回报,工人根据自己的专业知识提供标签和相关的置信度。特别是,教师使用从不完美标记集学习到的经验风险最小化器进行教学;假设工作者在学习过程中存在遗忘行为,其学习速度取决于教学项目的解释难度。此外,根据工人进行标记时的信心水平,我们还表明,教师使用的经验风险最小值是利用无偏替代损失的未知目标概念的可靠和现实的替代品。最后,通过多个真实世界图像和文本数据集的实验证明了该框架的性能。
{"title":"Crowd Teaching with Imperfect Labels","authors":"Yao Zhou, A. R. Nelakurthi, Ross Maciejewski, Wei Fan, Jingrui He","doi":"10.1145/3366423.3380099","DOIUrl":"https://doi.org/10.1145/3366423.3380099","url":null,"abstract":"The need for annotated labels to train machine learning models led to a surge in crowdsourcing - collecting labels from non-experts. Instead of annotating from scratch, given an imperfect labeled set, how can we leverage the label information obtained from amateur crowd workers to improve the data quality? Furthermore, is there a way to teach the amateur crowd workers using this imperfect labeled set in order to improve their labeling performance? In this paper, we aim to answer both questions via a novel interactive teaching framework, which uses visual explanations to simultaneously teach and gauge the confidence level of the crowd workers. Motivated by the huge demand for fine-grained label information in real-world applications, we start from the realistic and yet challenging assumption that neither the teacher nor the crowd workers are perfect. Then, we propose an adaptive scheme that could improve both of them through a sequence of interactions: the teacher teaches the workers using labeled data, and in return, the workers provide labels and the associated confidence level based on their own expertise. In particular, the teacher performs teaching using an empirical risk minimizer learned from an imperfect labeled set; the workers are assumed to have a forgetting behavior during learning and their learning rate depends on the interpretation difficulty of the teaching item. Furthermore, depending on the level of confidence when the workers perform labeling, we also show that the empirical risk minimizer used by the teacher is a reliable and realistic substitute of the unknown target concept by utilizing the unbiased surrogate loss. Finally, the performance of the proposed framework is demonstrated through experiments on multiple real-world image and text data sets.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73401030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Directional and Explainable Serendipity Recommendation 定向和可解释的意外建议
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380100
Xueqi Li, Wenjun Jiang, Weiguang Chen, Jie Wu, Guojun Wang, Kenli Li
Serendipity recommendation has attracted more and more attention in recent years; it is committed to providing recommendations which could not only cater to users’ demands but also broaden their horizons. However, existing approaches usually measure user-item relevance with a scalar instead of a vector, ignoring user preference direction, which increases the risk of unrelated recommendations. In addition, reasonable explanations increase users’ trust and acceptance, but there is no work to provide explanations for serendipitous recommendations. To address these limitations, we propose a Directional and Explainable Serendipity Recommendation method named DESR. Specifically, we extract users’ long-term preferences with an unsupervised method based on GMM (Gaussian Mixture Model) and capture their short-term demands with the capsule network at first. Then, we propose the serendipity vector to combine long-term preferences with short-term demands and generate directionally serendipitous recommendations with it. Finally, a back-routing scheme is exploited to offer explanations. Extensive experiments on real-world datasets show that DESR could effectively improve the serendipity and explainability, and give impetus to the diversity, compared with existing serendipity-based methods.
近年来,机缘推荐越来越受到关注;它致力于提供既能满足用户需求又能开阔用户视野的推荐。然而,现有的方法通常使用标量而不是向量来度量用户-项目相关性,忽略了用户偏好方向,这增加了不相关推荐的风险。此外,合理的解释增加了用户的信任和接受度,但没有工作为偶然的推荐提供解释。为了解决这些限制,我们提出了一种定向和可解释的Serendipity推荐方法,称为DESR。具体而言,我们首先使用基于高斯混合模型的无监督方法提取用户的长期偏好,然后使用胶囊网络捕获用户的短期需求。然后,我们提出了将长期偏好与短期需求结合起来的偶然性向量,并利用它生成有方向性的偶然性推荐。最后,一个反向路由方案被用来提供解释。在现实数据集上的大量实验表明,与现有基于偶然性的方法相比,DESR可以有效地提高偶然性和可解释性,并促进多样性。
{"title":"Directional and Explainable Serendipity Recommendation","authors":"Xueqi Li, Wenjun Jiang, Weiguang Chen, Jie Wu, Guojun Wang, Kenli Li","doi":"10.1145/3366423.3380100","DOIUrl":"https://doi.org/10.1145/3366423.3380100","url":null,"abstract":"Serendipity recommendation has attracted more and more attention in recent years; it is committed to providing recommendations which could not only cater to users’ demands but also broaden their horizons. However, existing approaches usually measure user-item relevance with a scalar instead of a vector, ignoring user preference direction, which increases the risk of unrelated recommendations. In addition, reasonable explanations increase users’ trust and acceptance, but there is no work to provide explanations for serendipitous recommendations. To address these limitations, we propose a Directional and Explainable Serendipity Recommendation method named DESR. Specifically, we extract users’ long-term preferences with an unsupervised method based on GMM (Gaussian Mixture Model) and capture their short-term demands with the capsule network at first. Then, we propose the serendipity vector to combine long-term preferences with short-term demands and generate directionally serendipitous recommendations with it. Finally, a back-routing scheme is exploited to offer explanations. Extensive experiments on real-world datasets show that DESR could effectively improve the serendipity and explainability, and give impetus to the diversity, compared with existing serendipity-based methods.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84583696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Valve: Securing Function Workflows on Serverless Computing Platforms 阀门:在无服务器计算平台上保护功能工作流
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380173
P. Datta, P. Kumar, Tristan Morris, M. Grace, Amir Rahmati, Adam Bates
Serverless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions developed by third-parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT). As a practical means of addressing this problem, we present Valve, a serverless platform that enables developers to exert complete fine-grained control of information flows in their applications. Valve enables workflow developers to reason about function behaviors, and specify restrictions, through auditing of network-layer information flows. By proxying network requests and propagating taint labels across network flows, Valve is able to restrict function behavior without code modification. We demonstrate that Valve is able defend against known serverless attack behaviors including container reuse-based persistence and data exfiltration over cloud platform APIs with less than 2.8% runtime overhead, 6.25% deployment overhead and 2.35% teardown overhead.
无服务器计算已经迅速成为一种占主导地位的云计算范式,它允许开发人员使用一组执行单个逻辑任务的小函数来快速构建事件驱动应用程序的原型。然而,许多这样的应用程序工作流部分地基于由第三方开发的公开可用的功能,这就造成了功能以意想不到的甚至恶意的方式运行的可能性。目前,开发人员并不能完全控制他们的数据在哪里以及如何流动,这在已经接受无服务器(例如物联网)的增长市场中造成了重大的安全和隐私风险。作为解决这个问题的实际方法,我们提出了Valve,一个无服务器平台,使开发人员能够在他们的应用程序中对信息流进行完全的细粒度控制。Valve使工作流开发人员能够通过审计网络层信息流来推断功能行为,并指定限制。通过代理网络请求和跨网络流传播污染标签,Valve能够在不修改代码的情况下限制功能行为。我们证明了Valve能够防御已知的无服务器攻击行为,包括基于容器重用的持久性和云平台api上的数据泄露,其运行时开销低于2.8%,部署开销为6.25%,拆除开销为2.35%。
{"title":"Valve: Securing Function Workflows on Serverless Computing Platforms","authors":"P. Datta, P. Kumar, Tristan Morris, M. Grace, Amir Rahmati, Adam Bates","doi":"10.1145/3366423.3380173","DOIUrl":"https://doi.org/10.1145/3366423.3380173","url":null,"abstract":"Serverless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions developed by third-parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT). As a practical means of addressing this problem, we present Valve, a serverless platform that enables developers to exert complete fine-grained control of information flows in their applications. Valve enables workflow developers to reason about function behaviors, and specify restrictions, through auditing of network-layer information flows. By proxying network requests and propagating taint labels across network flows, Valve is able to restrict function behavior without code modification. We demonstrate that Valve is able defend against known serverless attack behaviors including container reuse-based persistence and data exfiltration over cloud platform APIs with less than 2.8% runtime overhead, 6.25% deployment overhead and 2.35% teardown overhead.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83700139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Collective Multi-type Entity Alignment Between Knowledge Graphs 知识图谱之间的集体多类型实体对齐
Pub Date : 2020-04-20 DOI: 10.1145/3366423.3380289
Qi Zhu, Hao Wei, Bunyamin Sisman, Da Zheng, C. Faloutsos, Xin Dong, Jiawei Han
Knowledge graph (e.g. Freebase, YAGO) is a multi-relational graph representing rich factual information among entities of various types. Entity alignment is the key step towards knowledge graph integration from multiple sources. It aims to identify entities across different knowledge graphs that refer to the same real world entity. However, current entity alignment systems overlook the sparsity of different knowledge graphs and can not align multi-type entities by one single model. In this paper, we present a Collective Graph neural network for Multi-type entity Alignment, called CG-MuAlign. Different from previous work, CG-MuAlign jointly aligns multiple types of entities, collectively leverages the neighborhood information and generalizes to unlabeled entity types. Specifically, we propose novel collective aggregation function tailored for this task, that (1) relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, (2) scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy. We conduct experiments on real world knowledge graphs with millions of entities and observe the superior performance beyond existing methods. In addition, the running time of our approach is much less than the current state-of-the-art deep learning methods.
知识图(如Freebase, YAGO)是一种多关系图,表示各种类型实体之间丰富的事实信息。实体对齐是实现多源知识图谱集成的关键步骤。它的目的是在不同的知识图谱中识别指向同一个现实世界实体的实体。然而,现有的实体对齐系统忽视了不同知识图的稀疏性,无法通过单一模型对多种类型的实体进行对齐。本文提出了一种用于多类型实体对齐的集体图神经网络,称为CG-MuAlign。与以往的工作不同,CG-MuAlign联合对齐多种类型的实体,共同利用邻域信息,并推广到未标记的实体类型。具体来说,我们提出了一种新的集体聚集函数,它(1)通过交叉图和自关注来缓解知识图的不完全性,(2)通过小批量训练范式和有效的邻域采样策略来有效地扩展。我们在包含数百万个实体的真实世界知识图上进行了实验,并观察到超越现有方法的优越性能。此外,我们的方法的运行时间比目前最先进的深度学习方法要短得多。
{"title":"Collective Multi-type Entity Alignment Between Knowledge Graphs","authors":"Qi Zhu, Hao Wei, Bunyamin Sisman, Da Zheng, C. Faloutsos, Xin Dong, Jiawei Han","doi":"10.1145/3366423.3380289","DOIUrl":"https://doi.org/10.1145/3366423.3380289","url":null,"abstract":"Knowledge graph (e.g. Freebase, YAGO) is a multi-relational graph representing rich factual information among entities of various types. Entity alignment is the key step towards knowledge graph integration from multiple sources. It aims to identify entities across different knowledge graphs that refer to the same real world entity. However, current entity alignment systems overlook the sparsity of different knowledge graphs and can not align multi-type entities by one single model. In this paper, we present a Collective Graph neural network for Multi-type entity Alignment, called CG-MuAlign. Different from previous work, CG-MuAlign jointly aligns multiple types of entities, collectively leverages the neighborhood information and generalizes to unlabeled entity types. Specifically, we propose novel collective aggregation function tailored for this task, that (1) relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, (2) scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy. We conduct experiments on real world knowledge graphs with millions of entities and observe the superior performance beyond existing methods. In addition, the running time of our approach is much less than the current state-of-the-art deep learning methods.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83130864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
期刊
Proceedings of The Web Conference 2020
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1