The World Wide Web Conference最新文献_第2页

Quality-Sensitive Training! Social Advertisement Generation by Leveraging User Click Behavior 质量相关培训!利用用户点击行为生成社交广告

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313536

Yongzhen Wang, Heng Huang, Yuliang Yan, Xiaozhong Liu

Social advertisement has emerged as a viable means to improve purchase sharing in the context of e-commerce. However, humanly generating lots of advertising scripts can be prohibitive to both e-platforms and online sellers, and moreover, developing the desired auto-generator will need substantial gold-standard training samples. In this paper, we put forward a novel seq2seq model to generate social advertisements automatically, in which a quality-sensitive loss function is proposed based on user click behavior to differentiate training samples of varied qualities. Our motivation is to leverage the clickthrough data as a kind of quality indicator to measure the textual fitness of each training sample quantitatively, and only those ground truths that satisfy social media users will be considered the eligible and able to optimize the social advertisement generation. Specifically, under the qualified case, the ground truth should be utilized to supervise the whole training phase as much as possible, whereas in the opposite situation, the generated result ought to preserve the semantics of original input to the greatest extent. Simulation experiments on a large-scale dataset demonstrate that our approach achieves a significant superiority over two existing methods of distant supervision and three state-of-the-art NLG solutions.

社交广告作为一种可行的手段在电子商务的背景下提高购买分享。然而，人工生成大量广告脚本可能会让电子平台和在线卖家望而却步，此外，开发所需的自动生成器将需要大量的黄金标准训练样本。本文提出了一种新的自动生成社交广告的seq2seq模型，该模型提出了一个基于用户点击行为的质量敏感损失函数来区分不同质量的训练样本。我们的动机是利用点击量数据作为一种质量指标，定量地衡量每个训练样本的文本适应度，只有那些满足社交媒体用户的基本事实才被认为是合格的，能够优化社交广告生成。具体来说，在限定情况下，应该尽可能地利用ground truth来监督整个训练阶段，而在相反情况下，生成的结果应该最大程度地保留原始输入的语义。在大规模数据集上的模拟实验表明，我们的方法比两种现有的远程监督方法和三种最先进的NLG解决方案取得了显著的优势。

{"title":"Quality-Sensitive Training! Social Advertisement Generation by Leveraging User Click Behavior","authors":"Yongzhen Wang, Heng Huang, Yuliang Yan, Xiaozhong Liu","doi":"10.1145/3308558.3313536","DOIUrl":"https://doi.org/10.1145/3308558.3313536","url":null,"abstract":"Social advertisement has emerged as a viable means to improve purchase sharing in the context of e-commerce. However, humanly generating lots of advertising scripts can be prohibitive to both e-platforms and online sellers, and moreover, developing the desired auto-generator will need substantial gold-standard training samples. In this paper, we put forward a novel seq2seq model to generate social advertisements automatically, in which a quality-sensitive loss function is proposed based on user click behavior to differentiate training samples of varied qualities. Our motivation is to leverage the clickthrough data as a kind of quality indicator to measure the textual fitness of each training sample quantitatively, and only those ground truths that satisfy social media users will be considered the eligible and able to optimize the social advertisement generation. Specifically, under the qualified case, the ground truth should be utilized to supervise the whole training phase as much as possible, whereas in the opposite situation, the generated result ought to preserve the semantics of original input to the greatest extent. Simulation experiments on a large-scale dataset demonstrate that our approach achieves a significant superiority over two existing methods of distant supervision and three state-of-the-art NLG solutions.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83601326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Efficient (α, β)-core Computation: an Index-based Approach 高效(α， β)核计算:基于索引的方法

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313522

Boge Liu, Long Yuan, Xuemin Lin, Lu Qin, W. Zhang, Jingren Zhou

The problem of computing (α, β)-core in a bipartite graph for given α and β is a fundamental problem in bipartite graph analysis and can be used in many applications such as online group recommendation, fraudsters detection, etc. Existing solution to computing (α, β)-core needs to traverse the entire bipartite graph once. Considering the real bipartite graph can be very large and the requests to compute (α, β)-core can be issued frequently in real applications, the existing solution is too expensive to compute the (α, β)-core. In this paper, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size (thus, the algorithm is optimal since it needs at least linear time to output the result). We prove that the index only requires O(m) space where m is the number of edges in the bipartite graph. Moreover, we devise an efficient algorithm with time complexity O(δ·m) for index construction where δ is bounded by √m and is much smaller than √m in practice. We also discuss efficient algorithms to maintain the index when the bipartite graph is dynamically updated and parallel implementation of the index construction algorithm. The experimental results on real and synthetic graphs (more than 1 billion edges) demonstrate that our algorithms achieve up to 5 orders of magnitude speedup for computing (α, β)-core and up to 3 orders of magnitude speedup for index construction, respectively, compared with existing techniques.

给定α和β的二部图(α， β)核计算问题是二部图分析中的一个基本问题，可用于在线群组推荐、欺诈者检测等许多应用。现有的计算(α， β)核的解需要遍历整个二部图一次。考虑到实二部图可能非常大，并且在实际应用中可能会频繁地发出计算(α， β)核的请求，现有的解决方案对于计算(α， β)核来说过于昂贵。在本文中，我们提出了一种基于新索引的高效算法，使得算法在结果大小方面在线性时间内运行(因此，该算法是最优的，因为它至少需要线性时间来输出结果)。我们证明了索引只需要O(m)空间，其中m是二部图中的边数。此外，我们设计了一个有效的时间复杂度为O(δ·m)的索引构建算法，其中δ以√m为界，并且在实践中远小于√m。讨论了二部图动态更新时索引维护的有效算法和索引构建算法的并行实现。在真实图和合成图(超过10亿个边)上的实验结果表明，与现有技术相比，我们的算法在计算(α， β)核方面分别提高了5个数量级和3个数量级的索引构建速度。

{"title":"Efficient (α, β)-core Computation: an Index-based Approach","authors":"Boge Liu, Long Yuan, Xuemin Lin, Lu Qin, W. Zhang, Jingren Zhou","doi":"10.1145/3308558.3313522","DOIUrl":"https://doi.org/10.1145/3308558.3313522","url":null,"abstract":"The problem of computing (α, β)-core in a bipartite graph for given α and β is a fundamental problem in bipartite graph analysis and can be used in many applications such as online group recommendation, fraudsters detection, etc. Existing solution to computing (α, β)-core needs to traverse the entire bipartite graph once. Considering the real bipartite graph can be very large and the requests to compute (α, β)-core can be issued frequently in real applications, the existing solution is too expensive to compute the (α, β)-core. In this paper, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size (thus, the algorithm is optimal since it needs at least linear time to output the result). We prove that the index only requires O(m) space where m is the number of edges in the bipartite graph. Moreover, we devise an efficient algorithm with time complexity O(δ·m) for index construction where δ is bounded by √m and is much smaller than √m in practice. We also discuss efficient algorithms to maintain the index when the bipartite graph is dynamically updated and parallel implementation of the index construction algorithm. The experimental results on real and synthetic graphs (more than 1 billion edges) demonstrate that our algorithms achieve up to 5 orders of magnitude speedup for computing (α, β)-core and up to 3 orders of magnitude speedup for index construction, respectively, compared with existing techniques.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85412842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Spatio-Temporal Capsule-based Reinforcement Learning for Mobility-on-Demand Network Coordination 基于时空胶囊的移动按需网络协调强化学习

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313401

Suining He, K. Shin

As an alternative means of convenient and smart transportation, mobility-on-demand (MOD), typified by online ride-sharing and connected taxicabs, has been rapidly growing and spreading worldwide. The large volume of complex traffic and the uncertainty of market supplies/demands have made it essential for many MOD service providers to proactively dispatch vehicles towards ride-seekers. To meet this need effectively, we propose STRide, an MOD coordination-learning mechanism reinforced spatio-temporally with capsules. We formalize the adaptive coordination of vehicles into a reinforcement learning framework. STRide incorporates spatial and temporal distributions of supplies (vehicles) and demands (ride requests), customers' preferences and other external factors. A novel spatio-temporal capsule neural network is designed to predict the provider's rewards based on MOD network states, vehicles and their dispatch actions. This way, the MOD platform adapts itself to the supply-demand dynamics with the best potential rewards. We have conducted extensive data analytics and experimental evaluation with three large-scale datasets (~ 21 million rides from Uber, Yellow Taxis and Didi). STRide is shown to outperform state-of-the-arts, substantially reducing request-rejection rate and passenger waiting time, and also increasing the service provider's profits, often making 30% improvement over state-of-the-arts.

作为一种方便、智能的替代交通方式，以在线拼车和联网出租车为代表的按需出行(MOD)在全球范围内迅速发展和普及。大量复杂的交通和市场供需的不确定性使得许多MOD服务提供商必须主动向乘车寻求者派遣车辆。为了有效地满足这一需求，我们提出了一种基于胶囊的时空强化MOD协调学习机制STRide。我们将车辆的自适应协调形式化为一个强化学习框架。STRide整合了供应(车辆)和需求(乘车请求)的时空分布、客户偏好和其他外部因素。设计了一种基于MOD网络状态、车辆及其调度行为的时空胶囊神经网络来预测供应商的奖励。通过这种方式，MOD平台能够以最佳的潜在回报来适应供需动态。我们对三个大型数据集(Uber、Yellow Taxis和Didi的约2100万次出行)进行了广泛的数据分析和实验评估。STRide被证明优于最先进的技术，大大降低了请求拒绝率和乘客等待时间，也增加了服务提供商的利润，通常比最先进的技术提高30%。

{"title":"Spatio-Temporal Capsule-based Reinforcement Learning for Mobility-on-Demand Network Coordination","authors":"Suining He, K. Shin","doi":"10.1145/3308558.3313401","DOIUrl":"https://doi.org/10.1145/3308558.3313401","url":null,"abstract":"As an alternative means of convenient and smart transportation, mobility-on-demand (MOD), typified by online ride-sharing and connected taxicabs, has been rapidly growing and spreading worldwide. The large volume of complex traffic and the uncertainty of market supplies/demands have made it essential for many MOD service providers to proactively dispatch vehicles towards ride-seekers. To meet this need effectively, we propose STRide, an MOD coordination-learning mechanism reinforced spatio-temporally with capsules. We formalize the adaptive coordination of vehicles into a reinforcement learning framework. STRide incorporates spatial and temporal distributions of supplies (vehicles) and demands (ride requests), customers' preferences and other external factors. A novel spatio-temporal capsule neural network is designed to predict the provider's rewards based on MOD network states, vehicles and their dispatch actions. This way, the MOD platform adapts itself to the supply-demand dynamics with the best potential rewards. We have conducted extensive data analytics and experimental evaluation with three large-scale datasets (~ 21 million rides from Uber, Yellow Taxis and Didi). STRide is shown to outperform state-of-the-arts, substantially reducing request-rejection rate and passenger waiting time, and also increasing the service provider's profits, often making 30% improvement over state-of-the-arts.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85425448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Learning Novelty-Aware Ranking of Answers to Complex Questions 学习对复杂问题答案的新奇感排序

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313457

Shahar Harel, S. Albo, Eugene Agichtein, Kira Radinsky

Result ranking diversification has become an important issue for web search, summarization, and question answering. For more complex questions with multiple aspects, such as those in community-based question answering (CQA) sites, a retrieval system should provide a diversified set of relevant results, addressing the different aspects of the query, while minimizing redundancy or repetition. We present a new method, DRN , which learns novelty-related features from unlabeled data with minimal social signals, to emphasize diversity in ranking. Specifically, DRN parameterizes question-answer interactions via an LSTM representation, coupled with an extension of neural tensor network, which in turn is combined with a novelty-driven sampling approach to automatically generate training data. DRN provides a novel and general approach to complex question answering diversification and suggests promising directions for search improvements.

结果排序多样化已经成为网络搜索、摘要和问答的重要问题。对于具有多个方面的更复杂的问题，例如基于社区的问答(CQA)站点中的问题，检索系统应该提供多样化的相关结果集，处理查询的不同方面，同时最大限度地减少冗余或重复。我们提出了一种新的方法，即DRN，它通过最小的社会信号从未标记的数据中学习新颖性相关特征，以强调排名的多样性。具体来说，DRN通过LSTM表示对问答交互进行参数化，再加上神经张量网络的扩展，再结合新奇驱动的采样方法来自动生成训练数据。DRN为复杂问题回答多样化提供了一种新颖而通用的方法，并为搜索改进提出了有希望的方向。

引用次数: 7

Generalists and Specialists: Using Community Embeddings to Quantify Activity Diversity in Online Platforms 通才和专家:使用社区嵌入来量化在线平台的活动多样性

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313729

Isaac Waller, Ashton Anderson

In many online platforms, people must choose how broadly to allocate their energy. Should one concentrate on a narrow area of focus, and become a specialist, or apply oneself more broadly, and become a generalist? In this work, we propose a principled measure of how generalist or specialist a user is, and study behavior in online platforms through this lens. To do this, we construct highly accurate community embeddings that represent communities in a high-dimensional space. We develop sets of community analogies and use them to optimize our embeddings so that they encode community relationships extremely well. Based on these embeddings, we introduce a natural measure of activity diversity, the GS-score. Applying our embedding-based measure to online platforms, we observe a broad spectrum of user activity styles, from extreme specialists to extreme generalists, in both community membership on Reddit and programming contributions on GitHub. We find that activity diversity is related to many important phenomena of user behavior. For example, specialists are much more likely to stay in communities they contribute to, but generalists are much more likely to remain on platforms as a whole. We also find that generalists engage with significantly more diverse sets of users than specialists do. Furthermore, our methodology leads to a simple algorithm for community recommendation, matching state-of-the-art methods like collaborative filtering. Our methods and results introduce an important new dimension of online user behavior and shed light on many aspects of online platform use.

在许多在线平台上，人们必须选择如何广泛地分配他们的精力。一个人应该专注于一个狭窄的领域，成为一个专家，还是更广泛地应用自己，成为一个通才?在这项工作中，我们提出了一个原则性的衡量标准，衡量用户是多面手还是专家，并通过这个视角研究在线平台上的行为。为了做到这一点，我们构建了高度精确的社区嵌入，在高维空间中表示社区。我们开发了一组社区类比，并使用它们来优化我们的嵌入，使它们能够非常好地编码社区关系。基于这些嵌入，我们引入了一种衡量活动多样性的自然方法，即gs分数。将我们基于嵌入的测量应用到在线平台上，我们观察到用户活动风格的广泛范围，从极端的专家到极端的通才，无论是在Reddit的社区成员还是在GitHub上的编程贡献。我们发现活动多样性与用户行为的许多重要现象有关。例如，专家更有可能留在他们所贡献的社区，但通才更有可能留在整个平台上。我们还发现，通才所接触到的用户比专才多得多。此外，我们的方法为社区推荐提供了一个简单的算法，与协作过滤等最先进的方法相匹配。我们的方法和结果引入了在线用户行为的一个重要的新维度，并揭示了在线平台使用的许多方面。

{"title":"Generalists and Specialists: Using Community Embeddings to Quantify Activity Diversity in Online Platforms","authors":"Isaac Waller, Ashton Anderson","doi":"10.1145/3308558.3313729","DOIUrl":"https://doi.org/10.1145/3308558.3313729","url":null,"abstract":"In many online platforms, people must choose how broadly to allocate their energy. Should one concentrate on a narrow area of focus, and become a specialist, or apply oneself more broadly, and become a generalist? In this work, we propose a principled measure of how generalist or specialist a user is, and study behavior in online platforms through this lens. To do this, we construct highly accurate community embeddings that represent communities in a high-dimensional space. We develop sets of community analogies and use them to optimize our embeddings so that they encode community relationships extremely well. Based on these embeddings, we introduce a natural measure of activity diversity, the GS-score. Applying our embedding-based measure to online platforms, we observe a broad spectrum of user activity styles, from extreme specialists to extreme generalists, in both community membership on Reddit and programming contributions on GitHub. We find that activity diversity is related to many important phenomena of user behavior. For example, specialists are much more likely to stay in communities they contribute to, but generalists are much more likely to remain on platforms as a whole. We also find that generalists engage with significantly more diverse sets of users than specialists do. Furthermore, our methodology leads to a simple algorithm for community recommendation, matching state-of-the-art methods like collaborative filtering. Our methods and results introduce an important new dimension of online user behavior and shed light on many aspects of online platform use.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73372946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

GhostLink: Latent Network Inference for Influence-aware Recommendation GhostLink:影响感知推荐的潜在网络推断

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313449

Subhabrata Mukherjee, Stephan Günnemann

Social influence plays a vital role in shaping a user's behavior in online communities dealing with items of fine taste like movies, food, and beer. For online recommendation, this implies that users' preferences and ratings are influenced due to other individuals. Given only time-stamped reviews of users, can we find out who-influences-whom, and characteristics of the underlying influence network? Can we use this network to improve recommendation? While prior works in social-aware recommendation have leveraged social interaction by considering the observed social network of users, many communities like Amazon, Beeradvocate, and Ratebeer do not have explicit user-user links. Therefore, we propose GhostLink, an unsupervised probabilistic graphical model, to automatically learn the latent influence network underlying a review community - given only the temporal traces (timestamps) of users' posts and their content. Based on extensive experiments with four real-world datasets with 13 million reviews, we show that GhostLink improves item recommendation by around 23% over state-of-the-art methods that do not consider this influence. As additional use-cases, we show that GhostLink can be used to differentiate between users' latent preferences and influenced ones, as well as to detect influential users based on the learned influence graph.

在网络社区中，社交影响在塑造用户的行为方面起着至关重要的作用，这些行为涉及到电影、食物和啤酒等品味良好的物品。对于在线推荐，这意味着用户的偏好和评分受到其他个体的影响。如果只给出带有时间戳的用户评论，我们能找出谁影响了谁，以及潜在影响网络的特征吗?我们可以利用这个网络来改进推荐吗?虽然之前在社会意识推荐方面的工作通过考虑观察到的用户社交网络来利用社会互动，但许多社区，如Amazon、Beeradvocate和Ratebeer，并没有明确的用户-用户链接。因此，我们提出了GhostLink，一种无监督概率图形模型，仅给定用户帖子及其内容的时间痕迹(时间戳)，就可以自动学习评论社区底层的潜在影响网络。基于对四个真实世界数据集和1300万条评论的广泛实验，我们表明GhostLink比不考虑这种影响的最先进方法提高了约23%的商品推荐。作为额外的用例，我们表明GhostLink可用于区分用户的潜在偏好和受影响的偏好，以及基于学习的影响图检测有影响力的用户。

{"title":"GhostLink: Latent Network Inference for Influence-aware Recommendation","authors":"Subhabrata Mukherjee, Stephan Günnemann","doi":"10.1145/3308558.3313449","DOIUrl":"https://doi.org/10.1145/3308558.3313449","url":null,"abstract":"Social influence plays a vital role in shaping a user's behavior in online communities dealing with items of fine taste like movies, food, and beer. For online recommendation, this implies that users' preferences and ratings are influenced due to other individuals. Given only time-stamped reviews of users, can we find out who-influences-whom, and characteristics of the underlying influence network? Can we use this network to improve recommendation? While prior works in social-aware recommendation have leveraged social interaction by considering the observed social network of users, many communities like Amazon, Beeradvocate, and Ratebeer do not have explicit user-user links. Therefore, we propose GhostLink, an unsupervised probabilistic graphical model, to automatically learn the latent influence network underlying a review community - given only the temporal traces (timestamps) of users' posts and their content. Based on extensive experiments with four real-world datasets with 13 million reviews, we show that GhostLink improves item recommendation by around 23% over state-of-the-art methods that do not consider this influence. As additional use-cases, we show that GhostLink can be used to differentiate between users' latent preferences and influenced ones, as well as to detect influential users based on the learned influence graph.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"496 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77072112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms ·诊断:大规模微服务平台中小窗口长尾延迟的无监督实时诊断

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313653

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, Wei Ding

Microservice architectures and container technologies are broadly adopted by giant internet companies to support their web services, which typically have a strict service-level objective (SLO), tail latency, rather than average latency. However, diagnosing SLO violations, e.g., long tail latency problem, is non-trivial for large-scale web applications in shared microservice platforms due to million-level operational data and complex operational environments. We identify a new type of tail latency problem for web services, small-window long-tail latency (SWLT), which is typically aggregated during a small statistical window (e.g., 1-minute or 1-second). We observe SWLT usually occurs in a small number of containers in microservice clusters and sharply shifts among different containers at different time points. To diagnose root-causes of SWLT, we propose an unsupervised and low-cost diagnosis algorithm-?-Diagnosis, using two-sample test algorithm and ?-statistics for measuring similarity of time series to identify root-cause metrics from millions of metrics. We implement and deploy a real-time diagnosis system in our real-production microservice platforms. The evaluation using real web application datasets demonstrates that ?-Diagnosis can identify all the actual root-causes at runtime and significantly reduce the candidate problem space, outperforming other time-series distance based root-cause analysis algorithms.

大型互联网公司广泛采用微服务架构和容器技术来支持其web服务，这些web服务通常具有严格的服务水平目标(SLO)、尾部延迟，而不是平均延迟。然而，由于百万级的操作数据和复杂的操作环境，对于共享微服务平台上的大规模web应用来说，诊断SLO违规(例如长尾延迟问题)是非常重要的。我们为web服务确定了一种新的尾部延迟问题，小窗口长尾延迟(SWLT)，它通常在一个小的统计窗口(例如，1分钟或1秒)内聚集。我们观察到SWLT通常发生在微服务集群中的少数容器中，并且在不同的容器中在不同的时间点发生急剧变化。为了诊断SWLT的根本原因，我们提出了一种无监督的低成本诊断算法-?-诊断，使用双样本测试算法和-统计量测量时间序列的相似性，从数百万个指标中识别根本原因指标。我们在实际生产的微服务平台上实现并部署了实时诊断系统。使用真实web应用程序数据集的评估表明，?-Diagnosis可以在运行时识别所有实际的根本原因，并显着减少候选问题空间，优于其他基于时间序列距离的根本原因分析算法。

{"title":"?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms","authors":"Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, Wei Ding","doi":"10.1145/3308558.3313653","DOIUrl":"https://doi.org/10.1145/3308558.3313653","url":null,"abstract":"Microservice architectures and container technologies are broadly adopted by giant internet companies to support their web services, which typically have a strict service-level objective (SLO), tail latency, rather than average latency. However, diagnosing SLO violations, e.g., long tail latency problem, is non-trivial for large-scale web applications in shared microservice platforms due to million-level operational data and complex operational environments. We identify a new type of tail latency problem for web services, small-window long-tail latency (SWLT), which is typically aggregated during a small statistical window (e.g., 1-minute or 1-second). We observe SWLT usually occurs in a small number of containers in microservice clusters and sharply shifts among different containers at different time points. To diagnose root-causes of SWLT, we propose an unsupervised and low-cost diagnosis algorithm-?-Diagnosis, using two-sample test algorithm and ?-statistics for measuring similarity of time series to identify root-cause metrics from millions of metrics. We implement and deploy a real-time diagnosis system in our real-production microservice platforms. The evaluation using real web application datasets demonstrates that ?-Diagnosis can identify all the actual root-causes at runtime and significantly reduce the candidate problem space, outperforming other time-series distance based root-cause analysis algorithms.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82581232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

DPLink: User Identity Linkage via Deep Neural Network From Heterogeneous Mobility Data DPLink:基于异构移动数据的深度神经网络用户身份链接

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313424

J. Feng, Mingyang Zhang, Huandong Wang, Zeyu Yang, Chao Zhang, Yong Li, Depeng Jin

Online services are playing critical roles in almost all aspects of users' life. Users usually have multiple online identities (IDs) in different online services. In order to fuse the separated user data in multiple services for better business intelligence, it is critical for service providers to link online IDs belonging to the same user. On the other hand, the popularity of mobile networks and GPS-equipped smart devices have provided a generic way to link IDs, i.e., utilizing the mobility traces of IDs. However, linking IDs based on their mobility traces has been a challenging problem due to the highly heterogeneous, incomplete and noisy mobility data across services. In this paper, we propose DPLink, an end-to-end deep learning based framework, to complete the user identity linkage task for heterogeneous mobility data collected from different services with different properties. DPLink is made up by a feature extractor including a location encoder and a trajectory encoder to extract representative features from trajectory and a comparator to compare and decide whether to link two trajectories as the same user. Particularly, we propose a pre-training strategy with a simple task to train the DPLink model to overcome the training difficulties introduced by the highly heterogeneous nature of different source mobility data. Besides, we introduce a multi-modal embedding network and a co-attention mechanism in DPLink to deal with the low-quality problem of mobility data. By conducting extensive experiments on two real-life ground-truth mobility datasets with eight baselines, we demonstrate that DPLink outperforms the state-of-the-art solutions by more than 15% in terms of hit-precision. Moreover, it is expandable to add external geographical context data and works stably with heterogeneous noisy mobility traces. Our code is publicly available1.

在线服务在用户生活的几乎所有方面都扮演着至关重要的角色。用户在不同的在线业务中通常具有多个在线身份(id)。为了将分离的用户数据融合到多个服务中以获得更好的业务智能，服务提供商链接属于同一用户的在线id至关重要。另一方面，移动网络和配备gps的智能设备的普及提供了一种通用的方式来链接id，即利用id的移动痕迹。然而，由于跨服务的高度异构、不完整和嘈杂的移动数据，基于移动轨迹链接id一直是一个具有挑战性的问题。本文提出了基于端到端深度学习的DPLink框架，用于完成来自不同属性的不同服务的异构移动数据的用户身份链接任务。DPLink由特征提取器(包括位置编码器和轨迹编码器)和比较器(用于比较和决定是否将两个轨迹作为同一用户链接)组成，特征提取器包括从轨迹中提取代表性特征的位置编码器和轨迹编码器。特别地，我们提出了一种预训练策略，通过一个简单的任务来训练DPLink模型，以克服由不同源移动数据的高度异构特性带来的训练困难。此外，我们在DPLink中引入了多模态嵌入网络和共同关注机制，以解决移动数据的低质量问题。通过在两个真实的地面移动数据集上进行广泛的实验，我们证明DPLink在命中精度方面比最先进的解决方案高出15%以上。此外，该方法可扩展到添加外部地理环境数据，并能稳定地处理异构噪声移动轨迹。我们的代码是公开的。

{"title":"DPLink: User Identity Linkage via Deep Neural Network From Heterogeneous Mobility Data","authors":"J. Feng, Mingyang Zhang, Huandong Wang, Zeyu Yang, Chao Zhang, Yong Li, Depeng Jin","doi":"10.1145/3308558.3313424","DOIUrl":"https://doi.org/10.1145/3308558.3313424","url":null,"abstract":"Online services are playing critical roles in almost all aspects of users' life. Users usually have multiple online identities (IDs) in different online services. In order to fuse the separated user data in multiple services for better business intelligence, it is critical for service providers to link online IDs belonging to the same user. On the other hand, the popularity of mobile networks and GPS-equipped smart devices have provided a generic way to link IDs, i.e., utilizing the mobility traces of IDs. However, linking IDs based on their mobility traces has been a challenging problem due to the highly heterogeneous, incomplete and noisy mobility data across services. In this paper, we propose DPLink, an end-to-end deep learning based framework, to complete the user identity linkage task for heterogeneous mobility data collected from different services with different properties. DPLink is made up by a feature extractor including a location encoder and a trajectory encoder to extract representative features from trajectory and a comparator to compare and decide whether to link two trajectories as the same user. Particularly, we propose a pre-training strategy with a simple task to train the DPLink model to overcome the training difficulties introduced by the highly heterogeneous nature of different source mobility data. Besides, we introduce a multi-modal embedding network and a co-attention mechanism in DPLink to deal with the low-quality problem of mobility data. By conducting extensive experiments on two real-life ground-truth mobility datasets with eight baselines, we demonstrate that DPLink outperforms the state-of-the-art solutions by more than 15% in terms of hit-precision. Moreover, it is expandable to add external geographical context data and works stably with heterogeneous noisy mobility traces. Our code is publicly available1.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78667909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Understanding Urban Dynamics via State-sharing Hidden Markov Model 通过状态共享隐马尔可夫模型理解城市动态

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313453

Tong Xia, Yue Yu, Fengli Xu, Funing Sun, Diansheng Guo, Depeng Jin, Yong Li

Modeling people's activities in the urban space is a crucial socio-economic task but extremely challenging due to the deficiency of suitable methods. To model the temporal dynamics of human activities concisely and specifically, we present State-sharing Hidden Markov Model (SSHMM). First, it extracts the urban states from the whole city, which captures the volume of population flows as well as the frequency of each type of Point of Interests (PoIs) visited. Second, it characterizes the urban dynamics of each urban region as the state transition on the shared-states, which reveals distinct daily rhythms of urban activities. We evaluate our method via a large-scale real-life mobility dataset and results demonstrate that SSHMM learns semantics-rich urban dynamics, which are highly correlated with the functions of the region. Besides, it recovers the urban dynamics in different time slots with an error of 0.0793, which outperforms the general HMM by 54.2%.

人们在城市空间中的活动建模是一项重要的社会经济任务，但由于缺乏合适的方法而极具挑战性。为了简洁而具体地模拟人类活动的时间动态，我们提出了状态共享隐马尔可夫模型(SSHMM)。首先，它从整个城市中提取城市状态，从而捕获人口流量以及访问每种类型的兴趣点(poi)的频率。其次，它将每个城市区域的城市动态描述为共享状态上的状态转换，揭示了城市活动的不同日常节奏。我们通过大规模的现实生活交通数据集评估了我们的方法，结果表明SSHMM学习了与区域功能高度相关的丰富语义的城市动态。此外，该方法还能恢复不同时段的城市动态，误差为0.0793，比一般HMM提高54.2%。

引用次数: 15

Multiple Treatment Effect Estimation using Deep Generative Model with Task Embedding 基于任务嵌入的深度生成模型的多重处理效果估计

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313744

S. Saini, Sunny Dhamnani, Aakash Srinivasan, A. A. Ibrahim, Prithviraj Chavan

Causal inference using observational data on multiple treatments is an important problem in a wide variety of fields. However, the existing literature tends to focus only on causal inference in case of binary or multinoulli treatments. These models are either incompatible with multiple treatments, or extending them to multiple treatments is computationally expensive. We use a previous formulation of causal inference using variational autoencoder (VAE) and propose a novel architecture to estimate the causal effect of any subset of the treatments. The higher order effects of multiple treatments are captured through a task embedding. The task embedding allows the model to scale to multiple treatments. The model is applied on real digital marketing dataset to evaluate the next best set of marketing actions. For evaluation, the model is compared against competitive baseline models on two semi-synthetic datasets created using the covariates from the real dataset. The performance is measured along four evaluation metrics considered in the causal inference literature and one proposed by us. The proposed evaluation metric measures the loss in the expected outcome when a particular model is used for decision making as compared to the ground truth. The proposed model outperforms the baselines along all five evaluation metrics. It outperforms the best baseline by over 30% along these evaluation metrics. The proposed approach is also shown to be robust when a subset of the confounders is not observed. The results on real data show the importance of the flexible modeling approach provided by the proposed model.

利用多重治疗的观测数据进行因果推理在许多领域都是一个重要问题。然而，现有文献往往只关注二元或多oulli治疗的因果推理。这些模型要么与多种治疗不兼容，要么将它们扩展到多种治疗在计算上代价高昂。我们使用以前的因果推理公式，使用变分自编码器(VAE)，并提出了一种新的架构来估计任何治疗子集的因果效应。通过任务嵌入捕获多个处理的高阶效应。任务嵌入允许模型扩展到多个处理。将该模型应用于真实的数字营销数据集，以评估次优营销行动集。为了进行评估，将模型与使用真实数据集的协变量创建的两个半合成数据集上的竞争性基线模型进行比较。绩效是根据因果推理文献中考虑的四个评估指标和我们提出的一个评估指标来衡量的。所提出的评价指标衡量了当使用特定模型进行决策时预期结果的损失，与实际情况相比。所建议的模型在所有五个评估指标上都优于基线。根据这些评估指标，它比最佳基线高出30%以上。当不观察到混杂因素的子集时，所提出的方法也显示出鲁棒性。实际数据的结果表明了该模型所提供的灵活建模方法的重要性。

{"title":"Multiple Treatment Effect Estimation using Deep Generative Model with Task Embedding","authors":"S. Saini, Sunny Dhamnani, Aakash Srinivasan, A. A. Ibrahim, Prithviraj Chavan","doi":"10.1145/3308558.3313744","DOIUrl":"https://doi.org/10.1145/3308558.3313744","url":null,"abstract":"Causal inference using observational data on multiple treatments is an important problem in a wide variety of fields. However, the existing literature tends to focus only on causal inference in case of binary or multinoulli treatments. These models are either incompatible with multiple treatments, or extending them to multiple treatments is computationally expensive. We use a previous formulation of causal inference using variational autoencoder (VAE) and propose a novel architecture to estimate the causal effect of any subset of the treatments. The higher order effects of multiple treatments are captured through a task embedding. The task embedding allows the model to scale to multiple treatments. The model is applied on real digital marketing dataset to evaluate the next best set of marketing actions. For evaluation, the model is compared against competitive baseline models on two semi-synthetic datasets created using the covariates from the real dataset. The performance is measured along four evaluation metrics considered in the causal inference literature and one proposed by us. The proposed evaluation metric measures the loss in the expected outcome when a particular model is used for decision making as compared to the ground truth. The proposed model outperforms the baselines along all five evaluation metrics. It outperforms the best baseline by over 30% along these evaluation metrics. The proposed approach is also shown to be robust when a subset of the confounders is not observed. The results on real data show the importance of the flexible modeling approach provided by the proposed model.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87935074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15