The World Wide Web Conference最新文献_第5页

Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation Learn2Clean:优化Web数据准备的任务顺序

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313602

Laure Berti-Équille

Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.

数据清理和准备一直是数据科学中的一个长期挑战，以避免从脏数据中获得不正确的结果和误导性的结论。对于给定的数据集和给定的基于机器学习的任务，过多的数据预处理技术和替代数据管理策略可能导致质量性能不平等的显著不同的输出。然而，目前大多数关于数据清理和自动化机器学习的工作都集中在开发清理算法或用户引导系统上，或者认为依赖于一种有原则的方法来选择数据预处理步骤的顺序，从而导致数据的最佳质量性能。在本文中，我们提出了Learn2Clean，这是一种基于Q-Learning的方法，这是一种无模型强化学习技术，它为给定的数据集、ML模型和质量性能指标选择用于预处理数据的最佳任务序列，从而使ML模型结果的质量最大化。作为对我们的方法在Web数据分析上下文中的初步验证，我们在对真实数据进行聚类、回归和分类的数据准备方面给出了一些有希望的结果。

引用次数: 31

LearnerExp: Exploring and Explaining the Time Management of Online Learning Activity LearnerExp:探索和解释在线学习活动的时间管理

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3314140

Huan He, Q. Zheng, Bo Dong

How do learners schedule their online learning? This issue is concerned by both course instructors and researchers, especially in the context of self-paced online learning environment. Many indicators and methods have been proposed to understand and improve the time management of learning activities, however, there are few tools of visualizing, comparing and exploring the time management to gain intuitive understanding. In this demo, we introduce the LearnExp, an interactive visual analytic system designed to explore the temporal patterns of learning activities and explain the relationships between academic performance and these patterns. This system will help instructors to comparatively explore the distribution of learner activities from multiple aspects, and to visually explain the time management of different learner groups with the prediction of learning performance.

学习者如何安排他们的在线学习?这一问题受到课程教师和研究人员的关注，特别是在在线自主学习环境下。人们提出了许多指标和方法来理解和改进学习活动的时间管理，但对时间管理进行可视化、比较和探索以获得直观理解的工具却很少。在本演示中，我们将介绍LearnExp，这是一个交互式可视化分析系统，旨在探索学习活动的时间模式，并解释学习成绩与这些模式之间的关系。该系统可以帮助教师从多个方面比较探索学习者活动的分布，直观地解释不同学习者群体的时间管理，预测学习绩效。

引用次数: 5

Keyphrase Extraction from Disaster-related Tweets 从与灾难相关的推文中提取关键词

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313696

Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea

While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.

近年来，关键词提取受到了相当大的关注，但从Twitter等社交媒体平台中提取关键词的研究相对较少，而从这些平台中提取与灾害相关的关键词的研究就更少了。在灾难中，关键字对于过滤相关的推文非常有用，可以增强态势感知。在此之前，将两个不同层的堆叠递归神经网络联合训练用于关键字发现和关键字提取，已被证明可以有效地从一般Twitter数据中提取关键字。我们通过结合上下文词嵌入、pos标签、语音学和音系特征，提高了模型在一般Twitter数据和与灾难相关的Twitter数据上的性能。此外，我们讨论了常用的f1度量的缺点，用于评估相对于基础真值注释的预测关键短语的质量。代替f1度量，我们建议使用基于嵌入的度量来更好地捕获预测关键短语的正确性。此外，我们还提出了一种基于嵌入的度量的新扩展。该扩展允许一个人更好地控制惩罚的数量的基础真理和预测的关键字的差异。

{"title":"Keyphrase Extraction from Disaster-related Tweets","authors":"Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea","doi":"10.1145/3308558.3313696","DOIUrl":"https://doi.org/10.1145/3308558.3313696","url":null,"abstract":"While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78811666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Dealing with Interdependencies and Uncertainty in Multi-Channel Advertising Campaigns Optimization 多渠道广告活动优化中的相互依赖和不确定性处理

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313470

Alessandro Nuara, Nicola Sosio, F. Trovò, Maria Chiara Zaccardi, N. Gatti, Marcello Restelli

In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.

2017年，全球互联网广告支出达到2090亿美元，而电视广告收入为1780亿美元。互联网广告活动包括多达数千个在多个渠道(如搜索、社交、展示)上的子活动，其参数(投标和每日预算)需要每天优化，受(累积)预算约束。这样的过程对人类来说往往是负担不起的，因此自动化是至关重要的。正如营销漏斗模型所显示的那样，子活动通常是相互依存的，例如，展示广告诱导意识，增加印象数量，从而也增加了搜索广告的转换数量。这种相互依存关系在优化过程中被人类广泛利用，然而，据我们所知，没有算法考虑到这一点。在本文中，我们提供了第一个捕获子活动相互依赖关系的模型。我们还提供了IDIL算法，该算法采用格兰杰因果关系和高斯过程，从过去的数据中学习，并返回最优的固定出价/每日预算分配。我们证明了在千里眼解决方案中对IDIL损失的理论保证，并且与现有方法相比，我们展示了其在现实和现实世界设置中的优越性的经验证据。

{"title":"Dealing with Interdependencies and Uncertainty in Multi-Channel Advertising Campaigns Optimization","authors":"Alessandro Nuara, Nicola Sosio, F. Trovò, Maria Chiara Zaccardi, N. Gatti, Marcello Restelli","doi":"10.1145/3308558.3313470","DOIUrl":"https://doi.org/10.1145/3308558.3313470","url":null,"abstract":"In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86885265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Predicting Human Mobility via Variational Attention 通过变分注意预测人类流动性

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313610

Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, Fengli Zhang

An important task in Location based Social Network applications is to predict mobility - specifically, user's next point-of-interest (POI) - challenging due to the implicit feedback of footprints, sparsity of generated check-ins, and the joint impact of historical periodicity and recent check-ins. Motivated by recent success of deep variational inference, we propose VANext (Variational Attention based Next) POI prediction: a latent variable model for inferring user's next footprint, with historical mobility attention. The variational encoding captures latent features of recent mobility, followed by searching the similar historical trajectories for periodical patterns. A trajectory convolutional network is then used to learn historical mobility, significantly improving the efficiency over often used recurrent networks. A novel variational attention mechanism is proposed to exploit the periodicity of historical mobility patterns, combined with recent check-in preference to predict next POIs. We also implement a semi-supervised variant - VANext-S, which relies on variational encoding for pre-training all current trajectories in an unsupervised manner, and uses the latent variables to initialize the current trajectory learning. Experiments conducted on real-world datasets demonstrate that VANext and VANext-S outperform the state-of-the-art human mobility prediction models.

基于位置的社交网络应用程序的一项重要任务是预测移动性——特别是用户的下一个兴趣点(POI)——由于足迹的隐式反馈、生成签到的稀疏性以及历史周期性和最近签到的共同影响，这一任务具有挑战性。受最近深度变分推理成功的启发，我们提出了VANext(基于变分注意力的下一步)POI预测:一个潜在变量模型，用于推断用户的下一个足迹，具有历史移动注意力。变分编码捕捉近期流动性的潜在特征，然后搜索相似的历史轨迹以寻找周期性模式。然后使用轨迹卷积网络来学习历史移动性，显著提高了常用循环网络的效率。提出了一种新的变分注意机制，利用历史移动模式的周期性，结合最近的登记偏好来预测下一个poi。我们还实现了一种半监督变体VANext-S，它依赖于变分编码以无监督的方式预训练所有当前轨迹，并使用潜在变量初始化当前轨迹学习。在真实世界数据集上进行的实验表明，VANext和VANext- s优于最先进的人类移动性预测模型。

{"title":"Predicting Human Mobility via Variational Attention","authors":"Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, Fengli Zhang","doi":"10.1145/3308558.3313610","DOIUrl":"https://doi.org/10.1145/3308558.3313610","url":null,"abstract":"An important task in Location based Social Network applications is to predict mobility - specifically, user's next point-of-interest (POI) - challenging due to the implicit feedback of footprints, sparsity of generated check-ins, and the joint impact of historical periodicity and recent check-ins. Motivated by recent success of deep variational inference, we propose VANext (Variational Attention based Next) POI prediction: a latent variable model for inferring user's next footprint, with historical mobility attention. The variational encoding captures latent features of recent mobility, followed by searching the similar historical trajectories for periodical patterns. A trajectory convolutional network is then used to learn historical mobility, significantly improving the efficiency over often used recurrent networks. A novel variational attention mechanism is proposed to exploit the periodicity of historical mobility patterns, combined with recent check-in preference to predict next POIs. We also implement a semi-supervised variant - VANext-S, which relies on variational encoding for pre-training all current trajectories in an unsupervised manner, and uses the latent variables to initialize the current trajectory learning. Experiments conducted on real-world datasets demonstrate that VANext and VANext-S outperform the state-of-the-art human mobility prediction models.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89095619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96

Improving Outfit Recommendation with Co-supervision of Fashion Generation 在时尚生成的共同监督下改进服装推荐

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313614

Yujie Lin, Pengjie Ren, Zhumin Chen, Z. Ren, Jun Ma, M. de Rijke

The task of fashion recommendation includes two main challenges: visual understanding and visual matching. Visual understanding aims to extract effective visual features. Visual matching aims to model a human notion of compatibility to compute a match between fashion items. Most previous studies rely on recommendation loss alone to guide visual understanding and matching. Although the features captured by these methods describe basic characteristics (e.g., color, texture, shape) of the input items, they are not directly related to the visual signals of the output items (to be recommended). This is problematic because the aesthetic characteristics (e.g., style, design), based on which we can directly infer the output items, are lacking. Features are learned under the recommendation loss alone, where the supervision signal is simply whether the given two items are matched or not. To address this problem, we propose a neural co-supervision learning framework, called the FAshion Recommendation Machine (FARM). FARM improves visual understanding by incorporating the supervision of generation loss, which we hypothesize to be able to better encode aesthetic information. FARM enhances visual matching by introducing a novel layer-to-layer matching mechanism to fuse aesthetic information more effectively, and meanwhile avoiding paying too much attention to the generation quality and ignoring the recommendation performance. Extensive experiments on two publicly available datasets show that FARM outperforms state-of-the-art models on outfit recommendation, in terms of AUC and MRR. Detailed analyses of generated and recommended items demonstrate that FARM can encode better features and generate high quality images as references to improve recommendation performance.

时尚推荐的任务包括两个主要挑战:视觉理解和视觉匹配。视觉理解的目的是提取有效的视觉特征。视觉匹配旨在模拟人类的兼容性概念，以计算时尚产品之间的匹配。大多数先前的研究仅依赖于推荐损失来指导视觉理解和匹配。虽然这些方法捕获的特征描述了输入项目的基本特征(如颜色、纹理、形状)，但它们与输出项目的视觉信号没有直接关系(有待推荐)。这是有问题的，因为缺乏美学特征(例如，风格，设计)，我们可以根据这些特征直接推断输出项目。特征是单独在推荐损失下学习的，其中监督信号仅仅是给定的两个项目是否匹配。为了解决这个问题，我们提出了一个神经共同监督学习框架，称为时尚推荐机(FARM)。FARM通过结合对生成损失的监督来提高视觉理解，我们假设这能够更好地编码美学信息。FARM通过引入一种新颖的层对层匹配机制来增强视觉匹配，从而更有效地融合审美信息，同时避免过多关注生成质量而忽略推荐性能。在两个公开可用的数据集上进行的大量实验表明，FARM在AUC和MRR方面优于最先进的服装推荐模型。对生成和推荐项目的详细分析表明，FARM可以对更好的特征进行编码，并生成高质量的图像作为参考，从而提高推荐性能。

{"title":"Improving Outfit Recommendation with Co-supervision of Fashion Generation","authors":"Yujie Lin, Pengjie Ren, Zhumin Chen, Z. Ren, Jun Ma, M. de Rijke","doi":"10.1145/3308558.3313614","DOIUrl":"https://doi.org/10.1145/3308558.3313614","url":null,"abstract":"The task of fashion recommendation includes two main challenges: visual understanding and visual matching. Visual understanding aims to extract effective visual features. Visual matching aims to model a human notion of compatibility to compute a match between fashion items. Most previous studies rely on recommendation loss alone to guide visual understanding and matching. Although the features captured by these methods describe basic characteristics (e.g., color, texture, shape) of the input items, they are not directly related to the visual signals of the output items (to be recommended). This is problematic because the aesthetic characteristics (e.g., style, design), based on which we can directly infer the output items, are lacking. Features are learned under the recommendation loss alone, where the supervision signal is simply whether the given two items are matched or not. To address this problem, we propose a neural co-supervision learning framework, called the FAshion Recommendation Machine (FARM). FARM improves visual understanding by incorporating the supervision of generation loss, which we hypothesize to be able to better encode aesthetic information. FARM enhances visual matching by introducing a novel layer-to-layer matching mechanism to fuse aesthetic information more effectively, and meanwhile avoiding paying too much attention to the generation quality and ignoring the recommendation performance. Extensive experiments on two publicly available datasets show that FARM outperforms state-of-the-art models on outfit recommendation, in terms of AUC and MRR. Detailed analyses of generated and recommended items demonstrate that FARM can encode better features and generate high quality images as references to improve recommendation performance.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81226225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Context-aware Variational Trajectory Encoding and Human Mobility Inference 情境感知变分轨迹编码与人类移动性推理

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313608

Fan Zhou, Xiaoli Yue, Goce Trajcevski, Ting Zhong, Kunpeng Zhang

Unveiling human mobility patterns is an important task for many downstream applications like point-of-interest (POI) recommendation and personalized trip planning. Compelling results exist in various sequential modeling methods and representation techniques. However, discovering and exploiting the context of trajectories in terms of abstract topics associated with the motion can provide a more comprehensive understanding of the dynamics of patterns. We propose a new paradigm for moving pattern mining based on learning trajectory context, and a method - Context-Aware Variational Trajectory Encoding and Human Mobility Inference (CATHI) - for learning user trajectory representation via a framework consisting of: (1) a variational encoder and a recurrent encoder; (2) a variational attention layer; (3) two decoders. We simultaneously tackle two subtasks: (T1) recovering user routes (trajectory reconstruction); and (T2) predicting the trip that the user would travel (trajectory prediction). We show that the encoded contextual trajectory vectors efficiently characterize the hierarchical mobility semantics, from which one can decode the implicit meanings of trajectories. We evaluate our method on several public datasets and demonstrate that the proposed CATHI can efficiently improve the performance of both subtasks, compared to state-of-the-art approaches.

揭示人类移动模式是许多下游应用程序的重要任务，如兴趣点(POI)推荐和个性化旅行计划。在各种顺序建模方法和表示技术中都存在令人信服的结果。然而，在与运动相关的抽象主题方面发现和利用轨迹的上下文可以提供对模式动态的更全面的理解。我们提出了一种基于学习轨迹上下文的移动模式挖掘新范式，以及一种基于上下文感知的变分轨迹编码和人类移动性推理(CATHI)方法，该方法通过以下框架来学习用户轨迹表示:(1)变分编码器和循环编码器;(2)变分关注层;(3)两个解码器。我们同时处理两个子任务:(T1)恢复用户路由(轨迹重建);(T2)预测用户的行程(轨迹预测)。我们证明了编码的上下文轨迹向量有效地表征了层次移动语义，从中可以解码轨迹的隐含意义。我们在几个公共数据集上评估了我们的方法，并证明与最先进的方法相比，所提出的CATHI可以有效地提高两个子任务的性能。

{"title":"Context-aware Variational Trajectory Encoding and Human Mobility Inference","authors":"Fan Zhou, Xiaoli Yue, Goce Trajcevski, Ting Zhong, Kunpeng Zhang","doi":"10.1145/3308558.3313608","DOIUrl":"https://doi.org/10.1145/3308558.3313608","url":null,"abstract":"Unveiling human mobility patterns is an important task for many downstream applications like point-of-interest (POI) recommendation and personalized trip planning. Compelling results exist in various sequential modeling methods and representation techniques. However, discovering and exploiting the context of trajectories in terms of abstract topics associated with the motion can provide a more comprehensive understanding of the dynamics of patterns. We propose a new paradigm for moving pattern mining based on learning trajectory context, and a method - Context-Aware Variational Trajectory Encoding and Human Mobility Inference (CATHI) - for learning user trajectory representation via a framework consisting of: (1) a variational encoder and a recurrent encoder; (2) a variational attention layer; (3) two decoders. We simultaneously tackle two subtasks: (T1) recovering user routes (trajectory reconstruction); and (T2) predicting the trip that the user would travel (trajectory prediction). We show that the encoded contextual trajectory vectors efficiently characterize the hierarchical mobility semantics, from which one can decode the implicit meanings of trajectories. We evaluate our method on several public datasets and demonstrate that the proposed CATHI can efficiently improve the performance of both subtasks, compared to state-of-the-art approaches.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86225839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Focusing Attention Network for Answer Ranking 聚焦注意力网络的答案排名

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313518

Yufei Xie, Shuchun Liu, Tangren Yao, Yao Peng, Zhao Lu

Answer ranking is an important task in Community Question Answering (CQA), by which “Good” answers should be ranked in the front of “Bad” or “Potentially Useful” answers. The state of the art is the attention-based classification framework that learns the mapping between the questions and the answers. However, we observe that existing attention-based methods perform poorly on complicated question-answer pairs. One major reason is that existing methods cannot get accurate alignments between questions and answers for such pairs. We call the phenomenon “attention divergence”. In this paper, we propose a new attention mechanism, called Focusing Attention Network(FAN), which can automatically draw back the divergent attention by adding the semantic, and metadata features. Our Model can focus on the most important part of the sentence and therefore improve the answer ranking performance. Experimental results on the CQA dataset of SemEval-2016 and SemEval-2017 demonstrate that our method respectively attains 79.38 and 88.72 on MAP and outperforms the Top-1 system in the shared task by 0.19 and 0.29.

答案排序是社区问答(CQA)中的一项重要任务，“好的”答案应该排在“坏的”或“潜在有用的”答案的前面。最先进的技术是基于注意力的分类框架，它可以学习问题和答案之间的映射。然而，我们观察到现有的基于注意力的方法在复杂的问答对上表现不佳。一个主要的原因是现有的方法不能得到这种对的问题和答案之间的精确对齐。我们称这种现象为“注意力分化”。在本文中，我们提出了一种新的注意力机制，即聚焦注意力网络(FAN)，它可以通过添加语义特征和元数据特征来自动收回分散的注意力。我们的模型可以专注于句子中最重要的部分，从而提高答案排名的性能。在SemEval-2016和SemEval-2017的CQA数据集上的实验结果表明，我们的方法在MAP上分别达到79.38和88.72，在共享任务上比Top-1系统分别高出0.19和0.29。

引用次数: 6

Revisiting Mobile Advertising Threats with MAdLife 《MAdLife》重新审视手机广告威胁

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313549

Gong Chen, W. Meng, J. Copeland

Online advertising is one of the primary funding sources for various of content, services, and applications on both web and mobile platforms. Mobile in-app advertising reuses many existing web technologies under the same ad-serving model (i.e., users - publishers - ad networks - advertisers). Nevertheless, mobile in-app advertising is different from the traditional web advertising in many aspects. For example, malicious app developers can generate fraudulent ad clicks in an automated fashion, but malicious web publishers have to launch click fraud with bots. In spite of using the same underlying web infrastructure, advertising threats behave differently on the two platforms. Existing works have studied separately click fraud and malvertising in the mobile setting. However, it is unknown if there exists a relationship between these two dominant threats. In this paper, we present an ad collection framework – MAdLife – on Android to capture all the in-app ad traffic generated during an ad's entire lifespan. MAdLife allows us to revisit both threats in a fine-grained manner and study the relationship between them. It further enables the exploration of other threats related to ad landing pages. We analyzed 5.7K Android apps crawled from the Google Play Store, and collected 83K ads and their landing pages using MAdLife. Similar to traditional web ads, 58K ads landed on web pages. We discovered 37 click-fraud apps, and found that 1.49% of the 58K ads were malicious. We also revealed a strong correlation between fraudulent apps and malicious ads. Specifically, 15.44% of malicious ads originated from the fraudulent apps. Conversely, 18.36% of the ads served in the fraudulent apps were malicious, while only 1.28% were malicious in the rest apps. This suggests that users of fraudulent apps are much more (14x) likely to encounter malicious ads. Additionally, we discovered that 243 popular JavaScript snippets embedded by over 10% of the landing pages were malicious. Finally, we conducted the first analysis on inappropriate mobile in-app ads.

在线广告是网络和移动平台上各种内容、服务和应用程序的主要资金来源之一。移动应用内广告在相同的广告服务模式下重用了许多现有的网络技术(即用户-发布商-广告网络-广告商)。然而，手机应用内广告与传统的网页广告在很多方面都有不同。例如，恶意应用开发者可以自动生成欺诈性广告点击，但恶意网络发布者必须使用机器人来进行点击欺诈。尽管使用相同的底层网络基础设施，广告威胁在两个平台上的表现却不同。现有的研究分别研究了移动环境下的点击欺诈和恶意广告。然而，这两种主要威胁之间是否存在关系尚不清楚。在本文中，我们提出了一个广告收集框架- MAdLife -在Android上捕获所有在广告的整个生命周期中产生的应用内广告流量。MAdLife允许我们以细粒度的方式重新审视这两种威胁，并研究它们之间的关系。它还可以进一步探索与广告着陆页相关的其他威胁。我们分析了从Google Play Store抓取的5.7万个Android应用，并使用MAdLife收集了8.3万个广告及其登陆页面。与传统的网络广告类似，在网页上投放了58K个广告。我们发现了37个点击欺诈应用，并发现5.8万个广告中有1.49%是恶意的。我们还发现欺诈性应用和恶意广告之间存在很强的相关性。具体来说，15.44%的恶意广告来自于欺诈性应用。相反，欺诈应用中18.36%的广告是恶意的，而在其他应用中只有1.28%的广告是恶意的。这表明欺诈性应用的用户更有可能(14倍)遇到恶意广告。此外，我们发现超过10%的登陆页面中嵌入的243个流行JavaScript片段是恶意的。最后，我们对不恰当的手机应用内广告进行了首次分析。

{"title":"Revisiting Mobile Advertising Threats with MAdLife","authors":"Gong Chen, W. Meng, J. Copeland","doi":"10.1145/3308558.3313549","DOIUrl":"https://doi.org/10.1145/3308558.3313549","url":null,"abstract":"Online advertising is one of the primary funding sources for various of content, services, and applications on both web and mobile platforms. Mobile in-app advertising reuses many existing web technologies under the same ad-serving model (i.e., users - publishers - ad networks - advertisers). Nevertheless, mobile in-app advertising is different from the traditional web advertising in many aspects. For example, malicious app developers can generate fraudulent ad clicks in an automated fashion, but malicious web publishers have to launch click fraud with bots. In spite of using the same underlying web infrastructure, advertising threats behave differently on the two platforms. Existing works have studied separately click fraud and malvertising in the mobile setting. However, it is unknown if there exists a relationship between these two dominant threats. In this paper, we present an ad collection framework – MAdLife – on Android to capture all the in-app ad traffic generated during an ad's entire lifespan. MAdLife allows us to revisit both threats in a fine-grained manner and study the relationship between them. It further enables the exploration of other threats related to ad landing pages. We analyzed 5.7K Android apps crawled from the Google Play Store, and collected 83K ads and their landing pages using MAdLife. Similar to traditional web ads, 58K ads landed on web pages. We discovered 37 click-fraud apps, and found that 1.49% of the 58K ads were malicious. We also revealed a strong correlation between fraudulent apps and malicious ads. Specifically, 15.44% of malicious ads originated from the fraudulent apps. Conversely, 18.36% of the ads served in the fraudulent apps were malicious, while only 1.28% were malicious in the rest apps. This suggests that users of fraudulent apps are much more (14x) likely to encounter malicious ads. Additionally, we discovered that 243 popular JavaScript snippets embedded by over 10% of the landing pages were malicious. Finally, we conducted the first analysis on inappropriate mobile in-app ads.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88786332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation 基于边缘的关系分类和模式生成条件变分自编码器

The World Wide Web Conference

Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313436

Fenglong Ma, Yaliang Li, Chenwei Zhang, Jing Gao, Nan Du, Wei Fan

Relation classification is a basic yet important task in natural language processing. Existing relation classification approaches mainly rely on distant supervision, which assumes that a bag of sentences mentioning a pair of entities and extracted from a given corpus should express the same relation type of this entity pair. The training of these models needs a lot of high-quality bag-level data. However, in some specific domains, such as medical domain, it is difficult to obtain sufficient and high-quality sentences in a text corpus that mention two entities with a certain medical relation between them. In such a case, it is hard for existing discriminative models to capture the representative features (i.e., common patterns) from diversely expressed entity pairs with a given relation. Thus, the classification performance cannot be guaranteed when limited features are obtained from the corpus. To address this challenge, in this paper, we propose to employ a generative model, called conditional variational autoencoder (CVAE), to handle the pattern sparsity. We define that each relation has an individually learned latent distribution from all possible sentences expressing this relation. As these distributions are learned based on the purpose of input reconstruction, the model's classification ability may not be strong enough and should be improved. By distinguishing the differences among different relation distributions, a margin-based regularizer is designed, which leads to a margin-based CVAE (MCVAE) that can significantly enhance the classification ability. Besides, MCVAE can automatically generate semantically meaningful patterns that describe the given relations. Experiments on two real-world datasets validate the effectiveness of the proposed MCVAE on the tasks of relation classification and relation-specific pattern generation.

关系分类是自然语言处理中一项基本而又重要的任务。现有的关系分类方法主要依赖于远程监督，该方法假设从给定语料库中提取的提到一对实体的一组句子应该表达该实体对的相同关系类型。这些模型的训练需要大量高质量的袋级数据。然而，在一些特定的领域，如医学领域，很难在文本语料库中获得足够的、高质量的句子，这些句子提到两个实体之间具有一定的医学关系。在这种情况下，现有的判别模型很难从具有给定关系的不同表达的实体对中捕获代表性特征(即公共模式)。因此，当从语料库中获得有限的特征时，不能保证分类性能。为了解决这一挑战，在本文中，我们建议采用一种称为条件变分自编码器(CVAE)的生成模型来处理模式稀疏性。我们定义每个关系都有一个单独学习的潜在分布，来自所有表达这种关系的可能句子。由于这些分布是基于输入重构的目的来学习的，所以模型的分类能力可能不够强，需要改进。通过区分不同关系分布之间的差异，设计了基于边缘的正则化器，从而得到了能显著提高分类能力的基于边缘的CVAE (MCVAE)。此外，MCVAE可以自动生成描述给定关系的语义上有意义的模式。在两个真实数据集上的实验验证了所提出的MCVAE在关系分类和特定于关系的模式生成任务上的有效性。

{"title":"MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation","authors":"Fenglong Ma, Yaliang Li, Chenwei Zhang, Jing Gao, Nan Du, Wei Fan","doi":"10.1145/3308558.3313436","DOIUrl":"https://doi.org/10.1145/3308558.3313436","url":null,"abstract":"Relation classification is a basic yet important task in natural language processing. Existing relation classification approaches mainly rely on distant supervision, which assumes that a bag of sentences mentioning a pair of entities and extracted from a given corpus should express the same relation type of this entity pair. The training of these models needs a lot of high-quality bag-level data. However, in some specific domains, such as medical domain, it is difficult to obtain sufficient and high-quality sentences in a text corpus that mention two entities with a certain medical relation between them. In such a case, it is hard for existing discriminative models to capture the representative features (i.e., common patterns) from diversely expressed entity pairs with a given relation. Thus, the classification performance cannot be guaranteed when limited features are obtained from the corpus. To address this challenge, in this paper, we propose to employ a generative model, called conditional variational autoencoder (CVAE), to handle the pattern sparsity. We define that each relation has an individually learned latent distribution from all possible sentences expressing this relation. As these distributions are learned based on the purpose of input reconstruction, the model's classification ability may not be strong enough and should be improved. By distinguishing the differences among different relation distributions, a margin-based regularizer is designed, which leads to a margin-based CVAE (MCVAE) that can significantly enhance the classification ability. Besides, MCVAE can automatically generate semantically meaningful patterns that describe the given relations. Experiments on two real-world datasets validate the effectiveness of the proposed MCVAE on the tasks of relation classification and relation-specific pattern generation.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88808786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5