Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
{"title":"Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation","authors":"Laure Berti-Équille","doi":"10.1145/3308558.3313602","DOIUrl":"https://doi.org/10.1145/3308558.3313602","url":null,"abstract":"Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"380 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80660923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How do learners schedule their online learning? This issue is concerned by both course instructors and researchers, especially in the context of self-paced online learning environment. Many indicators and methods have been proposed to understand and improve the time management of learning activities, however, there are few tools of visualizing, comparing and exploring the time management to gain intuitive understanding. In this demo, we introduce the LearnExp, an interactive visual analytic system designed to explore the temporal patterns of learning activities and explain the relationships between academic performance and these patterns. This system will help instructors to comparatively explore the distribution of learner activities from multiple aspects, and to visually explain the time management of different learner groups with the prediction of learning performance.
{"title":"LearnerExp: Exploring and Explaining the Time Management of Online Learning Activity","authors":"Huan He, Q. Zheng, Bo Dong","doi":"10.1145/3308558.3314140","DOIUrl":"https://doi.org/10.1145/3308558.3314140","url":null,"abstract":"How do learners schedule their online learning? This issue is concerned by both course instructors and researchers, especially in the context of self-paced online learning environment. Many indicators and methods have been proposed to understand and improve the time management of learning activities, however, there are few tools of visualizing, comparing and exploring the time management to gain intuitive understanding. In this demo, we introduce the LearnExp, an interactive visual analytic system designed to explore the temporal patterns of learning activities and explain the relationships between academic performance and these patterns. This system will help instructors to comparatively explore the distribution of learner activities from multiple aspects, and to visually explain the time management of different learner groups with the prediction of learning performance.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82424072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea
While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.
{"title":"Keyphrase Extraction from Disaster-related Tweets","authors":"Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea","doi":"10.1145/3308558.3313696","DOIUrl":"https://doi.org/10.1145/3308558.3313696","url":null,"abstract":"While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint training of two different layers of a stacked Recurrent Neural Network for keyword discovery and keyphrase extraction had been shown to be effective in extracting keyphrases from general Twitter data. We improve the model's performance on both general Twitter data and disaster-related Twitter data by incorporating contextual word embeddings, POS-tags, phonetics, and phonological features. Moreover, we discuss the shortcomings of the often used F1-measure for evaluating the quality of predicted keyphrases with respect to the ground truth annotations. Instead of the F1-measure, we propose the use of embedding-based metrics to better capture the correctness of the predicted keyphrases. In addition, we also present a novel extension of an embedding-based metric. The extension allows one to better control the penalty for the difference in the number of ground-truth and predicted keyphrases.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78811666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Nuara, Nicola Sosio, F. Trovò, Maria Chiara Zaccardi, N. Gatti, Marcello Restelli
In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.
{"title":"Dealing with Interdependencies and Uncertainty in Multi-Channel Advertising Campaigns Optimization","authors":"Alessandro Nuara, Nicola Sosio, F. Trovò, Maria Chiara Zaccardi, N. Gatti, Marcello Restelli","doi":"10.1145/3308558.3313470","DOIUrl":"https://doi.org/10.1145/3308558.3313470","url":null,"abstract":"In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86885265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An important task in Location based Social Network applications is to predict mobility - specifically, user's next point-of-interest (POI) - challenging due to the implicit feedback of footprints, sparsity of generated check-ins, and the joint impact of historical periodicity and recent check-ins. Motivated by recent success of deep variational inference, we propose VANext (Variational Attention based Next) POI prediction: a latent variable model for inferring user's next footprint, with historical mobility attention. The variational encoding captures latent features of recent mobility, followed by searching the similar historical trajectories for periodical patterns. A trajectory convolutional network is then used to learn historical mobility, significantly improving the efficiency over often used recurrent networks. A novel variational attention mechanism is proposed to exploit the periodicity of historical mobility patterns, combined with recent check-in preference to predict next POIs. We also implement a semi-supervised variant - VANext-S, which relies on variational encoding for pre-training all current trajectories in an unsupervised manner, and uses the latent variables to initialize the current trajectory learning. Experiments conducted on real-world datasets demonstrate that VANext and VANext-S outperform the state-of-the-art human mobility prediction models.
{"title":"Predicting Human Mobility via Variational Attention","authors":"Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, Fengli Zhang","doi":"10.1145/3308558.3313610","DOIUrl":"https://doi.org/10.1145/3308558.3313610","url":null,"abstract":"An important task in Location based Social Network applications is to predict mobility - specifically, user's next point-of-interest (POI) - challenging due to the implicit feedback of footprints, sparsity of generated check-ins, and the joint impact of historical periodicity and recent check-ins. Motivated by recent success of deep variational inference, we propose VANext (Variational Attention based Next) POI prediction: a latent variable model for inferring user's next footprint, with historical mobility attention. The variational encoding captures latent features of recent mobility, followed by searching the similar historical trajectories for periodical patterns. A trajectory convolutional network is then used to learn historical mobility, significantly improving the efficiency over often used recurrent networks. A novel variational attention mechanism is proposed to exploit the periodicity of historical mobility patterns, combined with recent check-in preference to predict next POIs. We also implement a semi-supervised variant - VANext-S, which relies on variational encoding for pre-training all current trajectories in an unsupervised manner, and uses the latent variables to initialize the current trajectory learning. Experiments conducted on real-world datasets demonstrate that VANext and VANext-S outperform the state-of-the-art human mobility prediction models.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89095619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujie Lin, Pengjie Ren, Zhumin Chen, Z. Ren, Jun Ma, M. de Rijke
The task of fashion recommendation includes two main challenges: visual understanding and visual matching. Visual understanding aims to extract effective visual features. Visual matching aims to model a human notion of compatibility to compute a match between fashion items. Most previous studies rely on recommendation loss alone to guide visual understanding and matching. Although the features captured by these methods describe basic characteristics (e.g., color, texture, shape) of the input items, they are not directly related to the visual signals of the output items (to be recommended). This is problematic because the aesthetic characteristics (e.g., style, design), based on which we can directly infer the output items, are lacking. Features are learned under the recommendation loss alone, where the supervision signal is simply whether the given two items are matched or not. To address this problem, we propose a neural co-supervision learning framework, called the FAshion Recommendation Machine (FARM). FARM improves visual understanding by incorporating the supervision of generation loss, which we hypothesize to be able to better encode aesthetic information. FARM enhances visual matching by introducing a novel layer-to-layer matching mechanism to fuse aesthetic information more effectively, and meanwhile avoiding paying too much attention to the generation quality and ignoring the recommendation performance. Extensive experiments on two publicly available datasets show that FARM outperforms state-of-the-art models on outfit recommendation, in terms of AUC and MRR. Detailed analyses of generated and recommended items demonstrate that FARM can encode better features and generate high quality images as references to improve recommendation performance.
{"title":"Improving Outfit Recommendation with Co-supervision of Fashion Generation","authors":"Yujie Lin, Pengjie Ren, Zhumin Chen, Z. Ren, Jun Ma, M. de Rijke","doi":"10.1145/3308558.3313614","DOIUrl":"https://doi.org/10.1145/3308558.3313614","url":null,"abstract":"The task of fashion recommendation includes two main challenges: visual understanding and visual matching. Visual understanding aims to extract effective visual features. Visual matching aims to model a human notion of compatibility to compute a match between fashion items. Most previous studies rely on recommendation loss alone to guide visual understanding and matching. Although the features captured by these methods describe basic characteristics (e.g., color, texture, shape) of the input items, they are not directly related to the visual signals of the output items (to be recommended). This is problematic because the aesthetic characteristics (e.g., style, design), based on which we can directly infer the output items, are lacking. Features are learned under the recommendation loss alone, where the supervision signal is simply whether the given two items are matched or not. To address this problem, we propose a neural co-supervision learning framework, called the FAshion Recommendation Machine (FARM). FARM improves visual understanding by incorporating the supervision of generation loss, which we hypothesize to be able to better encode aesthetic information. FARM enhances visual matching by introducing a novel layer-to-layer matching mechanism to fuse aesthetic information more effectively, and meanwhile avoiding paying too much attention to the generation quality and ignoring the recommendation performance. Extensive experiments on two publicly available datasets show that FARM outperforms state-of-the-art models on outfit recommendation, in terms of AUC and MRR. Detailed analyses of generated and recommended items demonstrate that FARM can encode better features and generate high quality images as references to improve recommendation performance.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81226225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Zhou, Xiaoli Yue, Goce Trajcevski, Ting Zhong, Kunpeng Zhang
Unveiling human mobility patterns is an important task for many downstream applications like point-of-interest (POI) recommendation and personalized trip planning. Compelling results exist in various sequential modeling methods and representation techniques. However, discovering and exploiting the context of trajectories in terms of abstract topics associated with the motion can provide a more comprehensive understanding of the dynamics of patterns. We propose a new paradigm for moving pattern mining based on learning trajectory context, and a method - Context-Aware Variational Trajectory Encoding and Human Mobility Inference (CATHI) - for learning user trajectory representation via a framework consisting of: (1) a variational encoder and a recurrent encoder; (2) a variational attention layer; (3) two decoders. We simultaneously tackle two subtasks: (T1) recovering user routes (trajectory reconstruction); and (T2) predicting the trip that the user would travel (trajectory prediction). We show that the encoded contextual trajectory vectors efficiently characterize the hierarchical mobility semantics, from which one can decode the implicit meanings of trajectories. We evaluate our method on several public datasets and demonstrate that the proposed CATHI can efficiently improve the performance of both subtasks, compared to state-of-the-art approaches.
{"title":"Context-aware Variational Trajectory Encoding and Human Mobility Inference","authors":"Fan Zhou, Xiaoli Yue, Goce Trajcevski, Ting Zhong, Kunpeng Zhang","doi":"10.1145/3308558.3313608","DOIUrl":"https://doi.org/10.1145/3308558.3313608","url":null,"abstract":"Unveiling human mobility patterns is an important task for many downstream applications like point-of-interest (POI) recommendation and personalized trip planning. Compelling results exist in various sequential modeling methods and representation techniques. However, discovering and exploiting the context of trajectories in terms of abstract topics associated with the motion can provide a more comprehensive understanding of the dynamics of patterns. We propose a new paradigm for moving pattern mining based on learning trajectory context, and a method - Context-Aware Variational Trajectory Encoding and Human Mobility Inference (CATHI) - for learning user trajectory representation via a framework consisting of: (1) a variational encoder and a recurrent encoder; (2) a variational attention layer; (3) two decoders. We simultaneously tackle two subtasks: (T1) recovering user routes (trajectory reconstruction); and (T2) predicting the trip that the user would travel (trajectory prediction). We show that the encoded contextual trajectory vectors efficiently characterize the hierarchical mobility semantics, from which one can decode the implicit meanings of trajectories. We evaluate our method on several public datasets and demonstrate that the proposed CATHI can efficiently improve the performance of both subtasks, compared to state-of-the-art approaches.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86225839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufei Xie, Shuchun Liu, Tangren Yao, Yao Peng, Zhao Lu
Answer ranking is an important task in Community Question Answering (CQA), by which “Good” answers should be ranked in the front of “Bad” or “Potentially Useful” answers. The state of the art is the attention-based classification framework that learns the mapping between the questions and the answers. However, we observe that existing attention-based methods perform poorly on complicated question-answer pairs. One major reason is that existing methods cannot get accurate alignments between questions and answers for such pairs. We call the phenomenon “attention divergence”. In this paper, we propose a new attention mechanism, called Focusing Attention Network(FAN), which can automatically draw back the divergent attention by adding the semantic, and metadata features. Our Model can focus on the most important part of the sentence and therefore improve the answer ranking performance. Experimental results on the CQA dataset of SemEval-2016 and SemEval-2017 demonstrate that our method respectively attains 79.38 and 88.72 on MAP and outperforms the Top-1 system in the shared task by 0.19 and 0.29.
{"title":"Focusing Attention Network for Answer Ranking","authors":"Yufei Xie, Shuchun Liu, Tangren Yao, Yao Peng, Zhao Lu","doi":"10.1145/3308558.3313518","DOIUrl":"https://doi.org/10.1145/3308558.3313518","url":null,"abstract":"Answer ranking is an important task in Community Question Answering (CQA), by which “Good” answers should be ranked in the front of “Bad” or “Potentially Useful” answers. The state of the art is the attention-based classification framework that learns the mapping between the questions and the answers. However, we observe that existing attention-based methods perform poorly on complicated question-answer pairs. One major reason is that existing methods cannot get accurate alignments between questions and answers for such pairs. We call the phenomenon “attention divergence”. In this paper, we propose a new attention mechanism, called Focusing Attention Network(FAN), which can automatically draw back the divergent attention by adding the semantic, and metadata features. Our Model can focus on the most important part of the sentence and therefore improve the answer ranking performance. Experimental results on the CQA dataset of SemEval-2016 and SemEval-2017 demonstrate that our method respectively attains 79.38 and 88.72 on MAP and outperforms the Top-1 system in the shared task by 0.19 and 0.29.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86275771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online advertising is one of the primary funding sources for various of content, services, and applications on both web and mobile platforms. Mobile in-app advertising reuses many existing web technologies under the same ad-serving model (i.e., users - publishers - ad networks - advertisers). Nevertheless, mobile in-app advertising is different from the traditional web advertising in many aspects. For example, malicious app developers can generate fraudulent ad clicks in an automated fashion, but malicious web publishers have to launch click fraud with bots. In spite of using the same underlying web infrastructure, advertising threats behave differently on the two platforms. Existing works have studied separately click fraud and malvertising in the mobile setting. However, it is unknown if there exists a relationship between these two dominant threats. In this paper, we present an ad collection framework – MAdLife – on Android to capture all the in-app ad traffic generated during an ad's entire lifespan. MAdLife allows us to revisit both threats in a fine-grained manner and study the relationship between them. It further enables the exploration of other threats related to ad landing pages. We analyzed 5.7K Android apps crawled from the Google Play Store, and collected 83K ads and their landing pages using MAdLife. Similar to traditional web ads, 58K ads landed on web pages. We discovered 37 click-fraud apps, and found that 1.49% of the 58K ads were malicious. We also revealed a strong correlation between fraudulent apps and malicious ads. Specifically, 15.44% of malicious ads originated from the fraudulent apps. Conversely, 18.36% of the ads served in the fraudulent apps were malicious, while only 1.28% were malicious in the rest apps. This suggests that users of fraudulent apps are much more (14x) likely to encounter malicious ads. Additionally, we discovered that 243 popular JavaScript snippets embedded by over 10% of the landing pages were malicious. Finally, we conducted the first analysis on inappropriate mobile in-app ads.
在线广告是网络和移动平台上各种内容、服务和应用程序的主要资金来源之一。移动应用内广告在相同的广告服务模式下重用了许多现有的网络技术(即用户-发布商-广告网络-广告商)。然而,手机应用内广告与传统的网页广告在很多方面都有不同。例如,恶意应用开发者可以自动生成欺诈性广告点击,但恶意网络发布者必须使用机器人来进行点击欺诈。尽管使用相同的底层网络基础设施,广告威胁在两个平台上的表现却不同。现有的研究分别研究了移动环境下的点击欺诈和恶意广告。然而,这两种主要威胁之间是否存在关系尚不清楚。在本文中,我们提出了一个广告收集框架- MAdLife -在Android上捕获所有在广告的整个生命周期中产生的应用内广告流量。MAdLife允许我们以细粒度的方式重新审视这两种威胁,并研究它们之间的关系。它还可以进一步探索与广告着陆页相关的其他威胁。我们分析了从Google Play Store抓取的5.7万个Android应用,并使用MAdLife收集了8.3万个广告及其登陆页面。与传统的网络广告类似,在网页上投放了58K个广告。我们发现了37个点击欺诈应用,并发现5.8万个广告中有1.49%是恶意的。我们还发现欺诈性应用和恶意广告之间存在很强的相关性。具体来说,15.44%的恶意广告来自于欺诈性应用。相反,欺诈应用中18.36%的广告是恶意的,而在其他应用中只有1.28%的广告是恶意的。这表明欺诈性应用的用户更有可能(14倍)遇到恶意广告。此外,我们发现超过10%的登陆页面中嵌入的243个流行JavaScript片段是恶意的。最后,我们对不恰当的手机应用内广告进行了首次分析。
{"title":"Revisiting Mobile Advertising Threats with MAdLife","authors":"Gong Chen, W. Meng, J. Copeland","doi":"10.1145/3308558.3313549","DOIUrl":"https://doi.org/10.1145/3308558.3313549","url":null,"abstract":"Online advertising is one of the primary funding sources for various of content, services, and applications on both web and mobile platforms. Mobile in-app advertising reuses many existing web technologies under the same ad-serving model (i.e., users - publishers - ad networks - advertisers). Nevertheless, mobile in-app advertising is different from the traditional web advertising in many aspects. For example, malicious app developers can generate fraudulent ad clicks in an automated fashion, but malicious web publishers have to launch click fraud with bots. In spite of using the same underlying web infrastructure, advertising threats behave differently on the two platforms. Existing works have studied separately click fraud and malvertising in the mobile setting. However, it is unknown if there exists a relationship between these two dominant threats. In this paper, we present an ad collection framework – MAdLife – on Android to capture all the in-app ad traffic generated during an ad's entire lifespan. MAdLife allows us to revisit both threats in a fine-grained manner and study the relationship between them. It further enables the exploration of other threats related to ad landing pages. We analyzed 5.7K Android apps crawled from the Google Play Store, and collected 83K ads and their landing pages using MAdLife. Similar to traditional web ads, 58K ads landed on web pages. We discovered 37 click-fraud apps, and found that 1.49% of the 58K ads were malicious. We also revealed a strong correlation between fraudulent apps and malicious ads. Specifically, 15.44% of malicious ads originated from the fraudulent apps. Conversely, 18.36% of the ads served in the fraudulent apps were malicious, while only 1.28% were malicious in the rest apps. This suggests that users of fraudulent apps are much more (14x) likely to encounter malicious ads. Additionally, we discovered that 243 popular JavaScript snippets embedded by over 10% of the landing pages were malicious. Finally, we conducted the first analysis on inappropriate mobile in-app ads.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88786332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fenglong Ma, Yaliang Li, Chenwei Zhang, Jing Gao, Nan Du, Wei Fan
Relation classification is a basic yet important task in natural language processing. Existing relation classification approaches mainly rely on distant supervision, which assumes that a bag of sentences mentioning a pair of entities and extracted from a given corpus should express the same relation type of this entity pair. The training of these models needs a lot of high-quality bag-level data. However, in some specific domains, such as medical domain, it is difficult to obtain sufficient and high-quality sentences in a text corpus that mention two entities with a certain medical relation between them. In such a case, it is hard for existing discriminative models to capture the representative features (i.e., common patterns) from diversely expressed entity pairs with a given relation. Thus, the classification performance cannot be guaranteed when limited features are obtained from the corpus. To address this challenge, in this paper, we propose to employ a generative model, called conditional variational autoencoder (CVAE), to handle the pattern sparsity. We define that each relation has an individually learned latent distribution from all possible sentences expressing this relation. As these distributions are learned based on the purpose of input reconstruction, the model's classification ability may not be strong enough and should be improved. By distinguishing the differences among different relation distributions, a margin-based regularizer is designed, which leads to a margin-based CVAE (MCVAE) that can significantly enhance the classification ability. Besides, MCVAE can automatically generate semantically meaningful patterns that describe the given relations. Experiments on two real-world datasets validate the effectiveness of the proposed MCVAE on the tasks of relation classification and relation-specific pattern generation.
{"title":"MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation","authors":"Fenglong Ma, Yaliang Li, Chenwei Zhang, Jing Gao, Nan Du, Wei Fan","doi":"10.1145/3308558.3313436","DOIUrl":"https://doi.org/10.1145/3308558.3313436","url":null,"abstract":"Relation classification is a basic yet important task in natural language processing. Existing relation classification approaches mainly rely on distant supervision, which assumes that a bag of sentences mentioning a pair of entities and extracted from a given corpus should express the same relation type of this entity pair. The training of these models needs a lot of high-quality bag-level data. However, in some specific domains, such as medical domain, it is difficult to obtain sufficient and high-quality sentences in a text corpus that mention two entities with a certain medical relation between them. In such a case, it is hard for existing discriminative models to capture the representative features (i.e., common patterns) from diversely expressed entity pairs with a given relation. Thus, the classification performance cannot be guaranteed when limited features are obtained from the corpus. To address this challenge, in this paper, we propose to employ a generative model, called conditional variational autoencoder (CVAE), to handle the pattern sparsity. We define that each relation has an individually learned latent distribution from all possible sentences expressing this relation. As these distributions are learned based on the purpose of input reconstruction, the model's classification ability may not be strong enough and should be improved. By distinguishing the differences among different relation distributions, a margin-based regularizer is designed, which leads to a margin-based CVAE (MCVAE) that can significantly enhance the classification ability. Besides, MCVAE can automatically generate semantically meaningful patterns that describe the given relations. Experiments on two real-world datasets validate the effectiveness of the proposed MCVAE on the tasks of relation classification and relation-specific pattern generation.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88808786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}