Travel time estimation of a given route with respect to real-time traffic condition is extremely useful for many applications like route planning. We argue that it is even more useful to estimate the travel time distribution, from which we can derive the expected travel time as well as the uncertainty. In this paper, we develop a deep generative model - DeepGTT - to learn the travel time distribution for any route by conditioning on the real-time traffic. DeepGTT interprets the generation of travel time using a three-layer hierarchical probabilistic model. In the first layer, we present two techniques, amortization and spatial smoothness embeddings, to share statistical strength among different road segments; a convolutional neural net based representation learning component is also proposed to capture the dynamically changing real-time traffic condition. In the middle layer, a nonlinear factorization model is developed to generate auxiliary random variable i.e., speed. The introduction of this middle layer separates the statical spatial features from the dynamically changing real-time traffic conditions, allowing us to incorporate the heterogeneous influencing factors into a single model. In the last layer, an attention mechanism based function is proposed to collectively generate the observed travel time. DeepGTT describes the generation process in a reasonable manner, and thus it not only produces more accurate results but also is more efficient. On a real-world large-scale data set, we show that DeepGTT produces substantially better results than state-of-the-art alternatives in two tasks: travel time estimation and route recovery from sparse trajectory data.
{"title":"Learning Travel Time Distributions with Deep Generative Model","authors":"Xiucheng Li, G. Cong, Aixin Sun, Yun Cheng","doi":"10.1145/3308558.3313418","DOIUrl":"https://doi.org/10.1145/3308558.3313418","url":null,"abstract":"Travel time estimation of a given route with respect to real-time traffic condition is extremely useful for many applications like route planning. We argue that it is even more useful to estimate the travel time distribution, from which we can derive the expected travel time as well as the uncertainty. In this paper, we develop a deep generative model - DeepGTT - to learn the travel time distribution for any route by conditioning on the real-time traffic. DeepGTT interprets the generation of travel time using a three-layer hierarchical probabilistic model. In the first layer, we present two techniques, amortization and spatial smoothness embeddings, to share statistical strength among different road segments; a convolutional neural net based representation learning component is also proposed to capture the dynamically changing real-time traffic condition. In the middle layer, a nonlinear factorization model is developed to generate auxiliary random variable i.e., speed. The introduction of this middle layer separates the statical spatial features from the dynamically changing real-time traffic conditions, allowing us to incorporate the heterogeneous influencing factors into a single model. In the last layer, an attention mechanism based function is proposed to collectively generate the observed travel time. DeepGTT describes the generation process in a reasonable manner, and thus it not only produces more accurate results but also is more efficient. On a real-world large-scale data set, we show that DeepGTT produces substantially better results than state-of-the-art alternatives in two tasks: travel time estimation and route recovery from sparse trajectory data.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90353418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the important yet insufficiently studied subjects in fair allocation is the externality effect among agents. For a resource allocation problem, externalities imply that the share allocated to an agent may affect the utilities of other agents. In this paper, we conduct a study of fair allocation of indivisible goods when the externalities are not negligible. Inspired by the models in the context of network diffusion, we present a simple and natural model, namely network externalities, to capture the externalities. To evaluate fairness in the network externalities model, we generalize the idea behind the notion of maximin-share () to achieve a new criterion, namely, extended-maximin-share (). Next, we consider two problems concerning our model. First, we discuss the computational aspects of finding the value of for every agent. For this, we introduce a generalized form of partitioning problem that includes many famous partitioning problems such as maximin, minimax, and leximin. We further show that a 1/2-approximation algorithm exists for this partitioning problem. Next, we investigate on finding approximately optimal allocations, i.e., allocations that guarantee each agent a utility of at least a fraction of his extended-maximin-share. We show that under a natural assumption that the agents are a-self-reliant, an a/2- allocation always exists. The combination of this with the former result yields a polynomial-time a/4- allocation algorithm.
{"title":"Externalities and Fairness","authors":"Masoud Seddighin, Hamed Saleh, M. Ghodsi","doi":"10.1145/3308558.3313670","DOIUrl":"https://doi.org/10.1145/3308558.3313670","url":null,"abstract":"One of the important yet insufficiently studied subjects in fair allocation is the externality effect among agents. For a resource allocation problem, externalities imply that the share allocated to an agent may affect the utilities of other agents. In this paper, we conduct a study of fair allocation of indivisible goods when the externalities are not negligible. Inspired by the models in the context of network diffusion, we present a simple and natural model, namely network externalities, to capture the externalities. To evaluate fairness in the network externalities model, we generalize the idea behind the notion of maximin-share () to achieve a new criterion, namely, extended-maximin-share (). Next, we consider two problems concerning our model. First, we discuss the computational aspects of finding the value of for every agent. For this, we introduce a generalized form of partitioning problem that includes many famous partitioning problems such as maximin, minimax, and leximin. We further show that a 1/2-approximation algorithm exists for this partitioning problem. Next, we investigate on finding approximately optimal allocations, i.e., allocations that guarantee each agent a utility of at least a fraction of his extended-maximin-share. We show that under a natural assumption that the agents are a-self-reliant, an a/2- allocation always exists. The combination of this with the former result yields a polynomial-time a/4- allocation algorithm.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"53 72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90374289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuan-Son Vu, Addi Ait-Mlouk, E. Elmroth, Lili Jiang
Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacy-concern analysis, community detection).
{"title":"Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics","authors":"Xuan-Son Vu, Addi Ait-Mlouk, E. Elmroth, Lili Jiang","doi":"10.1145/3308558.3314138","DOIUrl":"https://doi.org/10.1145/3308558.3314138","url":null,"abstract":"Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacy-concern analysis, community detection).","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90579829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Carmel, Yaroslav Fyodorov, Saar Kuzi, Avihai Mejer, Fiana Raiber, Elad Rainshmidt
Enriching the content of news articles with auxiliary resources is a technique often employed by online news services to keep articles up-to-date and thereby increase users' engagement. We address the task of enriching news articles with related search queries, which are extracted from a search engine's query log. Clicking on a recommended query invokes a search session that allows the user to further explore content related to the article. We present a three-phase retrieval framework for query recommendation that incorporates various article-dependent and article-independent relevance signals. Evaluation based on an offline experiment, performed using annotations by professional editors, and a large-scale online experiment, conducted with real users, demonstrates the merits of our approach. In addition, a comprehensive analysis of our online experiment reveals interesting characteristics of the type of queries users tend to click and the nature of their interaction with the resultant search engine results page.
{"title":"Enriching News Articles with Related Search Queries","authors":"David Carmel, Yaroslav Fyodorov, Saar Kuzi, Avihai Mejer, Fiana Raiber, Elad Rainshmidt","doi":"10.1145/3308558.3313588","DOIUrl":"https://doi.org/10.1145/3308558.3313588","url":null,"abstract":"Enriching the content of news articles with auxiliary resources is a technique often employed by online news services to keep articles up-to-date and thereby increase users' engagement. We address the task of enriching news articles with related search queries, which are extracted from a search engine's query log. Clicking on a recommended query invokes a search session that allows the user to further explore content related to the article. We present a three-phase retrieval framework for query recommendation that incorporates various article-dependent and article-independent relevance signals. Evaluation based on an offline experiment, performed using annotations by professional editors, and a large-scale online experiment, conducted with real users, demonstrates the merits of our approach. In addition, a comprehensive analysis of our online experiment reveals interesting characteristics of the type of queries users tend to click and the nature of their interaction with the resultant search engine results page.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90630693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Music is known to exhibit different characteristics, depending on genre and style. While most research that studies such differences takes a musicological perspective and analyzes acoustic properties of individual pieces or artists, we conduct a large-scale analysis using various web resources. Exploiting content information from song lyrics, contextual information reflected in music artists' Wikipedia articles, and listening information, we particularly study the aspects of popularity, length, repetitiveness, and readability of lyrics and Wikipedia articles. We measure popularity in terms of song play count (PC) and listener count (LC), length in terms of character and word count, repetitiveness in terms of text compression ratio, and readability in terms of the Simple Measure of Gobbledygook (SMOG). Extending datasets of music listening histories and genre annotations from Last.fm, we extract and analyze 424,476 song lyrics by 18,724 artists from LyricWiki. We set out to answer whether there exist significant genre differences in song lyrics (RQ1) and artist Wikipedia articles (RQ2) in terms of repetitiveness and readability. We also assess whether we can find evidence to support the cliche´ that lyrics of very popular artists are particularly simple and repetitive (RQ3). We further investigate whether the characteristics of popularity, length, repetitiveness, and readability correlate within and between lyrics and Wikipedia articles (RQ4). We identify substantial differences in repetitiveness and readability of lyrics between music genres. In contrast, no significant differences between genres are found for artists' Wikipedia pages. Also, we find that lyrics of highly popular artists are repetitive but not necessarily simple in terms of readability. Furthermore, we uncover weak correlations between length of lyrics and of Wikipedia pages of the same artist, weak correlations between lyrics' reading difficulty and their length, and moderate correlations between artists' popularity and length of their lyrics.
{"title":"Genre Differences of Song Lyrics and Artist Wikis: An Analysis of Popularity, Length, Repetitiveness, and Readability","authors":"M. Schedl","doi":"10.1145/3308558.3313604","DOIUrl":"https://doi.org/10.1145/3308558.3313604","url":null,"abstract":"Music is known to exhibit different characteristics, depending on genre and style. While most research that studies such differences takes a musicological perspective and analyzes acoustic properties of individual pieces or artists, we conduct a large-scale analysis using various web resources. Exploiting content information from song lyrics, contextual information reflected in music artists' Wikipedia articles, and listening information, we particularly study the aspects of popularity, length, repetitiveness, and readability of lyrics and Wikipedia articles. We measure popularity in terms of song play count (PC) and listener count (LC), length in terms of character and word count, repetitiveness in terms of text compression ratio, and readability in terms of the Simple Measure of Gobbledygook (SMOG). Extending datasets of music listening histories and genre annotations from Last.fm, we extract and analyze 424,476 song lyrics by 18,724 artists from LyricWiki. We set out to answer whether there exist significant genre differences in song lyrics (RQ1) and artist Wikipedia articles (RQ2) in terms of repetitiveness and readability. We also assess whether we can find evidence to support the cliche´ that lyrics of very popular artists are particularly simple and repetitive (RQ3). We further investigate whether the characteristics of popularity, length, repetitiveness, and readability correlate within and between lyrics and Wikipedia articles (RQ4). We identify substantial differences in repetitiveness and readability of lyrics between music genres. In contrast, no significant differences between genres are found for artists' Wikipedia pages. Also, we find that lyrics of highly popular artists are repetitive but not necessarily simple in terms of readability. Furthermore, we uncover weak correlations between length of lyrics and of Wikipedia pages of the same artist, weak correlations between lyrics' reading difficulty and their length, and moderate correlations between artists' popularity and length of their lyrics.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85283158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.
{"title":"RED: Redundancy-Driven Data Extraction from Result Pages?","authors":"Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob","doi":"10.1145/3308558.3313529","DOIUrl":"https://doi.org/10.1145/3308558.3313529","url":null,"abstract":"Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"88 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84053858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satadal Sengupta, Niloy Ganguly, Pradipta De, Sandip Chakraborty
Network traffic classification is an important tool for network administrators in enabling monitoring and service provisioning. Traditional techniques employed in classifying traffic do not work well for mobile app traffic due to lack of unique signatures. Encryption renders this task even more difficult since packet content is no longer available to parse. More recent techniques based on statistical analysis of parameters such as packet-size and arrival time of packets have shown promise; such techniques have been shown to classify traffic from a small number of applications with a high degree of accuracy. However, we show that when employed to a large number of applications, the performance falls short of satisfactory. In this paper, we propose a novel set of bit-sequence based features which exploit differences in randomness of data generated by different applications. These differences originating due to dissimilarities in encryption implementations by different applications leave footprints on the data generated by them. We validate that these features can differentiate data encrypted with various ciphers (89% accuracy) and key-sizes (83% accuracy). Our evaluation shows that such features can not only differentiate traffic originating from different categories of mobile apps (90% accuracy), but can also classify 175 individual applications with 95% accuracy.
{"title":"Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification","authors":"Satadal Sengupta, Niloy Ganguly, Pradipta De, Sandip Chakraborty","doi":"10.1145/3308558.3313738","DOIUrl":"https://doi.org/10.1145/3308558.3313738","url":null,"abstract":"Network traffic classification is an important tool for network administrators in enabling monitoring and service provisioning. Traditional techniques employed in classifying traffic do not work well for mobile app traffic due to lack of unique signatures. Encryption renders this task even more difficult since packet content is no longer available to parse. More recent techniques based on statistical analysis of parameters such as packet-size and arrival time of packets have shown promise; such techniques have been shown to classify traffic from a small number of applications with a high degree of accuracy. However, we show that when employed to a large number of applications, the performance falls short of satisfactory. In this paper, we propose a novel set of bit-sequence based features which exploit differences in randomness of data generated by different applications. These differences originating due to dissimilarities in encryption implementations by different applications leave footprints on the data generated by them. We validate that these features can differentiate data encrypted with various ciphers (89% accuracy) and key-sizes (83% accuracy). Our evaluation shows that such features can not only differentiate traffic originating from different categories of mobile apps (90% accuracy), but can also classify 175 individual applications with 95% accuracy.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87964431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Kharraz, Zane Ma, Paul Murley, Chaz Lever, Joshua Mason, Andrew K. Miller, N. Borisov, M. Antonakakis, Michael Bailey
In-browser cryptojacking is a form of resource abuse that leverages end-users' machines to mine cryptocurrency without obtaining the users' consent. In this paper, we design, implement, and evaluate Outguard, an automated cryptojacking detection system. We construct a large ground-truth dataset, extract several features using an instrumented web browser, and ultimately select seven distinctive features that are used to build an SVM classification model. Outguardachieves a 97.9% TPR and 1.1% FPR and is reasonably tolerant to adversarial evasions. We utilized Outguardin the wild by deploying it across the Alexa Top 1M websites and found 6,302 cryptojacking sites, of which 3,600 are new detections that were absent from the training data. These cryptojacking sites paint a broad picture of the cryptojacking ecosystem, with particular emphasis on the prevalence of cryptojacking websites and the shared infrastructure that provides clues to the operators behind the cryptojacking phenomenon.
{"title":"Outguard: Detecting In-Browser Covert Cryptocurrency Mining in the Wild","authors":"Amin Kharraz, Zane Ma, Paul Murley, Chaz Lever, Joshua Mason, Andrew K. Miller, N. Borisov, M. Antonakakis, Michael Bailey","doi":"10.1145/3308558.3313665","DOIUrl":"https://doi.org/10.1145/3308558.3313665","url":null,"abstract":"In-browser cryptojacking is a form of resource abuse that leverages end-users' machines to mine cryptocurrency without obtaining the users' consent. In this paper, we design, implement, and evaluate Outguard, an automated cryptojacking detection system. We construct a large ground-truth dataset, extract several features using an instrumented web browser, and ultimately select seven distinctive features that are used to build an SVM classification model. Outguardachieves a 97.9% TPR and 1.1% FPR and is reasonably tolerant to adversarial evasions. We utilized Outguardin the wild by deploying it across the Alexa Top 1M websites and found 6,302 cryptojacking sites, of which 3,600 are new detections that were absent from the training data. These cryptojacking sites paint a broad picture of the cryptojacking ecosystem, with particular emphasis on the prevalence of cryptojacking websites and the shared infrastructure that provides clues to the operators behind the cryptojacking phenomenon.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"56 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91443773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Pastor, Matti Antero Parssinen, Patricia Callejo, Pelayo Vallina, R. C. Rumín, Ángel Cuevas, M. Kotila, A. Azcorra
Invalid ad traffic is an inherent problem of programmatic advertising that has not been properly addressed so far. Traditionally, it has been considered that invalid ad traffic only harms the interests of advertisers, which pay for the cost of invalid ad impressions while other industry stakeholders earn revenue through commissions regardless of the quality of the impression. Our first contribution consists of providing evidence that shows how the Demand Side Platforms (DSPs), one of the most important intermediaries in the programmatic advertising supply chain, may be suffering from economic losses due to invalid ad traffic. Addressing the problem of invalid traffic at DSPs requires a highly scalable solution that can identify invalid traffic in real time at the individual bid request level. The second and main contribution is the design and implementation of a solution for the invalid traffic problem, a system that can be seamlessly integrated into the current programmatic ecosystem by the DSPs. Our system has been released under an open source license, becoming the first auditable solution for invalid ad traffic detection. The intrinsic transparency of our solution along with the good results obtained in industrial trials have led the World Federation of Advertisers to endorse it.
{"title":"Nameles: An intelligent system for Real-Time Filtering of Invalid Ad Traffic","authors":"Antonio Pastor, Matti Antero Parssinen, Patricia Callejo, Pelayo Vallina, R. C. Rumín, Ángel Cuevas, M. Kotila, A. Azcorra","doi":"10.1145/3308558.3313601","DOIUrl":"https://doi.org/10.1145/3308558.3313601","url":null,"abstract":"Invalid ad traffic is an inherent problem of programmatic advertising that has not been properly addressed so far. Traditionally, it has been considered that invalid ad traffic only harms the interests of advertisers, which pay for the cost of invalid ad impressions while other industry stakeholders earn revenue through commissions regardless of the quality of the impression. Our first contribution consists of providing evidence that shows how the Demand Side Platforms (DSPs), one of the most important intermediaries in the programmatic advertising supply chain, may be suffering from economic losses due to invalid ad traffic. Addressing the problem of invalid traffic at DSPs requires a highly scalable solution that can identify invalid traffic in real time at the individual bid request level. The second and main contribution is the design and implementation of a solution for the invalid traffic problem, a system that can be seamlessly integrated into the current programmatic ecosystem by the DSPs. Our system has been released under an open source license, becoming the first auditable solution for invalid ad traffic detection. The intrinsic transparency of our solution along with the good results obtained in industrial trials have led the World Federation of Advertisers to endorse it.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90681472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cen Chen, Minghui Qiu, Yinfei Yang, Jun Zhou, Jun Huang, Xiaolong Li, F. S. Bao
Consumers today face too many reviews to read when shopping online. Presenting the most helpful reviews, instead of all, to them will greatly ease purchase decision making. Most of the existing studies on review helpfulness prediction focused on domains with rich labels, not suitable for domains with insufficient labels. In response, we explore a multi-domain approach that learns domain relationships to help the task by transferring knowledge from data-rich domains to data-deficient domains. To better model domain differences, our approach gates multi-granularity embeddings in a Neural Network (NN) based transfer learning framework to reflect the domain-variant importance of words. Extensive experiments empirically demonstrate that our model outperforms the state-of-the-art baselines and NN-based methods without gating on this task. Our approach facilitates more effective knowledge transfer between domains, especially when the target domain dataset is small. Meanwhile, the domain relationship and domain-specific embedding gating are insightful and interpretable.
{"title":"Multi-Domain Gated CNN for Review Helpfulness Prediction","authors":"Cen Chen, Minghui Qiu, Yinfei Yang, Jun Zhou, Jun Huang, Xiaolong Li, F. S. Bao","doi":"10.1145/3308558.3313587","DOIUrl":"https://doi.org/10.1145/3308558.3313587","url":null,"abstract":"Consumers today face too many reviews to read when shopping online. Presenting the most helpful reviews, instead of all, to them will greatly ease purchase decision making. Most of the existing studies on review helpfulness prediction focused on domains with rich labels, not suitable for domains with insufficient labels. In response, we explore a multi-domain approach that learns domain relationships to help the task by transferring knowledge from data-rich domains to data-deficient domains. To better model domain differences, our approach gates multi-granularity embeddings in a Neural Network (NN) based transfer learning framework to reflect the domain-variant importance of words. Extensive experiments empirically demonstrate that our model outperforms the state-of-the-art baselines and NN-based methods without gating on this task. Our approach facilitates more effective knowledge transfer between domains, especially when the target domain dataset is small. Meanwhile, the domain relationship and domain-specific embedding gating are insightful and interpretable.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82003505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}