首页 > 最新文献

The World Wide Web Conference最新文献

英文 中文
Learning Travel Time Distributions with Deep Generative Model 用深度生成模型学习旅行时间分布
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313418
Xiucheng Li, G. Cong, Aixin Sun, Yun Cheng
Travel time estimation of a given route with respect to real-time traffic condition is extremely useful for many applications like route planning. We argue that it is even more useful to estimate the travel time distribution, from which we can derive the expected travel time as well as the uncertainty. In this paper, we develop a deep generative model - DeepGTT - to learn the travel time distribution for any route by conditioning on the real-time traffic. DeepGTT interprets the generation of travel time using a three-layer hierarchical probabilistic model. In the first layer, we present two techniques, amortization and spatial smoothness embeddings, to share statistical strength among different road segments; a convolutional neural net based representation learning component is also proposed to capture the dynamically changing real-time traffic condition. In the middle layer, a nonlinear factorization model is developed to generate auxiliary random variable i.e., speed. The introduction of this middle layer separates the statical spatial features from the dynamically changing real-time traffic conditions, allowing us to incorporate the heterogeneous influencing factors into a single model. In the last layer, an attention mechanism based function is proposed to collectively generate the observed travel time. DeepGTT describes the generation process in a reasonable manner, and thus it not only produces more accurate results but also is more efficient. On a real-world large-scale data set, we show that DeepGTT produces substantially better results than state-of-the-art alternatives in two tasks: travel time estimation and route recovery from sparse trajectory data.
根据实时交通状况对给定路线的行程时间进行估计,对于路线规划等许多应用非常有用。我们认为估计旅行时间分布更有用,从中我们可以得到期望旅行时间以及不确定性。在本文中,我们开发了一个深度生成模型——DeepGTT——通过实时交通条件来学习任意路线的行程时间分布。DeepGTT使用三层分层概率模型来解释旅行时间的产生。在第一层,我们提出了两种技术,摊销和空间平滑嵌入,以共享不同路段之间的统计强度;提出了一种基于卷积神经网络的表征学习组件,用于捕获动态变化的实时交通状况。在中间层,建立非线性因子分解模型,生成辅助随机变量,即速度。中间层的引入将静态空间特征从动态变化的实时交通状况中分离出来,使我们能够将异构的影响因素合并到单个模型中。在最后一层,提出了一个基于注意机制的函数来共同生成观测到的旅行时间。DeepGTT以合理的方式描述了生成过程,因此不仅产生更准确的结果,而且效率更高。在现实世界的大规模数据集上,我们表明DeepGTT在两个任务上比最先进的替代方案产生了更好的结果:旅行时间估计和从稀疏轨迹数据中恢复路线。
{"title":"Learning Travel Time Distributions with Deep Generative Model","authors":"Xiucheng Li, G. Cong, Aixin Sun, Yun Cheng","doi":"10.1145/3308558.3313418","DOIUrl":"https://doi.org/10.1145/3308558.3313418","url":null,"abstract":"Travel time estimation of a given route with respect to real-time traffic condition is extremely useful for many applications like route planning. We argue that it is even more useful to estimate the travel time distribution, from which we can derive the expected travel time as well as the uncertainty. In this paper, we develop a deep generative model - DeepGTT - to learn the travel time distribution for any route by conditioning on the real-time traffic. DeepGTT interprets the generation of travel time using a three-layer hierarchical probabilistic model. In the first layer, we present two techniques, amortization and spatial smoothness embeddings, to share statistical strength among different road segments; a convolutional neural net based representation learning component is also proposed to capture the dynamically changing real-time traffic condition. In the middle layer, a nonlinear factorization model is developed to generate auxiliary random variable i.e., speed. The introduction of this middle layer separates the statical spatial features from the dynamically changing real-time traffic conditions, allowing us to incorporate the heterogeneous influencing factors into a single model. In the last layer, an attention mechanism based function is proposed to collectively generate the observed travel time. DeepGTT describes the generation process in a reasonable manner, and thus it not only produces more accurate results but also is more efficient. On a real-world large-scale data set, we show that DeepGTT produces substantially better results than state-of-the-art alternatives in two tasks: travel time estimation and route recovery from sparse trajectory data.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90353418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Externalities and Fairness 外部性与公平性
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313670
Masoud Seddighin, Hamed Saleh, M. Ghodsi
One of the important yet insufficiently studied subjects in fair allocation is the externality effect among agents. For a resource allocation problem, externalities imply that the share allocated to an agent may affect the utilities of other agents. In this paper, we conduct a study of fair allocation of indivisible goods when the externalities are not negligible. Inspired by the models in the context of network diffusion, we present a simple and natural model, namely network externalities, to capture the externalities. To evaluate fairness in the network externalities model, we generalize the idea behind the notion of maximin-share () to achieve a new criterion, namely, extended-maximin-share (). Next, we consider two problems concerning our model. First, we discuss the computational aspects of finding the value of for every agent. For this, we introduce a generalized form of partitioning problem that includes many famous partitioning problems such as maximin, minimax, and leximin. We further show that a 1/2-approximation algorithm exists for this partitioning problem. Next, we investigate on finding approximately optimal allocations, i.e., allocations that guarantee each agent a utility of at least a fraction of his extended-maximin-share. We show that under a natural assumption that the agents are a-self-reliant, an a/2- allocation always exists. The combination of this with the former result yields a polynomial-time a/4- allocation algorithm.
公平分配中一个重要但研究不足的课题是主体间的外部性效应。对于资源分配问题,外部性意味着分配给一个代理的份额可能会影响其他代理的效用。本文研究了外部性不可忽略的情况下不可分割商品的公平分配问题。受网络扩散模型的启发,我们提出了一个简单而自然的模型,即网络外部性模型来捕捉外部性。为了评估网络外部性模型中的公平性,我们将最大份额()概念背后的思想推广到一个新的标准,即扩展最大份额()。接下来,我们考虑与模型有关的两个问题。首先,我们讨论了寻找每个代理的值的计算方面。为此,我们引入了一种广义形式的分区问题,其中包括许多著名的分区问题,如极大、极小和leximin。我们进一步证明了对于这个划分问题存在一个1/2近似算法。接下来,我们研究如何找到近似最优分配,即保证每个代理的效用至少是其扩展最大份额的一小部分的分配。我们证明了在agent是a-自依赖的自然假设下,a/2-分配总是存在的。这与前一个结果的组合产生了一个多项式时间a/4-分配算法。
{"title":"Externalities and Fairness","authors":"Masoud Seddighin, Hamed Saleh, M. Ghodsi","doi":"10.1145/3308558.3313670","DOIUrl":"https://doi.org/10.1145/3308558.3313670","url":null,"abstract":"One of the important yet insufficiently studied subjects in fair allocation is the externality effect among agents. For a resource allocation problem, externalities imply that the share allocated to an agent may affect the utilities of other agents. In this paper, we conduct a study of fair allocation of indivisible goods when the externalities are not negligible. Inspired by the models in the context of network diffusion, we present a simple and natural model, namely network externalities, to capture the externalities. To evaluate fairness in the network externalities model, we generalize the idea behind the notion of maximin-share () to achieve a new criterion, namely, extended-maximin-share (). Next, we consider two problems concerning our model. First, we discuss the computational aspects of finding the value of for every agent. For this, we introduce a generalized form of partitioning problem that includes many famous partitioning problems such as maximin, minimax, and leximin. We further show that a 1/2-approximation algorithm exists for this partitioning problem. Next, we investigate on finding approximately optimal allocations, i.e., allocations that guarantee each agent a utility of at least a fraction of his extended-maximin-share. We show that under a natural assumption that the agents are a-self-reliant, an a/2- allocation always exists. The combination of this with the former result yields a polynomial-time a/4- allocation algorithm.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"53 72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90374289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics 基于图的异构数据检索与分析交互式数据联合系统
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3314138
Xuan-Son Vu, Addi Ait-Mlouk, E. Elmroth, Lili Jiang
Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacy-concern analysis, community detection).
由于存储在关系数据库、文件系统或云环境中的异构数据越来越多,因此需要易于访问和语义连接,以便进行进一步的数据分析。数据联合的潜力在很大程度上尚未开发,本文提出了一个交互式数据联合系统(https://vimeo.com/319473546),通过应用大规模技术,包括异构数据联合、自然语言处理、关联规则和语义网,对社交网络数据进行数据检索和分析。系统首先创建一个虚拟数据库(VDB)来虚拟地集成来自多个数据源的数据。接下来,构建RDF生成器来统一数据和SPARQL查询,以支持通过自然语言处理(NLP)对处理过的文本数据进行语义数据搜索。关联规则分析用于发现模式并识别来自多个数据源的变量的最重要的共现。该系统演示了它如何促进针对不同应用场景的交互式数据分析(例如,情感分析,隐私问题分析,社区检测)。
{"title":"Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics","authors":"Xuan-Son Vu, Addi Ait-Mlouk, E. Elmroth, Lili Jiang","doi":"10.1145/3308558.3314138","DOIUrl":"https://doi.org/10.1145/3308558.3314138","url":null,"abstract":"Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacy-concern analysis, community detection).","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90579829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Enriching News Articles with Related Search Queries 丰富新闻文章与相关的搜索查询
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313588
David Carmel, Yaroslav Fyodorov, Saar Kuzi, Avihai Mejer, Fiana Raiber, Elad Rainshmidt
Enriching the content of news articles with auxiliary resources is a technique often employed by online news services to keep articles up-to-date and thereby increase users' engagement. We address the task of enriching news articles with related search queries, which are extracted from a search engine's query log. Clicking on a recommended query invokes a search session that allows the user to further explore content related to the article. We present a three-phase retrieval framework for query recommendation that incorporates various article-dependent and article-independent relevance signals. Evaluation based on an offline experiment, performed using annotations by professional editors, and a large-scale online experiment, conducted with real users, demonstrates the merits of our approach. In addition, a comprehensive analysis of our online experiment reveals interesting characteristics of the type of queries users tend to click and the nature of their interaction with the resultant search engine results page.
使用辅助资源丰富新闻文章的内容是在线新闻服务经常采用的一种技术,以保持文章的最新状态,从而增加用户的参与度。我们解决了用相关搜索查询丰富新闻文章的任务,这些查询是从搜索引擎的查询日志中提取的。单击推荐查询将调用一个搜索会话,允许用户进一步探索与文章相关的内容。我们提出了一个用于查询推荐的三阶段检索框架,该框架包含各种文章相关和文章无关的相关信号。基于由专业编辑使用注释进行的离线实验和由真实用户进行的大规模在线实验的评估,证明了我们的方法的优点。此外,对我们在线实验的全面分析揭示了用户倾向于点击的查询类型的有趣特征,以及他们与搜索引擎结果页面的交互性质。
{"title":"Enriching News Articles with Related Search Queries","authors":"David Carmel, Yaroslav Fyodorov, Saar Kuzi, Avihai Mejer, Fiana Raiber, Elad Rainshmidt","doi":"10.1145/3308558.3313588","DOIUrl":"https://doi.org/10.1145/3308558.3313588","url":null,"abstract":"Enriching the content of news articles with auxiliary resources is a technique often employed by online news services to keep articles up-to-date and thereby increase users' engagement. We address the task of enriching news articles with related search queries, which are extracted from a search engine's query log. Clicking on a recommended query invokes a search session that allows the user to further explore content related to the article. We present a three-phase retrieval framework for query recommendation that incorporates various article-dependent and article-independent relevance signals. Evaluation based on an offline experiment, performed using annotations by professional editors, and a large-scale online experiment, conducted with real users, demonstrates the merits of our approach. In addition, a comprehensive analysis of our online experiment reveals interesting characteristics of the type of queries users tend to click and the nature of their interaction with the resultant search engine results page.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90630693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Genre Differences of Song Lyrics and Artist Wikis: An Analysis of Popularity, Length, Repetitiveness, and Readability 歌曲歌词与艺术家维基的体裁差异:流行度、长度、重复度与可读性分析
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313604
M. Schedl
Music is known to exhibit different characteristics, depending on genre and style. While most research that studies such differences takes a musicological perspective and analyzes acoustic properties of individual pieces or artists, we conduct a large-scale analysis using various web resources. Exploiting content information from song lyrics, contextual information reflected in music artists' Wikipedia articles, and listening information, we particularly study the aspects of popularity, length, repetitiveness, and readability of lyrics and Wikipedia articles. We measure popularity in terms of song play count (PC) and listener count (LC), length in terms of character and word count, repetitiveness in terms of text compression ratio, and readability in terms of the Simple Measure of Gobbledygook (SMOG). Extending datasets of music listening histories and genre annotations from Last.fm, we extract and analyze 424,476 song lyrics by 18,724 artists from LyricWiki. We set out to answer whether there exist significant genre differences in song lyrics (RQ1) and artist Wikipedia articles (RQ2) in terms of repetitiveness and readability. We also assess whether we can find evidence to support the cliche´ that lyrics of very popular artists are particularly simple and repetitive (RQ3). We further investigate whether the characteristics of popularity, length, repetitiveness, and readability correlate within and between lyrics and Wikipedia articles (RQ4). We identify substantial differences in repetitiveness and readability of lyrics between music genres. In contrast, no significant differences between genres are found for artists' Wikipedia pages. Also, we find that lyrics of highly popular artists are repetitive but not necessarily simple in terms of readability. Furthermore, we uncover weak correlations between length of lyrics and of Wikipedia pages of the same artist, weak correlations between lyrics' reading difficulty and their length, and moderate correlations between artists' popularity and length of their lyrics.
众所周知,音乐表现出不同的特征,取决于流派和风格。虽然大多数研究这种差异的研究都是从音乐学的角度出发,分析单个作品或艺术家的声学特性,但我们使用各种网络资源进行了大规模的分析。利用歌曲歌词的内容信息、音乐艺术家维基百科文章中的上下文信息和听力信息,我们特别研究了歌词和维基百科文章的受欢迎程度、长度、重复性和可读性等方面。我们根据歌曲播放数(PC)和听众数(LC)来衡量流行程度,根据字符和单词数来衡量长度,根据文本压缩比来衡量重复程度,根据简单的官样文章(SMOG)来衡量可读性。扩展音乐收听历史和体裁注释的数据集。我们从LyricWiki上提取并分析了18724位艺术家的424,476首歌词。我们着手回答歌词(RQ1)和艺术家维基百科文章(RQ2)在重复性和可读性方面是否存在显著的类型差异。我们还评估了我们是否能找到证据来支持“非常受欢迎的艺术家的歌词特别简单和重复”这一陈词滥调。我们进一步研究歌词和维基百科文章之间的流行度、长度、重复性和可读性特征是否相关(RQ4)。我们发现歌词的重复性和可读性在不同音乐类型之间存在实质性差异。相比之下,在艺术家的维基百科页面上,没有发现流派之间的显著差异。此外,我们发现热门歌手的歌词是重复的,但在可读性方面并不一定简单。此外,我们发现歌词长度与同一艺术家的维基百科页面长度之间存在弱相关性,歌词阅读难度与歌词长度之间存在弱相关性,艺术家的受欢迎程度与歌词长度之间存在中等相关性。
{"title":"Genre Differences of Song Lyrics and Artist Wikis: An Analysis of Popularity, Length, Repetitiveness, and Readability","authors":"M. Schedl","doi":"10.1145/3308558.3313604","DOIUrl":"https://doi.org/10.1145/3308558.3313604","url":null,"abstract":"Music is known to exhibit different characteristics, depending on genre and style. While most research that studies such differences takes a musicological perspective and analyzes acoustic properties of individual pieces or artists, we conduct a large-scale analysis using various web resources. Exploiting content information from song lyrics, contextual information reflected in music artists' Wikipedia articles, and listening information, we particularly study the aspects of popularity, length, repetitiveness, and readability of lyrics and Wikipedia articles. We measure popularity in terms of song play count (PC) and listener count (LC), length in terms of character and word count, repetitiveness in terms of text compression ratio, and readability in terms of the Simple Measure of Gobbledygook (SMOG). Extending datasets of music listening histories and genre annotations from Last.fm, we extract and analyze 424,476 song lyrics by 18,724 artists from LyricWiki. We set out to answer whether there exist significant genre differences in song lyrics (RQ1) and artist Wikipedia articles (RQ2) in terms of repetitiveness and readability. We also assess whether we can find evidence to support the cliche´ that lyrics of very popular artists are particularly simple and repetitive (RQ3). We further investigate whether the characteristics of popularity, length, repetitiveness, and readability correlate within and between lyrics and Wikipedia articles (RQ4). We identify substantial differences in repetitiveness and readability of lyrics between music genres. In contrast, no significant differences between genres are found for artists' Wikipedia pages. Also, we find that lyrics of highly popular artists are repetitive but not necessarily simple in terms of readability. Furthermore, we uncover weak correlations between length of lyrics and of Wikipedia pages of the same artist, weak correlations between lyrics' reading difficulty and their length, and moderate correlations between artists' popularity and length of their lyrics.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85283158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
RED: Redundancy-Driven Data Extraction from Result Pages? RED:从结果页中提取冗余驱动的数据?
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313529
Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.
数据驱动型网站大多通过搜索界面访问。这些站点遵循一种常见的发布模式,令人惊讶的是,这种模式尚未被完全用于无监督的数据提取:搜索结果以结果记录的分页列表的形式呈现。每个结果记录包含一个对象的主要属性,并链接到一个专门用于该对象详细信息的页面。我们提出了red,一种自动方法和原型系统,用于按照这种发布模式从站点提取数据记录。Red利用结果记录和相应详细页面之间固有的冗余来设计一种有效的、完全不受监督的、独立于领域的方法。它能够从结果页中提取出现在结果记录和相应详细信息页中的对象的所有属性。相对于以前的无监督方法,我们的方法不需要任何先验的领域相关知识(例如本体),在自动选择对象属性的同时可以获得更高的精度,这是传统的完全无监督方法无法完成的任务。相对于之前的监督或半监督方法,red可以在许多领域(例如,职位发布)中达到相似的准确性,而不需要对每个领域进行监督,更不用说每个网站了。
{"title":"RED: Redundancy-Driven Data Extraction from Result Pages?","authors":"Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob","doi":"10.1145/3308558.3313529","DOIUrl":"https://doi.org/10.1145/3308558.3313529","url":null,"abstract":"Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"88 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84053858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification 利用Android TLS实现的多样性实现移动应用流量分类
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313738
Satadal Sengupta, Niloy Ganguly, Pradipta De, Sandip Chakraborty
Network traffic classification is an important tool for network administrators in enabling monitoring and service provisioning. Traditional techniques employed in classifying traffic do not work well for mobile app traffic due to lack of unique signatures. Encryption renders this task even more difficult since packet content is no longer available to parse. More recent techniques based on statistical analysis of parameters such as packet-size and arrival time of packets have shown promise; such techniques have been shown to classify traffic from a small number of applications with a high degree of accuracy. However, we show that when employed to a large number of applications, the performance falls short of satisfactory. In this paper, we propose a novel set of bit-sequence based features which exploit differences in randomness of data generated by different applications. These differences originating due to dissimilarities in encryption implementations by different applications leave footprints on the data generated by them. We validate that these features can differentiate data encrypted with various ciphers (89% accuracy) and key-sizes (83% accuracy). Our evaluation shows that such features can not only differentiate traffic originating from different categories of mobile apps (90% accuracy), but can also classify 175 individual applications with 95% accuracy.
网络流分类是网络管理员实现监控和业务发放的重要工具。由于缺乏唯一签名,传统的流量分类技术不能很好地用于移动应用流量。加密使这项任务更加困难,因为数据包内容不再可用于解析。基于诸如数据包大小和数据包到达时间等参数的统计分析的最新技术显示出了希望;这种技术已经被证明可以对来自少数应用程序的流量进行高度精确的分类。然而,我们表明,当使用到大量的应用程序时,性能不尽如人意。在本文中,我们提出了一套新的基于位序列的特征,利用不同应用程序生成的数据的随机性差异。这些差异是由于不同应用程序在加密实现上的不同而产生的,会在它们生成的数据上留下痕迹。我们验证了这些特征可以区分使用各种密码(89%准确率)和密钥大小(83%准确率)加密的数据。我们的评估表明,这些功能不仅可以区分来自不同类别移动应用程序的流量(准确率为90%),还可以对175个单独的应用程序进行分类,准确率为95%。
{"title":"Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification","authors":"Satadal Sengupta, Niloy Ganguly, Pradipta De, Sandip Chakraborty","doi":"10.1145/3308558.3313738","DOIUrl":"https://doi.org/10.1145/3308558.3313738","url":null,"abstract":"Network traffic classification is an important tool for network administrators in enabling monitoring and service provisioning. Traditional techniques employed in classifying traffic do not work well for mobile app traffic due to lack of unique signatures. Encryption renders this task even more difficult since packet content is no longer available to parse. More recent techniques based on statistical analysis of parameters such as packet-size and arrival time of packets have shown promise; such techniques have been shown to classify traffic from a small number of applications with a high degree of accuracy. However, we show that when employed to a large number of applications, the performance falls short of satisfactory. In this paper, we propose a novel set of bit-sequence based features which exploit differences in randomness of data generated by different applications. These differences originating due to dissimilarities in encryption implementations by different applications leave footprints on the data generated by them. We validate that these features can differentiate data encrypted with various ciphers (89% accuracy) and key-sizes (83% accuracy). Our evaluation shows that such features can not only differentiate traffic originating from different categories of mobile apps (90% accuracy), but can also classify 175 individual applications with 95% accuracy.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87964431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Outguard: Detecting In-Browser Covert Cryptocurrency Mining in the Wild Outguard:在野外检测浏览器内隐蔽的加密货币挖掘
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313665
Amin Kharraz, Zane Ma, Paul Murley, Chaz Lever, Joshua Mason, Andrew K. Miller, N. Borisov, M. Antonakakis, Michael Bailey
In-browser cryptojacking is a form of resource abuse that leverages end-users' machines to mine cryptocurrency without obtaining the users' consent. In this paper, we design, implement, and evaluate Outguard, an automated cryptojacking detection system. We construct a large ground-truth dataset, extract several features using an instrumented web browser, and ultimately select seven distinctive features that are used to build an SVM classification model. Outguardachieves a 97.9% TPR and 1.1% FPR and is reasonably tolerant to adversarial evasions. We utilized Outguardin the wild by deploying it across the Alexa Top 1M websites and found 6,302 cryptojacking sites, of which 3,600 are new detections that were absent from the training data. These cryptojacking sites paint a broad picture of the cryptojacking ecosystem, with particular emphasis on the prevalence of cryptojacking websites and the shared infrastructure that provides clues to the operators behind the cryptojacking phenomenon.
浏览器内加密劫持是一种资源滥用形式,它利用最终用户的机器在未经用户同意的情况下挖掘加密货币。在本文中,我们设计,实现和评估了Outguard,一个自动加密劫持检测系统。我们构建了一个大型的真实数据集,使用仪器化的web浏览器提取几个特征,并最终选择七个不同的特征用于构建支持向量机分类模型。outguard达到97.9%的TPR和1.1%的FPR,对对抗性规避有一定的容受性。我们利用Outguardin将其部署在Alexa排名前100万的网站上,发现了6302个加密劫持网站,其中3600个是训练数据中缺失的新检测。这些加密劫持网站描绘了加密劫持生态系统的广阔图景,特别强调了加密劫持网站的流行和共享基础设施,这些基础设施为加密劫持现象背后的运营商提供了线索。
{"title":"Outguard: Detecting In-Browser Covert Cryptocurrency Mining in the Wild","authors":"Amin Kharraz, Zane Ma, Paul Murley, Chaz Lever, Joshua Mason, Andrew K. Miller, N. Borisov, M. Antonakakis, Michael Bailey","doi":"10.1145/3308558.3313665","DOIUrl":"https://doi.org/10.1145/3308558.3313665","url":null,"abstract":"In-browser cryptojacking is a form of resource abuse that leverages end-users' machines to mine cryptocurrency without obtaining the users' consent. In this paper, we design, implement, and evaluate Outguard, an automated cryptojacking detection system. We construct a large ground-truth dataset, extract several features using an instrumented web browser, and ultimately select seven distinctive features that are used to build an SVM classification model. Outguardachieves a 97.9% TPR and 1.1% FPR and is reasonably tolerant to adversarial evasions. We utilized Outguardin the wild by deploying it across the Alexa Top 1M websites and found 6,302 cryptojacking sites, of which 3,600 are new detections that were absent from the training data. These cryptojacking sites paint a broad picture of the cryptojacking ecosystem, with particular emphasis on the prevalence of cryptojacking websites and the shared infrastructure that provides clues to the operators behind the cryptojacking phenomenon.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"56 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91443773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Nameles: An intelligent system for Real-Time Filtering of Invalid Ad Traffic 名称:一个智能系统,实时过滤无效的广告流量
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313601
Antonio Pastor, Matti Antero Parssinen, Patricia Callejo, Pelayo Vallina, R. C. Rumín, Ángel Cuevas, M. Kotila, A. Azcorra
Invalid ad traffic is an inherent problem of programmatic advertising that has not been properly addressed so far. Traditionally, it has been considered that invalid ad traffic only harms the interests of advertisers, which pay for the cost of invalid ad impressions while other industry stakeholders earn revenue through commissions regardless of the quality of the impression. Our first contribution consists of providing evidence that shows how the Demand Side Platforms (DSPs), one of the most important intermediaries in the programmatic advertising supply chain, may be suffering from economic losses due to invalid ad traffic. Addressing the problem of invalid traffic at DSPs requires a highly scalable solution that can identify invalid traffic in real time at the individual bid request level. The second and main contribution is the design and implementation of a solution for the invalid traffic problem, a system that can be seamlessly integrated into the current programmatic ecosystem by the DSPs. Our system has been released under an open source license, becoming the first auditable solution for invalid ad traffic detection. The intrinsic transparency of our solution along with the good results obtained in industrial trials have led the World Federation of Advertisers to endorse it.
无效的广告流量是程序化广告的固有问题,到目前为止还没有得到妥善解决。传统上,人们一直认为无效的广告流量只会损害广告商的利益,广告商为无效的广告印象支付成本,而其他行业利益相关者则通过佣金赚取收入,而不管印象的质量如何。我们的第一个贡献包括提供证据,表明需求端平台(dsp)是程序化广告供应链中最重要的中介之一,可能因无效广告流量而遭受经济损失。解决dsp的无效流量问题需要一个高度可扩展的解决方案,该解决方案可以在单个投标请求级别实时识别无效流量。第二个也是主要的贡献是为无效流量问题设计和实现了一个解决方案,该系统可以通过dsp无缝集成到当前的程序化生态系统中。我们的系统已经在开源许可下发布,成为第一个针对无效广告流量检测的可审计解决方案。我们的解决方案固有的透明度以及在工业试验中获得的良好结果使世界广告主联合会认可了它。
{"title":"Nameles: An intelligent system for Real-Time Filtering of Invalid Ad Traffic","authors":"Antonio Pastor, Matti Antero Parssinen, Patricia Callejo, Pelayo Vallina, R. C. Rumín, Ángel Cuevas, M. Kotila, A. Azcorra","doi":"10.1145/3308558.3313601","DOIUrl":"https://doi.org/10.1145/3308558.3313601","url":null,"abstract":"Invalid ad traffic is an inherent problem of programmatic advertising that has not been properly addressed so far. Traditionally, it has been considered that invalid ad traffic only harms the interests of advertisers, which pay for the cost of invalid ad impressions while other industry stakeholders earn revenue through commissions regardless of the quality of the impression. Our first contribution consists of providing evidence that shows how the Demand Side Platforms (DSPs), one of the most important intermediaries in the programmatic advertising supply chain, may be suffering from economic losses due to invalid ad traffic. Addressing the problem of invalid traffic at DSPs requires a highly scalable solution that can identify invalid traffic in real time at the individual bid request level. The second and main contribution is the design and implementation of a solution for the invalid traffic problem, a system that can be seamlessly integrated into the current programmatic ecosystem by the DSPs. Our system has been released under an open source license, becoming the first auditable solution for invalid ad traffic detection. The intrinsic transparency of our solution along with the good results obtained in industrial trials have led the World Federation of Advertisers to endorse it.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90681472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multi-Domain Gated CNN for Review Helpfulness Prediction 多域门控CNN评论帮助预测
Pub Date : 2019-05-13 DOI: 10.1145/3308558.3313587
Cen Chen, Minghui Qiu, Yinfei Yang, Jun Zhou, Jun Huang, Xiaolong Li, F. S. Bao
Consumers today face too many reviews to read when shopping online. Presenting the most helpful reviews, instead of all, to them will greatly ease purchase decision making. Most of the existing studies on review helpfulness prediction focused on domains with rich labels, not suitable for domains with insufficient labels. In response, we explore a multi-domain approach that learns domain relationships to help the task by transferring knowledge from data-rich domains to data-deficient domains. To better model domain differences, our approach gates multi-granularity embeddings in a Neural Network (NN) based transfer learning framework to reflect the domain-variant importance of words. Extensive experiments empirically demonstrate that our model outperforms the state-of-the-art baselines and NN-based methods without gating on this task. Our approach facilitates more effective knowledge transfer between domains, especially when the target domain dataset is small. Meanwhile, the domain relationship and domain-specific embedding gating are insightful and interpretable.
今天的消费者在网上购物时要面对太多的评论。提供最有帮助的评论,而不是所有的评论,将大大简化他们的购买决策。现有的复习帮助预测研究大多集中在标签丰富的领域,不适合标签不足的领域。为此,我们探索了一种学习领域关系的多领域方法,通过将知识从数据丰富的领域转移到数据缺乏的领域来帮助完成任务。为了更好地建模领域差异,我们的方法在基于神经网络(NN)的迁移学习框架中引入了多粒度嵌入,以反映单词在领域变化中的重要性。大量的实验经验表明,我们的模型在没有门控的情况下优于最先进的基线和基于神经网络的方法。我们的方法促进了更有效的领域之间的知识转移,特别是当目标领域数据集很小的时候。同时,领域关系和特定于领域的嵌入门控具有深刻的洞察力和可解释性。
{"title":"Multi-Domain Gated CNN for Review Helpfulness Prediction","authors":"Cen Chen, Minghui Qiu, Yinfei Yang, Jun Zhou, Jun Huang, Xiaolong Li, F. S. Bao","doi":"10.1145/3308558.3313587","DOIUrl":"https://doi.org/10.1145/3308558.3313587","url":null,"abstract":"Consumers today face too many reviews to read when shopping online. Presenting the most helpful reviews, instead of all, to them will greatly ease purchase decision making. Most of the existing studies on review helpfulness prediction focused on domains with rich labels, not suitable for domains with insufficient labels. In response, we explore a multi-domain approach that learns domain relationships to help the task by transferring knowledge from data-rich domains to data-deficient domains. To better model domain differences, our approach gates multi-granularity embeddings in a Neural Network (NN) based transfer learning framework to reflect the domain-variant importance of words. Extensive experiments empirically demonstrate that our model outperforms the state-of-the-art baselines and NN-based methods without gating on this task. Our approach facilitates more effective knowledge transfer between domains, especially when the target domain dataset is small. Meanwhile, the domain relationship and domain-specific embedding gating are insightful and interpretable.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82003505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
期刊
The World Wide Web Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1