Proceedings of the 25th International Conference on World Wide Web最新文献_第4页

What Links Alice and Bob?: Matching and Ranking Semantic Patterns in Heterogeneous Networks 爱丽丝和鲍勃有什么联系?:异构网络中语义模式的匹配与排序

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883007

Jiongqian Liang, Deepak Ajwani, Patrick K. Nicholson, A. Sala, S. Parthasarathy

An increasing number of applications are modeled and analyzed in network form, where nodes represent entities of interest and edges represent interactions or relationships between entities. Commonly, such relationship analysis tools assume homogeneity in both node type and edge type. Recent research has sought to redress the assumption of homogeneity and focused on mining heterogeneous information networks (HINs) where both nodes and edges can be of different types. Building on such efforts, in this work we articulate a novel approach for mining relationships across entities in such networks while accounting for user preference (prioritization) over relationship type and interestingness metric. We formalize the problem as a top-$k$ lightest paths problem, contextualized in a real-world communication network, and seek to find the $k$ most interesting path instances matching the preferred relationship type. Our solution, PROphetic HEuristic Algorithm for Path Searching (PRO-HEAPS), leverages a combination of novel graph preprocessing techniques, well designed heuristics and the venerable A* search algorithm. We run our algorithm on real-world large-scale graphs and show that our algorithm significantly outperforms a wide variety of baseline approaches with speedups as large as 100X. We also conduct a case study and demonstrate valuable applications of our algorithm.

越来越多的应用程序以网络形式建模和分析，其中节点表示感兴趣的实体，边缘表示实体之间的交互或关系。通常，这种关系分析工具在节点类型和边缘类型上都假定同质性。最近的研究试图纠正同质性假设，并专注于挖掘异构信息网络(HINs)，其中节点和边缘可以是不同类型的。在这些努力的基础上，在这项工作中，我们阐明了一种新的方法，用于挖掘此类网络中跨实体的关系，同时考虑用户偏好(优先级)而不是关系类型和兴趣度量。我们将这个问题形式化为top-$k$最轻路径问题，将其置于现实世界的通信网络中，并寻求找到与首选关系类型匹配的$k$最有趣的路径实例。我们的解决方案，预言性启发式路径搜索算法(PRO-HEAPS)，结合了新颖的图形预处理技术，精心设计的启发式算法和古老的a *搜索算法。我们在现实世界的大规模图形上运行我们的算法，并表明我们的算法显著优于各种基线方法，加速高达100倍。我们还进行了一个案例研究，并展示了我们的算法的有价值的应用。

{"title":"What Links Alice and Bob?: Matching and Ranking Semantic Patterns in Heterogeneous Networks","authors":"Jiongqian Liang, Deepak Ajwani, Patrick K. Nicholson, A. Sala, S. Parthasarathy","doi":"10.1145/2872427.2883007","DOIUrl":"https://doi.org/10.1145/2872427.2883007","url":null,"abstract":"An increasing number of applications are modeled and analyzed in network form, where nodes represent entities of interest and edges represent interactions or relationships between entities. Commonly, such relationship analysis tools assume homogeneity in both node type and edge type. Recent research has sought to redress the assumption of homogeneity and focused on mining heterogeneous information networks (HINs) where both nodes and edges can be of different types. Building on such efforts, in this work we articulate a novel approach for mining relationships across entities in such networks while accounting for user preference (prioritization) over relationship type and interestingness metric. We formalize the problem as a top-$k$ lightest paths problem, contextualized in a real-world communication network, and seek to find the $k$ most interesting path instances matching the preferred relationship type. Our solution, PROphetic HEuristic Algorithm for Path Searching (PRO-HEAPS), leverages a combination of novel graph preprocessing techniques, well designed heuristics and the venerable A* search algorithm. We run our algorithm on real-world large-scale graphs and show that our algorithm significantly outperforms a wide variety of baseline approaches with speedups as large as 100X. We also conduct a case study and demonstrate valuable applications of our algorithm.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87047172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases 分析网络表扩展跨领域知识库的潜力

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883017

Dominique Ritze, O. Lehmberg, Yaser Oulabi, Christian Bizer

Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.

跨领域知识库(如DBpedia、YAGO或Google knowledge Graph)在过去几年中获得了越来越多的关注，并开始在各种用例中部署。然而，这些知识库的内容远不完整，远不总是正确的，并且存在弃用的问题(即人口数字在一段时间后会过时)。因此，需要努力利用各种类型的Web数据来补充、更新和扩展这些知识库。可能提供非常广泛覆盖的Web数据源是在Web上发现的数百万关系HTML表。使用来自Web表的数据来增强跨领域知识库的现有工作只报告聚合的性能数字。Web表的实际内容和可以使用这些表加以补充的知识库的主题领域仍然不清楚。在本文中，我们将一个大型的、公开可用的Web表语料库与DBpedia知识库相匹配。基于匹配结果，我们分析了Web表在增加跨领域知识库的不同部分方面的潜力，并报告了关于类、属性和实例的详细统计信息，其中缺失的值可以使用Web表数据作为证据来填充。为了估计新值的潜在质量，我们对局部封闭世界假设进行了实证检验，并使用它来确定理想数据融合策略可能产生的正确事实的最大数量。以此为基础，我们比较了三种数据融合策略，并得出结论:基于知识的信任优于基于PageRank和基于投票的融合。

{"title":"Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases","authors":"Dominique Ritze, O. Lehmberg, Yaser Oulabi, Christian Bizer","doi":"10.1145/2872427.2883017","DOIUrl":"https://doi.org/10.1145/2872427.2883017","url":null,"abstract":"Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85361829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 83

Discovering Structure in the Universe of Attribute Names 发现属性名的结构

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882975

A. Halevy, Natasha Noy, Sunita Sarawagi, Steven Euijong Whang, Xiao Yu

Recently, search engines have invested significant effort to answering entity--attribute queries from structured data, but have focused mostly on queries for frequent attributes. In parallel, several research efforts have demonstrated that there is a long tail of attributes, often thousands per class of entities, that are of interest to users. Researchers are beginning to leverage these new collections of attributes to expand the ontologies that power search engines and to recognize entity--attribute queries. Because of the sheer number of potential attributes, such tasks require us to impose some structure on this long and heavy tail of attributes. This paper introduces the problem of organizing the attributes by expressing the compositional structure of their names as a rule-based grammar. These rules offer a compact and rich semantic interpretation of multi-word attributes, while generalizing from the observed attributes to new unseen ones. The paper describes an unsupervised learning method to generate such a grammar automatically from a large set of attribute names. Experiments show that our method can discover a precise grammar over 100,000 attributes of {sc Countries} while providing a 40-fold compaction over the attribute names. Furthermore, our grammar enables us to increase the precision of attributes from 47% to more than 90% with only a minimal curation effort. Thus, our approach provides an efficient and scalable way to expand ontologies with attributes of user interest.

最近，搜索引擎投入了大量精力来回答来自结构化数据的实体属性查询，但主要集中在对频繁属性的查询上。与此同时，一些研究工作已经证明，用户感兴趣的属性有一个长尾，通常每类实体有数千个属性。研究人员开始利用这些新的属性集合来扩展本体，从而为搜索引擎提供动力，并识别实体属性查询。由于潜在属性的绝对数量，这样的任务要求我们在这个又长又重的属性尾部强加一些结构。本文介绍了一种基于规则的语法，通过表达属性名称的组合结构来组织属性的问题。这些规则为多词属性提供了紧凑和丰富的语义解释，同时从观察到的属性泛化到新的未见属性。本文描述了一种从大量属性名称中自动生成这种语法的无监督学习方法。实验表明，我们的方法可以在{sc nations}的100,000个属性中发现精确的语法，同时对属性名称提供40倍的压缩。此外，我们的语法使我们能够将属性的精度从47%提高到90%以上，只需要最少的管理工作。因此，我们的方法提供了一种高效且可扩展的方法来扩展具有用户感兴趣属性的本体。

{"title":"Discovering Structure in the Universe of Attribute Names","authors":"A. Halevy, Natasha Noy, Sunita Sarawagi, Steven Euijong Whang, Xiao Yu","doi":"10.1145/2872427.2882975","DOIUrl":"https://doi.org/10.1145/2872427.2882975","url":null,"abstract":"Recently, search engines have invested significant effort to answering entity--attribute queries from structured data, but have focused mostly on queries for frequent attributes. In parallel, several research efforts have demonstrated that there is a long tail of attributes, often thousands per class of entities, that are of interest to users. Researchers are beginning to leverage these new collections of attributes to expand the ontologies that power search engines and to recognize entity--attribute queries. Because of the sheer number of potential attributes, such tasks require us to impose some structure on this long and heavy tail of attributes. This paper introduces the problem of organizing the attributes by expressing the compositional structure of their names as a rule-based grammar. These rules offer a compact and rich semantic interpretation of multi-word attributes, while generalizing from the observed attributes to new unseen ones. The paper describes an unsupervised learning method to generate such a grammar automatically from a large set of attribute names. Experiments show that our method can discover a precise grammar over 100,000 attributes of {sc Countries} while providing a 40-fold compaction over the attribute names. Furthermore, our grammar enables us to increase the precision of attributes from 47% to more than 90% with only a minimal curation effort. Thus, our approach provides an efficient and scalable way to expand ontologies with attributes of user interest.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82329793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Modeling a Retweet Network via an Adaptive Bayesian Approach 基于自适应贝叶斯方法的转发网络建模

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882985

Bin Bi, Junghoo Cho

Twitter (and similar microblogging services) has become a central nexus for discussion of the topics of the day. Twitter data contains rich content and structured information on users' topics of interest and behavior patterns. Correctly analyzing and modeling Twitter data enables the prediction of the user behavior and preference in a variety of practical applications, such as tweet recommendation and followee recommendation. Although a number of models have been developed on Twitter data in prior work, most of these only model the tweets from users, while neglecting their valuable retweet information in the data. Models would enhance their predictive power by incorporating users' retweet content as well as their retweet behavior. In this paper, we propose two novel Bayesian nonparametric models, URM and UCM, on retweet data. Both of them are able to integrate the analysis of tweet text and users' retweet behavior in the same probabilistic framework. Moreover, they both jointly model users' interest in tweet and retweet. As nonparametric models, URM and UCM can automatically determine the parameters of the models based on input data, avoiding arbitrary parameter settings. Extensive experiments on real-world Twitter data show that both URM and UCM are superior to all the baselines, while UCM further outperforms URM, confirming the appropriateness of our models in retweet modeling.

Twitter(以及类似的微博服务)已经成为当今话题讨论的中心纽带。Twitter数据包含关于用户感兴趣的主题和行为模式的丰富内容和结构化信息。对Twitter数据进行正确的分析和建模，可以在各种实际应用中预测用户的行为和偏好，例如推特推荐和关注者推荐。虽然在之前的工作中已经针对Twitter数据开发了许多模型，但大多数模型只对用户的推文进行建模，而忽略了数据中用户有价值的转发信息。模型将通过整合用户的转发内容和转发行为来增强其预测能力。本文提出了两种新的贝叶斯非参数模型:URM和UCM。两者都能够将推文分析和用户转发行为在同一概率框架下进行整合。此外，它们都共同模拟用户对tweet和转发的兴趣。作为非参数模型，URM和UCM可以根据输入数据自动确定模型的参数，避免任意设置参数。在真实Twitter数据上的大量实验表明，URM和UCM都优于所有基线，而UCM进一步优于URM，证实了我们的模型在转发建模中的适用性。

{"title":"Modeling a Retweet Network via an Adaptive Bayesian Approach","authors":"Bin Bi, Junghoo Cho","doi":"10.1145/2872427.2882985","DOIUrl":"https://doi.org/10.1145/2872427.2882985","url":null,"abstract":"Twitter (and similar microblogging services) has become a central nexus for discussion of the topics of the day. Twitter data contains rich content and structured information on users' topics of interest and behavior patterns. Correctly analyzing and modeling Twitter data enables the prediction of the user behavior and preference in a variety of practical applications, such as tweet recommendation and followee recommendation. Although a number of models have been developed on Twitter data in prior work, most of these only model the tweets from users, while neglecting their valuable retweet information in the data. Models would enhance their predictive power by incorporating users' retweet content as well as their retweet behavior. In this paper, we propose two novel Bayesian nonparametric models, URM and UCM, on retweet data. Both of them are able to integrate the analysis of tweet text and users' retweet behavior in the same probabilistic framework. Moreover, they both jointly model users' interest in tweet and retweet. As nonparametric models, URM and UCM can automatically determine the parameters of the models based on input data, avoiding arbitrary parameter settings. Extensive experiments on real-world Twitter data show that both URM and UCM are superior to all the baselines, while UCM further outperforms URM, confirming the appropriateness of our models in retweet modeling.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87179120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Abusive Language Detection in Online User Content 在线用户内容中的辱骂语言检测

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883062

Chikashi Nobata, Joel R. Tetreault, A. Thomas, Yashar Mehdad, Yi Chang

Detection of abusive language in user generated online content has become an issue of increasing importance in recent years. Most current commercial methods make use of blacklists and regular expressions, however these measures fall short when contending with more subtle, less ham-fisted examples of hate speech. In this work, we develop a machine learning based method to detect hate speech on online user comments from two domains which outperforms a state-of-the-art deep learning approach. We also develop a corpus of user comments annotated for abusive language, the first of its kind. Finally, we use our detection tool to analyze abusive language over time and in different settings to further enhance our knowledge of this behavior.

近年来，在用户生成的网络内容中检测辱骂性语言已成为一个日益重要的问题。目前大多数商业方法都使用黑名单和正则表达式，然而，这些措施在对付更微妙、不那么笨拙的仇恨言论时，效果不佳。在这项工作中，我们开发了一种基于机器学习的方法来检测来自两个领域的在线用户评论中的仇恨言论，该方法优于最先进的深度学习方法。我们还开发了一个针对辱骂性语言的用户评论语料库，这是同类中第一个。最后，我们使用我们的检测工具来分析不同时间和不同环境下的辱骂性语言，以进一步提高我们对这种行为的认识。

引用次数: 953

Linking Users Across Domains with Location Data: Theory and Validation 使用位置数据链接跨域用户:理论与验证

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883002

Christopher J. Riederer, Yunsung Kim, A. Chaintreau, Nitish Korula, Silvio Lattanzi

Linking accounts of the same user across datasets -- even when personally identifying information is removed or unavailable -- is an important open problem studied in many contexts. Beyond many practical applications, (such as cross domain analysis, recommendation, and link prediction), understanding this problem more generally informs us on the privacy implications of data disclosure. Previous work has typically addressed this question using either different portions of the same dataset or observing the same behavior across thematically similar domains. In contrast, the general cross-domain case where users have different profiles independently generated from a common but unknown pattern raises new challenges, including difficulties in validation, and remains under-explored. In this paper, we address the reconciliation problem for location-based datasets and introduce a robust method for this general setting. Location datasets are a particularly fruitful domain to study: such records are frequently produced by users in an increasing number of applications and are highly sensitive, especially when linked to other datasets. Our main contribution is a generic and self-tunable algorithm that leverages any pair of sporadic location-based datasets to determine the most likely matching between the users it contains. While making very general assumptions on the patterns of mobile users, we show that the maximum weight matching we compute is provably correct. Although true cross-domain datasets are a rarity, our experimental evaluation uses two entirely new data collections, including one we crawled, on an unprecedented scale. The method we design outperforms naive rules and prior heuristics. As it combines both sparse and dense properties of location-based data and accounts for probabilistic dynamics of observation, it can be shown to be robust even when data gets sparse.

跨数据集链接同一用户的帐户——即使个人身份信息已被删除或不可用——是在许多情况下研究的一个重要开放性问题。除了许多实际应用(如跨域分析、推荐和链接预测)之外，更普遍地理解这个问题可以让我们了解数据披露对隐私的影响。以前的工作通常使用相同数据集的不同部分或在主题相似的领域中观察相同的行为来解决这个问题。相比之下，一般的跨域情况下，用户有不同的配置文件，这些配置文件是由一个共同的但未知的模式独立生成的，这会带来新的挑战，包括验证方面的困难，并且仍然没有得到充分的研究。在本文中，我们解决了基于位置的数据集的协调问题，并为这种一般设置引入了一种鲁棒方法。位置数据集是一个特别富有成效的研究领域:此类记录经常由用户在越来越多的应用程序中产生，并且高度敏感，特别是当与其他数据集相关联时。我们的主要贡献是一个通用的、自调的算法，它利用任何一对零星的基于位置的数据集来确定它所包含的用户之间最可能的匹配。在对移动用户的模式做出非常一般的假设时，我们证明了我们计算的最大权重匹配是可以证明正确的。虽然真正的跨领域数据集是罕见的，但我们的实验评估使用了两个全新的数据集，其中一个是我们抓取的，规模前所未有。我们设计的方法优于朴素规则和先验启发式。由于它结合了基于位置的数据的稀疏和密集特性，并考虑了观测的概率动态，因此即使数据变得稀疏，它也可以显示出鲁棒性。

{"title":"Linking Users Across Domains with Location Data: Theory and Validation","authors":"Christopher J. Riederer, Yunsung Kim, A. Chaintreau, Nitish Korula, Silvio Lattanzi","doi":"10.1145/2872427.2883002","DOIUrl":"https://doi.org/10.1145/2872427.2883002","url":null,"abstract":"Linking accounts of the same user across datasets -- even when personally identifying information is removed or unavailable -- is an important open problem studied in many contexts. Beyond many practical applications, (such as cross domain analysis, recommendation, and link prediction), understanding this problem more generally informs us on the privacy implications of data disclosure. Previous work has typically addressed this question using either different portions of the same dataset or observing the same behavior across thematically similar domains. In contrast, the general cross-domain case where users have different profiles independently generated from a common but unknown pattern raises new challenges, including difficulties in validation, and remains under-explored. In this paper, we address the reconciliation problem for location-based datasets and introduce a robust method for this general setting. Location datasets are a particularly fruitful domain to study: such records are frequently produced by users in an increasing number of applications and are highly sensitive, especially when linked to other datasets. Our main contribution is a generic and self-tunable algorithm that leverages any pair of sporadic location-based datasets to determine the most likely matching between the users it contains. While making very general assumptions on the patterns of mobile users, we show that the maximum weight matching we compute is provably correct. Although true cross-domain datasets are a rarity, our experimental evaluation uses two entirely new data collections, including one we crawled, on an unprecedented scale. The method we design outperforms naive rules and prior heuristics. As it combines both sparse and dense properties of location-based data and accounts for probabilistic dynamics of observation, it can be shown to be robust even when data gets sparse.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74833354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 144

Pushing the Frontier: Exploring the African Web Ecosystem 推动前沿:探索非洲网络生态系统

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882997

Rodérick Fanou, Gareth Tyson, Pierre François, A. Sathiaseelan

It is well known that Africa's mobile and fixed Internet infrastructure is progressing at a rapid pace. A flurry of recent research has quantified this, highlighting the expansion of its underlying connectivity network. However, improving the infrastructure is not useful without appropriately provisioned services to utilise it. This paper measures the availability of web content infrastructure in Africa. Whereas others have explored web infrastructure in developed regions, we shed light on practices in developing regions. To achieve this, we apply a comprehensive measurement methodology to collect data from a variety of sources. We focus on a large content delivery network to reveal that Africa's content infrastructure is, indeed, expanding. However, we find much web content is still served from the US and Europe. We discover that many of the problems faced are actually caused by significant inter-AS delays in Africa, which contribute to local ISPs not sharing their cache capacity. We discover that a related problem is the poor DNS configuration used by some ISPs, which confounds the attempts of providers to optimise their delivery. We then explore a number of other websites to show that large web infrastructure deployments are a rarity in Africa and that even regional websites host their services abroad. We conclude by making suggestions for improvements.

众所周知，非洲的移动和固定互联网基础设施正在快速发展。最近的一系列研究对其进行了量化，强调了其潜在连接网络的扩张。然而，如果没有适当提供的服务来利用基础设施，那么改进基础设施是没有用的。本文测量了非洲网络内容基础设施的可用性。虽然其他人已经探索了发达地区的网络基础设施，但我们的重点是发展中地区的实践。为了实现这一目标，我们采用综合测量方法从各种来源收集数据。我们将重点放在一个大型内容交付网络上，以揭示非洲的内容基础设施确实正在扩大。然而，我们发现许多网络内容仍然来自美国和欧洲。我们发现，所面临的许多问题实际上是由非洲的重大as间延迟引起的，这导致当地isp不共享其缓存容量。我们发现一个相关的问题是一些互联网服务提供商使用的糟糕的DNS配置，这混淆了提供商优化其交付的尝试。然后，我们研究了许多其他网站，以表明大型网络基础设施部署在非洲是罕见的，甚至地区性网站也将其服务托管在国外。最后，我们提出了改进的建议。

{"title":"Pushing the Frontier: Exploring the African Web Ecosystem","authors":"Rodérick Fanou, Gareth Tyson, Pierre François, A. Sathiaseelan","doi":"10.1145/2872427.2882997","DOIUrl":"https://doi.org/10.1145/2872427.2882997","url":null,"abstract":"It is well known that Africa's mobile and fixed Internet infrastructure is progressing at a rapid pace. A flurry of recent research has quantified this, highlighting the expansion of its underlying connectivity network. However, improving the infrastructure is not useful without appropriately provisioned services to utilise it. This paper measures the availability of web content infrastructure in Africa. Whereas others have explored web infrastructure in developed regions, we shed light on practices in developing regions. To achieve this, we apply a comprehensive measurement methodology to collect data from a variety of sources. We focus on a large content delivery network to reveal that Africa's content infrastructure is, indeed, expanding. However, we find much web content is still served from the US and Europe. We discover that many of the problems faced are actually caused by significant inter-AS delays in Africa, which contribute to local ISPs not sharing their cache capacity. We discover that a related problem is the poor DNS configuration used by some ISPs, which confounds the attempts of providers to optimise their delivery. We then explore a number of other websites to show that large web infrastructure deployments are a rarity in Africa and that even regional websites host their services abroad. We conclude by making suggestions for improvements.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79601139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Tracking the Trackers 追踪追踪者

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883028

Zhonghao Yu, S. Macbeth, Konark Modi, J. M. Pujol

Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites' functionality and appearance, and require significant manual maintenance. In this paper we propose a novel approach, based on the concepts leveraged from $k$-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests. We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized AdBlock Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites, the largest scale empirical evaluation to date. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.

在线跟踪带来了严重的隐私挑战，引起了学术界和工业界的极大关注。现有的防止用户跟踪的方法是基于精心策划的块列表的，这些方法的覆盖范围有限，分类分辨率较粗，依赖于影响站点功能和外观的异常，并且需要大量的人工维护。在本文中，我们提出了一种新颖的方法，基于从$k$-匿名中利用的概念，其中用户集体识别不安全的数据元素，这些元素有可能唯一地标识单个用户，并从请求中删除它们。我们将我们的系统部署到20万运行Cliqz浏览器或Cliqz Firefox扩展的德国用户中，以评估其效率和可行性。结果表明，我们的方法比Disconnect提供的封锁列表实现了更好的隐私保护，同时将网站破坏降至最低，甚至低于社区优化的AdBlock Plus。我们还提供了证据，证明了跟踪器的普遍性和覆盖范围，涵盖了35万个独立网站的2100多万页，这是迄今为止规模最大的实证评估。95%的访问页面包含第三方对潜在跟踪器的请求，78%的页面试图传输不安全的数据。跟踪组织也进行了排名，显示单个组织可以达到德国所有页面访问量的42%。

{"title":"Tracking the Trackers","authors":"Zhonghao Yu, S. Macbeth, Konark Modi, J. M. Pujol","doi":"10.1145/2872427.2883028","DOIUrl":"https://doi.org/10.1145/2872427.2883028","url":null,"abstract":"Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites' functionality and appearance, and require significant manual maintenance. In this paper we propose a novel approach, based on the concepts leveraged from $k$-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests. We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized AdBlock Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites, the largest scale empirical evaluation to date. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76119833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Did You Say U2 or YouTube?: Inferring Implicit Transcripts from Voice Search Logs 你说的是U2还是YouTube?:从语音搜索日志中推断隐含文本

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882994

Milad Shokouhi, Umut Ozertem, Nick Craswell

Web search via voice is becoming increasingly popular, taking advantage of recent advances in automatic speech recognition. Speech recognition systems are trained using audio transcripts, which can be generated by a paid annotator listening to some audio and manually transcribing it. This paper considers an alternative source of training data for speech recognition, called implicit transcription. This is based on Web search clicks and reformulations, which can be interpreted as validating or correcting the recognition done during a real Web search. This can give a large amount of free training data that matches the exact characteristics of real incoming voice searches and the implicit transcriptions can better reflect the needs of real users because they come from the user who generated the audio. On an overall basis we demonstrate that the new training data has value in improving speech recognition. We further show that the in-context feedback from real users can allow the speech recognizer to exploit contextual signals, and reduce the recognition error rate further by up to 23%.

利用自动语音识别技术的最新进展，通过语音进行网络搜索正变得越来越流行。语音识别系统是使用音频记录进行训练的，音频记录可以由付费的注释者听一些音频并手动转录生成。本文考虑了语音识别训练数据的另一种来源，称为隐式转录。这是基于Web搜索点击和重新表述，可以将其解释为验证或纠正在实际Web搜索期间完成的识别。这可以提供大量与真实输入语音搜索的确切特征相匹配的免费训练数据，并且隐含转录由于来自生成音频的用户，因此可以更好地反映真实用户的需求。总的来说，我们证明了新的训练数据在改进语音识别方面具有价值。我们进一步表明，来自真实用户的上下文反馈可以允许语音识别器利用上下文信号，并进一步降低高达23%的识别错误率。

引用次数: 22

Immersive Recommendation: News and Event Recommendations Using Personal Digital Traces 沉浸式推荐:使用个人数字痕迹的新闻和事件推荐

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883006

C. Hsieh, Longqi Yang, Honghao Wei, Mor Naaman, D. Estrin

We propose a new user-centric recommendation model, called Immersive Recommendation, that incorporates cross-platform and diverse personal digital traces into recommendations. Our context-aware topic modeling algorithm systematically profiles users' interests based on their traces from different contexts, and our hybrid recommendation algorithm makes high-quality recommendations by fusing users' personal profiles, item profiles, and existing ratings. Specifically, in this work we target personalized news and local event recommendations for their utility and societal importance. We evaluated the model with a large-scale offline evaluation leveraging users' public Twitter traces. In addition, we conducted a direct evaluation of the model's recommendations in a 33-participant study using Twitter, Facebook and email traces. In the both cases, the proposed model showed significant improvement over the state-of-the-art algorithms, suggesting the value of using this new user-centric recommendation model to improve recommendation quality, including in cold-start situations.

我们提出了一种新的以用户为中心的推荐模型，称为沉浸式推荐，它将跨平台和多样化的个人数字痕迹整合到推荐中。我们的上下文感知主题建模算法基于用户在不同上下文中的轨迹系统地描述用户的兴趣，我们的混合推荐算法通过融合用户的个人资料、项目资料和现有评分来提供高质量的推荐。具体来说，在这项工作中，我们针对个性化新闻和本地事件推荐的效用和社会重要性。我们利用用户的公开Twitter痕迹，通过大规模的离线评估来评估该模型。此外，我们使用Twitter、Facebook和电子邮件跟踪对33名参与者的研究模型的建议进行了直接评估。在这两种情况下，所提出的模型都比最先进的算法有了显著的改进，这表明使用这种新的以用户为中心的推荐模型来提高推荐质量的价值，包括在冷启动情况下。

引用次数: 64