Proceedings of the 25th International Conference on World Wide Web最新文献_第2页

Where Can I Buy a Boulder?: Searching for Offline Retail Locations 我在哪里可以买到巨石?:搜索线下零售地点

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882998

Sandro Bauer, Filip Radlinski, Ryen W. White

People commonly need to purchase things in person, from large garden supplies to home decor. Although modern search systems are very effective at finding online products, little research attention has been paid to helping users find places that sell a specific product offline. For instance, users searching for an apron are not typically directed to a nearby kitchen store by a standard search engine. In this paper, we investigate "where can I buy"-style queries related to in-person purchases of products and services. Answering these queries is challenging since little is known about the range of products sold in many stores, especially those which are smaller in size. To better understand this class of queries, we first present an in-depth analysis of typical offline purchase needs as observed by a major search engine, producing an ontology of such needs. We then propose ranking features for this new problem, and learn a ranking function that returns stores most likely to sell a queried item or service, even if there is very little information available online about some of the stores. Our final contribution is a new evaluation framework that combines distance with store relevance in measuring the effectiveness of such a search system. We evaluate our method using this approach and show that it outperforms a modern web search engine.

人们通常需要亲自购买物品，从大型园艺用品到家居装饰。尽管现代搜索系统在寻找在线产品方面非常有效，但很少有研究关注如何帮助用户找到线下销售特定产品的地方。例如，搜索围裙的用户通常不会被标准搜索引擎引导到附近的厨房商店。在本文中，我们研究了与亲自购买产品和服务相关的“我可以在哪里购买”式查询。回答这些问题是具有挑战性的，因为人们对许多商店出售的产品范围知之甚少，尤其是那些尺寸较小的产品。为了更好地理解这类查询，我们首先对一个主要搜索引擎观察到的典型离线购买需求进行了深入分析，生成了这类需求的本体。然后，我们为这个新问题提出了排名特性，并学习了一个排名函数，该函数返回最有可能出售所查询的商品或服务的商店，即使关于某些商店的在线信息很少。我们最后的贡献是一个新的评估框架，它结合了距离和商店相关性来衡量这样一个搜索系统的有效性。我们使用这种方法评估我们的方法，并表明它优于现代网络搜索引擎。

{"title":"Where Can I Buy a Boulder?: Searching for Offline Retail Locations","authors":"Sandro Bauer, Filip Radlinski, Ryen W. White","doi":"10.1145/2872427.2882998","DOIUrl":"https://doi.org/10.1145/2872427.2882998","url":null,"abstract":"People commonly need to purchase things in person, from large garden supplies to home decor. Although modern search systems are very effective at finding online products, little research attention has been paid to helping users find places that sell a specific product offline. For instance, users searching for an apron are not typically directed to a nearby kitchen store by a standard search engine. In this paper, we investigate \"where can I buy\"-style queries related to in-person purchases of products and services. Answering these queries is challenging since little is known about the range of products sold in many stores, especially those which are smaller in size. To better understand this class of queries, we first present an in-depth analysis of typical offline purchase needs as observed by a major search engine, producing an ontology of such needs. We then propose ranking features for this new problem, and learn a ranking function that returns stores most likely to sell a queried item or service, even if there is very little information available online about some of the stores. Our final contribution is a new evaluation framework that combines distance with store relevance in measuring the effectiveness of such a search system. We evaluate our method using this approach and show that it outperforms a modern web search engine.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86217225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Identifying Web Queries with Question Intent 识别带有问题意图的Web查询

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883058

Gilad Tsur, Yuval Pinter, Idan Szpektor, David Carmel

Vertical selection is the task of predicting relevant verticals for a Web query so as to enrich the Web search results with complementary vertical results. We investigate a novel variant of this task, where the goal is to detect queries with a question intent. Specifically, we address queries for which the user would like an answer with a human touch. We call these CQA-intent queries, since answers to them are typically found in community question answering (CQA) sites. A typical approach in vertical selection is using a vertical's specific language model of relevant queries and computing the query-likelihood for each vertical as a selective criterion. This works quite well for many domains like Shopping, Local and Travel. Yet, we claim that queries with CQA intent are harder to distinguish by modeling content alone, since they cover many different topics. We propose to also take the structure of queries into consideration, reasoning that queries with question intent have quite a different structure than other queries. We present a supervised classification scheme, random forest over word-clusters for variable length texts, which can model the query structure. Our experiments show that it substantially improves classification performance in the CQA-intent selection task compared to content-oriented based classification, especially as query length grows.

垂直选择的任务是预测Web查询的相关垂直方向，以便用互补的垂直方向结果丰富Web搜索结果。我们研究了该任务的一个新变体，其目标是检测带有问题意图的查询。具体来说，我们处理用户想要一个人性化的回答的查询。我们称这些为CQA意图查询，因为它们的答案通常可以在社区问答(CQA)站点中找到。垂直选择的典型方法是使用垂直的特定语言模型进行相关查询，并计算每个垂直的查询可能性作为选择标准。这在很多领域都很有效，比如购物、本地和旅游。然而，我们声称，具有CQA意图的查询很难通过单独建模内容来区分，因为它们涵盖了许多不同的主题。我们还建议考虑查询的结构，因为具有问题意图的查询与其他查询具有完全不同的结构。我们提出了一种监督分类方案，即对变长度文本进行词簇上的随机森林，它可以对查询结构进行建模。我们的实验表明，与基于内容的分类相比，它大大提高了cqa -意图选择任务的分类性能，特别是当查询长度增加时。

{"title":"Identifying Web Queries with Question Intent","authors":"Gilad Tsur, Yuval Pinter, Idan Szpektor, David Carmel","doi":"10.1145/2872427.2883058","DOIUrl":"https://doi.org/10.1145/2872427.2883058","url":null,"abstract":"Vertical selection is the task of predicting relevant verticals for a Web query so as to enrich the Web search results with complementary vertical results. We investigate a novel variant of this task, where the goal is to detect queries with a question intent. Specifically, we address queries for which the user would like an answer with a human touch. We call these CQA-intent queries, since answers to them are typically found in community question answering (CQA) sites. A typical approach in vertical selection is using a vertical's specific language model of relevant queries and computing the query-likelihood for each vertical as a selective criterion. This works quite well for many domains like Shopping, Local and Travel. Yet, we claim that queries with CQA intent are harder to distinguish by modeling content alone, since they cover many different topics. We propose to also take the structure of queries into consideration, reasoning that queries with question intent have quite a different structure than other queries. We present a supervised classification scheme, random forest over word-clusters for variable length texts, which can model the query structure. Our experiments show that it substantially improves classification performance in the CQA-intent selection task compared to content-oriented based classification, especially as query length grows.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80646159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Entity Disambiguation with Linkless Knowledge Bases 基于无链接知识库的实体消歧

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883068

Yang Li, Shulong Tan, Huan Sun, Jiawei Han, D. Roth, Xifeng Yan

Named Entity Disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a reference knowledge base (e.g. Wikipedia). Such disambiguation can help add semantics to plain text and distinguish homonymous entities. Previous research has tackled this problem by making use of two types of context-aware features derived from the reference knowledge base, namely, the context similarity and the semantic relatedness. Both features heavily rely on the cross-document hyperlinks within the knowledge base: the semantic relatedness feature is directly measured via those hyperlinks, while the context similarity feature implicitly makes use of those hyperlinks to expand entity candidates' descriptions and then compares them against the query context. Unfortunately, cross-document hyperlinks are rarely available in many closed domain knowledge bases and it is very expensive to manually add such links. Therefore few algorithms can work well on linkless knowledge bases. In this work, we propose the challenging Named Entity Disambiguation with Linkless Knowledge Bases (LNED) problem and tackle it by leveraging the useful disambiguation evidences scattered across the reference knowledge base. We propose a generative model to automatically mine such evidences out of noisy information. The mined evidences can mimic the role of the missing links and help boost the LNED performance. Experimental results show that our proposed method substantially improves the disambiguation accuracy over the baseline approaches.

命名实体消歧是消除自然语言文本中提到的命名实体的歧义，并将它们链接到参考知识库(例如Wikipedia)中的相应条目。这种消歧可以帮助为纯文本添加语义并区分同义实体。以往的研究主要是利用从参考知识库中衍生出来的两类上下文感知特征，即上下文相似性和语义相关性来解决这一问题。这两个特征都严重依赖于知识库中的跨文档超链接:语义相关性特征是通过这些超链接直接测量的，而上下文相似性特征则隐式地利用这些超链接来扩展实体候选的描述，然后将它们与查询上下文进行比较。不幸的是，在许多封闭的领域知识库中很少有跨文档的超链接，而且手动添加这种链接的成本非常高。因此，很少有算法能很好地处理无链接知识库。在这项工作中，我们提出了具有挑战性的无链接知识库命名实体消歧(LNED)问题，并利用分散在参考知识库中的有用消歧证据来解决该问题。我们提出了一个生成模型来自动地从噪声信息中挖掘这些证据。挖掘的证据可以模拟缺失环节的作用，有助于提高LNED的性能。实验结果表明，该方法较基线方法的消歧精度有较大提高。

{"title":"Entity Disambiguation with Linkless Knowledge Bases","authors":"Yang Li, Shulong Tan, Huan Sun, Jiawei Han, D. Roth, Xifeng Yan","doi":"10.1145/2872427.2883068","DOIUrl":"https://doi.org/10.1145/2872427.2883068","url":null,"abstract":"Named Entity Disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a reference knowledge base (e.g. Wikipedia). Such disambiguation can help add semantics to plain text and distinguish homonymous entities. Previous research has tackled this problem by making use of two types of context-aware features derived from the reference knowledge base, namely, the context similarity and the semantic relatedness. Both features heavily rely on the cross-document hyperlinks within the knowledge base: the semantic relatedness feature is directly measured via those hyperlinks, while the context similarity feature implicitly makes use of those hyperlinks to expand entity candidates' descriptions and then compares them against the query context. Unfortunately, cross-document hyperlinks are rarely available in many closed domain knowledge bases and it is very expensive to manually add such links. Therefore few algorithms can work well on linkless knowledge bases. In this work, we propose the challenging Named Entity Disambiguation with Linkless Knowledge Bases (LNED) problem and tackle it by leveraging the useful disambiguation evidences scattered across the reference knowledge base. We propose a generative model to automatically mine such evidences out of noisy information. The mined evidences can mimic the role of the missing links and help boost the LNED performance. Experimental results show that our proposed method substantially improves the disambiguation accuracy over the baseline approaches.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85338228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

When do Recommender Systems Work the Best?: The Moderating Effects of Product Attributes and Consumer Reviews on Recommender Performance 什么时候推荐系统最有效?产品属性和消费者评价对推荐人绩效的调节作用

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882976

Dokyun Lee, K. Hosanagar

We investigate the moderating effect of product attributes and consumer reviews on the efficacy of a collaborative filtering recommender system on an e-commerce site. We run a randomized field experiment on a top North American retailer's website with 184,375 users split into a recommender-treated group and a control group with 37,215 unique products in the dataset. By augmenting the dataset with Amazon Mechanical Turk tagged product attributes and consumer review data from the website, we study their moderating influence on recommenders in generating conversion. We first confirm that the use of recommenders increases the baseline conversion rate by 5.9%. We find that the recommenders act as substitutes for high average review ratings with the effect of using recommenders increasing the conversion rate as much as about 1.4 additional average star ratings. Additionally, we find that the positive impacts on conversion from recommenders are greater for hedonic products compared to utilitarian products while search-experience quality did not have any impact. We also find that the higher the price, the lower the positive impact of recommenders, while having lengthier product descriptions and higher review volumes increased the recommender's effectiveness. More findings are discussed in the Results. For managers, we 1) identify the products and product attributes for which the recommenders work well, 2) show how other product information sources on e-commerce sites interact with recommenders. Additionally, the insights from the results could inform novel recommender algorithm designs that are aware of strength and shortcomings. From an academic standpoint, we provide insight into the underlying mechanism behind how recommenders cause consumers to purchase.

我们研究了产品属性和消费者评论对电子商务网站协同过滤推荐系统效能的调节作用。我们在北美一家顶级零售商的网站上进行了一项随机现场实验，该网站有184,375名用户，他们被分为推荐者组和对照组，数据集中有37,215种独特的产品。通过使用Amazon Mechanical Turk标记的产品属性和来自网站的消费者评论数据增强数据集，我们研究了它们对推荐人产生转换的调节作用。我们首先确认推荐的使用将基准转化率提高了5.9%。我们发现，推荐作为高平均评论评级的替代品，使用推荐的效果是将转化率提高约1.4个额外的平均星级评级。此外，我们发现享乐产品对转化率的积极影响大于功利产品，而搜索体验质量没有任何影响。我们还发现，价格越高，推荐人的积极影响越低，而更长的产品描述和更高的评论量增加了推荐人的有效性。结果中讨论了更多的发现。对于管理人员，我们1)确定推荐器工作良好的产品和产品属性，2)展示电子商务网站上其他产品信息源如何与推荐器交互。此外，从结果中获得的见解可以为新的推荐算法设计提供信息，这些算法可以意识到优点和缺点。从学术的角度来看，我们提供了关于推荐者如何导致消费者购买背后的潜在机制的见解。

{"title":"When do Recommender Systems Work the Best?: The Moderating Effects of Product Attributes and Consumer Reviews on Recommender Performance","authors":"Dokyun Lee, K. Hosanagar","doi":"10.1145/2872427.2882976","DOIUrl":"https://doi.org/10.1145/2872427.2882976","url":null,"abstract":"We investigate the moderating effect of product attributes and consumer reviews on the efficacy of a collaborative filtering recommender system on an e-commerce site. We run a randomized field experiment on a top North American retailer's website with 184,375 users split into a recommender-treated group and a control group with 37,215 unique products in the dataset. By augmenting the dataset with Amazon Mechanical Turk tagged product attributes and consumer review data from the website, we study their moderating influence on recommenders in generating conversion. We first confirm that the use of recommenders increases the baseline conversion rate by 5.9%. We find that the recommenders act as substitutes for high average review ratings with the effect of using recommenders increasing the conversion rate as much as about 1.4 additional average star ratings. Additionally, we find that the positive impacts on conversion from recommenders are greater for hedonic products compared to utilitarian products while search-experience quality did not have any impact. We also find that the higher the price, the lower the positive impact of recommenders, while having lengthier product descriptions and higher review volumes increased the recommender's effectiveness. More findings are discussed in the Results. For managers, we 1) identify the products and product attributes for which the recommenders work well, 2) show how other product information sources on e-commerce sites interact with recommenders. Additionally, the insights from the results could inform novel recommender algorithm designs that are aware of strength and shortcomings. From an academic standpoint, we provide insight into the underlying mechanism behind how recommenders cause consumers to purchase.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74660257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Which to View: Personalized Prioritization for Broadcast Emails 查看哪个:广播电子邮件的个性化优先级

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883049

Beidou Wang, M. Ester, Jiajun Bu, Yu Zhu, Ziyu Guan, Deng Cai

Email is one of the most important communication tools today, but email overload resulting from the large number of unimportant or irrelevant emails is causing trillion-level economy loss every year. Thus personalized email prioritization algorithms are of urgent need. Despite lots of previous effort on this topic, broadcast email, an important type of email, is overlooked in previous literature. Broadcast emails are significantly different from normal emails, introducing both new challenges and opportunities. On one hand, lack of real senders and limited user interactions invalidate the key features exploited by traditional email prioritization algorithms; on the other hand, thousands of receivers for one broadcast email bring us the opportunity to predict importance through collaborative filtering. However, broadcast emails face a severe cold-start problem which hinders the direct application of collaborative filtering. In this paper, we propose the first framework for broadcast email prioritization by designing a novel active learning model that considers the collaborative filtering, implicit feedback and time sensitive responsiveness features of broadcast emails. Our method is thoroughly evaluated on a large scale real world industrial dataset from Samsung Electronics. Our method is proved highly effective and outperforms state-of-the-art personalized email prioritization methods.

电子邮件是当今最重要的沟通工具之一，但由于大量不重要或不相关的电子邮件而导致的电子邮件过载每年造成数万亿美元的经济损失。因此，个性化的邮件优先排序算法是迫切需要的。尽管之前在这个主题上做了很多努力，但广播电子邮件作为一种重要的电子邮件类型，在以前的文献中被忽视了。广播邮件与普通邮件有很大的不同，它带来了新的挑战和机遇。一方面，缺乏真正的发件人和有限的用户交互使传统的电子邮件优先排序算法所利用的关键功能失效;另一方面，一封广播邮件有成千上万的收件人，这使我们有机会通过协同过滤来预测邮件的重要性。然而，广播邮件面临着严重的冷启动问题，阻碍了协同过滤的直接应用。在本文中，我们通过设计一种新的主动学习模型，提出了广播电子邮件优先级的第一个框架，该模型考虑了广播电子邮件的协同过滤、隐式反馈和时间敏感响应特性。我们的方法在三星电子的大规模真实工业数据集上进行了彻底的评估。我们的方法被证明是非常有效的，并且优于最先进的个性化电子邮件优先级方法。

{"title":"Which to View: Personalized Prioritization for Broadcast Emails","authors":"Beidou Wang, M. Ester, Jiajun Bu, Yu Zhu, Ziyu Guan, Deng Cai","doi":"10.1145/2872427.2883049","DOIUrl":"https://doi.org/10.1145/2872427.2883049","url":null,"abstract":"Email is one of the most important communication tools today, but email overload resulting from the large number of unimportant or irrelevant emails is causing trillion-level economy loss every year. Thus personalized email prioritization algorithms are of urgent need. Despite lots of previous effort on this topic, broadcast email, an important type of email, is overlooked in previous literature. Broadcast emails are significantly different from normal emails, introducing both new challenges and opportunities. On one hand, lack of real senders and limited user interactions invalidate the key features exploited by traditional email prioritization algorithms; on the other hand, thousands of receivers for one broadcast email bring us the opportunity to predict importance through collaborative filtering. However, broadcast emails face a severe cold-start problem which hinders the direct application of collaborative filtering. In this paper, we propose the first framework for broadcast email prioritization by designing a novel active learning model that considers the collaborative filtering, implicit feedback and time sensitive responsiveness features of broadcast emails. Our method is thoroughly evaluated on a large scale real world industrial dataset from Samsung Electronics. Our method is proved highly effective and outperforms state-of-the-art personalized email prioritization methods.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89447225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Effective Construction of Relative Lempel-Ziv Dictionaries 相对Lempel-Ziv词典的有效构建

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883042

Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.

网络爬虫生成大量文本，由发起它们的搜索服务保留和存档。为了存储这样的数据并使存储成本最小化，同时仍然提供对压缩数据的某种程度的随机访问，高效和有效的压缩技术至关重要。相对于自适应压缩方案，相对Lempel Ziv (RLZ)方案提供了对大型压缩集合中的文档的快速解压缩和检索，甚至使用相对较小的ram常驻字典。到目前为止，RLZ压缩所需的字典是由定期从底层文档集合中采样的子字符串的连接形成的，然后以一种只保留高使用部分的方式进行修剪。在这项工作中，我们开发了新的字典设计启发式方法，基于有效构建，而不是基于修剪;我们将字典构造定义为(字符串)覆盖问题。为了避免字符串覆盖算法在大型集合上的复杂性，我们关注k-mers及其频率。首先，使用储层取样器，我们有效地识别了最常见的k-mers。然后，由于一个集合通常包含局部相似的区域，我们在每个“epoch”中选择一个片段，其k-mers在本地的覆盖分数最高。字典是由这些时代派生的片段串联而成的。我们的选择过程的灵感来自于对集合覆盖问题的贪心方法。与现有的最佳修剪方法CARE相比，该方案的施工时间相近，但压缩效果更好。在几个千兆字节的文档集合中，相对收益高达27%。

{"title":"Effective Construction of Relative Lempel-Ziv Dictionaries","authors":"Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth","doi":"10.1145/2872427.2883042","DOIUrl":"https://doi.org/10.1145/2872427.2883042","url":null,"abstract":"Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each \"epoch\" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87777961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

The Effect of Recommendations on Network Structure 推荐对网络结构的影响

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883040

Jessica Su, Aneesh Sharma, Sharad Goel

Online social networks regularly offer users personalized, algorithmic suggestions of whom to connect to. Here we examine the aggregate effects of such recommendations on network structure, focusing on whether these recommendations increase the popularity of niche users or, conversely, those who are already popular. We investigate this issue by empirically and theoretically analyzing abrupt changes in Twitter's network structure around the mid-2010 introduction of its "Who to Follow" feature. We find that users across the popularity spectrum benefitted from the recommendations; however, the most popular users profited substantially more than average. We trace this "rich get richer" phenomenon to three intertwined factors. First, as is typical of network recommenders, the system relies on a "friend-of-friend"-style algorithm, which we show generally results in users being recommended proportional to their degree. Second, we find that the baseline growth rate of users is sublinear in degree. This mismatch between the recommender and the natural network dynamics thus alters the structural evolution of the network. Finally, we find that people are much more likely to respond positively to recommendations for popular users---perhaps because of their greater name recognition---further amplifying the cumulative advantage of well-known individuals.

在线社交网络会定期向用户提供个性化的算法建议，告诉他们该与谁联系。在这里，我们研究了这些推荐对网络结构的总体影响，重点关注这些推荐是否增加了利基用户的受欢迎程度，或者相反，增加了那些已经受欢迎的用户的受欢迎程度。我们通过实证和理论分析Twitter在2010年年中推出“关注谁”功能前后网络结构的突然变化来调查这一问题。我们发现，所有受欢迎程度的用户都从推荐中受益;然而，最受欢迎的用户的利润远远高于平均水平。我们将这种“富者愈富”的现象归结为三个相互交织的因素。首先，作为典型的网络推荐，该系统依赖于“朋友的朋友”式算法，我们展示的结果通常是用户被推荐成比例。其次，我们发现用户的基线增长率在程度上是次线性的。因此，推荐者和自然网络动态之间的不匹配改变了网络的结构演变。最后，我们发现人们更有可能对热门用户的推荐做出积极回应——也许是因为他们的知名度更高——进一步放大了知名人士的累积优势。

{"title":"The Effect of Recommendations on Network Structure","authors":"Jessica Su, Aneesh Sharma, Sharad Goel","doi":"10.1145/2872427.2883040","DOIUrl":"https://doi.org/10.1145/2872427.2883040","url":null,"abstract":"Online social networks regularly offer users personalized, algorithmic suggestions of whom to connect to. Here we examine the aggregate effects of such recommendations on network structure, focusing on whether these recommendations increase the popularity of niche users or, conversely, those who are already popular. We investigate this issue by empirically and theoretically analyzing abrupt changes in Twitter's network structure around the mid-2010 introduction of its \"Who to Follow\" feature. We find that users across the popularity spectrum benefitted from the recommendations; however, the most popular users profited substantially more than average. We trace this \"rich get richer\" phenomenon to three intertwined factors. First, as is typical of network recommenders, the system relies on a \"friend-of-friend\"-style algorithm, which we show generally results in users being recommended proportional to their degree. Second, we find that the baseline growth rate of users is sublinear in degree. This mismatch between the recommender and the natural network dynamics thus alters the structural evolution of the network. Finally, we find that people are much more likely to respond positively to recommendations for popular users---perhaps because of their greater name recognition---further amplifying the cumulative advantage of well-known individuals.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86672106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Discovery of Topical Authorities in Instagram 在Instagram上发现了局部权威

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883078

Aditya Pal, Amac Herdagdelen, Sourav Chatterji, Sumit Taank, Deepayan Chakrabarti

Instagram has more than 400 million monthly active accounts who share more than 80 million pictures and videos daily. This large volume of user-generated content is the application's notable strength, but also makes the problem of finding the authoritative users for a given topic challenging. Discovering topical authorities can be useful for providing relevant recommendations to the users. In addition, it can aid in building a catalog of topics and top topical authorities in order to engage new users, and hence provide a solution to the cold-start problem. In this paper, we present a novel approach that we call the Authority Learning Framework (ALF) to find topical authorities in Instagram. ALF is based on the self-described interests of the follower base of popular accounts. We infer regular users' interests from their self-reported biographies that are publicly available and use Wikipedia pages to ground these interests as fine-grained, disambiguated concepts. We propose a generalized label propagation algorithm to propagate the interests over the follower graph to the popular accounts. We show that even if biography-based interests are sparse at an individual user level they provide strong signals to infer the topical authorities and let us obtain a high precision authority list per topic. Our experiments demonstrate that ALF performs significantly better at user recommendation task compared to fine-tuned and competitive methods, via controlled experiments, in-the-wild tests, and over an expert-curated list of topical authorities.

Instagram每月有超过4亿个活跃账户，每天分享超过8000万张照片和视频。大量用户生成的内容是应用程序的显著优势，但也使得为给定主题查找权威用户的问题具有挑战性。发现专题权威机构有助于向用户提供相关建议。此外，它还可以帮助建立主题目录和顶级主题权威，以吸引新用户，从而为冷启动问题提供解决方案。在本文中，我们提出了一种新颖的方法，我们称之为权威学习框架(ALF)来寻找Instagram中的主题权威。ALF是基于热门账户的追随者基础的自我描述的兴趣。我们从普通用户公开的自我报告传记中推断出他们的兴趣，并使用维基百科页面将这些兴趣作为细粒度的、消除了模糊的概念。我们提出了一种广义标签传播算法，将关注者图上的兴趣传播到热门账户。我们表明，即使基于传记的兴趣在个人用户层面上是稀疏的，它们也提供了推断主题权威的强信号，并让我们获得每个主题的高精度权威列表。通过控制实验、野外测试和专家策划的主题权威列表，我们的实验表明，与微调和竞争方法相比，ALF在用户推荐任务上表现得更好。

{"title":"Discovery of Topical Authorities in Instagram","authors":"Aditya Pal, Amac Herdagdelen, Sourav Chatterji, Sumit Taank, Deepayan Chakrabarti","doi":"10.1145/2872427.2883078","DOIUrl":"https://doi.org/10.1145/2872427.2883078","url":null,"abstract":"Instagram has more than 400 million monthly active accounts who share more than 80 million pictures and videos daily. This large volume of user-generated content is the application's notable strength, but also makes the problem of finding the authoritative users for a given topic challenging. Discovering topical authorities can be useful for providing relevant recommendations to the users. In addition, it can aid in building a catalog of topics and top topical authorities in order to engage new users, and hence provide a solution to the cold-start problem. In this paper, we present a novel approach that we call the Authority Learning Framework (ALF) to find topical authorities in Instagram. ALF is based on the self-described interests of the follower base of popular accounts. We infer regular users' interests from their self-reported biographies that are publicly available and use Wikipedia pages to ground these interests as fine-grained, disambiguated concepts. We propose a generalized label propagation algorithm to propagate the interests over the follower graph to the popular accounts. We show that even if biography-based interests are sparse at an individual user level they provide strong signals to infer the topical authorities and let us obtain a high precision authority list per topic. Our experiments demonstrate that ALF performs significantly better at user recommendation task compared to fine-tuned and competitive methods, via controlled experiments, in-the-wild tests, and over an expert-curated list of topical authorities.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"162 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86380766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Cracking Classifiers for Evasion: A Case Study on the Google's Phishing Pages Filter 破解逃避分类器:b谷歌的网络钓鱼页面过滤器案例研究

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883060

Bin Liang, Miaoqiang Su, Wei You, Wenchang Shi, Gang Yang

Various classifiers based on the machine learning techniques have been widely used in security applications. Meanwhile, they also became an attack target of adversaries. Many existing studies have paid much attention to the evasion attacks on the online classifiers and discussed defensive methods. However, the security of the classifiers deployed in the client environment has not got the attention it deserves. Besides, earlier studies only concentrated on the experimental classifiers developed for research purposes only. The security of widely-used commercial classifiers still remains unclear. In this paper, we use the Google's phishing pages filter (GPPF), a classifier deployed in the Chrome browser which owns over one billion users, as a case to investigate the security challenges for the client-side classifiers. We present a new attack methodology targeting on client-side classifiers, called classifiers cracking. With the methodology, we successfully cracked the classification model of GPPF and extracted sufficient knowledge can be exploited for evasion attacks, including the classification algorithm, scoring rules and features, etc. Most importantly, we completely reverse engineered 84.8% scoring rules, covering most of high-weighted rules. Based on the cracked information, we performed two kinds of evasion attacks to GPPF, using 100 real phishing pages for the evaluation purpose. The experiments show that all the phishing pages (100%) can be easily manipulated to bypass the detection of GPPF. Our study demonstrates that the existing client-side classifiers are very vulnerable to classifiers cracking attacks.

各种基于机器学习技术的分类器在安全领域得到了广泛的应用。与此同时，他们也成为了对手的攻击目标。已有的研究对在线分类器的规避攻击进行了较多的关注，并讨论了防御方法。然而，部署在客户机环境中的分类器的安全性并没有得到应有的重视。此外，早期的研究只集中在为研究目的而开发的实验分类器上。广泛使用的商业分类器的安全性仍然不清楚。在本文中，我们使用谷歌的网络钓鱼页面过滤器(GPPF)，一个部署在拥有超过10亿用户的Chrome浏览器中的分类器，作为一个案例来研究客户端分类器的安全挑战。我们提出了一种针对客户端分类器的新攻击方法，称为分类器破解。利用该方法，我们成功地破解了GPPF的分类模型，并提取了足够的可用于规避攻击的知识，包括分类算法、评分规则和特征等。最重要的是，我们完全逆向工程了84.8%的评分规则，涵盖了大多数高权重规则。基于破解的信息，我们对GPPF进行了两种规避攻击，使用100个真实的网络钓鱼页面进行评估。实验表明，所有的钓鱼页面(100%)都可以很容易地被操纵以绕过GPPF的检测。我们的研究表明，现有的客户端分类器非常容易受到分类器破解攻击。

{"title":"Cracking Classifiers for Evasion: A Case Study on the Google's Phishing Pages Filter","authors":"Bin Liang, Miaoqiang Su, Wei You, Wenchang Shi, Gang Yang","doi":"10.1145/2872427.2883060","DOIUrl":"https://doi.org/10.1145/2872427.2883060","url":null,"abstract":"Various classifiers based on the machine learning techniques have been widely used in security applications. Meanwhile, they also became an attack target of adversaries. Many existing studies have paid much attention to the evasion attacks on the online classifiers and discussed defensive methods. However, the security of the classifiers deployed in the client environment has not got the attention it deserves. Besides, earlier studies only concentrated on the experimental classifiers developed for research purposes only. The security of widely-used commercial classifiers still remains unclear. In this paper, we use the Google's phishing pages filter (GPPF), a classifier deployed in the Chrome browser which owns over one billion users, as a case to investigate the security challenges for the client-side classifiers. We present a new attack methodology targeting on client-side classifiers, called classifiers cracking. With the methodology, we successfully cracked the classification model of GPPF and extracted sufficient knowledge can be exploited for evasion attacks, including the classification algorithm, scoring rules and features, etc. Most importantly, we completely reverse engineered 84.8% scoring rules, covering most of high-weighted rules. Based on the cracked information, we performed two kinds of evasion attacks to GPPF, using 100 real phishing pages for the evaluation purpose. The experiments show that all the phishing pages (100%) can be easily manipulated to bypass the detection of GPPF. Our study demonstrates that the existing client-side classifiers are very vulnerable to classifiers cracking attacks.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90142081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model 使用整体终身主题模型挖掘特定方面的意见

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883086

Shuai Wang, Zhiyuan Chen, Bing Liu

Aspect-level sentiment analysis or opinion mining consists of several core sub-tasks: aspect extraction, opinion identification, polarity classification, and separation of general and aspect-specific opinions. Various topic models have been proposed by researchers to address some of these sub-tasks. However, there is little work on modeling all of them together. In this paper, we first propose a holistic fine-grained topic model, called the JAST (Joint Aspect-based Sentiment Topic) model, that can simultaneously model all of above problems under a unified framework. To further improve it, we incorporate the idea of lifelong machine learning and propose a more advanced model, called the LAST (Lifelong Aspect-based Sentiment Topic) model. LAST automatically mines the prior knowledge of aspect, opinion, and their correspondence from other products or domains. Such knowledge is automatically extracted and incorporated into the proposed LAST model without any human involvement. Our experiments using reviews of a large number of product domains show major improvements of the proposed models over state-of-the-art baselines.

方面级情感分析或意见挖掘由几个核心子任务组成:方面提取、意见识别、极性分类以及一般意见和特定方面意见的分离。研究人员提出了各种主题模型来解决其中的一些子任务。然而，对所有这些模型进行建模的工作很少。在本文中，我们首先提出了一个整体的细粒度主题模型，称为JAST (Joint Aspect-based Sentiment topic)模型，该模型可以在一个统一的框架下同时对上述所有问题进行建模。为了进一步改进它，我们结合了终身机器学习的思想，并提出了一个更高级的模型，称为LAST(终身面向方面的情感主题)模型。LAST自动从其他产品或领域中挖掘方面、意见及其对应的先验知识。这些知识被自动提取并合并到所提出的LAST模型中，而无需任何人工参与。我们的实验使用了大量产品领域的回顾，显示了在最先进的基线上提出的模型的主要改进。

引用次数: 128