Proceedings of the 25th International Conference on World Wide Web最新文献

英文中文

Semantics and Expressive Power of Subqueries and Aggregates in SPARQL 1.1 SPARQL 1.1中子查询和聚合的语义和表达能力

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883022

M. Kaminski, Egor V. Kostylev, B. C. Grau

Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence and sensor networks. In order to fulfill the requirements of such applications, the standardisation of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this paper we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable and extensible query engines supporting complex numerical and analytics tasks.

回答聚合查询是语义技术新兴应用的关键需求，例如数据仓库、商业智能和传感器网络。为了满足此类应用程序的需求，SPARQL 1.1的标准化导致引入了广泛的构造，这些构造支持值计算、聚合和查询嵌套。在本文中，我们对SPARQL 1.1规范中定义的这些新结构的语义和表达能力进行了深入的形式化分析，从而为开发支持复杂数值和分析任务的健壮、可扩展和可扩展的查询引擎奠定了必要的基础。

引用次数: 28

Automatic Extraction of Indicators of Compromise for Web Applications Web应用程序危害指标的自动提取

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883056

Onur Catakoglu, Marco Balduzzi, D. Balzarotti

Indicators of Compromise (IOCs) are forensic artifacts that are used as signs that a system has been compromised by an attack or that it has been infected with a particular malicious software. In this paper we propose for the first time an automated technique to extract and validate IOCs for web applications, by analyzing the information collected by a high-interaction honeypot. Our approach has several advantages compared with traditional techniques used to detect malicious websites. First of all, not all the compromised web pages are malicious or harmful for the user. Some may be defaced to advertise product or services, and some may be part of affiliate programs to redirect users toward (more or less legitimate) online shopping websites. In any case, it is important to detect these pages to inform their owners and to alert the users on the fact that the content of the page has been compromised and cannot be trusted. Also in the case of more traditional drive-by-download pages, the use of IOCs allows for a prompt detection and correlation of infected pages, even before they may be blocked by more traditional URLs blacklists. Our experiments show that our system is able to automatically generate web indicators of compromise that have been used by attackers for several months (and sometimes years) in the wild without being detected. So far, these apparently harmless scripts were able to stay under the radar of the existing detection methodologies -- despite being hosted for a long time on public web sites.

入侵指标(ioc)是一种取证工件，用于表明系统已受到攻击或已被特定恶意软件感染。在本文中，我们首次提出了一种自动化技术，通过分析高交互蜜罐收集的信息，为web应用程序提取和验证ioc。与传统检测恶意网站的技术相比，我们的方法有几个优点。首先，并非所有被入侵的网页都是恶意的或对用户有害的。有些可能是为了宣传产品或服务，有些可能是附属计划的一部分，将用户重定向到(或多或少合法的)在线购物网站。无论如何，检测这些页面以通知其所有者并提醒用户页面内容已被泄露且不可信任的事实是很重要的。此外，在更传统的下载驱动页面的情况下，使用ioc可以及时检测和关联受感染的页面，甚至在它们可能被更传统的url黑名单阻止之前。我们的实验表明，我们的系统能够自动生成已经被攻击者使用了几个月(有时甚至几年)的网络妥协指标，而不会被发现。到目前为止，尽管这些看似无害的脚本在公共网站上托管了很长一段时间，但它们能够躲过现有检测方法的监视。

{"title":"Automatic Extraction of Indicators of Compromise for Web Applications","authors":"Onur Catakoglu, Marco Balduzzi, D. Balzarotti","doi":"10.1145/2872427.2883056","DOIUrl":"https://doi.org/10.1145/2872427.2883056","url":null,"abstract":"Indicators of Compromise (IOCs) are forensic artifacts that are used as signs that a system has been compromised by an attack or that it has been infected with a particular malicious software. In this paper we propose for the first time an automated technique to extract and validate IOCs for web applications, by analyzing the information collected by a high-interaction honeypot. Our approach has several advantages compared with traditional techniques used to detect malicious websites. First of all, not all the compromised web pages are malicious or harmful for the user. Some may be defaced to advertise product or services, and some may be part of affiliate programs to redirect users toward (more or less legitimate) online shopping websites. In any case, it is important to detect these pages to inform their owners and to alert the users on the fact that the content of the page has been compromised and cannot be trusted. Also in the case of more traditional drive-by-download pages, the use of IOCs allows for a prompt detection and correlation of infected pages, even before they may be blocked by more traditional URLs blacklists. Our experiments show that our system is able to automatically generate web indicators of compromise that have been used by attackers for several months (and sometimes years) in the wild without being detected. So far, these apparently harmless scripts were able to stay under the radar of the existing detection methodologies -- despite being hosted for a long time on public web sites.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83127716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

The Semantic Web and the Semantics of the Web: Where Does Meaning Come From? 语义网与网络的语义:意义从何而来?

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2874818

Peter Norvig

We would like to understand the meaning of content on the web. Bit where should that meaning come from? From markup language annotations created by the authors of the content? Crowdsourced from readers of the content? Automatically extracted by machine learning algorithms? This talk investigates the possibilities.

我们想要理解网络内容的含义。那么这个意思从何而来呢?从内容作者创建的标记语言注释中?众包内容的读者?通过机器学习算法自动提取?这次演讲探讨了这些可能性。

引用次数: 7

N-gram over Context N-gram over Context

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882981

N. Kawamae

Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.

我们的建议，$N$-gram over Context (NOC)，是一个非参数主题模型，旨在帮助我们理解给定的语料库，并应用于许多文本挖掘应用。与其他主题模型一样，NOC将每个文档表示为主题的混合物，并从一个主题生成每个单词。与这些模型不同的是，NOC既关注主题结构作为内部语言结构，也关注N-gram作为外部语言结构。为了提高特定于主题的N-gram的质量，NOC揭示了一个主题树，该主题树捕获给定语料库中主题之间的语义关系作为上下文，并通过提供该主题树上词频的幂律分布来形成N-gram。为了有效地获得这两种语言结构，NOC以统一的方式从给定的语料库中学习它们。通过在每个文档的生成过程中在单词级别访问整个树，NOC使每个文档能够保持主题一致性，并在上下文上形成$N$-grams。我们开发了一种并行推理算法D-NOC，以支持大型数据集。对评论文章/论文/tweet的实验表明，NOC作为生成模型可以有效地发现主题结构和相应的N-grams，并且可以很好地补充人类专家和领域特定知识。在开源分布式机器学习框架的帮助下，D-NOC可以处理大型数据集，同时保持完整的生成模型性能。

{"title":"N-gram over Context","authors":"N. Kawamae","doi":"10.1145/2872427.2882981","DOIUrl":"https://doi.org/10.1145/2872427.2882981","url":null,"abstract":"Our proposal, $N$-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms $N$-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form $N$-grams over context. We develop a parallelizable inference algorithm, D-NOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80739840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

On Sampling Nodes in a Network 关于网络中的节点采样

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883045

Flavio Chierichetti, Anirban Dasgupta, Ravi Kumar, Silvio Lattanzi, Tamás Sarlós

Random walk is an important tool in many graph mining applications including estimating graph parameters, sampling portions of the graph, and extracting dense communities. In this paper we consider the problem of sampling nodes from a large graph according to a prescribed distribution by using random walk as the basic primitive. Our goal is to obtain algorithms that make a small number of queries to the graph but output a node that is sampled according to the prescribed distribution. Focusing on the uniform distribution case, we study the query complexity of three algorithms and show a near-tight bound expressed in terms of the parameters of the graph such as average degree and the mixing time. Both theoretically and empirically, we show that some algorithms are preferable in practice than the others. We also extend our study to the problem of sampling nodes according to some polynomial function of their degrees; this has implications for designing efficient algorithms for applications such as triangle counting.

随机漫步是许多图挖掘应用中的重要工具，包括估计图参数，图的采样部分和提取密集社区。本文用随机漫步作为基本原语，研究了从一个大图中按规定的分布抽取节点的问题。我们的目标是获得对图进行少量查询，但根据规定分布输出采样节点的算法。针对均匀分布情况，研究了三种算法的查询复杂度，给出了用图的平均度和混合时间等参数表示的近紧界。从理论上和经验上，我们都证明了一些算法在实践中比其他算法更可取。我们还将研究扩展到根据节点度的多项式函数进行采样的问题;这对于为诸如三角形计数之类的应用程序设计高效算法具有启示意义。

引用次数: 66

No Honor Among Thieves: A Large-Scale Analysis of Malicious Web Shells 盗贼之间没有荣誉:恶意网络外壳的大规模分析

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882992

Oleksii Starov, J. Dahse, Syed Sharique Ahmad, Thorsten Holz, Nick Nikiforakis

Web shells are malicious scripts that attackers upload to a compromised web server in order to remotely execute arbitrary commands, maintain their access, and elevate their privileges. Despite their high prevalence in practice and heavy involvement in security breaches, web shells have never been the direct subject of any study. In contrast, web shells have been treated as malicious blackboxes that need to be detected and removed, rather than malicious pieces of software that need to be analyzed and, in detail, understood. In this paper, we report on the first comprehensive study of web shells. By utilizing different static and dynamic analysis methods, we discover and quantify the visible and invisible features offered by popular malicious shells, and we discuss how attackers can take advantage of these features. For visible features, we find the presence of password bruteforcers, SQL database clients, portscanners, and checks for the presence of security software installed on the compromised server. In terms of invisible features, we find that about half of the analyzed shells contain an authentication mechanism, but this mechanism can be bypassed in a third of the cases. Furthermore, we find that about a third of the analyzed shells perform homephoning, i.e., the shells, upon execution, surreptitiously communicate to various third parties with the intent of revealing the location of new shell installations. By setting up honeypots, we quantify the number of third-party attackers benefiting from shell installations and show how an attacker, by merely registering the appropriate domains, can completely take over all installations of specific vulnerable shells.

Web shell是攻击者上传到受感染的Web服务器上的恶意脚本，目的是远程执行任意命令、保持访问权限并提升权限。尽管它们在实践中非常普遍，并且大量涉及安全漏洞，但web shell从未成为任何研究的直接对象。相比之下，web shell被视为需要检测和移除的恶意黑箱，而不是需要分析和详细理解的恶意软件。在本文中，我们报告了第一次对web shell的全面研究。通过使用不同的静态和动态分析方法，我们发现并量化了流行的恶意shell提供的可见和不可见特征，并讨论了攻击者如何利用这些特征。对于可见的特征，我们发现存在密码暴力破解者、SQL数据库客户端、端口扫描器，并检查是否存在安装在受损服务器上的安全软件。就不可见的特性而言，我们发现所分析的shell中大约有一半包含身份验证机制，但是在三分之一的情况下可以绕过该机制。此外，我们发现所分析的shell中约有三分之一执行同调，也就是说，这些shell在执行时秘密地与各种第三方通信，目的是揭示新shell安装的位置。通过设置蜜罐，我们量化了从shell安装中受益的第三方攻击者的数量，并展示了攻击者如何仅通过注册适当的域，就可以完全接管特定易受攻击的shell的所有安装。

{"title":"No Honor Among Thieves: A Large-Scale Analysis of Malicious Web Shells","authors":"Oleksii Starov, J. Dahse, Syed Sharique Ahmad, Thorsten Holz, Nick Nikiforakis","doi":"10.1145/2872427.2882992","DOIUrl":"https://doi.org/10.1145/2872427.2882992","url":null,"abstract":"Web shells are malicious scripts that attackers upload to a compromised web server in order to remotely execute arbitrary commands, maintain their access, and elevate their privileges. Despite their high prevalence in practice and heavy involvement in security breaches, web shells have never been the direct subject of any study. In contrast, web shells have been treated as malicious blackboxes that need to be detected and removed, rather than malicious pieces of software that need to be analyzed and, in detail, understood. In this paper, we report on the first comprehensive study of web shells. By utilizing different static and dynamic analysis methods, we discover and quantify the visible and invisible features offered by popular malicious shells, and we discuss how attackers can take advantage of these features. For visible features, we find the presence of password bruteforcers, SQL database clients, portscanners, and checks for the presence of security software installed on the compromised server. In terms of invisible features, we find that about half of the analyzed shells contain an authentication mechanism, but this mechanism can be bypassed in a third of the cases. Furthermore, we find that about a third of the analyzed shells perform homephoning, i.e., the shells, upon execution, surreptitiously communicate to various third parties with the intent of revealing the location of new shell installations. By setting up honeypots, we quantify the number of third-party attackers benefiting from shell installations and show how an attacker, by merely registering the appropriate domains, can completely take over all installations of specific vulnerable shells.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89547560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Modeling User Consumption Sequences 用户消费序列建模

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883024

Austin R. Benson, Ravi Kumar, A. Tomkins

We study sequences of consumption in which the same item may be consumed multiple times. We identify two macroscopic behavior patterns of repeated consumptions. First, in a given user's lifetime, very few items live for a long time. Second, the last consumptions of an item exhibit growing inter-arrival gaps consistent with the notion of increasing boredom leading up to eventual abandonment. We then present what is to our knowledge the first holistic model of sequential repeated consumption, covering all observed aspects of this behavior. Our simple and purely combinatorial model includes no planted notion of lifetime distributions or user boredom; nonetheless, the model correctly predicts both of these phenomena. Further, we provide theoretical analysis of the behavior of the model confirming these phenomena. Additionally, the model quantitatively matches a number of microscopic phenomena across a broad range of datasets. Intriguingly, these findings suggest that the observation in a variety of domains of increasing user boredom leading to abandonment may be explained simply by probabilistic conditioning on an extinction event in a simple model, without resort to explanations based on complex human dynamics.

我们研究消费序列，其中同一项目可能被多次消费。我们确定了重复消费的两种宏观行为模式。首先，在给定用户的生命周期中，很少有物品能够存活很长时间。其次，商品的最后消费表现出越来越大的到达间隔，这与日益增加的无聊导致最终放弃的概念相一致。然后，我们提出了据我们所知的第一个连续重复消费的整体模型，涵盖了这种行为的所有观察方面。我们的简单和纯粹的组合模型不包含生命周期分布或用户无聊的植入概念;尽管如此，该模型正确地预测了这两种现象。此外，我们对模型的行为进行了理论分析，证实了这些现象。此外，该模型在大量数据集上定量地匹配了许多微观现象。有趣的是，这些发现表明，在各种领域的观察增加用户无聊导致放弃可以简单地解释为概率条件下的灭绝事件在一个简单的模型，而不是诉诸于解释基于复杂的人类动态。

{"title":"Modeling User Consumption Sequences","authors":"Austin R. Benson, Ravi Kumar, A. Tomkins","doi":"10.1145/2872427.2883024","DOIUrl":"https://doi.org/10.1145/2872427.2883024","url":null,"abstract":"We study sequences of consumption in which the same item may be consumed multiple times. We identify two macroscopic behavior patterns of repeated consumptions. First, in a given user's lifetime, very few items live for a long time. Second, the last consumptions of an item exhibit growing inter-arrival gaps consistent with the notion of increasing boredom leading up to eventual abandonment. We then present what is to our knowledge the first holistic model of sequential repeated consumption, covering all observed aspects of this behavior. Our simple and purely combinatorial model includes no planted notion of lifetime distributions or user boredom; nonetheless, the model correctly predicts both of these phenomena. Further, we provide theoretical analysis of the behavior of the model confirming these phenomena. Additionally, the model quantitatively matches a number of microscopic phenomena across a broad range of datasets. Intriguingly, these findings suggest that the observation in a variety of domains of increasing user boredom leading to abandonment may be explained simply by probabilistic conditioning on an extinction event in a simple model, without resort to explanations based on complex human dynamics.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90101182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

Improving Post-Click User Engagement on Native Ads via Survival Analysis 通过生存分析提高原生广告的点击后用户粘性

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883092

Nicola Barbieri, F. Silvestri, M. Lalmas

In this paper we focus on estimating the post-click engagement on native ads by predicting the dwell time on the corresponding ad landing pages. To infer relationships between features of the ads and dwell time we resort to the application of survival analysis techniques, which allow us to estimate the distribution of the length of time that the user will spend on the ad. This information is then integrated into the ad ranking function with the goal of promoting the rank of ads that are likely to be clicked and consumed by users (dwell time greater than a given threshold). The online evaluation over live traffic shows that considering post-click engagement has a consistent positive effect on both CTR, decreases the number of bounces and increases the average dwell time, hence leading to a better user post-click experience.

在本文中，我们主要通过预测相应广告登陆页面的停留时间来估计原生广告的点击后粘性。为了推断广告特征与停留时间之间的关系，我们采用了生存分析技术，这使我们能够估计用户将在广告上花费的时间长度的分布。然后将这些信息整合到广告排名功能中，目标是提高可能被用户点击和消费的广告的排名(停留时间大于给定阈值)。对实时流量的在线评估表明，考虑点击后粘性对点击率和平均停留时间都有持续的积极影响，减少了反弹次数，增加了平均停留时间，从而带来了更好的用户点击后体验。

引用次数: 57

Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News 为流媒体新闻提供实时高精度标签推荐的学习排序

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882982

Bichen Shi, Georgiana Ifrim, N. Hurley

We address the problem of real-time recommendation of streaming Twitter hashtags to an incoming stream of news articles. The technical challenge can be framed as large scale topic classification where the set of topics (i.e., hashtags) is huge and highly dynamic. Our main applications come from digital journalism, e.g., promoting original content to Twitter communities and social indexing of news to enable better retrieval and story tracking. In contrast to the state-of-the-art that focuses on topic modelling approaches, we propose a learning-to-rank approach for modelling hashtag relevance. This enables us to deal with the dynamic nature of the problem, since a relevance model is stable over time, while a topic model needs to be continuously retrained. We present the data collection and processing pipeline, as well as our methodology for achieving low latency, high precision recommendations. Our empirical results show that our method outperforms the state-of-the-art, delivering more than 80% precision. Our techniques are implemented in a real-time system that is currently under user trial with a big news organisation.

我们解决了实时推荐的问题，将Twitter标签流到传入的新闻文章流中。技术挑战可以定义为大规模主题分类，其中主题集(即hashtag)非常庞大且高度动态。我们的主要应用来自数字新闻，例如，向Twitter社区推广原创内容，并对新闻进行社会索引，以便更好地检索和跟踪故事。与关注主题建模方法的最新技术相比，我们提出了一种用于建模标签相关性的学习排序方法。这使我们能够处理问题的动态特性，因为相关性模型随着时间的推移是稳定的，而主题模型需要不断地重新训练。我们介绍了数据收集和处理管道，以及实现低延迟、高精度建议的方法。我们的实证结果表明，我们的方法优于最先进的技术，提供超过80%的精度。我们的技术在一个实时系统中实现，该系统目前正在一家大型新闻机构进行用户试用。

{"title":"Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News","authors":"Bichen Shi, Georgiana Ifrim, N. Hurley","doi":"10.1145/2872427.2882982","DOIUrl":"https://doi.org/10.1145/2872427.2882982","url":null,"abstract":"We address the problem of real-time recommendation of streaming Twitter hashtags to an incoming stream of news articles. The technical challenge can be framed as large scale topic classification where the set of topics (i.e., hashtags) is huge and highly dynamic. Our main applications come from digital journalism, e.g., promoting original content to Twitter communities and social indexing of news to enable better retrieval and story tracking. In contrast to the state-of-the-art that focuses on topic modelling approaches, we propose a learning-to-rank approach for modelling hashtag relevance. This enables us to deal with the dynamic nature of the problem, since a relevance model is stable over time, while a topic model needs to be continuously retrained. We present the data collection and processing pipeline, as well as our methodology for achieving low latency, high precision recommendations. Our empirical results show that our method outperforms the state-of-the-art, delivering more than 80% precision. Our techniques are implemented in a real-time system that is currently under user trial with a big news organisation.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81129587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Beyond Collaborative Filtering: The List Recommendation Problem 超越协同过滤:列表推荐问题

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883057

Oren Sar Shalom, Noam Koenigstein, U. Paquet, Hastagiri P. Vanchinathan

Most Collaborative Filtering (CF) algorithms are optimized using a dataset of isolated user-item tuples. However, in commercial applications recommended items are usually served as an ordered list of several items and not as isolated items. In this setting, inter-item interactions have an effect on the list's Click-Through Rate (CTR) that is unaccounted for using traditional CF approaches. Most CF approaches also ignore additional important factors like click propensity variation, item fatigue, etc. In this work, we introduce the list recommendation problem. We present useful insights gleaned from user behavior and consumption patterns from a large scale real world recommender system. We then propose a novel two-layered framework that builds upon existing CF algorithms to optimize a list's click probability. Our approach accounts for inter-item interactions as well as additional information such as item fatigue, trendiness patterns, contextual information etc. Finally, we evaluate our approach using a novel adaptation of Inverse Propensity Scoring (IPS) which facilitates off-policy estimation of our method's CTR and showcases its effectiveness in real-world settings.

大多数协同过滤(CF)算法都是使用孤立的用户项元组数据集进行优化的。然而，在商业应用程序中，推荐项通常作为若干项的有序列表提供，而不是作为孤立的项。在这种情况下，项目间的互动会对列表的点击率(CTR)产生影响，而使用传统的CF方法是无法考虑到这一点的。大多数CF方法也忽略了其他重要因素，如点击倾向变化，道具疲劳等。在这项工作中，我们引入了列表推荐问题。我们从一个大规模的真实世界的推荐系统中收集了用户行为和消费模式的有用见解。然后，我们提出了一个新的双层框架，该框架基于现有的CF算法来优化列表的点击概率。我们的方法考虑了项目间的互动以及额外的信息，如项目疲劳、流行模式、上下文信息等。最后，我们使用一种新的逆倾向评分(IPS)来评估我们的方法，该方法有助于对我们方法的CTR进行非策略估计，并展示了其在现实环境中的有效性。

{"title":"Beyond Collaborative Filtering: The List Recommendation Problem","authors":"Oren Sar Shalom, Noam Koenigstein, U. Paquet, Hastagiri P. Vanchinathan","doi":"10.1145/2872427.2883057","DOIUrl":"https://doi.org/10.1145/2872427.2883057","url":null,"abstract":"Most Collaborative Filtering (CF) algorithms are optimized using a dataset of isolated user-item tuples. However, in commercial applications recommended items are usually served as an ordered list of several items and not as isolated items. In this setting, inter-item interactions have an effect on the list's Click-Through Rate (CTR) that is unaccounted for using traditional CF approaches. Most CF approaches also ignore additional important factors like click propensity variation, item fatigue, etc. In this work, we introduce the list recommendation problem. We present useful insights gleaned from user behavior and consumption patterns from a large scale real world recommender system. We then propose a novel two-layered framework that builds upon existing CF algorithms to optimize a list's click probability. Our approach accounts for inter-item interactions as well as additional information such as item fatigue, trendiness patterns, contextual information etc. Finally, we evaluate our approach using a novel adaptation of Inverse Propensity Scoring (IPS) which facilitates off-policy estimation of our method's CTR and showcases its effectiveness in real-world settings.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82767309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 25th International Conference on World Wide Web

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀