Proceedings of the 2017 ACM on Conference on Information and Knowledge Management最新文献

英文中文

HyPerInsight: Data Exploration Deep Inside HyPer HyPerInsight: HyPerInsight内部的数据探索

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133167

N. Hubig, Linnea Passing, Maximilian E. Schüle, Dimitri Vorona, A. Kemper, Thomas Neumann

Nowadays we are drowning in data of various varieties. For all these mixed types and categories of data there exist even more different analysis approaches, often done in single hand-written solutions. We propose to extend HyPer, a main memory database system to a uniform data agent platform following the one system fits all approach for solving a wide variety of data analysis problems. We achieve this by applying a flexible operator concept to a set of various important data exploration algorithms. With that, HyPer solves analytical questions using clustering, classification, association rule mining and graph mining besides standard HTAP (Hybrid Transaction and Analytical Processing) workloads on the same database state. It enables to approach the full variety and volume of HTAP extended for data exploration (HTAPx), and only needs knowledge of already introduced SQL extensions that are automatically optimized by the database's standard optimizer. In this demo we will focus on the benefits and flexibility we create by using the SQL extensions for several well-known mining workloads. In our interactive webinterface for this project named HyPerInsight we demonstrate how HyPer outperforms the best open source competitor Apache Spark in common use cases in social media, geo-data, recommender systems and several other.

如今，我们被各种各样的数据淹没了。对于所有这些混合类型和类别的数据，存在更多不同的分析方法，通常在单一的手写解决方案中完成。我们建议将主存数据库系统HyPer扩展为统一的数据代理平台，遵循“一刀切”的方法来解决各种各样的数据分析问题。我们通过将灵活的算子概念应用于一系列重要的数据探索算法来实现这一点。因此，除了在相同的数据库状态下使用标准的HTAP(混合事务和分析处理)工作负载外，HyPer还使用聚类、分类、关联规则挖掘和图挖掘来解决分析问题。它可以接近为数据探索(HTAPx)扩展的所有种类和数量的HTAP，并且只需要了解已经介绍的SQL扩展，这些扩展由数据库的标准优化器自动优化。在这个演示中，我们将重点介绍通过为几个知名的挖掘工作负载使用SQL扩展所带来的好处和灵活性。在这个名为HyPerInsight的项目的交互式网络界面中，我们展示了HyPer如何在社交媒体、地理数据、推荐系统和其他一些常见用例中胜过最好的开源竞争对手Apache Spark。

{"title":"HyPerInsight: Data Exploration Deep Inside HyPer","authors":"N. Hubig, Linnea Passing, Maximilian E. Schüle, Dimitri Vorona, A. Kemper, Thomas Neumann","doi":"10.1145/3132847.3133167","DOIUrl":"https://doi.org/10.1145/3132847.3133167","url":null,"abstract":"Nowadays we are drowning in data of various varieties. For all these mixed types and categories of data there exist even more different analysis approaches, often done in single hand-written solutions. We propose to extend HyPer, a main memory database system to a uniform data agent platform following the one system fits all approach for solving a wide variety of data analysis problems. We achieve this by applying a flexible operator concept to a set of various important data exploration algorithms. With that, HyPer solves analytical questions using clustering, classification, association rule mining and graph mining besides standard HTAP (Hybrid Transaction and Analytical Processing) workloads on the same database state. It enables to approach the full variety and volume of HTAP extended for data exploration (HTAPx), and only needs knowledge of already introduced SQL extensions that are automatically optimized by the database's standard optimizer. In this demo we will focus on the benefits and flexibility we create by using the SQL extensions for several well-known mining workloads. In our interactive webinterface for this project named HyPerInsight we demonstrate how HyPer outperforms the best open source competitor Apache Spark in common use cases in social media, geo-data, recommender systems and several other.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72678469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

SemFacet: Making Hard Faceted Search Easier SemFacet:使硬面搜索更容易

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133192

E. Kharlamov, Luca Giacomelli, Evgeny Sherkhonov, B. C. Grau, Egor V. Kostylev, Ian Horrocks

Faceted search is a prominent search paradigm that became the standard in many Web applications and has also been recently proposed as a suitable paradigm for exploring and querying RDF graphs. One of the main challenges that hampers usability of faceted search systems especially in the RDF context is information overload, that is, when the size of faceted interfaces becomes comparable to the size of the data over which the search is performed. In this demo we present (an extension of) our faceted search system SemFacet and focus on features that address the information overload: ranking, aggregation, and reachability. The demo attendees will be able to try our system on an RDF graph that models online shopping over a catalogs with up to millions of products.

分面搜索是一种突出的搜索范例，已成为许多Web应用程序的标准，最近也被提议作为探索和查询RDF图的合适范例。阻碍分面搜索系统(尤其是在RDF上下文中)可用性的主要挑战之一是信息过载，也就是说，当分面接口的大小与执行搜索的数据的大小相当时。在这个演示中，我们展示了我们的分面搜索系统SemFacet(一个扩展)，并重点介绍了解决信息过载的特性:排名、聚合和可达性。演示参与者将能够在RDF图上试用我们的系统，该图在多达数百万种产品的目录上对在线购物进行建模。

引用次数: 20

A Matrix-Vector Recurrent Unit Model for Capturing Compositional Semantics in Phrase Embeddings 基于矩阵-向量循环单元模型的短语嵌入组合语义捕获

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132984

Rui Wang, Wei Liu, C. McDonald

The meaning of a multi-word phrase not only depends on the meaning of its constituent words, but also the rules of composing them to give the so-called compositional semantic. However, many deep learning models for learning compositional semantics target specific NLP tasks such as sentiment classification. Consequently, the word embeddings encode the lexical semantics, the weights of the networks are optimised for the classification task. Such models have no mechanisms to explicitly encode the compositional rules, and hence they are insufficient in capturing the semantics of phrases. We present a novel recurrent computational mechanism that specifically learns the compositionality by encoding the compositional rule of each word into a matrix. The network uses a recurrent architecture to capture the order of words for phrases with various lengths without requiring extra preprocessing such as part-of-speech tagging. The model is thoroughly evaluated on both supervised and unsupervised NLP tasks including phrase similarity, noun-modifier questions, sentiment distribution prediction, and domain specific term identification tasks. We demonstrate that our model consistently outperforms the LSTM and CNN deep learning models, simple algebraic compositions, and other popular baselines on different datasets.

多词短语的意义不仅取决于其组成词的意义，还取决于组成词的规则，从而形成所谓的组合语义。然而，许多用于学习组合语义的深度学习模型针对特定的NLP任务，如情感分类。因此，词嵌入对词法语义进行编码，并针对分类任务优化网络的权值。这样的模型没有机制来显式地对组合规则进行编码，因此它们在捕获短语的语义方面是不够的。我们提出了一种新的循环计算机制，通过将每个单词的组合规则编码成矩阵来学习组合性。该网络使用循环架构来捕捉不同长度短语的单词顺序，而不需要额外的预处理，比如词性标注。该模型在有监督和无监督的NLP任务上进行了全面的评估，包括短语相似度、名词修饰语问题、情感分布预测和领域特定术语识别任务。我们证明了我们的模型在不同数据集上始终优于LSTM和CNN深度学习模型、简单代数组合和其他流行的基线。

{"title":"A Matrix-Vector Recurrent Unit Model for Capturing Compositional Semantics in Phrase Embeddings","authors":"Rui Wang, Wei Liu, C. McDonald","doi":"10.1145/3132847.3132984","DOIUrl":"https://doi.org/10.1145/3132847.3132984","url":null,"abstract":"The meaning of a multi-word phrase not only depends on the meaning of its constituent words, but also the rules of composing them to give the so-called compositional semantic. However, many deep learning models for learning compositional semantics target specific NLP tasks such as sentiment classification. Consequently, the word embeddings encode the lexical semantics, the weights of the networks are optimised for the classification task. Such models have no mechanisms to explicitly encode the compositional rules, and hence they are insufficient in capturing the semantics of phrases. We present a novel recurrent computational mechanism that specifically learns the compositionality by encoding the compositional rule of each word into a matrix. The network uses a recurrent architecture to capture the order of words for phrases with various lengths without requiring extra preprocessing such as part-of-speech tagging. The model is thoroughly evaluated on both supervised and unsupervised NLP tasks including phrase similarity, noun-modifier questions, sentiment distribution prediction, and domain specific term identification tasks. We demonstrate that our model consistently outperforms the LSTM and CNN deep learning models, simple algebraic compositions, and other popular baselines on different datasets.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74656988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Boolean Matrix Decomposition by Formal Concept Sampling 布尔矩阵的形式概念抽样分解

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133054

P. Osicka, Martin Trnecka

Finding interesting patterns is a classical problem in data mining. Boolean matrix decomposition is nowadays a standard tool that can find a set of patterns-also called factors-in Boolean data that explain the data well. We describe and experimentally evaluate a probabilistic algorithm for Boolean matrix decomposition problem. The algorithm is derived from GreCon algorithm which uses formal concepts-maximal rectangles or tiles-as factors in order to find a decomposition. We change the core of GreCon by substituting a sampling procedure for a deterministic computation of suitable formal concepts. This allows us to alleviate the greedy nature of GreCon, creates a possibility to bypass some of the its pitfalls and to preserve its features, e.g. an ability to explain the entire data.

寻找有趣的模式是数据挖掘中的一个经典问题。布尔矩阵分解现在是一种标准工具，它可以在布尔数据中找到一组模式(也称为因子)，这些模式可以很好地解释数据。本文描述并实验评价了布尔矩阵分解问题的一种概率算法。该算法衍生自GreCon算法，该算法使用形式化概念-最大矩形或瓦片-作为因子来寻找分解。我们改变了GreCon的核心，用抽样过程代替了合适形式概念的确定性计算。这使我们能够减轻GreCon的贪婪本质，创造了绕过一些陷阱并保留其功能的可能性，例如解释整个数据的能力。

引用次数: 2

BoostVHT: Boosting Distributed Streaming Decision Trees BoostVHT:增强分布式流决策树

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132974

Theodore Vasiloudis, F. Beligianni, G. D. F. Morales

Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.

在线增强通过将无界数据流链接成一个集合来提高分类器对无界数据流的准确性。由于其顺序性，提升很难并行化，在在线环境中更是如此。介绍了一种并行化在线增强算法BoostVHT。我们的建议利用最近开发的流决策树模型并行学习算法作为基础学习器。这种设计允许将模型增强与其训练整齐地分开。因此，BoostVHT提供了一个灵活的学习框架，可以采用任何现有的在线增强算法，同时它可以利用现代并行和分布式集群环境的计算能力。我们在Apache SAMOA上实现了我们的技术，这是一个用于挖掘大数据流的开源平台，可以在几个分布式执行引擎上运行，并且与最先进的技术相比，我们展示了数量级的速度提升。

引用次数: 8

Learning Knowledge Embeddings by Combining Limit-based Scoring Loss 结合基于极限的评分损失学习知识嵌入

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132939

Xiaofei Zhou, Qiannan Zhu, Ping Liu, Li Guo

In knowledge graph embedding models, the margin-based ranking loss as the common loss function is usually used to encourage discrimination between golden triplets and incorrect triplets, which has proved effective in many translation-based models for knowledge graph embedding. However, we find that the loss function cannot ensure the fact that the scoring of correct triplets must be low enough to fulfill the translation. In this paper, we present a limit-based scoring loss to provide lower scoring of a golden triplet, and then to extend two basic translation models TransE and TransH, separately to TransE-RS and TransH-RS by combining limit-based scoring loss with margin-based ranking loss. Both the presented models have low complexities of parameters benefiting for application on large scale graphs. In experiments, we evaluate our models on two typical tasks including triplet classification and link prediction, and also analyze the scoring distributions of positive and negative triplets by different models. Experimental results show that the introduced limit-based scoring loss is effective to improve the capacities of knowledge graph embedding.

在知识图嵌入模型中，通常使用基于边际的排序损失作为通用损失函数来鼓励区分黄金三元组和错误三元组，这种方法在许多基于翻译的知识图嵌入模型中被证明是有效的。然而，我们发现损失函数不能保证正确三元组的得分必须足够低才能完成翻译。在本文中，我们提出了一个基于限制的评分损失来提供黄金三联的较低评分，然后将基于限制的评分损失与基于边际的排名损失相结合，将TransE和TransH两个基本翻译模型分别扩展到TransE- rs和TransH- rs。这两种模型的参数复杂度都较低，有利于大规模图的应用。在实验中，我们在三联体分类和链接预测两个典型任务上对我们的模型进行了评估，并分析了不同模型对正、负三联体的评分分布。实验结果表明，引入基于极限的评分损失可以有效地提高知识图嵌入的能力。

{"title":"Learning Knowledge Embeddings by Combining Limit-based Scoring Loss","authors":"Xiaofei Zhou, Qiannan Zhu, Ping Liu, Li Guo","doi":"10.1145/3132847.3132939","DOIUrl":"https://doi.org/10.1145/3132847.3132939","url":null,"abstract":"In knowledge graph embedding models, the margin-based ranking loss as the common loss function is usually used to encourage discrimination between golden triplets and incorrect triplets, which has proved effective in many translation-based models for knowledge graph embedding. However, we find that the loss function cannot ensure the fact that the scoring of correct triplets must be low enough to fulfill the translation. In this paper, we present a limit-based scoring loss to provide lower scoring of a golden triplet, and then to extend two basic translation models TransE and TransH, separately to TransE-RS and TransH-RS by combining limit-based scoring loss with margin-based ranking loss. Both the presented models have low complexities of parameters benefiting for application on large scale graphs. In experiments, we evaluate our models on two typical tasks including triplet classification and link prediction, and also analyze the scoring distributions of positive and negative triplets by different models. Experimental results show that the introduced limit-based scoring loss is effective to improve the capacities of knowledge graph embedding.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81828128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Maintaining Densest Subsets Efficiently in Evolving Hypergraphs 演化超图中最密集子集的有效维护

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132907

Shuguang Hu, Xiaowei Wu, T-H. Hubert Chan

In this paper we study the densest subgraph problem, which plays a key role in many graph mining applications. The goal of the problem is to find a subset of nodes that induces a graph with maximum average degree. The problem has been extensively studied in the past few decades under a variety of different settings. Several exact and approximation algorithms were proposed. However, as normal graph can only model objects with pairwise relationships, the densest subgraph problem fails in identifying communities under relationships that involve more than 2 objects, e.g., in a network connecting authors by publications. We consider in this work the densest subgraph problem in hypergraphs, which generalizes the problem to a wider class of networks in which edges might have different cardinalities and contain more than 2 nodes. We present two exact algorithms and a near-linear time r-approximation algorithm for the problem, where r is the maximum cardinality of an edge in the hypergraph. We also consider the dynamic version of the problem, in which an adversary can insert or delete an edge from the hypergraph in each round and the goal is to maintain efficiently an approximation of the densest subgraph. We present two dynamic approximation algorithms in this paper with amortized polog update time, for any ε > 0. For the case when there are only insertions, the approximation ratio we maintain is r(1+ε), while for the fully dynamic case, the ratio is r2(1+ε). Extensive experiments are performed on large real datasets to validate the effectiveness and efficiency of our algorithms.

本文研究了在许多图挖掘应用中起关键作用的最密集子图问题。该问题的目标是找到一个节点子集，该节点子集可以归纳出具有最大平均度的图。在过去的几十年里，这个问题在各种不同的环境下得到了广泛的研究。提出了几种精确和近似算法。然而，由于正态图只能对具有两两关系的对象建模，因此最密集子图问题无法识别涉及两个以上对象的关系下的社区，例如，在通过出版物连接作者的网络中。在这项工作中，我们考虑了超图中的最密集子图问题，它将问题推广到更广泛的网络类别，其中边缘可能具有不同的基数并包含超过2个节点。我们提出了两个精确算法和一个近线性时间r逼近算法，其中r是超图中边的最大基数。我们还考虑了问题的动态版本，其中对手可以在每轮超图中插入或删除一条边，目标是有效地保持最密集子图的近近值。对于任意ε >，本文给出了两种动态逼近算法。对于只有插入的情况，我们保持的近似比是r(1+ε)，而对于完全动态的情况，我们保持的近似比是r2(1+ε)。在大量的真实数据集上进行了大量的实验，以验证我们算法的有效性和效率。

{"title":"Maintaining Densest Subsets Efficiently in Evolving Hypergraphs","authors":"Shuguang Hu, Xiaowei Wu, T-H. Hubert Chan","doi":"10.1145/3132847.3132907","DOIUrl":"https://doi.org/10.1145/3132847.3132907","url":null,"abstract":"In this paper we study the densest subgraph problem, which plays a key role in many graph mining applications. The goal of the problem is to find a subset of nodes that induces a graph with maximum average degree. The problem has been extensively studied in the past few decades under a variety of different settings. Several exact and approximation algorithms were proposed. However, as normal graph can only model objects with pairwise relationships, the densest subgraph problem fails in identifying communities under relationships that involve more than 2 objects, e.g., in a network connecting authors by publications. We consider in this work the densest subgraph problem in hypergraphs, which generalizes the problem to a wider class of networks in which edges might have different cardinalities and contain more than 2 nodes. We present two exact algorithms and a near-linear time r-approximation algorithm for the problem, where r is the maximum cardinality of an edge in the hypergraph. We also consider the dynamic version of the problem, in which an adversary can insert or delete an edge from the hypergraph in each round and the goal is to maintain efficiently an approximation of the densest subgraph. We present two dynamic approximation algorithms in this paper with amortized polog update time, for any ε > 0. For the case when there are only insertions, the approximation ratio we maintain is r(1+ε), while for the fully dynamic case, the ratio is r2(1+ε). Extensive experiments are performed on large real datasets to validate the effectiveness and efficiency of our algorithms.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81361861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

A Neural Collaborative Filtering Model with Interaction-based Neighborhood 基于交互邻域的神经协同过滤模型

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133083

Ting Bai, Ji-Rong Wen, Jun Zhang, Wayne Xin Zhao

Recently, deep neural networks have been widely applied to recommender systems. A representative work is to utilize deep learning for modeling complex user-item interactions. However, similar to traditional latent factor models by factorizing user-item interactions, they tend to be ineffective to capture localized information. Localized information, such as neighborhood, is important to recommender systems in complementing the user-item interaction data. Based on this consideration, we propose a novel Neighborhood-based Neural Collaborative Filtering model (NNCF). To the best of our knowledge, it is the first time that the neighborhood information is integrated into the neural collaborative filtering methods. Extensive experiments on three real-world datasets demonstrate the effectiveness of our model for the implicit recommendation task.

近年来，深度神经网络在推荐系统中得到了广泛应用。一个代表性的工作是利用深度学习来建模复杂的用户-项目交互。然而，与传统的通过分解用户-项目交互的潜在因素模型类似，它们往往无法捕获局部信息。局部化的信息，比如邻里关系，对于推荐系统在补充用户-物品交互数据方面是很重要的。基于此，我们提出了一种基于邻域的神经协同过滤模型(NNCF)。据我们所知，这是第一次将邻域信息集成到神经协同过滤方法中。在三个真实数据集上的大量实验证明了我们的模型对于隐式推荐任务的有效性。

引用次数: 93

Content Recommendation by Noise Contrastive Transfer Learning of Feature Representation 基于特征表示的噪声对比迁移学习的内容推荐

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3132855

Yiyang Li, Guanyu Tao, Weinan Zhang, Yong Yu, Jun Wang

Personalized recommendation has been proved effective as a content discovery tool for many online news publishers. As fresh news articles are frequently coming to the system while the old ones are fading away quickly, building a consistent and coherent feature representation over the ever-changing articles pool is fundamental to the performance of the recommendation. However, learning a good feature representation is challenging, especially for some small publishers that have normally fewer than 10,000 articles each year. In this paper, we consider to transfer knowledge from a larger text corpus. In our proposed solution, an effective article recommendation engine can be established with a small number of target publisher articles by transferring knowledge from a large corpus of text with a different distribution. Specifically, we leverage noise contrastive estimation techniques to learn the word conditional distribution given the context words, where the noise conditional distribution is pre-trained from the large corpus. Our solution has been deployed in a commercial recommendation service. The large-scale online A/B testing on two commercial publishers demonstrates up to 9.97% relative overall performance gain of our proposed model on the recommendation click-though rate metric over the non-transfer learning baselines.

个性化推荐作为一种内容发现工具，已被许多在线新闻发布者证明是有效的。由于新的新闻文章经常进入系统，而旧的文章正在迅速消失，因此在不断变化的文章池中构建一致和连贯的特征表示是推荐性能的基础。然而，学习一个好的特征表示是具有挑战性的，特别是对于一些每年通常少于10,000篇文章的小型出版商。在本文中，我们考虑从更大的文本语料库中转移知识。在我们提出的解决方案中，通过从具有不同分布的大量文本语料库中转移知识，可以用少量目标发布者的文章建立有效的文章推荐引擎。具体来说，我们利用噪声对比估计技术来学习给定上下文词的单词条件分布，其中噪声条件分布是从大型语料库中预训练的。我们的解决方案已经部署在一个商业推荐服务中。在两家商业出版商上进行的大规模在线A/B测试表明，在推荐点击率指标上，我们提出的模型相对于非迁移学习基线的总体性能提高了9.97%。

{"title":"Content Recommendation by Noise Contrastive Transfer Learning of Feature Representation","authors":"Yiyang Li, Guanyu Tao, Weinan Zhang, Yong Yu, Jun Wang","doi":"10.1145/3132847.3132855","DOIUrl":"https://doi.org/10.1145/3132847.3132855","url":null,"abstract":"Personalized recommendation has been proved effective as a content discovery tool for many online news publishers. As fresh news articles are frequently coming to the system while the old ones are fading away quickly, building a consistent and coherent feature representation over the ever-changing articles pool is fundamental to the performance of the recommendation. However, learning a good feature representation is challenging, especially for some small publishers that have normally fewer than 10,000 articles each year. In this paper, we consider to transfer knowledge from a larger text corpus. In our proposed solution, an effective article recommendation engine can be established with a small number of target publisher articles by transferring knowledge from a large corpus of text with a different distribution. Specifically, we leverage noise contrastive estimation techniques to learn the word conditional distribution given the context words, where the noise conditional distribution is pre-trained from the large corpus. Our solution has been deployed in a commercial recommendation service. The large-scale online A/B testing on two commercial publishers demonstrates up to 9.97% relative overall performance gain of our proposed model on the recommendation click-though rate metric over the non-transfer learning baselines.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82199554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Robust Named-Entity Recognition System Using Syllable Bigram Embedding with Eojeol Prefix Information 一种基于音节重图嵌入词形前缀信息的鲁棒命名实体识别系统

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pub Date : 2017-11-06 DOI: 10.1145/3132847.3133105

Sunjae Kwon, Youngjoong Ko, Jungyun Seo

Korean named-entity recognition (NER) systems have been developed mainly on the morphological-level, and they are commonly based on a pipeline framework that identifies named-entities (NEs) following the morphological analysis. However, this framework can mean that the performance of NER systems is degraded, because errors from the morphological analysis propagate into NER systems. This paper proposes a novel syllable-level NER system, which does not require a morphological analysis and can achieve a similar or better performance compared with the morphological-level NER systems. In addition, because the proposed system does not require a morphological analysis step, its processing speed is about 1.9 times faster than those of the previous morphological-level NER systems.

韩国的命名实体识别(NER)系统主要是在形态学层面上开发的，它们通常基于一个管道框架，该框架根据形态学分析识别命名实体(NEs)。然而，这种框架可能意味着NER系统的性能下降，因为形态学分析的错误会传播到NER系统中。本文提出了一种新的音节级NER系统，该系统不需要进行形态学分析，并且可以达到与形态学级NER系统相似或更好的性能。此外，由于该系统不需要形态学分析步骤，因此其处理速度比以前的形态学级NER系统快1.9倍左右。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀