Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献_第9页

Permutation indexing: fast approximate retrieval from large corpora 排列索引:从大型语料库快速近似检索

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505646

M. Gurevich, Tamás Sarlós

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

倒排索引在包括网络搜索在内的检索系统中是一种普遍使用的技术。尽管它很受欢迎，但它有一个缺点，即查询检索时间是高度可变的，并且随着语料库的大小而增长。在这项工作中，我们提出了一种替代技术，排列索引，其中检索成本是严格限定的，并且仅对语料库大小有对数依赖。我们的方法基于两种新技术:(a)将术语空间划分为查询中经常出现的重叠术语簇，以及(b)将集群中由术语组成的所有查询的结果紧凑编码为连续的文档id序列的数据结构。然后，通过获取这些序列的一小部分来检索查询结果。这是有代价的:我们的编码是有损的，因此返回近似的结果集。返回的真实结果的比例，召回率，是由冗余级别控制的。为排列索引分配的空间越多，召回率就越高。我们在理论上分析了简化文档和查询模型下的排列索引，并在实际文档和查询集合上进行了经验分析。我们表明，虽然排列索引不能取代传统的检索方法，因为不能保证对所有查询都有高召回率，但它覆盖了高达77%的尾部查询，可以用来加快这些查询的检索速度。

{"title":"Permutation indexing: fast approximate retrieval from large corpora","authors":"M. Gurevich, Tamás Sarlós","doi":"10.1145/2505515.2505646","DOIUrl":"https://doi.org/10.1145/2505515.2505646","url":null,"abstract":"Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81487966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augmenting web search surrogates with images 用图像增强网络搜索代理

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505714

Robert G. Capra, Jaime Arguello, Falk Scholer

While images are commonly used in search result presentation for vertical domains such as shopping and news, web search results surrogates remain primarily text-based. In this paper, we present results of two large-scale user studies to examine the effects of augmenting text-based surrogates with images extracted from the underlying webpage. We evaluate effectiveness and efficiency at both the individual surrogate level and at the results page level. Additionally, we investigate the influence of two factors: the goodness of the image in terms of representing the underlying page content, and the diversity of the results on a results page. Our results show that at the individual surrogate level, good images provide only a small benefit in judgment accuracy versus text-only surrogates, with a slight increase in judgment time. At the results page level, surrogates with good images had similar effectiveness and efficiency compared to the text-only condition. However, in situations where the results page items had diverse senses, surrogates with images had higher click precision versus text-only ones. Results of these studies show tradeoffs in the use of images in web search surrogates, and highlight particular situations where they can provide benefits.

虽然图像通常用于购物和新闻等垂直领域的搜索结果显示，但网络搜索结果替代仍然主要是基于文本的。在本文中，我们展示了两个大规模用户研究的结果，以检验从底层网页提取图像增强基于文本的代理的效果。我们在单个代理级别和结果页面级别评估有效性和效率。此外，我们还研究了两个因素的影响:图像在表示潜在页面内容方面的良好性，以及结果页面上结果的多样性。我们的结果表明，在个体代理水平上，与纯文本代理相比，好的图像在判断准确性方面只提供了很小的好处，而判断时间则略有增加。在结果页面级别，与纯文本条件相比，具有良好图像的替代品具有相似的效果和效率。然而，在结果页面项具有多种感觉的情况下，带有图像的替代品比只有文本的替代品具有更高的点击精度。这些研究的结果显示了在网络搜索代理中使用图像的权衡，并强调了它们可以提供好处的特定情况。

{"title":"Augmenting web search surrogates with images","authors":"Robert G. Capra, Jaime Arguello, Falk Scholer","doi":"10.1145/2505515.2505714","DOIUrl":"https://doi.org/10.1145/2505515.2505714","url":null,"abstract":"While images are commonly used in search result presentation for vertical domains such as shopping and news, web search results surrogates remain primarily text-based. In this paper, we present results of two large-scale user studies to examine the effects of augmenting text-based surrogates with images extracted from the underlying webpage. We evaluate effectiveness and efficiency at both the individual surrogate level and at the results page level. Additionally, we investigate the influence of two factors: the goodness of the image in terms of representing the underlying page content, and the diversity of the results on a results page. Our results show that at the individual surrogate level, good images provide only a small benefit in judgment accuracy versus text-only surrogates, with a slight increase in judgment time. At the results page level, surrogates with good images had similar effectiveness and efficiency compared to the text-only condition. However, in situations where the results page items had diverse senses, surrogates with images had higher click precision versus text-only ones. Results of these studies show tradeoffs in the use of images in web search surrogates, and highlight particular situations where they can provide benefits.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81673012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

On segmentation of eCommerce queries 关于电子商务查询的细分

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505721

Nish Parikh, P. Sriram, M. Hasan

In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers' data and also frequency data from product titles what we call sellers' data. We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation. Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.

在本文中，我们提出了QSEGMENT，一个现实生活中的电子商务查询分割系统。QSEGMENT使用来自查询日志的频率数据，我们称之为买家数据，也使用来自产品标题的频率数据，我们称之为卖家数据。我们利用市场的分类结构来构建特定领域的频率模型。使用这种方法，QSEGMENT的性能优于前面描述的查询分割基线。此外，我们通过使用我们称为用户意图得分的无监督IR度量来执行大规模评估。我们讨论了QSEGMENT的整体架构，以及关于电子商务查询细分的各种用例和有趣的观察。

引用次数: 14

TODMIS: mining communities from trajectories TODMIS:从轨迹挖掘社区

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505552

Siyuan Liu, Shuhui Wang, Kasthuri Jayarajah, Archan Misra, R. Krishnan

Existing algorithms for trajectory-based clustering usually rely on simplex representation and a single proximity-related distance (or similarity) measure. Consequently, additional information markers (e.g., social interactions or the semantics of the spatial layout) are usually ignored, leading to the inability to fully discover the communities in the trajectory database. This is especially true for human-generated trajectories, where additional fine-grained markers (e.g., movement velocity at certain locations, or the sequence of semantic spaces visited) can help capture latent relationships between cluster members. To address this limitation, we propose TODMIS: a general framework for Trajectory cOmmunity Discovery using Multiple Information Sources. TODMIS combines additional information with raw trajectory data and creates multiple similarity metrics. In our proposed approach, we first develop a novel approach for computing semantic level similarity by constructing a Markov Random Walk model from the semantically-labeled trajectory data, and then measuring similarity at the distribution level. In addition, we also extract and compute pair-wise similarity measures related to three additional markers, namely trajectory level spatial alignment (proximity), temporal patterns and multi-scale velocity statistics. Finally, after creating a single similarity metric from the weighted combination of these multiple measures, we apply dense sub-graph detection to discover the set of distinct communities. We evaluated TODMIS extensively using traces of (i) student movement data in a campus, (ii) customer trajectories in a shopping mall, and (iii) city-scale taxi movement data. Experimental results demonstrate that TODMIS correctly and efficiently discovers the real grouping behaviors in these diverse settings.

现有的基于轨迹的聚类算法通常依赖于单纯形表示和单一的与邻近相关的距离(或相似度)度量。因此，额外的信息标记(例如，社会互动或空间布局的语义)通常被忽略，导致无法在轨迹数据库中充分发现社区。这对于人类生成的轨迹尤其正确，其中额外的细粒度标记(例如，在某些位置的移动速度，或访问的语义空间序列)可以帮助捕获集群成员之间的潜在关系。为了解决这一限制，我们提出了TODMIS:一个使用多个信息源的轨迹社区发现的通用框架。TODMIS将附加信息与原始轨迹数据相结合，并创建多个相似度量。在我们提出的方法中，我们首先开发了一种计算语义级相似度的新方法，通过从语义标记的轨迹数据构建马尔可夫随机行走模型，然后在分布级别测量相似度。此外，我们还提取并计算了与三个额外标记相关的成对相似性度量，即轨迹级空间对齐(接近)、时间模式和多尺度速度统计。最后，在从这些多个度量的加权组合中创建单个相似性度量后，我们应用密集子图检测来发现不同社区的集合。我们对TODMIS进行了广泛的评估，使用了(i)校园内的学生运动数据，(ii)购物中心的顾客轨迹，以及(iii)城市规模的出租车运动数据。实验结果表明，TODMIS能够正确有效地发现这些不同环境下的真实分组行为。

{"title":"TODMIS: mining communities from trajectories","authors":"Siyuan Liu, Shuhui Wang, Kasthuri Jayarajah, Archan Misra, R. Krishnan","doi":"10.1145/2505515.2505552","DOIUrl":"https://doi.org/10.1145/2505515.2505552","url":null,"abstract":"Existing algorithms for trajectory-based clustering usually rely on simplex representation and a single proximity-related distance (or similarity) measure. Consequently, additional information markers (e.g., social interactions or the semantics of the spatial layout) are usually ignored, leading to the inability to fully discover the communities in the trajectory database. This is especially true for human-generated trajectories, where additional fine-grained markers (e.g., movement velocity at certain locations, or the sequence of semantic spaces visited) can help capture latent relationships between cluster members. To address this limitation, we propose TODMIS: a general framework for Trajectory cOmmunity Discovery using Multiple Information Sources. TODMIS combines additional information with raw trajectory data and creates multiple similarity metrics. In our proposed approach, we first develop a novel approach for computing semantic level similarity by constructing a Markov Random Walk model from the semantically-labeled trajectory data, and then measuring similarity at the distribution level. In addition, we also extract and compute pair-wise similarity measures related to three additional markers, namely trajectory level spatial alignment (proximity), temporal patterns and multi-scale velocity statistics. Finally, after creating a single similarity metric from the weighted combination of these multiple measures, we apply dense sub-graph detection to discover the set of distinct communities. We evaluated TODMIS extensively using traces of (i) student movement data in a campus, (ii) customer trajectories in a shopping mall, and (iii) city-scale taxi movement data. Experimental results demonstrate that TODMIS correctly and efficiently discovers the real grouping behaviors in these diverse settings.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87316883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Learning to rank for question routing in community question answering 学习在社区问答中对问题路由进行排序

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505670

Zongcheng Ji, Bin Wang

This paper focuses on the problem of Question Routing (QR) in Community Question Answering (CQA), which aims to route newly posted questions to the potential answerers who are most likely to answer them. Traditional methods to solve this problem only consider the text similarity features between the newly posted question and the user profile, while ignoring the important statistical features, including the question-specific statistical feature and the user-specific statistical features. Moreover, traditional methods are based on unsupervised learning, which is not easy to introduce the rich features into them. This paper proposes a general framework based on the learning to rank concepts for QR. Training sets consist of triples (q, asker, answerers) are first collected. Then, by introducing the intrinsic relationships between the asker and the answerers in each CQA session to capture the intrinsic labels/orders of the users about their expertise degree of the question q, two different methods, including the SVM-based and RankingSVM-based methods, are presented to learn the models with different example creation processes from the training set. Finally, the potential answerers are ranked using the trained models. Extensive experiments conducted on a real world CQA dataset from Stack Overflow show that our proposed two methods can both outperform the traditional query likelihood language model (QLLM) as well as the state-of-the-art Latent Dirichlet Allocation based model (LDA). Specifically, the RankingSVM-based method achieves statistical significant improvements over the SVM-based method and has gained the best performance.

本文研究了社区问答(CQA)中的问题路由(QR)问题，其目的是将新发布的问题路由到最有可能回答这些问题的潜在答题者。解决这一问题的传统方法只考虑新发布的问题与用户简介之间的文本相似度特征，而忽略了重要的统计特征，包括特定于问题的统计特征和特定于用户的统计特征。此外，传统的方法是基于无监督学习的，不容易将丰富的特征引入其中。本文提出了一种基于概念排序学习的QR分类框架。首先收集由三元组(q、提问者、回答者)组成的训练集。然后，通过引入每个CQA会话中提问者和回答者之间的内在关系来捕获用户对问题q的专业程度的内在标签/顺序，提出两种不同的方法，包括基于svm和基于rankingsvm的方法，从训练集中学习具有不同示例创建过程的模型。最后，使用训练好的模型对潜在答案进行排序。在Stack Overflow的真实CQA数据集上进行的大量实验表明，我们提出的两种方法都可以优于传统的查询似然语言模型(QLLM)和最先进的基于潜在狄利克雷分配的模型(LDA)。具体来说，基于rankingsvm的方法在统计上比基于svm的方法有了显著的改进，获得了最佳的性能。

{"title":"Learning to rank for question routing in community question answering","authors":"Zongcheng Ji, Bin Wang","doi":"10.1145/2505515.2505670","DOIUrl":"https://doi.org/10.1145/2505515.2505670","url":null,"abstract":"This paper focuses on the problem of Question Routing (QR) in Community Question Answering (CQA), which aims to route newly posted questions to the potential answerers who are most likely to answer them. Traditional methods to solve this problem only consider the text similarity features between the newly posted question and the user profile, while ignoring the important statistical features, including the question-specific statistical feature and the user-specific statistical features. Moreover, traditional methods are based on unsupervised learning, which is not easy to introduce the rich features into them. This paper proposes a general framework based on the learning to rank concepts for QR. Training sets consist of triples (q, asker, answerers) are first collected. Then, by introducing the intrinsic relationships between the asker and the answerers in each CQA session to capture the intrinsic labels/orders of the users about their expertise degree of the question q, two different methods, including the SVM-based and RankingSVM-based methods, are presented to learn the models with different example creation processes from the training set. Finally, the potential answerers are ranked using the trained models. Extensive experiments conducted on a real world CQA dataset from Stack Overflow show that our proposed two methods can both outperform the traditional query likelihood language model (QLLM) as well as the state-of-the-art Latent Dirichlet Allocation based model (LDA). Specifically, the RankingSVM-based method achieves statistical significant improvements over the SVM-based method and has gained the best performance.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87789641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Assessing sparse information extraction using semantic contexts 使用语义上下文评估稀疏信息提取

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505598

Peipei Li, Haixun Wang, Hongsong Li, Xindong Wu

One important assumption of information extraction is that extractions occurring more frequently are more likely to be correct. Sparse information extraction is challenging because no matter how big a corpus is, there are extractions supported by only a small amount of evidence in the corpus. A pioneering work known as REALM learns HMMs to model the context of a semantic relationship for assessing the extractions. This is quite costly and the semantics revealed for the context are not explicit. In this work, we introduce a lightweight, explicit semantic approach for sparse information extraction. We use a large semantic network consisting of millions of concepts, entities, and attributes to explicitly model the context of semantic relationships. Experiments show that our approach improves the F-score of extraction by at least 11.2% over state-of-the-art, HMM based approaches while maintaining more efficiency.

信息提取的一个重要假设是，越频繁的提取越有可能是正确的。稀疏信息提取具有挑战性，因为无论语料库有多大，语料库中只有少量证据支持的提取。一项名为REALM的开创性工作学习hmm对语义关系的上下文进行建模，以评估提取。这是非常昂贵的，并且为上下文显示的语义并不显式。在这项工作中，我们引入了一种轻量级的、显式的语义方法来进行稀疏信息提取。我们使用由数百万个概念、实体和属性组成的大型语义网络来显式地建模语义关系的上下文。实验表明，我们的方法在保持更高效率的同时，比最先进的基于HMM的方法提高了至少11.2%的提取f分数。

引用次数: 4

A probabilistic mixture model for mining and analyzing product search log 基于概率混合模型的产品搜索日志挖掘与分析

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505578

Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, A. Gattani

The booming of e-commerce in recent years has led to the generation of large amounts of product search log data. Product search log is a unique new data with much valuable information and knowledge about user preferences over product attributes that is often hard to obtain from other sources. While regular search logs (e.g., Web search logs) contain click-throughs for unstructured text documents (e.g., web pages), product search logs contain clickth-roughs for structured entities defined by a set of attributes and their values. For instance, a laptop can be defined by its size, color, cpu, ram, etc. Such structures in product entities offer us opportunities to mine and discover detailed useful knowledge about user preferences at the attribute level, but they also raise significant challenges for mining due to the lack of attribute-level observations. In this paper, we propose a novel probabilistic mixture model for attribute-level analysis of product search logs. The model is based on a generative process where queries are generated by a mixture of unigram language models defined by each attribute-value pair of a clicked entity. The model can be efficiently estimated using the Expectation-Maximization (EM) algorithm. The estimated parameters, including the attribute-value language models and attribute-value preference models, can be directly used to improve product search accuracy, or aggregated to reveal knowledge for understanding user intent and supporting business intelligence. Evaluation of the proposed model on a commercial product search log shows that the model is effective for mining and analyzing product search logs to discover various kinds of useful knowledge.

近年来电子商务的蓬勃发展，产生了大量的产品搜索日志数据。产品搜索日志是一种独特的新数据，其中包含关于用户偏好而不是产品属性的许多有价值的信息和知识，这些信息和知识通常很难从其他来源获得。常规搜索日志(例如，Web搜索日志)包含非结构化文本文档(例如，网页)的点击，而产品搜索日志包含由一组属性及其值定义的结构化实体的点击。例如，一台笔记本电脑可以通过它的大小、颜色、cpu、ram等来定义。产品实体中的这种结构为我们提供了在属性级别上挖掘和发现关于用户偏好的详细有用知识的机会，但由于缺乏属性级别的观察，它们也为挖掘带来了重大挑战。在本文中，我们提出了一种新的概率混合模型用于产品搜索日志的属性级分析。该模型基于生成过程，其中查询由被单击实体的每个属性-值对定义的组合语言模型生成。利用期望最大化(EM)算法可以有效地估计模型。估计的参数，包括属性值语言模型和属性值偏好模型，可以直接用于提高产品搜索的准确性，或者聚合以揭示理解用户意图和支持商业智能的知识。对一个商业产品搜索日志的评价表明，该模型能够有效地挖掘和分析产品搜索日志，从而发现各种有用的知识。

{"title":"A probabilistic mixture model for mining and analyzing product search log","authors":"Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, A. Gattani","doi":"10.1145/2505515.2505578","DOIUrl":"https://doi.org/10.1145/2505515.2505578","url":null,"abstract":"The booming of e-commerce in recent years has led to the generation of large amounts of product search log data. Product search log is a unique new data with much valuable information and knowledge about user preferences over product attributes that is often hard to obtain from other sources. While regular search logs (e.g., Web search logs) contain click-throughs for unstructured text documents (e.g., web pages), product search logs contain clickth-roughs for structured entities defined by a set of attributes and their values. For instance, a laptop can be defined by its size, color, cpu, ram, etc. Such structures in product entities offer us opportunities to mine and discover detailed useful knowledge about user preferences at the attribute level, but they also raise significant challenges for mining due to the lack of attribute-level observations. In this paper, we propose a novel probabilistic mixture model for attribute-level analysis of product search logs. The model is based on a generative process where queries are generated by a mixture of unigram language models defined by each attribute-value pair of a clicked entity. The model can be efficiently estimated using the Expectation-Maximization (EM) algorithm. The estimated parameters, including the attribute-value language models and attribute-value preference models, can be directly used to improve product search accuracy, or aggregated to reveal knowledge for understanding user intent and supporting business intelligence. Evaluation of the proposed model on a commercial product search log shows that the model is effective for mining and analyzing product search logs to discover various kinds of useful knowledge.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88523606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Exploiting ranking factorization machines for microblog retrieval 利用排名分解机进行微博检索

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505648

Runwei Qiang, Feng Liang, Jianwu Yang

Learning to rank method has been proposed for practical application in the field of information retrieval. When employing it in microblog retrieval, the significant interactions of various involved features are rarely considered. In this paper, we propose a Ranking Factorization Machine (Ranking FM) model, which applies Factorization Machine model to microblog ranking on basis of pairwise classification. In this way, our proposed model combines the generality of learning to rank framework with the advantages of factorization models in estimating interactions between features, leading to better retrieval performance. Moreover, three groups of features (content relevance features, semantic expansion features and quality features) and their interactions are utilized in the Ranking FM model with the methods of stochastic gradient descent and adaptive regularization for optimization. Experimental results demonstrate its superiority over several baseline systems on a real Twitter dataset in terms of P@30 and MAP metrics. Furthermore, it outperforms the best performing results in the TREC'12 Real-Time Search Task.

学习排序法已被提出用于信息检索领域的实际应用。在将其应用于微博检索时，很少考虑所涉及的各种特征之间的重要交互作用。本文提出了一个排名因子机(Ranking FM)模型，将因子机模型应用于基于两两分类的微博排名中。这样，我们提出的模型结合了学习排序框架的通用性和分解模型在估计特征之间相互作用方面的优势，从而获得了更好的检索性能。采用随机梯度下降和自适应正则化的方法，将内容关联特征、语义扩展特征和质量特征三组特征及其相互作用应用到排序FM模型中。在真实Twitter数据集上的实验结果表明，该方法在P@30和MAP指标方面优于多个基线系统。此外，它优于TREC'12实时搜索任务中的最佳性能结果。

{"title":"Exploiting ranking factorization machines for microblog retrieval","authors":"Runwei Qiang, Feng Liang, Jianwu Yang","doi":"10.1145/2505515.2505648","DOIUrl":"https://doi.org/10.1145/2505515.2505648","url":null,"abstract":"Learning to rank method has been proposed for practical application in the field of information retrieval. When employing it in microblog retrieval, the significant interactions of various involved features are rarely considered. In this paper, we propose a Ranking Factorization Machine (Ranking FM) model, which applies Factorization Machine model to microblog ranking on basis of pairwise classification. In this way, our proposed model combines the generality of learning to rank framework with the advantages of factorization models in estimating interactions between features, leading to better retrieval performance. Moreover, three groups of features (content relevance features, semantic expansion features and quality features) and their interactions are utilized in the Ranking FM model with the methods of stochastic gradient descent and adaptive regularization for optimization. Experimental results demonstrate its superiority over several baseline systems on a real Twitter dataset in terms of P@30 and MAP metrics. Furthermore, it outperforms the best performing results in the TREC'12 Real-Time Search Task.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88152695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

The essence of knowledge (bases) through entity rankings 知识(基础)的本质通过实体排名

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507838

Evica Milchevski, S. Michel, A. Stupar

We consider the task of automatically phrasing and computing top-k rankings over the information contained in common knowledge bases (KBs), such as YAGO or DBPedia. We assemble the thematic focus and ranking criteria of rankings by inspecting the present Subject, Predicate, Object (SPO) triples. Making use of numerical attributes contained in the KB we are also able to compute the actual ranking content, i.e., entities and their performances. We further discuss the integration of existing rankings into the ranking generation process for increased coverage and ranking quality. We report on first results obtained using the YAGO knowledge base.

我们考虑在公共知识库(KBs)(如YAGO或DBPedia)中包含的信息上自动措辞和计算top-k排名的任务。我们通过检查当前的主语、谓语、宾语(SPO)三元组来组合主题焦点和排名标准。利用知识库中包含的数字属性，我们还能够计算实际的排名内容，即实体及其性能。我们进一步讨论将现有排名集成到排名生成过程中，以增加覆盖率和排名质量。我们报告使用YAGO知识库获得的第一个结果。

引用次数: 10

Flexible and extensible generation and corruption of personal data 灵活和可扩展的个人数据生成和损坏

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507815

P. Christen, Dinusha Vatsalan

With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.

由于今天的许多数据都是由人产生或涉及人，研究人员越来越多地需要包含个人识别信息的数据来评估他们的新算法。在记录匹配和重复数据删除、欺诈检测、云计算和健康信息学等领域，数据输入错误、排版错误、噪音或记录变化等问题都可能严重影响数据集成、处理和挖掘项目的结果。然而，隐私问题使得获取包含个人详细信息的真实数据变得困难。使用敏感真实数据的替代方法是创建具有类似特征的合成数据。合成数据的优势在于:(1)生成的数据具有明确的特征;(2)知道哪些记录代表一个单独创建的实体(这在实际数据中通常是未知的);(3)生成的数据和生成器程序本身可以被发布。我们提供了一个复杂的数据生成和破坏工具，允许创建各种类型的数据，从姓名和地址，日期，社会保障和信用卡号码，到数值，如工资或血压。我们的工具可以对属性之间的依赖关系进行建模，并且它允许以各种方式破坏值。我们描述了工具的总体体系结构和主要组件，并说明了用户如何使用新功能轻松扩展此工具。

{"title":"Flexible and extensible generation and corruption of personal data","authors":"P. Christen, Dinusha Vatsalan","doi":"10.1145/2505515.2507815","DOIUrl":"https://doi.org/10.1145/2505515.2507815","url":null,"abstract":"With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88322954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48