Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval最新文献

英文中文

User model-based metrics for offline query suggestion evaluation 用于离线查询建议评估的基于用户模型的度量

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484041

E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis

Query suggestion or auto-completion mechanisms are widely used by search engines and are increasingly attracting interest from the research community. However, the lack of commonly accepted evaluation methodology and metrics means that it is not possible to compare results and approaches from the literature. Moreover, often the metrics used to evaluate query suggestions tend to be an adaptation from other domains without a proper justification. Hence, it is not necessarily clear if the improvements reported in the literature would result in an actual improvement in the users' experience. Inspired by the cascade user models and state-of-the-art evaluation metrics in the web search domain, we address the query suggestion evaluation, by first studying the users behaviour from a search engine's query log and thereby deriving a new family of user models describing the users interaction with a query suggestion mechanism. Next, assuming a query log-based evaluation approach, we propose two new metrics to evaluate query suggestions, pSaved and eSaved. Both metrics are parameterised by a user model. pSaved is defined as the probability of using the query suggestions while submitting a query. eSaved equates to the expected relative amount of effort (keypresses) a user can avoid due to the deployed query suggestion mechanism. Finally, we experiment with both metrics using four user model instantiations as well as metrics previously used in the literature on a dataset of 6.1M sessions. Our results demonstrate that pSaved and eSaved show the best alignment with the users satisfaction amongst the considered metrics.

查询建议或自动补全机制被搜索引擎广泛使用，并且越来越引起研究社区的兴趣。然而，缺乏普遍接受的评估方法和指标意味着不可能比较文献中的结果和方法。此外，用于评估查询建议的指标往往是从其他领域改编而来的，没有适当的理由。因此，并不一定清楚文献中报告的改进是否会导致用户体验的实际改善。受网络搜索领域的级联用户模型和最先进的评估指标的启发，我们首先从搜索引擎的查询日志中研究用户行为，从而推导出一系列新的用户模型，描述用户与查询建议机制的交互，从而解决查询建议评估问题。接下来，假设使用基于查询日志的评估方法，我们提出两个新的指标来评估查询建议:pSaved和eSaved。这两个指标都由用户模型参数化。pSaved定义为在提交查询时使用查询建议的概率。节省的时间等于由于部署了查询建议机制，用户可以避免的预期相对工作量(按键)。最后，我们使用四个用户模型实例以及文献中先前在610万个会话数据集上使用的指标对这两个指标进行了实验。我们的结果表明，pSaved和eSaved在考虑的指标中显示出与用户满意度的最佳一致性。

{"title":"User model-based metrics for offline query suggestion evaluation","authors":"E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis","doi":"10.1145/2484028.2484041","DOIUrl":"https://doi.org/10.1145/2484028.2484041","url":null,"abstract":"Query suggestion or auto-completion mechanisms are widely used by search engines and are increasingly attracting interest from the research community. However, the lack of commonly accepted evaluation methodology and metrics means that it is not possible to compare results and approaches from the literature. Moreover, often the metrics used to evaluate query suggestions tend to be an adaptation from other domains without a proper justification. Hence, it is not necessarily clear if the improvements reported in the literature would result in an actual improvement in the users' experience. Inspired by the cascade user models and state-of-the-art evaluation metrics in the web search domain, we address the query suggestion evaluation, by first studying the users behaviour from a search engine's query log and thereby deriving a new family of user models describing the users interaction with a query suggestion mechanism. Next, assuming a query log-based evaluation approach, we propose two new metrics to evaluate query suggestions, pSaved and eSaved. Both metrics are parameterised by a user model. pSaved is defined as the probability of using the query suggestions while submitting a query. eSaved equates to the expected relative amount of effort (keypresses) a user can avoid due to the deployed query suggestion mechanism. Finally, we experiment with both metrics using four user model instantiations as well as metrics previously used in the literature on a dataset of 6.1M sessions. Our results demonstrate that pSaved and eSaved show the best alignment with the users satisfaction amongst the considered metrics.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120987264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

An incremental approach to efficient pseudo-relevance feedback 一种有效伪相关反馈的增量方法

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484051

Hao Wu, Hui Fang

Pseudo-relevance feedback is an important strategy to improve search accuracy. It is often implemented as a two-round retrieval process: the first round is to retrieve an initial set of documents relevant to an original query, and the second round is to retrieve final retrieval results using the original query expanded with terms selected from the previously retrieved documents. This two-round retrieval process is clearly time consuming, which could arguably be one of main reasons that hinder the wide adaptation of the pseudo-relevance feedback methods in real-world IR systems. In this paper, we study how to improve the efficiency of pseudo-relevance feedback methods. The basic idea is to reduce the time needed for the second round of retrieval by leveraging the query processing results of the first round. Specifically, instead of processing the expand query as a newly submitted query, we propose an incremental approach, which resumes the query processing results (i.e. document accumulators) for the first round of retrieval and process the second round of retrieval mainly as a step of adjusting the scores in the accumulators. Experimental results on TREC Terabyte collections show that the proposed incremental approach can improve the efficiency of pseudo-relevance feedback methods by a factor of two without sacrificing their effectiveness.

伪相关反馈是提高搜索精度的重要策略。它通常被实现为两轮检索过程:第一轮是检索与原始查询相关的一组初始文档，第二轮是检索使用从先前检索的文档中选择的术语展开的原始查询的最终检索结果。这种两轮检索过程明显耗时，这可能是阻碍伪相关反馈方法在现实红外系统中广泛应用的主要原因之一。本文研究了如何提高伪相关反馈方法的效率。基本思想是通过利用第一轮的查询处理结果来减少第二轮检索所需的时间。具体来说，我们提出了一种增量方法，将扩展查询作为新提交的查询来处理，即恢复第一轮检索的查询处理结果(即文档累加器)，并将第二轮检索主要作为调整累加器中的分数的步骤来处理。在TREC tb集合上的实验结果表明，该方法可以在不牺牲伪相关反馈方法有效性的前提下，将伪相关反馈方法的效率提高2倍。

{"title":"An incremental approach to efficient pseudo-relevance feedback","authors":"Hao Wu, Hui Fang","doi":"10.1145/2484028.2484051","DOIUrl":"https://doi.org/10.1145/2484028.2484051","url":null,"abstract":"Pseudo-relevance feedback is an important strategy to improve search accuracy. It is often implemented as a two-round retrieval process: the first round is to retrieve an initial set of documents relevant to an original query, and the second round is to retrieve final retrieval results using the original query expanded with terms selected from the previously retrieved documents. This two-round retrieval process is clearly time consuming, which could arguably be one of main reasons that hinder the wide adaptation of the pseudo-relevance feedback methods in real-world IR systems. In this paper, we study how to improve the efficiency of pseudo-relevance feedback methods. The basic idea is to reduce the time needed for the second round of retrieval by leveraging the query processing results of the first round. Specifically, instead of processing the expand query as a newly submitted query, we propose an incremental approach, which resumes the query processing results (i.e. document accumulators) for the first round of retrieval and process the second round of retrieval mainly as a step of adjusting the scores in the accumulators. Experimental results on TREC Terabyte collections show that the proposed incremental approach can improve the efficiency of pseudo-relevance feedback methods by a factor of two without sacrificing their effectiveness.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121235197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Sopra: a new social personalized ranking function for improving web search Sopra:一个新的社会个性化排名功能，用于改善网络搜索

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484131

Mohamed Reda Bouadjenek, Hakim Hacid, M. Bouzeghoub

We present in this paper a contribution to IR modeling by proposing a new ranking function called SoPRa that considers the social dimension of the Web. This social dimension is any social information that surrounds documents along with the social context of users. Currently, our approach relies on folksonomies for extracting these social contexts, but it can be extended to use any social meta-data, e.g. comments, ratings, tweets, etc. The evaluation performed on our approach shows its benefits for personalized search.

在本文中，我们提出了一个新的排序函数，称为SoPRa，它考虑了网络的社会维度，对IR建模做出了贡献。这个社会维度是围绕文档以及用户的社会上下文的任何社会信息。目前，我们的方法依赖于大众分类法来提取这些社会背景，但它可以扩展到使用任何社会元数据，例如评论、评级、推文等。对我们的方法进行的评估显示了它对个性化搜索的好处。

引用次数: 43

An effective implicit relevance feedback technique using affective, physiological and behavioural features 使用情感、生理和行为特征的有效内隐关联反馈技术

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484074

Yashar Moshfeghi, J. Jose

The effectiveness of various behavioural signals for implicit relevance feedback models has been exhaustively studied. Despite the advantages of such techniques for a real time information retrieval system, most of the behavioural signals are noisy and therefore not reliable enough to be employed. Among many, a combination of dwell time and task information has been shown to be effective for relevance judgement prediction. However, the task information might not be available to the system at all times. Thus, there is a need for other sources of information which can be used as a substitute for task information. Recently, affective and physiological signals have shown promise as a potential source of information for relevance judgement prediction. However, their accuracy is not high enough to be applicable on their own. This paper investigates whether affective and physiological signals can be used as a complementary source of information for behavioural signals (i.e. dwell time) to create a reliable signal for relevance judgement prediction. Using a video retrieval system as a use case, we study and compare the effectiveness of the affective and physiological signals on their own, as well as in combination with behavioural signals for the relevance judgment prediction task across four different search intentions: seeking information, re-finding a particular information object, and two different entertainment intentions (i.e. entertainment by adjusting arousal level, and entertainment by adjusting mood). Our experimental results show that the effectiveness of studied signals varies across different search intentions, and when affective and physiological signals are combined with dwell time, a significant improvement can be achieved. Overall, these findings will help to implement better search engines in the future.

各种行为信号对内隐关联反馈模型的有效性进行了详尽的研究。尽管这些技术对实时信息检索系统有好处，但大多数行为信号是有噪声的，因此不够可靠，无法使用。其中，停留时间和任务信息的结合已被证明对相关性判断预测是有效的。但是，任务信息可能不是在任何时候都对系统可用。因此，需要其他信息源来代替任务信息。近年来，情感信号和生理信号已成为相关性判断预测的潜在信息来源。然而，它们的精度还不够高，不能单独应用。本文探讨了情感和生理信号是否可以作为行为信号(即停留时间)的互补信息源，以创建一个可靠的相关性判断预测信号。以视频检索系统为例，研究并比较了情感信号和生理信号在四种不同搜索意图(寻找信息、重新找到特定的信息对象和两种不同的娱乐意图(即调节唤醒水平的娱乐和调节情绪的娱乐)中单独以及结合行为信号对相关性判断预测任务的有效性。我们的实验结果表明，所研究信号的有效性在不同的搜索意图中存在差异，当情感和生理信号与停留时间相结合时，可以显著提高搜索效果。总的来说，这些发现将有助于在未来实现更好的搜索引擎。

{"title":"An effective implicit relevance feedback technique using affective, physiological and behavioural features","authors":"Yashar Moshfeghi, J. Jose","doi":"10.1145/2484028.2484074","DOIUrl":"https://doi.org/10.1145/2484028.2484074","url":null,"abstract":"The effectiveness of various behavioural signals for implicit relevance feedback models has been exhaustively studied. Despite the advantages of such techniques for a real time information retrieval system, most of the behavioural signals are noisy and therefore not reliable enough to be employed. Among many, a combination of dwell time and task information has been shown to be effective for relevance judgement prediction. However, the task information might not be available to the system at all times. Thus, there is a need for other sources of information which can be used as a substitute for task information. Recently, affective and physiological signals have shown promise as a potential source of information for relevance judgement prediction. However, their accuracy is not high enough to be applicable on their own. This paper investigates whether affective and physiological signals can be used as a complementary source of information for behavioural signals (i.e. dwell time) to create a reliable signal for relevance judgement prediction. Using a video retrieval system as a use case, we study and compare the effectiveness of the affective and physiological signals on their own, as well as in combination with behavioural signals for the relevance judgment prediction task across four different search intentions: seeking information, re-finding a particular information object, and two different entertainment intentions (i.e. entertainment by adjusting arousal level, and entertainment by adjusting mood). Our experimental results show that the effectiveness of studied signals varies across different search intentions, and when affective and physiological signals are combined with dwell time, a significant improvement can be achieved. Overall, these findings will help to implement better search engines in the future.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122076137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Pseudo test collections for training and tuning microblog rankers 用于训练和调优微博排名的伪测试集合

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484063

R. Berendsen, M. Tsagkias, W. Weerkamp, M. de Rijke

Recent years have witnessed a persistent interest in generating pseudo test collections, both for training and evaluation purposes. We describe a method for generating queries and relevance judgments for microblog search in an unsupervised way. Our starting point is this intuition: tweets with a hashtag are relevant to the topic covered by the hashtag and hence to a suitable query derived from the hashtag. Our baseline method selects all commonly used hashtags, and all associated tweets as relevance judgments; we then generate a query from these tweets. Next, we generate a timestamp for each query, allowing us to use temporal information in the training process. We then enrich the generation process with knowledge derived from an editorial test collection for microblog search. We use our pseudo test collections in two ways. First, we tune parameters of a variety of well known retrieval methods on them. Correlations with parameter sweeps on an editorial test collection are high on average, with a large variance over retrieval algorithms. Second, we use the pseudo test collections as training sets in a learning to rank scenario. Performance close to training on an editorial test collection is achieved in many cases. Our results demonstrate the utility of tuning and training microblog search algorithms on automatically generated training material.

近年来，为了训练和评估的目的，人们一直对生成伪测试集合很感兴趣。本文描述了一种无监督的微博搜索查询和相关性判断生成方法。我们的出发点是这样的直觉:带有hashtag的tweet与该hashtag所涵盖的主题相关，因此与从该hashtag派生的合适查询相关。我们的基线方法选择所有常用的标签，并将所有相关的推文作为相关性判断;然后，我们从这些tweet生成一个查询。接下来，我们为每个查询生成时间戳，允许我们在训练过程中使用时间信息。然后，我们使用来自微博搜索的编辑测试集的知识来丰富生成过程。我们以两种方式使用伪测试集合。首先，我们对各种已知检索方法的参数进行了调优。与编辑测试集合上的参数扫描的相关性平均很高，在检索算法上有很大的差异。其次，我们使用伪测试集合作为学习排序场景中的训练集。在许多情况下，在编辑测试集合上实现接近训练的性能。我们的结果证明了在自动生成的训练材料上调整和训练微博搜索算法的实用性。

{"title":"Pseudo test collections for training and tuning microblog rankers","authors":"R. Berendsen, M. Tsagkias, W. Weerkamp, M. de Rijke","doi":"10.1145/2484028.2484063","DOIUrl":"https://doi.org/10.1145/2484028.2484063","url":null,"abstract":"Recent years have witnessed a persistent interest in generating pseudo test collections, both for training and evaluation purposes. We describe a method for generating queries and relevance judgments for microblog search in an unsupervised way. Our starting point is this intuition: tweets with a hashtag are relevant to the topic covered by the hashtag and hence to a suitable query derived from the hashtag. Our baseline method selects all commonly used hashtags, and all associated tweets as relevance judgments; we then generate a query from these tweets. Next, we generate a timestamp for each query, allowing us to use temporal information in the training process. We then enrich the generation process with knowledge derived from an editorial test collection for microblog search. We use our pseudo test collections in two ways. First, we tune parameters of a variety of well known retrieval methods on them. Correlations with parameter sweeps on an editorial test collection are high on average, with a large variance over retrieval algorithms. Second, we use the pseudo test collections as training sets in a learning to rank scenario. Performance close to training on an editorial test collection is achieved in many cases. Our results demonstrate the utility of tuning and training microblog search algorithms on automatically generated training material.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129546994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Hybrid retrieval approaches to geospatial music recommendation 地理空间音乐推荐的混合检索方法

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484146

M. Schedl, Dominik Schnitzer

Recent advances in music retrieval and recommendation algorithms highlight the necessity to follow multimodal approaches in order to transcend limits imposed by methods that solely use audio, web, or collaborative filtering data. In this paper, we propose hybrid music recommendation algorithms that combine information on the music content, the music context, and the user context, in particular, integrating location-aware weighting of similarities. Using state-of-the-art techniques to extract audio features and contextual web features, and a novel standardized data set of music listening activities inferred from microblogs (MusicMicro), we propose several multimodal retrieval functions. The main contributions of this paper are (i) a systematic evaluation of mixture coefficients between state-of-the-art audio features and web features, using the first standardized microblog data set of music listening events for retrieval purposes and (ii) novel geospatial music recommendation approaches using location information of microblog users, and a comprehensive evaluation thereof.

音乐检索和推荐算法的最新进展强调了遵循多模态方法的必要性，以便超越仅使用音频、网络或协同过滤数据的方法所施加的限制。在本文中，我们提出了混合音乐推荐算法，该算法结合了音乐内容、音乐上下文和用户上下文的信息，特别是集成了相似度的位置感知加权。利用最先进的技术提取音频特征和上下文网络特征，以及从微博(MusicMicro)推断的音乐聆听活动的新颖标准化数据集，我们提出了几个多模态检索功能。本文的主要贡献在于:(i)利用首个标准化的微博音乐收听事件数据集，对最先进的音频特征和网络特征之间的混合系数进行了系统评价;(ii)利用微博用户的位置信息，提出了新的地理空间音乐推荐方法，并对其进行了综合评价。

引用次数: 38

Competition-based networks for expert finding 基于竞争的专家寻找网络

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484183

Çigdem Aslay, Neil O'Hare, L. Aiello, A. Jaimes

Finding experts in question answering platforms has important applications, such as question routing or identification of best answers. Addressing the problem of ranking users with respect to their expertise, we propose Competition-Based Expertise Networks (CBEN), a novel community expertise network structure based on the principle of competition among the answerers of a question. We evaluate our approach on a very large dataset from Yahoo! Answers using a variety of centrality measures. We show that it outperforms state-of-the-art network structures and, unlike previous methods, is able to consistly outperform simple metrics like best answer count. We also analyse question answering forums in Yahoo! Answers, and show that they can be characterised by factual or subjective information seeking behavior, social discussions and the conducting of polls or surveys. We find that the ability to identify experts greatly depends on the type of forum, which is directly reflected in the structural properties of the expertise networks.

在问答平台中寻找专家具有重要的应用，例如问题路由或最佳答案的识别。针对用户的专业知识排序问题，我们提出了基于竞争的专业知识网络(competition - based expertise Networks，简称CBEN)，这是一种基于问题答题者之间竞争原则的新型社区专业知识网络结构。我们用雅虎的一个非常大的数据集来评估我们的方法。使用各种中心性度量的答案。我们表明，它优于最先进的网络结构，并且与以前的方法不同，它能够始终优于简单的指标，如最佳答案计数。我们还分析了雅虎的问答论坛。答案，并表明它们可以以事实或主观信息寻求行为，社会讨论和进行民意调查或调查为特征。我们发现识别专家的能力在很大程度上取决于论坛的类型，这直接反映在专家网络的结构属性上。

引用次数: 43

An information-theoretic account of static index pruning 静态索引修剪的信息论描述

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484061

Ruey-Cheng Chen, Chia-Jung Lee

In this paper, we recast static index pruning as a model induction problem under the framework of Kullback's principle of minimum cross-entropy. We show that static index pruning has an approximate analytical solution in the form of convex integer program. Further analysis on computation feasibility suggests that one of its surrogate model can be solved efficiently. This result has led to the rediscovery of emph{uniform pruning}, a simple yet powerful pruning method proposed in 2001 and later easily ignored by many of us. To empirically verify this result, we conducted experiments under a new design in which prune ratio is strictly controlled. Our result on standard ad-hoc retrieval benchmarks has confirmed that uniform pruning is robust to high prune ratio and its performance is currently state of the art.

本文在Kullback最小交叉熵原理的框架下，将静态索引修剪问题转化为一个模型归纳问题。我们证明了静态索引修剪具有凸整数规划形式的近似解析解。进一步的计算可行性分析表明，其中一个代理模型可以有效求解。这一结果导致了统一修剪的重新发现，emph{统一修剪}是一种简单而强大的修剪方法，于2001年提出，后来很容易被我们许多人忽视。为了从经验上验证这一结果，我们在一个严格控制李子比例的新设计下进行了实验。我们在标准特设检索基准上的结果证实，均匀剪枝对高剪枝比具有鲁棒性，其性能是目前最先进的。

引用次数: 12

Indexing and querying overlapping structures 索引和查询重叠的结构

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484234

Faegheh Hasibi

Structural information retrieval is mostly based on hierarchy. However, in real life information is not purely hierarchical and structural elements may overlap each other. The most common example is a document with two distinct structural views, where the logical view is section/ subsection/ paragraph and the physical view is page/ line. Each single structural view of this document is a hierarchy and the components are either disjoint or nested inside each other. The overlapping issue arises when one structural element cannot be neatly nested into others. For instance, when a paragraph starts in one page and terminates in the next page. Similar situations can appear in videos and other multimedia contents, where temporal or spatial constituents of a media file may overlap each other. Querying over overlapping structures is one of the challenges of large scale search engines. For instance, FSIS (FAST Search for Internet Sites) [1] is a Microsoft search platform, which encounters overlaps while analysing content of textual data. FSIS uses a pipeline process to extract structure and semantic information of documents. The pipeline contains several components, where each component writes annotations to the input data. These annotations consist of structural elements and some of them may overlap each other. Handling overlapping structures in search engines will add a novel capability of searching, where users can ask queries such as "Find all the words that overlap two lines" or "Find the music played during Intro scene of Avatar movie". There are also other use cases, where the user of the search engine is not a person, but is a specific program with complex, non-traditional information retrieval needs. This research attempts to index overlapping structures and provide efficient query processing for large-scale search engines. The current research on overlapping structures revolves around encoding and modelling data, while indexing and query processing methods need investigations. Moreover, due to intrinsic complexity of overlaps, XML indexing and query processing techniques cannot be used for overlapping structures. Hence, my research on overlapping structures comprises three main parts: (1) an indexing method that supports both hierarchies and overlaps; (2) a query processing method based on the indexing technique and (3) a query language that is close to natural language and supports both full text and structural queries. Our approach for indexing overlaps is to adapt the PrePost [3] XML indexing method to overlapping structures. This method labels each node with its start and end positions and requires modest storage space. However, PrePost indexing cannot be used for overlapping nodes. To overcome this issue, we need to define a data model for overlapping structures. Since hierarchies are not sufficient to describe overlapping components, several data structures have been introduced by scholars. One of the most interesting data models is GODDAG [

结构信息检索主要基于层次结构。然而，在现实生活中，信息并不是纯粹分层的，结构元素可能相互重叠。最常见的例子是具有两个不同结构视图的文档，其中逻辑视图是节/分段/段落，物理视图是页/行。该文档的每个单一结构视图都是一个层次结构，组件要么不相交，要么彼此嵌套。当一个结构元素不能整齐地嵌套到其他元素中时，就会出现重叠问题。例如，当一个段落从一页开始，在下一页结束时。类似的情况也可能出现在视频和其他多媒体内容中，其中媒体文件的时间或空间成分可能相互重叠。对重叠结构的查询是大型搜索引擎面临的挑战之一。例如，FSIS (FAST Search For Internet Sites)[1]是微软的一个搜索平台，它在分析文本数据的内容时遇到了重叠。FSIS使用流水线过程提取文档的结构和语义信息。该管道包含几个组件，其中每个组件向输入数据写入注释。这些注释由结构元素组成，其中一些可能相互重叠。在搜索引擎中处理重叠结构将增加一种新颖的搜索功能，用户可以询问诸如“查找所有重叠两条线的单词”或“查找阿凡达电影介绍场景中播放的音乐”之类的问题。还有其他用例，其中搜索引擎的用户不是人，而是具有复杂非传统信息检索需求的特定程序。本研究试图对重叠结构进行索引，为大规模搜索引擎提供高效的查询处理。目前对重叠结构的研究主要围绕着数据的编码和建模，而索引和查询处理方法还有待研究。此外，由于重叠固有的复杂性，XML索引和查询处理技术不能用于重叠结构。因此，我对重叠结构的研究包括三个主要部分:(1)同时支持层次和重叠的索引方法;(2)一种基于索引技术的查询处理方法;(3)一种接近自然语言、支持全文查询和结构查询的查询语言。我们索引重叠的方法是将PrePost [3] XML索引方法应用于重叠的结构。该方法用开始和结束位置标记每个节点，并且需要适度的存储空间。但是，PrePost索引不能用于重叠节点。为了克服这个问题，我们需要为重叠结构定义一个数据模型。由于层次结构不足以描述重叠的组件，学者们引入了几种数据结构。最有趣的数据模型之一是GODDAG[2]。GODDAG是一个树状图，其中的节点可以有多个父节点。该模型既支持简单继承，也支持重叠。我们提出的索引重叠的数据模型就是这样一个树状结构，我们可以在其中定义重叠、父子关系和祖先-后代关系。

{"title":"Indexing and querying overlapping structures","authors":"Faegheh Hasibi","doi":"10.1145/2484028.2484234","DOIUrl":"https://doi.org/10.1145/2484028.2484234","url":null,"abstract":"Structural information retrieval is mostly based on hierarchy. However, in real life information is not purely hierarchical and structural elements may overlap each other. The most common example is a document with two distinct structural views, where the logical view is section/ subsection/ paragraph and the physical view is page/ line. Each single structural view of this document is a hierarchy and the components are either disjoint or nested inside each other. The overlapping issue arises when one structural element cannot be neatly nested into others. For instance, when a paragraph starts in one page and terminates in the next page. Similar situations can appear in videos and other multimedia contents, where temporal or spatial constituents of a media file may overlap each other. Querying over overlapping structures is one of the challenges of large scale search engines. For instance, FSIS (FAST Search for Internet Sites) [1] is a Microsoft search platform, which encounters overlaps while analysing content of textual data. FSIS uses a pipeline process to extract structure and semantic information of documents. The pipeline contains several components, where each component writes annotations to the input data. These annotations consist of structural elements and some of them may overlap each other. Handling overlapping structures in search engines will add a novel capability of searching, where users can ask queries such as \"Find all the words that overlap two lines\" or \"Find the music played during Intro scene of Avatar movie\". There are also other use cases, where the user of the search engine is not a person, but is a specific program with complex, non-traditional information retrieval needs. This research attempts to index overlapping structures and provide efficient query processing for large-scale search engines. The current research on overlapping structures revolves around encoding and modelling data, while indexing and query processing methods need investigations. Moreover, due to intrinsic complexity of overlaps, XML indexing and query processing techniques cannot be used for overlapping structures. Hence, my research on overlapping structures comprises three main parts: (1) an indexing method that supports both hierarchies and overlaps; (2) a query processing method based on the indexing technique and (3) a query language that is close to natural language and supports both full text and structural queries. Our approach for indexing overlaps is to adapt the PrePost [3] XML indexing method to overlapping structures. This method labels each node with its start and end positions and requires modest storage space. However, PrePost indexing cannot be used for overlapping nodes. To overcome this issue, we need to define a data model for overlapping structures. Since hierarchies are not sufficient to describe overlapping components, several data structures have been introduced by scholars. One of the most interesting data models is GODDAG [","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Kernel-based learning to rank with syntactic and semantic structures 根据句法和语义结构进行排序的基于核的学习

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2013-07-28 DOI: 10.1145/2484028.2484196

Alessandro Moschitti

Kernel Methods (KMs) are powerful machine learning techniques that can alleviate the data representation problem as they substitute scalar product between feature vectors with similarity functions (kernels) directly defined between data instances, e.g., syntactic trees, (thus features are not needed any longer). This tutorial aims at introducing essential and simplified theory of Support Vector Machines and KMs for the design of practical applications. It will describe effective kernels for easily engineering automatic classifiers and learning to rank algorithms using structured data and semantic processing. Some examples will be drawn from Question Answering, Passage Re-ranking, Short and Long Text Categorization, Relation Extraction, Named Entity Recognition, Co-Reference Resolution. Moreover, some practical demonstrations will be given using the SVM-Light-TK (tree kernel) toolkit.

核方法(km)是一种强大的机器学习技术，可以缓解数据表示问题，因为它们用数据实例之间直接定义的相似函数(核)代替特征向量之间的标量积，例如语法树(因此不再需要特征)。本教程旨在介绍支持向量机和km的基本和简化理论，用于实际应用的设计。它将描述简单工程自动分类器和学习使用结构化数据和语义处理排序算法的有效内核。一些例子将从问答、段落重新排序、短文本和长文分类、关系提取、命名实体识别、共同参考解析等方面抽取。此外，还将使用SVM-Light-TK(树内核)工具包进行一些实际演示。

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀