Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval最新文献

英文中文

Session details: Session 5A: Deep Learning 会议详情:5A:深度学习

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/3255927

Berthier Ribeiro-Neto

引用次数: 0

Modelling Term Dependence with Copulas 用copula建模项相关性

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767831

Carsten Eickhoff, A. D. Vries, Thomas Hofmann

Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually show significant conditional dependencies while the vast majority of co-located terms has no implications on the document's topical nature or relevance towards a given topic. It is exactly this situation that we capture in a formal framework: A limited number of meaningful dependencies in a system of largely independent observations. Making use of the formal copula framework, we describe the strength of causal dependency in terms of a number of established term co-occurrence metrics. Our experiments based on the well known ClueWeb'12 corpus and TREC 2013 topics indicate significant performance gains in terms of retrieval performance when we formally account for the dependency structure underlying pieces of natural language text.

许多生成语言和关联模型在观察单个术语的可能性之间假定条件独立。这种假设显然是幼稚的，但也很难取代或放松。只有很少的术语对显示出明显的条件依赖关系，而绝大多数位于同一位置的术语对文档的主题性质或与给定主题的相关性没有任何影响。我们在正式框架中捕捉到的正是这种情况:在一个很大程度上独立观察的系统中，有限数量的有意义的依赖关系。利用正式的copula框架，我们根据一些已建立的术语共现度量来描述因果依赖的强度。我们基于众所周知的ClueWeb'12语料库和TREC 2013主题的实验表明，当我们正式考虑自然语言文本片段的依赖结构时，在检索性能方面获得了显著的性能提升。

引用次数: 7

On the Reusability of Open Test Collections 开放测试集合的可重用性研究

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767788

Seyyed Hadi Hashemi, C. Clarke, Adriel Dean-Hall, J. Kamps, Julia Kiseleva

Creating test collections for modern search tasks is increasingly more challenging due to the growing scale and dynamic nature of content, and need for richer contextualization of the statements of request. To address these issues, the TREC Contextual Suggestion Track explored an open test collection, where participants were allowed to submit any web page as a result for a personalized venue recommendation task. This prompts the question on the reusability of the resulting test collection: How does the open nature affect the pooling process? Can participants reliably evaluate variant runs with the resulting qrels? Can other teams evaluate new runs reliably? In short, does the set of pooled and judged documents effectively produce a post hoc test collection? Our main findings are the following: First, while there is a strongly significant rank correlation, the effect of pooling is notable and results in underestimation of performance, implying the evaluation of non-pooled systems should be done with great care. Second, we extensively analyze impacts of open corpus on the fraction of judged documents, explaining how low recall affects the reusability, and how the personalization and low pooling depth aggravate that problem. Third, we outline a potential solution by deriving a fixed corpus from open web submissions.

由于内容的规模和动态特性的增长，以及对请求语句更丰富的上下文化的需求，为现代搜索任务创建测试集合变得越来越具有挑战性。为了解决这些问题，TREC上下文建议跟踪探索了一个开放的测试集，参与者可以提交任何网页作为个性化场地推荐任务的结果。这就引出了关于结果测试集合的可重用性的问题:开放性质如何影响池过程?参与者是否能够可靠地评估不同的跑步结果?其他团队能可靠地评估新运行吗?简而言之，汇集和判断的文档集是否有效地生成了一个事后测试集合?我们的主要发现如下:首先，虽然存在非常显著的等级相关性，但池化的影响是显著的，并且会导致对性能的低估，这意味着对非池化系统的评估应该非常小心。其次，我们广泛地分析了开放语料库对被判文档比例的影响，解释了低召回率如何影响可重用性，以及个性化和低池化深度如何加剧了这一问题。第三，我们概述了一个潜在的解决方案，从开放的网络提交中获得一个固定的语料库。

{"title":"On the Reusability of Open Test Collections","authors":"Seyyed Hadi Hashemi, C. Clarke, Adriel Dean-Hall, J. Kamps, Julia Kiseleva","doi":"10.1145/2766462.2767788","DOIUrl":"https://doi.org/10.1145/2766462.2767788","url":null,"abstract":"Creating test collections for modern search tasks is increasingly more challenging due to the growing scale and dynamic nature of content, and need for richer contextualization of the statements of request. To address these issues, the TREC Contextual Suggestion Track explored an open test collection, where participants were allowed to submit any web page as a result for a personalized venue recommendation task. This prompts the question on the reusability of the resulting test collection: How does the open nature affect the pooling process? Can participants reliably evaluate variant runs with the resulting qrels? Can other teams evaluate new runs reliably? In short, does the set of pooled and judged documents effectively produce a post hoc test collection? Our main findings are the following: First, while there is a strongly significant rank correlation, the effect of pooling is notable and results in underestimation of performance, implying the evaluation of non-pooled systems should be done with great care. Second, we extensively analyze impacts of open corpus on the fraction of judged documents, explaining how low recall affects the reusability, and how the personalization and low pooling depth aggravate that problem. Third, we outline a potential solution by deriving a fixed corpus from open web submissions.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"34 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131151302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Twitter Sentiment Analysis with Deep Convolutional Neural Networks 用深度卷积神经网络分析Twitter情绪

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767830

Aliaksei Severyn, Alessandro Moschitti

This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a new model for initializing the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to train initial word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model. We train the latter on the supervised training data recently made available by the official system evaluation campaign on Twitter Sentiment Analysis organized by Semeval-2015. A comparison between the results of our approach and the systems participating in the challenge on the official test sets, suggests that our model could be ranked in the first two positions in both the phrase-level subtask A (among 11 teams) and on the message-level subtask B (among 40 teams). This is an important evidence on the practical value of our solution.

本文描述了我们用于tweet情感分析的深度学习系统。这项工作的主要贡献是一个用于初始化卷积神经网络参数权重的新模型，这对于训练准确的模型至关重要，同时避免需要注入任何额外的特征。简而言之，我们使用无监督神经语言模型来训练初始词嵌入，这些词嵌入通过我们的深度学习模型在远程监督语料库上进一步调整。在最后阶段，使用网络的预训练参数初始化模型。我们根据最近由Semeval-2015组织的Twitter情感分析官方系统评估活动提供的监督训练数据对后者进行训练。将我们的方法的结果与官方测试集上参与挑战的系统进行比较，表明我们的模型可以在短语级子任务A(在11个团队中)和消息级子任务B(在40个团队中)中排名前两个位置。这是一个重要的证据，证明我们的解决方案的实用价值。

{"title":"Twitter Sentiment Analysis with Deep Convolutional Neural Networks","authors":"Aliaksei Severyn, Alessandro Moschitti","doi":"10.1145/2766462.2767830","DOIUrl":"https://doi.org/10.1145/2766462.2767830","url":null,"abstract":"This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a new model for initializing the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to train initial word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model. We train the latter on the supervised training data recently made available by the official system evaluation campaign on Twitter Sentiment Analysis organized by Semeval-2015. A comparison between the results of our approach and the systems participating in the challenge on the official test sets, suggests that our model could be ranked in the first two positions in both the phrase-level subtask A (among 11 teams) and on the message-level subtask B (among 40 teams). This is an important evidence on the practical value of our solution.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131484161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 583

Session details: Session 7A: Assessing 会议详情:会议7A:评估

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/3255934

J. Zobel

引用次数: 0

Finding Answers in Web Search 在网络搜索中找到答案

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767846

E. Yulianti

There are many informational queries that could be answered with a text passage, thereby not requiring the searcher to access the full web document. When building manual annotations of answer passages for TREC queries, Keikha et al. [6] confirmed that many such queries can be answered with just passages. By presenting the answers directly in the search result page, user information needs will be addressed more rapidly so that reduces user interaction (click) with the search result page [3] and gives a significant positive effect on user satisfaction [2, 7]. In the context of general web search, the problem of finding answer passages has not been explored extensively. Retrieving relevant passages has been studied in TREC HARD track [1] and in INEX [5], but relevant passages are not required to contain answers. One of the tasks in the TREC genomics track [4] was to find answer passages on biomedical literature. Previous work has shown that current passage retrieval methods that focus on topical relevance are not effective at finding answers [6]. Therefore, more knowledge is required to identify answers in a document. Bernstein et al. [2] has studied an approach to extract inline direct answers for search result using paid crowdsourcing service. Such an approach, however, is expensive and not practical to be applied for all possible information needs. A fully automatic process in finding answers remains a research challenge. The aim of this thesis is to find passages in the documents that contain answers to a user's query. In this research, we proposed to use a summarization technique through taking advantage of Community Question Answering (CQA) content. In our previous work, we have shown the benefit of using social media to generate more accurate summaries of web documents [8], but this was not designed to present answer in the summary. With the high volume of questions and answers posted in CQA, we believe that there are many questions that have been previously asked in CQA that are the same as or related to actual web queries, for which their best answers can guide us to extract answers in the document. As an initial work, we proposed using term distributions extracted from best answers for top matching questions in one of leading CQA sites, Yahoo! Answers (Y!A), for answer summaries generation. An experiment was done by comparing our summaries with reference answers built in previous work [6], finding some level of success. A manuscript is prepared for this result. Next, as an extension of our work above, we were interested to see whether the documents that have better quality answer summaries should be ranked higher in the result list. A set of features are derived from answer summaries to re-rank documents in the result list. Our experiment shows that answer summaries can be used to improve state-of-the-art document ranking. The method is also shown to outperform a current re-ranking approach using comprehensive document quality features. A

有许多信息查询可以用文本段落来回答，因此不需要搜索者访问完整的web文档。Keikha等人[6]在为TREC查询构建答案段落的手动注释时，证实了许多这样的查询可以只用段落来回答。通过在搜索结果页面中直接呈现答案，可以更快地满足用户的信息需求，减少用户与搜索结果页面的交互(点击)[3]，对用户满意度有显著的正向影响[2,7]。在一般网络搜索的背景下，寻找答案段落的问题还没有得到广泛的探讨。TREC HARD track[1]和INEX[5]已经研究了检索相关文章，但相关文章不需要包含答案。TREC基因组学轨道[4]的任务之一是查找生物医学文献的答案段落。先前的研究表明，当前关注主题相关性的文章检索方法在寻找答案方面并不有效[6]。因此，需要更多的知识来识别文档中的答案。Bernstein等人[2]研究了一种使用付费众包服务提取搜索结果内联直接答案的方法。然而，这种方法代价高昂，而且不实际，不能适用于所有可能的信息需求。寻找答案的全自动过程仍然是一项研究挑战。本文的目的是在文档中找到包含用户查询答案的段落。在本研究中，我们提出了利用社区问答(CQA)内容的摘要技术。在我们之前的工作中，我们已经展示了使用社交媒体生成更准确的web文档摘要的好处[8]，但这并不是为了在摘要中给出答案。由于CQA中发布了大量的问题和答案，我们相信在CQA中有许多之前被问过的问题与实际的web查询相同或相关，它们的最佳答案可以指导我们在文档中提取答案。作为一项初步工作，我们建议使用从最佳答案中提取的术语分布来解决一个领先的CQA网站Yahoo!答案(Y!A)，用于生成答案摘要。我们做了一个实验，将我们的总结与之前工作[6]中构建的参考答案进行比较，发现了一定程度的成功。为这个结果准备了一份手稿。接下来，作为我们上述工作的延伸，我们很想知道具有更好质量的答案摘要的文档是否应该在结果列表中排名更高。从答案摘要中派生出一组特性，以便在结果列表中对文档重新排序。我们的实验表明，答案摘要可以用来提高最先进的文档排名。该方法还显示优于当前使用综合文档质量特征的重新排序方法。为此结果提交了一份手稿。在未来的工作中，我们计划对Y!的顶级匹配问题及其对应的最佳答案进行更深入的分析。以便更好地了解它们对生成的摘要和重新排序结果的好处。例如，Y!的最佳答案在不同的相关度上的结果有何不同?A，用来生成摘要。也有机会改进Y!生成答案摘要，例如通过预测Y的最佳答案的质量。A对应于查询。我们还打算结合相关的Y!当有来自Y!的问题时，在初始结果列表中添加一个页面。A，与查询匹配得很好。接下来，重要的是要考虑为没有来自CQA的相关结果的查询生成答案摘要的方法。

{"title":"Finding Answers in Web Search","authors":"E. Yulianti","doi":"10.1145/2766462.2767846","DOIUrl":"https://doi.org/10.1145/2766462.2767846","url":null,"abstract":"There are many informational queries that could be answered with a text passage, thereby not requiring the searcher to access the full web document. When building manual annotations of answer passages for TREC queries, Keikha et al. [6] confirmed that many such queries can be answered with just passages. By presenting the answers directly in the search result page, user information needs will be addressed more rapidly so that reduces user interaction (click) with the search result page [3] and gives a significant positive effect on user satisfaction [2, 7]. In the context of general web search, the problem of finding answer passages has not been explored extensively. Retrieving relevant passages has been studied in TREC HARD track [1] and in INEX [5], but relevant passages are not required to contain answers. One of the tasks in the TREC genomics track [4] was to find answer passages on biomedical literature. Previous work has shown that current passage retrieval methods that focus on topical relevance are not effective at finding answers [6]. Therefore, more knowledge is required to identify answers in a document. Bernstein et al. [2] has studied an approach to extract inline direct answers for search result using paid crowdsourcing service. Such an approach, however, is expensive and not practical to be applied for all possible information needs. A fully automatic process in finding answers remains a research challenge. The aim of this thesis is to find passages in the documents that contain answers to a user's query. In this research, we proposed to use a summarization technique through taking advantage of Community Question Answering (CQA) content. In our previous work, we have shown the benefit of using social media to generate more accurate summaries of web documents [8], but this was not designed to present answer in the summary. With the high volume of questions and answers posted in CQA, we believe that there are many questions that have been previously asked in CQA that are the same as or related to actual web queries, for which their best answers can guide us to extract answers in the document. As an initial work, we proposed using term distributions extracted from best answers for top matching questions in one of leading CQA sites, Yahoo! Answers (Y!A), for answer summaries generation. An experiment was done by comparing our summaries with reference answers built in previous work [6], finding some level of success. A manuscript is prepared for this result. Next, as an extension of our work above, we were interested to see whether the documents that have better quality answer summaries should be ranked higher in the result list. A set of features are derived from answer summaries to re-rank documents in the result list. Our experiment shows that answer summaries can be used to improve state-of-the-art document ranking. The method is also shown to outperform a current re-ranking approach using comprehensive document quality features. A ","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133774279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adapted B-CUBED Metrics to Unbalanced Datasets 适应b - cube指标不平衡的数据集

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767836

Jose G. Moreno, G. Dias

B-CUBED metrics have recently been adopted in the evaluation of clustering results as well as in many other related tasks. However, this family of metrics is not well adapted when datasets are unbalanced. This issue is extremely frequent in Web results, where classes are distributed following a strong unbalanced pattern. In this paper, we present a modified version of B-CUBED metrics to overcome this situation. Results in toy and real datasets indicate that the proposed adaptation correctly considers the particularities of unbalanced cases.

b - cube指标最近被用于聚类结果的评估以及许多其他相关任务。然而，当数据集不平衡时，这一系列指标不能很好地适应。这个问题在Web结果中非常常见，因为类是按照强烈的不平衡模式分布的。在本文中，我们提出了一个修改版本的B-CUBED度量来克服这种情况。在玩具和实际数据集上的结果表明，所提出的自适应方法正确地考虑了不平衡情况的特殊性。

引用次数: 6

Relevance-aware Filtering of Tuples Sorted by an Attribute Value via Direct Optimization of Search Quality Metrics 基于搜索质量指标直接优化的属性值排序元组相关性感知过滤

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767822

N. Spirin, Mikhail Kuznetsov, Julia Kiseleva, Yaroslav V. Spirin, Pavel A. Izhutov

Sorting tuples by an attribute value is a common search scenario and many search engines support such capabilities, e.g. price-based sorting in e-commerce, time-based sorting on a job or social media website. However, sorting purely by the attribute value might lead to poor user experience because the relevance is not taken into account. Hence, at the top of the list the users might see irrelevant results. In this paper we choose a different approach. Rather than just returning the entire list of results sorted by the attribute value, additionally we suggest doing the relevance-aware search results (post-) filtering. Following this approach, we develop a new algorithm based on the dynamic programming that directly optimizes a given search quality metric. It can be seamlessly integrated as the final step of a query processing pipeline and provides a theoretical guarantee on optimality. We conduct a comprehensive evaluation of our algorithm on synthetic data and real learning to rank data sets. Based on the experimental results, we conclude that the proposed algorithm is superior to typically used heuristics and has a clear practical value for the search and related applications.

按属性值对元组排序是一种常见的搜索场景，许多搜索引擎都支持这种功能，例如电子商务中基于价格的排序，工作或社交媒体网站中基于时间的排序。然而，纯粹按属性值排序可能会导致糟糕的用户体验，因为没有考虑到相关性。因此，在列表的顶部，用户可能会看到不相关的结果。在本文中，我们选择了一种不同的方法。除了返回按属性值排序的整个结果列表外，我们还建议进行相关性感知搜索结果(post-)过滤。根据这种方法，我们开发了一种基于动态规划的新算法，该算法直接优化给定的搜索质量度量。它可以作为查询处理管道的最后一步无缝集成，并提供了最优性的理论保证。我们对我们的算法在合成数据和真实学习上进行了全面的评估，以对数据集进行排名。实验结果表明，该算法优于常用的启发式算法，对搜索及相关应用具有明显的实用价值。

{"title":"Relevance-aware Filtering of Tuples Sorted by an Attribute Value via Direct Optimization of Search Quality Metrics","authors":"N. Spirin, Mikhail Kuznetsov, Julia Kiseleva, Yaroslav V. Spirin, Pavel A. Izhutov","doi":"10.1145/2766462.2767822","DOIUrl":"https://doi.org/10.1145/2766462.2767822","url":null,"abstract":"Sorting tuples by an attribute value is a common search scenario and many search engines support such capabilities, e.g. price-based sorting in e-commerce, time-based sorting on a job or social media website. However, sorting purely by the attribute value might lead to poor user experience because the relevance is not taken into account. Hence, at the top of the list the users might see irrelevant results. In this paper we choose a different approach. Rather than just returning the entire list of results sorted by the attribute value, additionally we suggest doing the relevance-aware search results (post-) filtering. Following this approach, we develop a new algorithm based on the dynamic programming that directly optimizes a given search quality metric. It can be seamlessly integrated as the final step of a query processing pipeline and provides a theoretical guarantee on optimality. We conduct a comprehensive evaluation of our algorithm on synthetic data and real learning to rank data sets. Based on the experimental results, we conclude that the proposed algorithm is superior to typically used heuristics and has a clear practical value for the search and related applications.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115558903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

On the Cost of Phrase-Based Ranking 基于短语的排序成本研究

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767769

M. Petri, Alistair Moffat

Effective postings list compression techniques, and the efficiency of postings list processing schemes such as WAND, have significantly improved the practical performance of ranked document retrieval using inverted indexes. Recently, suffix array-based index structures have been proposed as a complementary tool, to support phrase searching. The relative merits of these alternative approaches to ranked querying using phrase components are, however, unclear. Here we provide: (1) an overview of existing phrase indexing techniques; (2) a description of how to incorporate recent advances in list compression and processing; and (3) an empirical evaluation of state-of-the-art suffix-array and inverted file-based phrase retrieval indexes using a standard IR test collection.

有效的发布列表压缩技术和发布列表处理方案(如WAND)的效率显著提高了使用倒排索引进行排序文档检索的实际性能。最近，基于后缀数组的索引结构被提出作为支持短语搜索的补充工具。然而，这些使用短语组件排序查询的替代方法的相对优点尚不清楚。本文提供:(1)对现有短语索引技术的概述;(2)描述如何结合列表压缩和处理的最新进展;(3)使用标准IR测试集对最先进的基于后缀数组和反向文件的短语检索索引进行实证评价。

引用次数: 3

Zero-shot Image Tagging by Hierarchical Semantic Embedding 基于层次语义嵌入的零拍摄图像标注

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pub Date : 2015-08-09 DOI: 10.1145/2766462.2767773

Xirong Li, Shuai Liao, Weiyu Lan, Xiaoyong Du, Gang Yang

Given the difficulty of acquiring labeled examples for many fine-grained visual classes, there is an increasing interest in zero-shot image tagging, aiming to tag images with novel labels that have no training examples present. Using a semantic space trained by a neural language model, the current state-of-the-art embeds both images and labels into the space, wherein cross-media similarity is computed. However, for labels of relatively low occurrence, its similarity to images and other labels can be unreliable. This paper proposes Hierarchical Semantic Embedding (HierSE), a simple model that exploits the WordNet hierarchy to improve label embedding and consequently image embedding. Moreover, we identify two good tricks, namely training the neural language model using Flickr tags instead of web documents, and using partial match instead of full match for vectorizing a WordNet node. All this lets us outperform the state-of-the-art. On a test set of over 1,500 visual object classes and 1.3 million images, the proposed model beats the current best results (18.3% versus 9.4% in hit@1).

考虑到获取许多细粒度视觉类的标记样例的困难，人们对零采样图像标记越来越感兴趣，旨在用没有训练样例的新标签标记图像。使用由神经语言模型训练的语义空间，当前最先进的技术将图像和标签嵌入到空间中，其中计算跨媒体相似性。然而，对于出现率相对较低的标签，其与图像和其他标签的相似性可能不可靠。本文提出了层次语义嵌入(HierSE)，这是一种利用WordNet层次结构来改进标签嵌入从而改进图像嵌入的简单模型。此外，我们确定了两个很好的技巧，即使用Flickr标签而不是web文档来训练神经语言模型，以及使用部分匹配而不是完全匹配来向量化WordNet节点。所有这些都让我们超越了最先进的技术。在超过1500个视觉对象类别和130万张图像的测试集上，提出的模型击败了当前的最佳结果(18.3% vs . hit@1中的9.4%)。

{"title":"Zero-shot Image Tagging by Hierarchical Semantic Embedding","authors":"Xirong Li, Shuai Liao, Weiyu Lan, Xiaoyong Du, Gang Yang","doi":"10.1145/2766462.2767773","DOIUrl":"https://doi.org/10.1145/2766462.2767773","url":null,"abstract":"Given the difficulty of acquiring labeled examples for many fine-grained visual classes, there is an increasing interest in zero-shot image tagging, aiming to tag images with novel labels that have no training examples present. Using a semantic space trained by a neural language model, the current state-of-the-art embeds both images and labels into the space, wherein cross-media similarity is computed. However, for labels of relatively low occurrence, its similarity to images and other labels can be unreliable. This paper proposes Hierarchical Semantic Embedding (HierSE), a simple model that exploits the WordNet hierarchy to improve label embedding and consequently image embedding. Moreover, we identify two good tricks, namely training the neural language model using Flickr tags instead of web documents, and using partial match instead of full match for vectorizing a WordNet node. All this lets us outperform the state-of-the-art. On a test set of over 1,500 visual object classes and 1.3 million images, the proposed model beats the current best results (18.3% versus 9.4% in hit@1).","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀