Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.
{"title":"Finding near-duplicate web pages: a large-scale evaluation of algorithms","authors":"M. Henzinger","doi":"10.1145/1148170.1148222","DOIUrl":"https://doi.org/10.1145/1148170.1148222","url":null,"abstract":"Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered \"state-of-the-art\" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125444175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collaborative filtering techniques have become popular in the past decade as an effective way to help people deal with information overload. Recent research has identified significant vulnerabilities in collaborative filtering techniques. Shilling attacks, in which attackers introduce biased ratings to influence recommendation systems, have been shown to be effective against memory-based collaborative filtering algorithms. We examine the effectiveness of two popular shilling attacks (the random attack and the average attack) on a model-based algorithm that uses Singular Value Decomposition (SVD) to learn a low-dimensional linear model. Our results show that the SVD-based algorithm is much more resistant to shilling attacks than memory-based algorithms. Furthermore, we develop an attack detection method directly built on the SVD-based algorithm and show that this method detects random shilling attacks with high detection rates and very low false alarm rates.
{"title":"Analysis of a low-dimensional linear model under recommendation attacks","authors":"Sheng Zhang, Ouyang Yi, J. Ford, F. Makedon","doi":"10.1145/1148170.1148259","DOIUrl":"https://doi.org/10.1145/1148170.1148259","url":null,"abstract":"Collaborative filtering techniques have become popular in the past decade as an effective way to help people deal with information overload. Recent research has identified significant vulnerabilities in collaborative filtering techniques. Shilling attacks, in which attackers introduce biased ratings to influence recommendation systems, have been shown to be effective against memory-based collaborative filtering algorithms. We examine the effectiveness of two popular shilling attacks (the random attack and the average attack) on a model-based algorithm that uses Singular Value Decomposition (SVD) to learn a low-dimensional linear model. Our results show that the SVD-based algorithm is much more resistant to shilling attacks than memory-based algorithms. Furthermore, we develop an attack detection method directly built on the SVD-based algorithm and show that this method detects random shilling attacks with high detection rates and very low false alarm rates.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123381124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work is an initial study on the utility of automatically generated queries for evaluating known-item retrieval and how such queries compare to real queries. The main advantage of automatically generating queries is that for any given test collection numerous queries can be produced at minimal cost. For evaluation, this has huge ramifications as state-of-the-art algorithms can be tested on different types of generated queries which mimic particular querying styles that a user may adopt. Our approach draws upon previous research in IR which has probabilistically generated simulated queries for other purposes [2, 3].
{"title":"Automatic construction of known-item finding test beds","authors":"L. Azzopardi, M. de Rijke","doi":"10.1145/1148170.1148276","DOIUrl":"https://doi.org/10.1145/1148170.1148276","url":null,"abstract":"This work is an initial study on the utility of automatically generated queries for evaluating known-item retrieval and how such queries compare to real queries. The main advantage of automatically generating queries is that for any given test collection numerous queries can be produced at minimal cost. For evaluation, this has huge ramifications as state-of-the-art algorithms can be tested on different types of generated queries which mimic particular querying styles that a user may adopt. Our approach draws upon previous research in IR which has probabilistically generated simulated queries for other purposes [2, 3].","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128062725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel language modeling approach to capturing the query reformulation behavior of Web search users. Based on a framework that categorizes eight different types of "user moves" (adding/removing query terms, etc.), we treat search sessions as sequence data and build n-gram language models to capture user behavior. We evaluated our models in a prediction task. The results suggest that useful patterns of activity can be extracted from user histories. Furthermore, by examining prediction performance under different order n-gram models, we gained insight into the amount of history/context that is associated with different types of user actions. Our work serves as the basis for more refined user models.
{"title":"Action modeling: language models that predict query behavior","authors":"G. C. Murray, Jimmy J. Lin, Abdur Chowdhury","doi":"10.1145/1148170.1148315","DOIUrl":"https://doi.org/10.1145/1148170.1148315","url":null,"abstract":"We present a novel language modeling approach to capturing the query reformulation behavior of Web search users. Based on a framework that categorizes eight different types of \"user moves\" (adding/removing query terms, etc.), we treat search sessions as sequence data and build n-gram language models to capture user behavior. We evaluated our models in a prediction task. The results suggest that useful patterns of activity can be extracted from user histories. Furthermore, by examining prediction performance under different order n-gram models, we gained insight into the amount of history/context that is associated with different types of user actions. Our work serves as the basis for more refined user models.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126351642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents.
{"title":"Measuring similarity of semi-structured documents with context weights","authors":"Christopher C. Yang, Nan Liu","doi":"10.1145/1148170.1148334","DOIUrl":"https://doi.org/10.1145/1148170.1148334","url":null,"abstract":"In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123391092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that operates on a Markov chain, by following a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question and subquestions derived from topic-relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art Question Answering (Q/A) system in order to retrieve a set of passages that can later be merged into a comprehensive answer by a Multi-Document Summarization (MDS) system. In our evaluations, we show that access to the decompositions generated using this method can significantly enhance the relevance and comprehensiveness of summary-length answers to complex questions.
{"title":"Answering complex questions with random walk models","authors":"S. Harabagiu, V. Lacatusu, Andrew Hickl","doi":"10.1145/1148170.1148211","DOIUrl":"https://doi.org/10.1145/1148170.1148211","url":null,"abstract":"We present a novel framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that operates on a Markov chain, by following a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question and subquestions derived from topic-relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art Question Answering (Q/A) system in order to retrieve a set of passages that can later be merged into a comprehensive answer by a Multi-Document Summarization (MDS) system. In our evaluations, we show that access to the decompositions generated using this method can significantly enhance the relevance and comprehensiveness of summary-length answers to complex questions.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123767142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The aim of this study is to investigate whether element retrieval (as opposed to full-text retrieval) is meaningful and useful for searchers when carrying out information-seeking tasks. Our results suggest that searchers find the structural breakdown of documents useful when browsing within retrieved documents, and provide support for the usefulness of element retrieval in interactive settings.
{"title":"Is XML retrieval meaningful to users?: searcher preferences for full documents vs. elements","authors":"Birger Larsen, A. Tombros, Saadia Malik","doi":"10.1145/1148170.1148306","DOIUrl":"https://doi.org/10.1145/1148170.1148306","url":null,"abstract":"The aim of this study is to investigate whether element retrieval (as opposed to full-text retrieval) is meaningful and useful for searchers when carrying out information-seeking tasks. Our results suggest that searchers find the structural breakdown of documents useful when browsing within retrieved documents, and provide support for the usefulness of element retrieval in interactive settings.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121813669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite its intuitive appeal, the hypothesis that retrieval at the level of "concepts" should outperform purely term-based approaches remains unverified empirically. In addition, the use of "knowledge" has not consistently resulted in performance gains. After identifying possible reasons for previous negative results, we present a novel framework for "conceptual retrieval" that articulates the types of knowledge that are important for information seeking. We instantiate this general framework in the domain of clinical medicine based on the principles of evidence-based medicine (EBM). Experiments show that an EBM-based scoring algorithm dramatically outperforms a state-of-the-art baseline that employs only term statistics. Ablation studies further yield a better understanding of the performance contributions of different components. Finally, we discuss how other domains can benefit from knowledge-based approaches.
{"title":"The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine","authors":"Jimmy J. Lin, Dina Demner-Fushman","doi":"10.1145/1148170.1148191","DOIUrl":"https://doi.org/10.1145/1148170.1148191","url":null,"abstract":"Despite its intuitive appeal, the hypothesis that retrieval at the level of \"concepts\" should outperform purely term-based approaches remains unverified empirically. In addition, the use of \"knowledge\" has not consistently resulted in performance gains. After identifying possible reasons for previous negative results, we present a novel framework for \"conceptual retrieval\" that articulates the types of knowledge that are important for information seeking. We instantiate this general framework in the domain of clinical medicine based on the principles of evidence-based medicine (EBM). Experiments show that an EBM-based scoring algorithm dramatically outperforms a state-of-the-art baseline that employs only term statistics. Ablation studies further yield a better understanding of the performance contributions of different components. Finally, we discuss how other domains can benefit from knowledge-based approaches.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121924237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.
{"title":"Hybrid index maintenance for growing text collections","authors":"Stefan Büttcher, C. Clarke, Brad Lushman","doi":"10.1145/1148170.1148233","DOIUrl":"https://doi.org/10.1145/1148170.1148233","url":null,"abstract":"We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130382901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a user keyword query, current Web search engines return a list of pages ranked by their “goodness” with respect to the query. However, this technique misses results whose contents are distributed across multiple physical pages and are connected via hyperlinks and frames [3]. That is, it is often the case that no single page contains all query keywords. Li et al. [3] make a first step towards this problem by returning a tree of hyperlinked pages that collectively contain all query keywords. The limitation of this approach is that it operates at the page-level granularity, which ignores the specific context where the keywords are found within the pages. More importantly, it is cumbersome for the user to locate the most desirable tree of pages due to the amount of data in each page tree and a large number of page trees.
{"title":"Searching the web using composed pages","authors":"R. Varadarajan, Vagelis Hristidis, Tao Li","doi":"10.1145/1148170.1148331","DOIUrl":"https://doi.org/10.1145/1148170.1148331","url":null,"abstract":"Given a user keyword query, current Web search engines return a list of pages ranked by their “goodness” with respect to the query. However, this technique misses results whose contents are distributed across multiple physical pages and are connected via hyperlinks and frames [3]. That is, it is often the case that no single page contains all query keywords. Li et al. [3] make a first step towards this problem by returning a tree of hyperlinked pages that collectively contain all query keywords. The limitation of this approach is that it operates at the page-level granularity, which ignores the specific context where the keywords are found within the pages. More importantly, it is cumbersome for the user to locate the most desirable tree of pages due to the amount of data in each page tree and a large number of page trees.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133859759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}