Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management最新文献
Given a set S of servers and a set C of clients, an optimal-location query returns a location where a new server can attract the greatest number of clients. Optimal-location queries are important in a lot of real-life applications, such as mobile service planning or resource distribution in an area. Previous studies assume that a client always visits its nearest server, which is too strict to be true in reality. In this paper, we relax this assumption and propose a new model to tackle this problem. We further generalize the problem to finding top-k optimal locations. The main challenge is that, even the fastest approach in existing studies needs to take hours to answer an optimal-location query on a typical real world dataset, which significantly limits the applications of the query. Using our relaxed model, we design an efficient grid-based approximation algorithm called FILM (Fast Influential Location Miner) to the queries, which is orders of magnitude faster than the best-known previous work and the number of clients attracted by a new server in the result location often exceeds 98% of the optimal. The algorithm is extended to finding k influential locations. Extensive experiments are conducted to show the efficiency and effectiveness of FILM on both real and synthetic datasets.
{"title":"Efficient methods for finding influential locations with adaptive grids","authors":"D. Yan, R. C. Wong, Wilfred Ng","doi":"10.1145/2063576.2063788","DOIUrl":"https://doi.org/10.1145/2063576.2063788","url":null,"abstract":"Given a set S of servers and a set C of clients, an optimal-location query returns a location where a new server can attract the greatest number of clients. Optimal-location queries are important in a lot of real-life applications, such as mobile service planning or resource distribution in an area. Previous studies assume that a client always visits its nearest server, which is too strict to be true in reality. In this paper, we relax this assumption and propose a new model to tackle this problem. We further generalize the problem to finding top-k optimal locations. The main challenge is that, even the fastest approach in existing studies needs to take hours to answer an optimal-location query on a typical real world dataset, which significantly limits the applications of the query. Using our relaxed model, we design an efficient grid-based approximation algorithm called FILM (Fast Influential Location Miner) to the queries, which is orders of magnitude faster than the best-known previous work and the number of clients attracted by a new server in the result location often exceeds 98% of the optimal. The algorithm is extended to finding k influential locations. Extensive experiments are conducted to show the efficiency and effectiveness of FILM on both real and synthetic datasets.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"30 1","pages":"1475-1484"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82234173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a "semi-index": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.
{"title":"Semi-indexing semi-structured data in tiny space","authors":"G. Ottaviano, R. Grossi","doi":"10.1145/2063576.2063790","DOIUrl":"https://doi.org/10.1145/2063576.2063790","url":null,"abstract":"Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a \"semi-index\": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"5 1","pages":"1485-1494"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81600276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Imagine you are a social network user who wants to search, in a list of potential candidates, for the best candidate for a job on the basis of their PageRank-induced importance ranking. Is it possible to compute this ranking for a low cost, by visiting only small subnetworks around the nodes that represent each candidate? The fundamental problem underpinning this question, i.e. computing locally the PageRank ranking of k nodes in an $n$-node graph, was first raised by Chen et al. (CIKM 2004) and then restated by Bar-Yossef and Mashiach (CIKM 2008). In this paper we formalize and provide the first analysis of the problem, proving that any local algorithm that computes a correct ranking must take into consideration Ω(√(kn)) nodes -- even when ranking the top $k$ nodes of the graph, even if their PageRank scores are "well separated", and even if the algorithm is randomized (and we prove a stronger Ω(n) bound for deterministic algorithms). Experiments carried out on large, publicly available crawls of the web and of a social network show that also in practice the fraction of the graph to be visited to compute the ranking may be considerable, both for algorithms that are always correct and for algorithms that employ (efficient) local score approximations.
{"title":"Local computation of PageRank: the ranking side","authors":"M. Bressan, Luca Pretto","doi":"10.1145/2063576.2063670","DOIUrl":"https://doi.org/10.1145/2063576.2063670","url":null,"abstract":"Imagine you are a social network user who wants to search, in a list of potential candidates, for the best candidate for a job on the basis of their PageRank-induced importance ranking. Is it possible to compute this ranking for a low cost, by visiting only small subnetworks around the nodes that represent each candidate? The fundamental problem underpinning this question, i.e. computing locally the PageRank ranking of k nodes in an $n$-node graph, was first raised by Chen et al. (CIKM 2004) and then restated by Bar-Yossef and Mashiach (CIKM 2008). In this paper we formalize and provide the first analysis of the problem, proving that any local algorithm that computes a correct ranking must take into consideration Ω(√(kn)) nodes -- even when ranking the top $k$ nodes of the graph, even if their PageRank scores are \"well separated\", and even if the algorithm is randomized (and we prove a stronger Ω(n) bound for deterministic algorithms). Experiments carried out on large, publicly available crawls of the web and of a social network show that also in practice the fraction of the graph to be visited to compute the ranking may be considerable, both for algorithms that are always correct and for algorithms that employ (efficient) local score approximations.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"28 1","pages":"631-640"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82341758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Tretyakov, Abel Armas-Cervantes, L. García-Bañuelos, J. Vilo, M. Dumas
Computing the shortest path between a pair of vertices in a graph is a fundamental primitive in graph algorithmics. Classical exact methods for this problem do not scale up to contemporary, rapidly evolving social networks with hundreds of millions of users and billions of connections. A number of approximate methods have been proposed, including several landmark-based methods that have been shown to scale up to very large graphs with acceptable accuracy. This paper presents two improvements to existing landmark-based shortest path estimation methods. The first improvement relates to the use of shortest-path trees (SPTs). Together with appropriate short-cutting heuristics, the use of SPTs allows to achieve higher accuracy with acceptable time and memory overhead. Furthermore, SPTs can be maintained incrementally under edge insertions and deletions, which allows for a fully-dynamic algorithm. The second improvement is a new landmark selection strategy that seeks to maximize the coverage of all shortest paths by the selected landmarks. The improved method is evaluated on the DBLP, Orkut, Twitter and Skype social networks.
{"title":"Fast fully dynamic landmark-based estimation of shortest path distances in very large graphs","authors":"Konstantin Tretyakov, Abel Armas-Cervantes, L. García-Bañuelos, J. Vilo, M. Dumas","doi":"10.1145/2063576.2063834","DOIUrl":"https://doi.org/10.1145/2063576.2063834","url":null,"abstract":"Computing the shortest path between a pair of vertices in a graph is a fundamental primitive in graph algorithmics. Classical exact methods for this problem do not scale up to contemporary, rapidly evolving social networks with hundreds of millions of users and billions of connections. A number of approximate methods have been proposed, including several landmark-based methods that have been shown to scale up to very large graphs with acceptable accuracy. This paper presents two improvements to existing landmark-based shortest path estimation methods. The first improvement relates to the use of shortest-path trees (SPTs). Together with appropriate short-cutting heuristics, the use of SPTs allows to achieve higher accuracy with acceptable time and memory overhead. Furthermore, SPTs can be maintained incrementally under edge insertions and deletions, which allows for a fully-dynamic algorithm. The second improvement is a new landmark selection strategy that seeks to maximize the coverage of all shortest paths by the selected landmarks. The improved method is evaluated on the DBLP, Orkut, Twitter and Skype social networks.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"15 1","pages":"1785-1794"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82523921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new image search and ranking algorithm for retrieving unannotated images by collaboratively mining online search results which consist of online image and text search results. The online image search results are leveraged as reference examples to perform content-based image search over unannotated images. The online text search results are utilized to estimate the reference images' relevance to the search query. The key feature of our method is its capability to deal with unreliable online image search results through jointly mining visual and textual aspects of online search results. Through such collaborative mining, our algorithm infers the relevance of an online search result image to a text query. Once we obtain the estimate of query relevance score for each online image search result, we can selectively use query specific online search result images as reference examples for retrieving and ranking unannotated images. We tested our algorithm both on the standard public image datasets and several modestly sized personal photo collections. We also compared our method with two well-known peer methods. The results indicate that our algorithm is superior to existing content-based image search algorithms for retrieving and ranking unannotated images.
{"title":"Retrieving and ranking unannotated images through collaboratively mining online search results","authors":"Songhua Xu, Hao Jiang, F. Lau","doi":"10.1145/2063576.2063650","DOIUrl":"https://doi.org/10.1145/2063576.2063650","url":null,"abstract":"We present a new image search and ranking algorithm for retrieving unannotated images by collaboratively mining online search results which consist of online image and text search results. The online image search results are leveraged as reference examples to perform content-based image search over unannotated images. The online text search results are utilized to estimate the reference images' relevance to the search query. The key feature of our method is its capability to deal with unreliable online image search results through jointly mining visual and textual aspects of online search results. Through such collaborative mining, our algorithm infers the relevance of an online search result image to a text query. Once we obtain the estimate of query relevance score for each online image search result, we can selectively use query specific online search result images as reference examples for retrieving and ranking unannotated images. We tested our algorithm both on the standard public image datasets and several modestly sized personal photo collections. We also compared our method with two well-known peer methods. The results indicate that our algorithm is superior to existing content-based image search algorithms for retrieving and ranking unannotated images.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"25 1","pages":"485-494"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80909168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrey Styskin, Fedor Romanenko, F. Vorobyev, P. Serdyukov
In this paper, we propose a web search retrieval approach which automatically detects recency sensitive queries and increases the freshness of the ordinary document ranking by a degree proportional to the probability of the need in recent content. We propose to solve the recency ranking problem by using result diversification principles and deal with the query's non-topical ambiguity appearing when the need in recent content can be detected only with uncertainty. Our offine and online experiments with millions of queries from real search engine users demonstrate the significant increase in satisfaction of users presented with a search result generated by our approach.
{"title":"Recency ranking by diversification of result set","authors":"Andrey Styskin, Fedor Romanenko, F. Vorobyev, P. Serdyukov","doi":"10.1145/2063576.2063862","DOIUrl":"https://doi.org/10.1145/2063576.2063862","url":null,"abstract":"In this paper, we propose a web search retrieval approach which automatically detects recency sensitive queries and increases the freshness of the ordinary document ranking by a degree proportional to the probability of the need in recent content. We propose to solve the recency ranking problem by using result diversification principles and deal with the query's non-topical ambiguity appearing when the need in recent content can be detected only with uncertainty. Our offine and online experiments with millions of queries from real search engine users demonstrate the significant increase in satisfaction of users presented with a search result generated by our approach.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"25 1","pages":"1949-1952"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83308498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Managing Interoperability and Complexity in Health Systems, MIXHS'11, aims to be a forum focussing on recent research and technical results in knowledge management and information systems in bio-medical and electronic health systems. The workshop will provide an opportunity for sharing practical experiences and best practices in e-Health information infrastructure development and management. Of particular interest to the workshop themes are technical solutions to recurring practical systems deployment issues, including harnessing the complexity of bio-medical domain knowledge and the interoperability of heterogeneous health systems. The workshop will gather experts, researchers, system developers, practitioners and policymakers designing and implementing solutions for managing clinical data and integrating existing and future electronic health systems infrastructures.
{"title":"Managing interoperability and complexity inhealth systems: MIXHS'11 workshop summary","authors":"M. Bouamrane, C. Tao","doi":"10.1145/2063576.2064050","DOIUrl":"https://doi.org/10.1145/2063576.2064050","url":null,"abstract":"Managing Interoperability and Complexity in Health Systems, MIXHS'11, aims to be a forum focussing on recent research and technical results in knowledge management and information systems in bio-medical and electronic health systems. The workshop will provide an opportunity for sharing practical experiences and best practices in e-Health information infrastructure development and management. Of particular interest to the workshop themes are technical solutions to recurring practical systems deployment issues, including harnessing the complexity of bio-medical domain knowledge and the interoperability of heterogeneous health systems. The workshop will gather experts, researchers, system developers, practitioners and policymakers designing and implementing solutions for managing clinical data and integrating existing and future electronic health systems infrastructures.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"115 1","pages":"2635-2636"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89313577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.
{"title":"Duplicate detection through structure optimization","authors":"Luís Leitão, P. Calado","doi":"10.1145/2063576.2063644","DOIUrl":"https://doi.org/10.1145/2063576.2063644","url":null,"abstract":"Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"12 1","pages":"443-452"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89826857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sundararajan Sellamanickam, Priyanka Garg, S. Keerthi
A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.
{"title":"A pairwise ranking based approach to learning with positive and unlabeled examples","authors":"Sundararajan Sellamanickam, Priyanka Garg, S. Keerthi","doi":"10.1145/2063576.2063675","DOIUrl":"https://doi.org/10.1145/2063576.2063675","url":null,"abstract":"A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"7 1","pages":"663-672"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89862696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A growing number of applications are built on top of search engines and issue complex structured queries. This paper contributes a customisable ranking-based processing of such queries, specifically SQL. Similar to how term-based statistics are exploited by term-based retrieval models, ranking-aware processing of SQL queries exploits tuple-based statistics that are derived from sources or, more precisely, derived from the relations specified in the SQL query. To implement this ranking-based processing, we leverage PSQL, a probabilistic variant of SQL, to facilitate probability estimation and the generalisation of document retrieval models to be used for tuple retrieval. The result is a general-purpose framework that can interpret any SQL query and then assign a probabilistic retrieval model to rank the results of that query. The evaluation on the IMDB and Monster benchmarks proves that the PSQL-based approach is applicable to (semi-)structured and unstructured data and structured queries.
{"title":"Ranking-based processing of SQL queries","authors":"H. Azzam, T. Roelleke, Sirvan Yahyaei","doi":"10.1145/2063576.2063614","DOIUrl":"https://doi.org/10.1145/2063576.2063614","url":null,"abstract":"A growing number of applications are built on top of search engines and issue complex structured queries. This paper contributes a customisable ranking-based processing of such queries, specifically SQL. Similar to how term-based statistics are exploited by term-based retrieval models, ranking-aware processing of SQL queries exploits tuple-based statistics that are derived from sources or, more precisely, derived from the relations specified in the SQL query. To implement this ranking-based processing, we leverage PSQL, a probabilistic variant of SQL, to facilitate probability estimation and the generalisation of document retrieval models to be used for tuple retrieval. The result is a general-purpose framework that can interpret any SQL query and then assign a probabilistic retrieval model to rank the results of that query. The evaluation on the IMDB and Monster benchmarks proves that the PSQL-based approach is applicable to (semi-)structured and unstructured data and structured queries.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"35 1","pages":"231-236"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86505244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management