Dinusha Vatsalan, P. Christen, Vassilios S. Verykios
Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party. We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.
{"title":"Efficient two-party private blocking based on sorted nearest neighborhood clustering","authors":"Dinusha Vatsalan, P. Christen, Vassilios S. Verykios","doi":"10.1145/2505515.2505757","DOIUrl":"https://doi.org/10.1145/2505515.2505757","url":null,"abstract":"Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party. We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78056721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parsing-based search, i.e., parsing keyword search queries using grammars, is often used to override the traditional "bag-of-words'" semantics in web search and enterprise search scenarios. Compared to the "bag-of-words" semantics, the parsing-based semantics is richer and more customizable. While a formalism for parsing-based semantics for keyword search has been proposed in prior work and ad-hoc implementations exist, the problem of designing efficient algorithms to support the semantics is largely unstudied. In this paper, we present a suite of efficient algorithms and auxiliary indexes for this problem. Our algorithms work for a broad classes of grammars used in practice, and cover a variety of database matching functions (set- and substring-containment, approximate and exact equality) and scoring functions (to filter and rank different parses). We formally analyze the time complexity of our algorithms and provide an empirical evaluation over real-world data to show that our algorithms scale well with the size of the database and grammar.
{"title":"Efficient parsing-based search over structured data","authors":"Aditya G. Parameswaran, R. Kaushik, A. Arasu","doi":"10.1145/2505515.2505764","DOIUrl":"https://doi.org/10.1145/2505515.2505764","url":null,"abstract":"Parsing-based search, i.e., parsing keyword search queries using grammars, is often used to override the traditional \"bag-of-words'\" semantics in web search and enterprise search scenarios. Compared to the \"bag-of-words\" semantics, the parsing-based semantics is richer and more customizable. While a formalism for parsing-based semantics for keyword search has been proposed in prior work and ad-hoc implementations exist, the problem of designing efficient algorithms to support the semantics is largely unstudied. In this paper, we present a suite of efficient algorithms and auxiliary indexes for this problem. Our algorithms work for a broad classes of grammars used in practice, and cover a variety of database matching functions (set- and substring-containment, approximate and exact equality) and scoring functions (to filter and rank different parses). We formally analyze the time complexity of our algorithms and provide an empirical evaluation over real-world data to show that our algorithms scale well with the size of the database and grammar.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"197 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72821445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a novel network-based approach for location estimation in social media that integrates evidence of the social tie strength between users for improved location estimation. Concretely, we propose a location estimator -- FriendlyLocation -- that leverages the relationship between the strength of the tie between a pair of users, and the distance between the pair. Based on an examination of over 100 million geo-encoded tweets and 73 million Twitter user profiles, we identify several factors such as the number of followers and how the users interact that can strongly reveal the distance between a pair of users. We use these factors to train a decision tree to distinguish between pairs of users who are likely to live nearby and pairs of users who are likely to live in different areas. We use the results of this decision tree as the input to a maximum likelihood estimator to predict a user's location. We find that this proposed method significantly improves the results of location estimation relative to a state-of-the-art technique. Our system reduces the average error distance for 80% of Twitter users from 40 miles to 21 miles using only information from the user's friends and friends-of-friends, which has great significance for augmenting traditional social media and enriching location-based services with more refined and accurate location estimates.
{"title":"Location prediction in social media based on tie strength","authors":"Jeffrey McGee, James Caverlee, Zhiyuan Cheng","doi":"10.1145/2505515.2505544","DOIUrl":"https://doi.org/10.1145/2505515.2505544","url":null,"abstract":"We propose a novel network-based approach for location estimation in social media that integrates evidence of the social tie strength between users for improved location estimation. Concretely, we propose a location estimator -- FriendlyLocation -- that leverages the relationship between the strength of the tie between a pair of users, and the distance between the pair. Based on an examination of over 100 million geo-encoded tweets and 73 million Twitter user profiles, we identify several factors such as the number of followers and how the users interact that can strongly reveal the distance between a pair of users. We use these factors to train a decision tree to distinguish between pairs of users who are likely to live nearby and pairs of users who are likely to live in different areas. We use the results of this decision tree as the input to a maximum likelihood estimator to predict a user's location. We find that this proposed method significantly improves the results of location estimation relative to a state-of-the-art technique. Our system reduces the average error distance for 80% of Twitter users from 40 miles to 21 miles using only information from the user's friends and friends-of-friends, which has great significance for augmenting traditional social media and enriching location-based services with more refined and accurate location estimates.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"56 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76808050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today, reporting is an essential part of everyday business life. But the preparation of complex Business Intelligence data by formulating relevant queries and presenting them in meaningful visualizations, so-called reports, is a challenging task for non-expert database users. To support these users with report creation, we leverage existing queries and present a system for query recommendation in a reporting environment, which is based on query matching. Targeting at large-scale, real-world reporting scenarios, we propose a scalable, index-based query matching approach. Moreover, schema matching is applied for a more fine-grained, structural comparison of the queries. In addition to interactively providing content-based query recommendations of good quality, the system works independent of particular data sources or query languages. We evaluate our system with an empirical data set and show that it achieves an F1-Measure of 0.56 and outperforms the approaches applied by state-of-the-art reporting tools (e.g., keyword search) by up to 30%.
{"title":"Query matching for report recommendation","authors":"Veronika Thost, Konrad Voigt, Daniel Schuster","doi":"10.1145/2505515.2505562","DOIUrl":"https://doi.org/10.1145/2505515.2505562","url":null,"abstract":"Today, reporting is an essential part of everyday business life. But the preparation of complex Business Intelligence data by formulating relevant queries and presenting them in meaningful visualizations, so-called reports, is a challenging task for non-expert database users. To support these users with report creation, we leverage existing queries and present a system for query recommendation in a reporting environment, which is based on query matching. Targeting at large-scale, real-world reporting scenarios, we propose a scalable, index-based query matching approach. Moreover, schema matching is applied for a more fine-grained, structural comparison of the queries. In addition to interactively providing content-based query recommendations of good quality, the system works independent of particular data sources or query languages. We evaluate our system with an empirical data set and show that it achieves an F1-Measure of 0.56 and outperforms the approaches applied by state-of-the-art reporting tools (e.g., keyword search) by up to 30%.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78659540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph partitioning is one of the key components in parallel graph computation, and the partition quality significantly affects the overall computing performance. In the existing graph computing systems, ``good'' partition schemes are preferred as they have smaller edge cut ratio and hence reduce the communication cost among working nodes. However, in an empirical study on Giraph[1], we found that the performance over well partitioned graph might be even two times worse than simple partitions. The cause is that the local message processing cost in graph computing systems may surpass the communication cost in several cases. In this paper, we analyse the cost of parallel graph computing systems as well as the relationship between the cost and underlying graph partitioning. Based on these observation, we propose a novel Partition Aware Graph computation Engine named PAGE. PAGE is equipped with two newly designed modules, i.e., the communication module with a dual concurrent message processor, and a partition aware one to monitor the system's status. The monitored information can be utilized to dynamically adjust the concurrency of dual concurrent message processor with a novel Dynamic Concurrency Control Model (DCCM). The DCCM applies several heuristic rules to determine the optimal concurrency for the message processor. We have implemented a prototype of PAGE and conducted extensive studies on a moderate size of cluster. The experimental results clearly demonstrate the PAGE's robustness under different graph partition qualities and show its advantages over existing systems with up to 59% improvement.
{"title":"PAGE: a partition aware graph computation engine","authors":"Yingxia Shao, Junjie Yao, B. Cui, Lin Ma","doi":"10.1145/2505515.2505617","DOIUrl":"https://doi.org/10.1145/2505515.2505617","url":null,"abstract":"Graph partitioning is one of the key components in parallel graph computation, and the partition quality significantly affects the overall computing performance. In the existing graph computing systems, ``good'' partition schemes are preferred as they have smaller edge cut ratio and hence reduce the communication cost among working nodes. However, in an empirical study on Giraph[1], we found that the performance over well partitioned graph might be even two times worse than simple partitions. The cause is that the local message processing cost in graph computing systems may surpass the communication cost in several cases. In this paper, we analyse the cost of parallel graph computing systems as well as the relationship between the cost and underlying graph partitioning. Based on these observation, we propose a novel Partition Aware Graph computation Engine named PAGE. PAGE is equipped with two newly designed modules, i.e., the communication module with a dual concurrent message processor, and a partition aware one to monitor the system's status. The monitored information can be utilized to dynamically adjust the concurrency of dual concurrent message processor with a novel Dynamic Concurrency Control Model (DCCM). The DCCM applies several heuristic rules to determine the optimal concurrency for the message processor. We have implemented a prototype of PAGE and conducted extensive studies on a moderate size of cluster. The experimental results clearly demonstrate the PAGE's robustness under different graph partition qualities and show its advantages over existing systems with up to 59% improvement.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76440846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Anderton, Maryam Bashir, Virgil Pavlu, J. Aslam
The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 "ad-hoc" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.
{"title":"An analysis of crowd workers mistakes for specific and complex relevance assessment task","authors":"J. Anderton, Maryam Bashir, Virgil Pavlu, J. Aslam","doi":"10.1145/2505515.2507884","DOIUrl":"https://doi.org/10.1145/2505515.2507884","url":null,"abstract":"The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 \"ad-hoc\" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"20 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76009689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lu Liu, Jie Tang, Yu Cheng, Ankit Agrawal, W. Liao, A. Choudhary
The fast development of hospital information systems (HIS) produces a large volume of electronic medical records, which provides a comprehensive source for exploratory analysis and statistics to support clinical decision-making. In this paper, we investigate how to utilize the heterogeneous medical records to aid the clinical treatments of diabetes mellitus. Diabetes mellitus, simply diabetes, is a group of metabolic diseases, which is often accompanied with many complications. We propose a Symptom-Diagnosis-Treatment model to mine the diabetes complication patterns and to unveil the latent association mechanism between treatments and symptoms from large volume of electronic medical records. Furthermore, we study the demographic statistics of patient population w.r.t. complication patterns in real data and observe several interesting phenomena. The discovered complication and treatment patterns can help physicians better understand their specialty and learn previous experiences. Our experiments on a collection of one-year diabetes clinical records from a famous geriatric hospital demonstrate the effectiveness of our approaches.
{"title":"Mining diabetes complication and treatment patterns for clinical decision support","authors":"Lu Liu, Jie Tang, Yu Cheng, Ankit Agrawal, W. Liao, A. Choudhary","doi":"10.1145/2505515.2505549","DOIUrl":"https://doi.org/10.1145/2505515.2505549","url":null,"abstract":"The fast development of hospital information systems (HIS) produces a large volume of electronic medical records, which provides a comprehensive source for exploratory analysis and statistics to support clinical decision-making. In this paper, we investigate how to utilize the heterogeneous medical records to aid the clinical treatments of diabetes mellitus. Diabetes mellitus, simply diabetes, is a group of metabolic diseases, which is often accompanied with many complications. We propose a Symptom-Diagnosis-Treatment model to mine the diabetes complication patterns and to unveil the latent association mechanism between treatments and symptoms from large volume of electronic medical records. Furthermore, we study the demographic statistics of patient population w.r.t. complication patterns in real data and observe several interesting phenomena. The discovered complication and treatment patterns can help physicians better understand their specialty and learn previous experiences. Our experiments on a collection of one-year diabetes clinical records from a famous geriatric hospital demonstrate the effectiveness of our approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86920016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars Dannecker, Wolfgang Lehner
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.
{"title":"DeExcelerator: a framework for extracting relational data from partially structured documents","authors":"Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars Dannecker, Wolfgang Lehner","doi":"10.1145/2505515.2508210","DOIUrl":"https://doi.org/10.1145/2505515.2508210","url":null,"abstract":"Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"92 10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87741648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online user engagement optimization is key to many Internet business. Several research areas are related to the concept of online user engagement optimization, including machine learning, data mining, information retrieval, recommender systems, online A/B (bucket) testing and psychology. In the past, research efforts in this direction are pursued in separate communities and conferences, yielding potential disconnected and repeated results. In addition, researchers and practitioners are sometimes only exposed to a specific aspect of the topic, which might be incomplete and suboptimal to the whole picture. Here, we organize the first workshop on the topic of online user engagement optimization, explicitly targeting the topic as a whole and bring researchers and practitioners together to foster the field. We invite two leading researchers from industry to give keynote talks about online machine learning and online experimentations. In addition, several invited talks from industry and academic researchers have covered the topics of content personalization, online experimental platforms and recommender systems. Also, six novel submissions are included as short papers in the workshop such that new results are discussed and shared among the workshop.
{"title":"The first workshop on user engagement optimization","authors":"Liangjie Hong, Shuang-Hong Yang","doi":"10.1145/2505515.2505816","DOIUrl":"https://doi.org/10.1145/2505515.2505816","url":null,"abstract":"Online user engagement optimization is key to many Internet business. Several research areas are related to the concept of online user engagement optimization, including machine learning, data mining, information retrieval, recommender systems, online A/B (bucket) testing and psychology. In the past, research efforts in this direction are pursued in separate communities and conferences, yielding potential disconnected and repeated results. In addition, researchers and practitioners are sometimes only exposed to a specific aspect of the topic, which might be incomplete and suboptimal to the whole picture. Here, we organize the first workshop on the topic of online user engagement optimization, explicitly targeting the topic as a whole and bring researchers and practitioners together to foster the field. We invite two leading researchers from industry to give keynote talks about online machine learning and online experimentations. In addition, several invited talks from industry and academic researchers have covered the topics of content personalization, online experimental platforms and recommender systems. Also, six novel submissions are included as short papers in the workshop such that new results are discussed and shared among the workshop.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"2010 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86272759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Top-K ranking query in uncertain databases aims to find the top-K tuples according to a ranking function. The interplay between score and uncertainty makes top-K ranking in uncertain databases an intriguing issue, leading to rich query semantics. Recently, a unified ranking framework based on parameterized ranking functions (PRFs) is formulated, which generalizes many previously proposed ranking semantics. Under the PRFs based ranking framework, efficient pruning approach for Top-K ranking on dataset with tuple uncertainty has been well studied in the literature. However, this cannot be applied to top-K ranking on dataset with value uncertainty (described through attribute-level uncertain data model), which are often natural and useful in analyzing uncertain data in many applications. This paper aims to develop efficient pruning techniques for top-K ranking on dataset with value uncertainty under the PRFs based ranking framework, which has not been well studied in the literature. We present the mathematics of deriving the pruning techniques and the corresponding algorithms. The experimental results on both real and synthetic data demonstrate the effectiveness and efficiency of the proposed pruning techniques.
{"title":"Efficient pruning algorithm for top-K ranking on dataset with value uncertainty","authors":"Jianwen Chen, Ling Feng","doi":"10.1145/2505515.2505735","DOIUrl":"https://doi.org/10.1145/2505515.2505735","url":null,"abstract":"Top-K ranking query in uncertain databases aims to find the top-K tuples according to a ranking function. The interplay between score and uncertainty makes top-K ranking in uncertain databases an intriguing issue, leading to rich query semantics. Recently, a unified ranking framework based on parameterized ranking functions (PRFs) is formulated, which generalizes many previously proposed ranking semantics. Under the PRFs based ranking framework, efficient pruning approach for Top-K ranking on dataset with tuple uncertainty has been well studied in the literature. However, this cannot be applied to top-K ranking on dataset with value uncertainty (described through attribute-level uncertain data model), which are often natural and useful in analyzing uncertain data in many applications. This paper aims to develop efficient pruning techniques for top-K ranking on dataset with value uncertainty under the PRFs based ranking framework, which has not been well studied in the literature. We present the mathematics of deriving the pruning techniques and the corresponding algorithms. The experimental results on both real and synthetic data demonstrate the effectiveness and efficiency of the proposed pruning techniques.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83857240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}