Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767907
Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee
Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.
{"title":"Towards exploratory hypothesis testing and analysis","authors":"Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee","doi":"10.1109/ICDE.2011.5767907","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767907","url":null,"abstract":"Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130608714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767944
Yu-Ling Hsueh, Roger Zimmermann, Wei-Shinn Ku, Yifan Jin
Skyline query processing has become an important feature in multi-dimensional, data-intensive applications. Such computations are especially challenging under dynamic conditions, when either snapshot queries need to be answered with short user response times or when continuous skyline queries need to be maintained efficiently over a set of objects that are frequently updated. To achieve high performance, we have recently designed the ESC algorithm, an Efficient update approach for Skyline Computations. ESC creates a pre-computed candidate skyline set behind the first skyline (a “second line of defense,” so to speak) that facilitates an incremental, two-stage skyline update strategy which results in a quicker query response time for the user. Our demonstration presents the two-threaded SkyEngine system that builds upon and extends the base-features of the ESC algorithm with innovative, user-oriented functionalities that are termed SkyAlert and AutoAdjust. These functions enable a data or service provider to be informed about and gain the opportunity of automatically promoting its data records to remain part of the skyline, if so desired. The SkyEngine demonstration includes both a server and a web browser based client. Finally, the SkyEngine system also provides visualizations that reveal its internal performance statistics.
{"title":"SkyEngine: Efficient Skyline search engine for Continuous Skyline computations","authors":"Yu-Ling Hsueh, Roger Zimmermann, Wei-Shinn Ku, Yifan Jin","doi":"10.1109/ICDE.2011.5767944","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767944","url":null,"abstract":"Skyline query processing has become an important feature in multi-dimensional, data-intensive applications. Such computations are especially challenging under dynamic conditions, when either snapshot queries need to be answered with short user response times or when continuous skyline queries need to be maintained efficiently over a set of objects that are frequently updated. To achieve high performance, we have recently designed the ESC algorithm, an Efficient update approach for Skyline Computations. ESC creates a pre-computed candidate skyline set behind the first skyline (a “second line of defense,” so to speak) that facilitates an incremental, two-stage skyline update strategy which results in a quicker query response time for the user. Our demonstration presents the two-threaded SkyEngine system that builds upon and extends the base-features of the ESC algorithm with innovative, user-oriented functionalities that are termed SkyAlert and AutoAdjust. These functions enable a data or service provider to be informed about and gain the opportunity of automatically promoting its data records to remain part of the skyline, if so desired. The SkyEngine demonstration includes both a server and a web browser based client. Finally, the SkyEngine system also provides visualizations that reveal its internal performance statistics.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128796465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767948
Mu-Woong Lee, Seung-won Hwang, Sunghun Kim
To support rapid and efficient software development, we propose to demonstrate our tool, integrating code search into software development process. For example, a developer, right during writing a module, can find a code piece sharing the same syntactic structure from a large code corpus representing the wisdom of other developers in the same team (or in the universe of open-source code). While there exist commercial code search engines on the code universe, they treat software as text (thus oblivious of syntactic structure), and fail at finding semantically related code. Meanwhile, existing tools, searching for syntactic clones, do not focus on efficiency, focusing on “post-mortem” usage scenario of detecting clones “after” the code development is completed. In clear contrast, we focus on optimizing efficiency for syntactic code search and making this search “interactive” for large-scale corpus, to complement the existing two lines of research. From our demonstration, we will show how such interactive search supports rapid software development, as similarly claimed lately in SE and HCI communities [1], [2]. As an enabling technology, we design efficient index building and traversal techniques, optimized for code corpus and code search workload. Our tool can identify relevant code in the corpus of 1.7 million code pieces in a sub-second response time, without compromising any accuracy obtained by a state-of-the-art tool, as we report our extensive evaluation results in [3].
{"title":"Integrating code search into the development session","authors":"Mu-Woong Lee, Seung-won Hwang, Sunghun Kim","doi":"10.1109/ICDE.2011.5767948","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767948","url":null,"abstract":"To support rapid and efficient software development, we propose to demonstrate our tool, integrating code search into software development process. For example, a developer, right during writing a module, can find a code piece sharing the same syntactic structure from a large code corpus representing the wisdom of other developers in the same team (or in the universe of open-source code). While there exist commercial code search engines on the code universe, they treat software as text (thus oblivious of syntactic structure), and fail at finding semantically related code. Meanwhile, existing tools, searching for syntactic clones, do not focus on efficiency, focusing on “post-mortem” usage scenario of detecting clones “after” the code development is completed. In clear contrast, we focus on optimizing efficiency for syntactic code search and making this search “interactive” for large-scale corpus, to complement the existing two lines of research. From our demonstration, we will show how such interactive search supports rapid software development, as similarly claimed lately in SE and HCI communities [1], [2]. As an enabling technology, we design efficient index building and traversal techniques, optimized for code corpus and code search workload. Our tool can identify relevant code in the corpus of 1.7 million code pieces in a sub-second response time, without compromising any accuracy obtained by a state-of-the-art tool, as we report our extensive evaluation results in [3].","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123097662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767862
Haibo Hu, Jianliang Xu, C. Ren, Byron Choi
Query processing that preserves both the data privacy of the owner and the query privacy of the client is a new research problem. It shows increasing importance as cloud computing drives more businesses to outsource their data and querying services. However, most existing studies, including those on data outsourcing, address the data privacy and query privacy separately and cannot be applied to this problem. In this paper, we propose a holistic and efficient solution that comprises a secure traversal framework and an encryption scheme based on privacy homomorphism. The framework is scalable to large datasets by leveraging an index-based approach. Based on this framework, we devise secure protocols for processing typical queries such as k-nearest-neighbor queries (kNN) on R-tree index. Moreover, several optimization techniques are presented to improve the efficiency of the query processing protocols. Our solution is verified by both theoretical analysis and performance study.
{"title":"Processing private queries over untrusted data cloud through privacy homomorphism","authors":"Haibo Hu, Jianliang Xu, C. Ren, Byron Choi","doi":"10.1109/ICDE.2011.5767862","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767862","url":null,"abstract":"Query processing that preserves both the data privacy of the owner and the query privacy of the client is a new research problem. It shows increasing importance as cloud computing drives more businesses to outsource their data and querying services. However, most existing studies, including those on data outsourcing, address the data privacy and query privacy separately and cannot be applied to this problem. In this paper, we propose a holistic and efficient solution that comprises a secure traversal framework and an encryption scheme based on privacy homomorphism. The framework is scalable to large datasets by leveraging an index-based approach. Based on this framework, we devise secure protocols for processing typical queries such as k-nearest-neighbor queries (kNN) on R-tree index. Moreover, several optimization techniques are presented to improve the efficiency of the query processing protocols. Our solution is verified by both theoretical analysis and performance study.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126303259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767828
Morten Middelfart, T. Pedersen
This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.
{"title":"Implementing sentinels in the TARGIT BI suite","authors":"Morten Middelfart, T. Pedersen","doi":"10.1109/ICDE.2011.5767828","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767828","url":null,"abstract":"This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126590700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767966
M. Ahrens, G. Alonso
Existing relational databases are facing significant challenges as the hardware infrastructure and the underlying platform change from single CPUs to virtualized multicore machines arranged in large clusters. The problems are both technical and related to the licensing models currently in place. In this short abstract we briefly outline the challenges faced by organizations trying to virtualize and bring existing relational databases into the cloud.
{"title":"Relational databases, virtualization, and the cloud","authors":"M. Ahrens, G. Alonso","doi":"10.1109/ICDE.2011.5767966","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767966","url":null,"abstract":"Existing relational databases are facing significant challenges as the hardware infrastructure and the underlying platform change from single CPUs to virtualized multicore machines arranged in large clusters. The problems are both technical and related to the licensing models currently in place. In this short abstract we briefly outline the challenges faced by organizations trying to virtualize and bring existing relational databases into the cloud.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126609081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767852
Ye Wang, S. Parthasarathy, S. Tatikonda
Outlier detection is fundamental to a variety of database and analytic tasks. Recently, distance-based outlier detection has emerged as a viable and scalable alternative to traditional statistical and geometric approaches. In this article we explore the role of ranking for the efficient discovery of distance-based outliers from large high dimensional data sets. Specifically, we develop a light-weight ranking scheme that is powered by locality sensitive hashing, which reorders the database points according to their likelihood of being an outlier. We provide theoretical arguments to justify the rationale for the approach and subsequently conduct an extensive empirical study highlighting the effectiveness of our approach over extant solutions. We show that our ranking scheme improves the efficiency of the distance-based outlier discovery process by up to 5-fold. Furthermore, we find that using our approach the top outliers can often be isolated very quickly, typically by scanning less than 3% of the data set.
{"title":"Locality Sensitive Outlier Detection: A ranking driven approach","authors":"Ye Wang, S. Parthasarathy, S. Tatikonda","doi":"10.1109/ICDE.2011.5767852","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767852","url":null,"abstract":"Outlier detection is fundamental to a variety of database and analytic tasks. Recently, distance-based outlier detection has emerged as a viable and scalable alternative to traditional statistical and geometric approaches. In this article we explore the role of ranking for the efficient discovery of distance-based outliers from large high dimensional data sets. Specifically, we develop a light-weight ranking scheme that is powered by locality sensitive hashing, which reorders the database points according to their likelihood of being an outlier. We provide theoretical arguments to justify the rationale for the approach and subsequently conduct an extensive empirical study highlighting the effectiveness of our approach over extant solutions. We show that our ranking scheme improves the efficiency of the distance-based outlier discovery process by up to 5-fold. Furthermore, we find that using our approach the top outliers can often be isolated very quickly, typically by scanning less than 3% of the data set.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116110166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767913
Xiang Li, C. Quix, D. Kensche, Sandra Geisler, Lisong Guo
Mediated schemas lie at the center of the well recognized data integration architecture. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process. Automatic generation of mediated schemas is still a goal to be achieved. We generate mediated schemas by merging multiple source schemas interrelated by tuple-generating dependencies (tgds). Schema merging is the process to consolidate multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Existing approaches fall short in various aspects, such as restricted expressiveness of input mappings, lacking data level interpretation, the output mapping is not in a logical language (or not given at all), and being confined to binary merging. We present here a novel system which is able to perform native n-ary schema merging using P2P style tgds as input. Suited in the scenario of generating mediated schemas for data integration, the system opts for a minimal schema signature retaining all certain answers of conjunctive queries. Logical output mappings are generated to support the mediated schemas, which enable query answering and, in some cases, query rewriting.
{"title":"Automatic generation of mediated schemas through reasoning over data dependencies","authors":"Xiang Li, C. Quix, D. Kensche, Sandra Geisler, Lisong Guo","doi":"10.1109/ICDE.2011.5767913","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767913","url":null,"abstract":"Mediated schemas lie at the center of the well recognized data integration architecture. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process. Automatic generation of mediated schemas is still a goal to be achieved. We generate mediated schemas by merging multiple source schemas interrelated by tuple-generating dependencies (tgds). Schema merging is the process to consolidate multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Existing approaches fall short in various aspects, such as restricted expressiveness of input mappings, lacking data level interpretation, the output mapping is not in a logical language (or not given at all), and being confined to binary merging. We present here a novel system which is able to perform native n-ary schema merging using P2P style tgds as input. Suited in the scenario of generating mediated schemas for data integration, the system opts for a minimal schema signature retaining all certain answers of conjunctive queries. Logical output mappings are generated to support the mediated schemas, which enable query answering and, in some cases, query rewriting.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767853
Stephen Revilak, P. O'Neil, E. O'Neil
Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995 [2], with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 [4] and improved in 2009 [5] by aborting transactions in dangerous structures, which had been shown in 2005 [9] to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI [4] or ESSI [5]) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.
{"title":"Precisely Serializable Snapshot Isolation (PSSI)","authors":"Stephen Revilak, P. O'Neil, E. O'Neil","doi":"10.1109/ICDE.2011.5767853","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767853","url":null,"abstract":"Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995 [2], with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 [4] and improved in 2009 [5] by aborting transactions in dangerous structures, which had been shown in 2005 [9] to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI [4] or ESSI [5]) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122780700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767847
Yifei Lu, Wei Wang, Jianxin Li, Chengfei Liu
An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities. In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.
{"title":"XClean: Providing valid spelling suggestions for XML keyword queries","authors":"Yifei Lu, Wei Wang, Jianxin Li, Chengfei Liu","doi":"10.1109/ICDE.2011.5767847","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767847","url":null,"abstract":"An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities. In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132639440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}