Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767864
Laure Berti-Équille, T. Dasu, D. Srivastava
Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.
{"title":"Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning","authors":"Laure Berti-Équille, T. Dasu, D. Srivastava","doi":"10.1109/ICDE.2011.5767864","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767864","url":null,"abstract":"Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127623665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.
{"title":"Similarity measures for multidimensional data","authors":"Eftychia Baikousi, Georgios Rogkakos, Panos Vassiliadis","doi":"10.1109/ICDE.2011.5767869","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767869","url":null,"abstract":"How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127984425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767863
Bin Bi, Sau-dan. Lee, B. Kao, Reynold Cheng
In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers). This leads to two issues that gravely affect the effectiveness of resource retrieval: (1) Noise: tags are picked from an uncontrolled vocabulary and are assigned by untrained taggers. The tags are thus noisy features in resource retrieval. (2) A multitude of aspects: different taggers focus on different aspects of a resource. Representing a resource using a flattened bag of tags ignores this important diversity of taggers. To improve the effectiveness of resource retrieval in social tagging systems, we propose CubeLSI — a technique that extends traditional LSI to include taggers as another dimension of feature space of resources. We compare CubeLSI against a number of other tag-based retrieval models and show that CubeLSI significantly outperforms the other models in terms of retrieval accuracy. We also prove two interesting theorems that allow CubeLSI to be very efficiently computed despite the much enlarged feature space it employs.
{"title":"CubeLSI: An effective and efficient method for searching resources in social tagging systems","authors":"Bin Bi, Sau-dan. Lee, B. Kao, Reynold Cheng","doi":"10.1109/ICDE.2011.5767863","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767863","url":null,"abstract":"In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers). This leads to two issues that gravely affect the effectiveness of resource retrieval: (1) Noise: tags are picked from an uncontrolled vocabulary and are assigned by untrained taggers. The tags are thus noisy features in resource retrieval. (2) A multitude of aspects: different taggers focus on different aspects of a resource. Representing a resource using a flattened bag of tags ignores this important diversity of taggers. To improve the effectiveness of resource retrieval in social tagging systems, we propose CubeLSI — a technique that extends traditional LSI to include taggers as another dimension of feature space of resources. We compare CubeLSI against a number of other tag-based retrieval models and show that CubeLSI significantly outperforms the other models in terms of retrieval accuracy. We also prove two interesting theorems that allow CubeLSI to be very efficiently computed despite the much enlarged feature space it employs.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132215673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767921
V. Borkar, M. Carey, Raman Grover, Nicola Onose, R. Vernica
Hyracks is a new partitioned-parallel software platform designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators' outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications.
{"title":"Hyracks: A flexible and extensible foundation for data-intensive computing","authors":"V. Borkar, M. Carey, Raman Grover, Nicola Onose, R. Vernica","doi":"10.1109/ICDE.2011.5767921","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767921","url":null,"abstract":"Hyracks is a new partitioned-parallel software platform designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators' outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130263670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767920
Senjuti Basu Roy, Gautam Das, S. Amer-Yahia, Cong Yu
Planning an itinerary when traveling to a city involves substantial effort in choosing Points-of-Interest (POIs), deciding in which order to visit them, and accounting for the time it takes to visit each POI and transit between them. Several online services address different aspects of itinerary planning but none of them provides an interactive interface where users give feedbacks and iteratively construct their itineraries based on personal interests and time budget. In this paper, we formalize interactive itinerary planning as an iterative process where, at each step: (1) the user provides feedback on POIs selected by the system, (2) the system recommends the best itineraries based on all feedback so far, and (3) the system further selects a new set of POIs, with optimal utility, to solicit feedback for, at the next step. This iterative process stops when the user is satisfied with the recommended itinerary. We show that computing an itinerary is NP-complete even for simple itinerary scoring functions, and that POI selection is NP-complete. We develop heuristics and optimizations for a specific case where the score of an itinerary is proportional to the number of desired POIs it contains. Our extensive experiments show that our algorithms are efficient and return high quality itineraries.
{"title":"Interactive itinerary planning","authors":"Senjuti Basu Roy, Gautam Das, S. Amer-Yahia, Cong Yu","doi":"10.1109/ICDE.2011.5767920","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767920","url":null,"abstract":"Planning an itinerary when traveling to a city involves substantial effort in choosing Points-of-Interest (POIs), deciding in which order to visit them, and accounting for the time it takes to visit each POI and transit between them. Several online services address different aspects of itinerary planning but none of them provides an interactive interface where users give feedbacks and iteratively construct their itineraries based on personal interests and time budget. In this paper, we formalize interactive itinerary planning as an iterative process where, at each step: (1) the user provides feedback on POIs selected by the system, (2) the system recommends the best itineraries based on all feedback so far, and (3) the system further selects a new set of POIs, with optimal utility, to solicit feedback for, at the next step. This iterative process stops when the user is satisfied with the recommended itinerary. We show that computing an itinerary is NP-complete even for simple itinerary scoring functions, and that POI selection is NP-complete. We develop heuristics and optimizations for a specific case where the score of an itinerary is proportional to the number of desired POIs it contains. Our extensive experiments show that our algorithms are efficient and return high quality itineraries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133587768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767872
Stefan Aulbach, Michael Seibold, D. Jacobs, A. Kemper
Software-as-a-Service applications commonly consolidate multiple businesses into the same database to reduce costs. This practice makes it harder to implement several essential features of enterprise applications. The first is support for master data, which should be shared rather than replicated for each tenant. The second is application modification and extension, which applies both to the database schema and master data it contains. The third is evolution of the schema and master data, which occurs as the application and its extensions are upgraded. These features cannot be easily implemented in a traditional DBMS and, to the extent that they are currently offered at all, they are generally implemented within the application layer. This approach reduces the DBMS to a ‘dumb data repository’ that only stores data rather than managing it. In addition, it complicates development of the application since many DBMS features have to be re-implemented. Instead, a next-generation multi-tenant DBMS should provide explicit support for Extensibility, Data Sharing and Evolution. As these three features are strongly related, they cannot be implemented independently from each other. Therefore, we propose FLEXSCHEME which captures all three aspects in one integrated model. In this paper, we focus on efficient storage mechanisms for this model and present a novel versioning mechanism, called XOR Delta, which is based on XOR encoding and is optimized for main-memory DBMSs.
{"title":"Extensibility and Data Sharing in evolving multi-tenant databases","authors":"Stefan Aulbach, Michael Seibold, D. Jacobs, A. Kemper","doi":"10.1109/ICDE.2011.5767872","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767872","url":null,"abstract":"Software-as-a-Service applications commonly consolidate multiple businesses into the same database to reduce costs. This practice makes it harder to implement several essential features of enterprise applications. The first is support for master data, which should be shared rather than replicated for each tenant. The second is application modification and extension, which applies both to the database schema and master data it contains. The third is evolution of the schema and master data, which occurs as the application and its extensions are upgraded. These features cannot be easily implemented in a traditional DBMS and, to the extent that they are currently offered at all, they are generally implemented within the application layer. This approach reduces the DBMS to a ‘dumb data repository’ that only stores data rather than managing it. In addition, it complicates development of the application since many DBMS features have to be re-implemented. Instead, a next-generation multi-tenant DBMS should provide explicit support for Extensibility, Data Sharing and Evolution. As these three features are strongly related, they cannot be implemented independently from each other. Therefore, we propose FLEXSCHEME which captures all three aspects in one integrated model. In this paper, we focus on efficient storage mechanisms for this model and present a novel versioning mechanism, called XOR Delta, which is based on XOR encoding and is optimized for main-memory DBMSs.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133619714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767842
P. Gulhane, Amit Madaan, Rupesh R. Mehta, J. Ramamirtham, R. Rastogi, Sandeepkumar Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari
Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
{"title":"Web-scale information extraction with vertex","authors":"P. Gulhane, Amit Madaan, Rupesh R. Mehta, J. Ramamirtham, R. Rastogi, Sandeepkumar Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari","doi":"10.1109/ICDE.2011.5767842","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767842","url":null,"abstract":"Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114519654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767942
Rubi Boim, T. Milo, Slava Novgorodov
In this demo we present DiRec, a plug-in that allows Collaborative Filtering (CF) Recommender systems to diversify the recommendations that they present to users. DiRec estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. Items are clustered based on a novel notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. We demonstrate the operation of DiRec in the context of a movie recommendation system. We show the advantage of recommendation diversification and its feasibility even in the absence of semantic information.
{"title":"DiRec: Diversified recommendations for semantic-less Collaborative Filtering","authors":"Rubi Boim, T. Milo, Slava Novgorodov","doi":"10.1109/ICDE.2011.5767942","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767942","url":null,"abstract":"In this demo we present DiRec, a plug-in that allows Collaborative Filtering (CF) Recommender systems to diversify the recommendations that they present to users. DiRec estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. Items are clustered based on a novel notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. We demonstrate the operation of DiRec in the context of a movie recommendation system. We show the advantage of recommendation diversification and its feasibility even in the absence of semantic information.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122150963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767899
R. Moussalli, Mariam Salloum, W. Najjar, V. Tsotras
In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.
{"title":"Massively parallel XML twig filtering using dynamic programming on FPGAs","authors":"R. Moussalli, Mariam Salloum, W. Najjar, V. Tsotras","doi":"10.1109/ICDE.2011.5767899","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767899","url":null,"abstract":"In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128459043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767961
G. Graefe
In the context of data management, robustness is usually associated with resilience against failure, recovery, redundancy, disaster preparedness, etc. Robust query processing, on the other hand, is about robustness of performance and of scalability. It is more than progress reporting or predictability. A system that fails predictably or obviously performs poorly may be better than an unpredictable one, but it is not robust.
{"title":"Robust query processing","authors":"G. Graefe","doi":"10.1109/ICDE.2011.5767961","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767961","url":null,"abstract":"In the context of data management, robustness is usually associated with resilience against failure, recovery, redundancy, disaster preparedness, etc. Robust query processing, on the other hand, is about robustness of performance and of scalability. It is more than progress reporting or predictability. A system that fails predictably or obviously performs poorly may be better than an unpredictable one, but it is not robust.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128898116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}