Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Yi-Cheng Tu, Anand Kumar, Di Yu, Ran Rui, Ryan Wheeler
The past decade has witnessed the popularity of push-based data management systems, in which the query executor passively receives data from either remote data sources (e.g., sensors) or I/O processes that scan database tables/files from local storage. Unlike traditional relational database management system (RDBMS) architectures that are mostly I/O-bound, push-based database systems often become heavily computation-bound since the data arrival rate could be very high. In this paper, we argue that modern multi-core hardware, especially Graphics Processing Units (GPU), provide the most cost-effective computing platform to catch up with the large amount of data streamed into a push-based database system. Based on that, we will open discussions on how to design and implement a query processing engine for such systems that run on GPUs.
{"title":"Data management systems on GPUs: promises and challenges","authors":"Yi-Cheng Tu, Anand Kumar, Di Yu, Ran Rui, Ryan Wheeler","doi":"10.1145/2484838.2484871","DOIUrl":"https://doi.org/10.1145/2484838.2484871","url":null,"abstract":"The past decade has witnessed the popularity of push-based data management systems, in which the query executor passively receives data from either remote data sources (e.g., sensors) or I/O processes that scan database tables/files from local storage. Unlike traditional relational database management system (RDBMS) architectures that are mostly I/O-bound, push-based database systems often become heavily computation-bound since the data arrival rate could be very high. In this paper, we argue that modern multi-core hardware, especially Graphics Processing Units (GPU), provide the most cost-effective computing platform to catch up with the large amount of data streamed into a push-based database system. Based on that, we will open discussions on how to design and implement a query processing engine for such systems that run on GPUs.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"62 1","pages":"33:1-33:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84970522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics software packages and relational database systems possess considerable overlap in the area of data loading, handling, and transformation. However, only databases are mainly optimized towards high performance in this area. In this paper, we present our approach on bringing the best of these two worlds together. We integrate the analytics-optimized database MonetDB and the R environment for statistical computing in a non-obtrusive, transparent and compatible way.
{"title":"Best of both worlds: relational databases and statistics","authors":"H. Mühleisen, T. Lumley","doi":"10.1145/2484838.2484869","DOIUrl":"https://doi.org/10.1145/2484838.2484869","url":null,"abstract":"Statistics software packages and relational database systems possess considerable overlap in the area of data loading, handling, and transformation. However, only databases are mainly optimized towards high performance in this area. In this paper, we present our approach on bringing the best of these two worlds together. We integrate the analytics-optimized database MonetDB and the R environment for statistical computing in a non-obtrusive, transparent and compatible way.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"32:1-32:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87319636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amarnath Gupta, A. Bandrowski, C. Condit, Xufei Qian, J. Grethe, M. Martone
The NIF system is a semantic search engine that uses an ontology to improve search quality. In this experience paper we present SKEYQL, our semantic keyword query language and describe a number of ontology-based query reformulation strategies that go beyond standard query expansion techniques. We also present a set of lessons learnt and strategies that did not work. We reaffirm the importance of pre-annotating data to ensure quality query results.
{"title":"Semantic query reformulation: the NIF experience","authors":"Amarnath Gupta, A. Bandrowski, C. Condit, Xufei Qian, J. Grethe, M. Martone","doi":"10.1145/2484838.2484839","DOIUrl":"https://doi.org/10.1145/2484838.2484839","url":null,"abstract":"The NIF system is a semantic search engine that uses an ontology to improve search quality. In this experience paper we present SKEYQL, our semantic keyword query language and describe a number of ontology-based query reformulation strategies that go beyond standard query expansion techniques. We also present a set of lessons learnt and strategies that did not work. We reaffirm the importance of pre-annotating data to ensure quality query results.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"35:1-35:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84565979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.
{"title":"Nearest group queries","authors":"Dongxiang Zhang, C. Chan, K. Tan","doi":"10.1145/2484838.2484866","DOIUrl":"https://doi.org/10.1145/2484838.2484866","url":null,"abstract":"k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"38 1","pages":"7:1-7:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81735118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multidimensional (MD) databases and data warehouses we commonly prefer instances that have summarizable dimensions. This is because they have good properties for query answering. Most typically, with summarizable dimensions, precomputed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that some established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. We propose an extension to the Hurtado-Meldelzon (HM) MD model with subcategories, the EHM model, and show that it allows to capture the summarizability. We propose an efficient algorithm that, for a given cube view (i.e. MD aggregate query) in an EHM database, determines from which minimal subset of precomputed cube views it can be correctly computed. Finally, we show how the EHM can be implemented with minor modifications to the familiar ROLAP schemas.
{"title":"A multidimensional data model with subcategories for flexibly capturing summarizability","authors":"S. Ariyan, L. Bertossi","doi":"10.1145/2484838.2484857","DOIUrl":"https://doi.org/10.1145/2484838.2484857","url":null,"abstract":"In multidimensional (MD) databases and data warehouses we commonly prefer instances that have summarizable dimensions. This is because they have good properties for query answering. Most typically, with summarizable dimensions, precomputed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that some established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. We propose an extension to the Hurtado-Meldelzon (HM) MD model with subcategories, the EHM model, and show that it allows to capture the summarizability. We propose an efficient algorithm that, for a given cube view (i.e. MD aggregate query) in an EHM database, determines from which minimal subset of precomputed cube views it can be correctly computed. Finally, we show how the EHM can be implemented with minor modifications to the familiar ROLAP schemas.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"6:1-6:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91088565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian D. Winter, Peer Kröger, Johannes Niedermayer, M. Renz
Most algorithms treat Wireless Sensor Networks (WSNs) only as a generator of data without any autonomy. In contrast to this approach, we propose the ACIDE framework: A completely decentralized, bottom-up clustering process and information exchange that does not depend on given infrastructure such as fixed root nodes. While it has slightly higher requirements for the nodes, its dynamic and independent nature has many advantages, such as the user beeing able to initiate queries from any point in the network rather than being limited to query the network through an a priori fixed sink node. The framework can deal with changing environments and energy depletion. Through careful abstraction, we also support customization and adaption to different environments.
{"title":"Autonomous clustering for wireless sensor networks","authors":"Fabian D. Winter, Peer Kröger, Johannes Niedermayer, M. Renz","doi":"10.1145/2484838.2484841","DOIUrl":"https://doi.org/10.1145/2484838.2484841","url":null,"abstract":"Most algorithms treat Wireless Sensor Networks (WSNs) only as a generator of data without any autonomy. In contrast to this approach, we propose the ACIDE framework: A completely decentralized, bottom-up clustering process and information exchange that does not depend on given infrastructure such as fixed root nodes. While it has slightly higher requirements for the nodes, its dynamic and independent nature has many advantages, such as the user beeing able to initiate queries from any point in the network rather than being limited to query the network through an a priori fixed sink node. The framework can deal with changing environments and energy depletion. Through careful abstraction, we also support customization and adaption to different environments.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"96 1","pages":"36:1-36:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89191196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are immersed in a world in which we constantly deal (and cope) with objects and phenomena in a variety of scales in space and time. With the increase in collaborative and inter-disciplinary research, there appeared a growing need for handling data in multiple scales and representations, within a single environment. The so called multi-scale environments must guarantee the manipulation of information while ensuring consistency. This paper is concerned with the challenges of managing data in multiple scales, while preserving consistency across scales. Its main contributions are the following: (a) the specification of generic, extensible multi-scale integrity constraints; and (b) the implementation of a prototype based on data versioning, which supports the maintenance of these constraints. This prototype was tested using watershed data from Brazil.
{"title":"Providing multi-scale consistency for multi-scale geospatial data","authors":"João Sávio C. Longo, C. B. Medeiros","doi":"10.1145/2484838.2484867","DOIUrl":"https://doi.org/10.1145/2484838.2484867","url":null,"abstract":"We are immersed in a world in which we constantly deal (and cope) with objects and phenomena in a variety of scales in space and time. With the increase in collaborative and inter-disciplinary research, there appeared a growing need for handling data in multiple scales and representations, within a single environment. The so called multi-scale environments must guarantee the manipulation of information while ensuring consistency. This paper is concerned with the challenges of managing data in multiple scales, while preserving consistency across scales. Its main contributions are the following: (a) the specification of generic, extensible multi-scale integrity constraints; and (b) the implementation of a prototype based on data versioning, which supports the maintenance of these constraints. This prototype was tested using watershed data from Brazil.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"5 1","pages":"8:1-8:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75327290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coverage pattern mining is an important model in data mining. It provides useful information pertaining to the sets of items that have coverage interesting to the users in a transactional database. The coverage patterns do not satisfy the anti-monotonic property. This increases the search space in the itemset lattice, which in turn increases the computational cost of mining these patterns. An Apriori-like algorithm known as CMine has been proposed in the literature to discover the patterns. It employs a pruning technique to reduce the search space. We have observed that there exists further scope for reducing the search space effectively. In this paper, we theoretically analyze different measures used in the pattern model, and introduce a novel pruning technique to reduce the search space. An Apriori-like algorithm, called CMine++, has also been proposed to discover the patterns. The performance study shows that mining coverage patterns with CMine++ is efficient.
{"title":"Towards efficient discovery of coverage patterns in transactional databases","authors":"R. U. Kiran, Masashi Toyoda, M. Kitsuregawa","doi":"10.1145/2484838.2484850","DOIUrl":"https://doi.org/10.1145/2484838.2484850","url":null,"abstract":"Coverage pattern mining is an important model in data mining. It provides useful information pertaining to the sets of items that have coverage interesting to the users in a transactional database. The coverage patterns do not satisfy the anti-monotonic property. This increases the search space in the itemset lattice, which in turn increases the computational cost of mining these patterns. An Apriori-like algorithm known as CMine has been proposed in the literature to discover the patterns. It employs a pruning technique to reduce the search space. We have observed that there exists further scope for reducing the search space effectively. In this paper, we theoretically analyze different measures used in the pattern model, and introduce a novel pruning technique to reduce the search space. An Apriori-like algorithm, called CMine++, has also been proposed to discover the patterns. The performance study shows that mining coverage patterns with CMine++ is efficient.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"38:1-38:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76022277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical open issue. In this paper, we develop a notion of contextual outliers on categorical data. Intuitively, a contextual outlier is a small group of objects that share strong similarity with a significantly larger reference group of objects on some attributes, but deviate dramatically on some other attributes. We develop a detection algorithm, and conduct experiments to evaluate our approach.
{"title":"Mining multidimensional contextual outliers from categorical relational data","authors":"Guanting Tang, J. Bailey, J. Pei, Guozhu Dong","doi":"10.1145/2484838.2484883","DOIUrl":"https://doi.org/10.1145/2484838.2484883","url":null,"abstract":"A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical open issue. In this paper, we develop a notion of contextual outliers on categorical data. Intuitively, a contextual outlier is a small group of objects that share strong similarity with a significantly larger reference group of objects on some attributes, but deviate dramatically on some other attributes. We develop a detection algorithm, and conduct experiments to evaluate our approach.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"15 1","pages":"43:1-43:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73177464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
{"title":"On the combination of relative clustering validity criteria","authors":"L. Vendramin, P. Jaskowiak, R. Campello","doi":"10.1145/2484838.2484844","DOIUrl":"https://doi.org/10.1145/2484838.2484844","url":null,"abstract":"Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"4:1-4:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73788776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management