首页 > 最新文献

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献

英文 中文
Data management systems on GPUs: promises and challenges gpu上的数据管理系统:承诺与挑战
Yi-Cheng Tu, Anand Kumar, Di Yu, Ran Rui, Ryan Wheeler
The past decade has witnessed the popularity of push-based data management systems, in which the query executor passively receives data from either remote data sources (e.g., sensors) or I/O processes that scan database tables/files from local storage. Unlike traditional relational database management system (RDBMS) architectures that are mostly I/O-bound, push-based database systems often become heavily computation-bound since the data arrival rate could be very high. In this paper, we argue that modern multi-core hardware, especially Graphics Processing Units (GPU), provide the most cost-effective computing platform to catch up with the large amount of data streamed into a push-based database system. Based on that, we will open discussions on how to design and implement a query processing engine for such systems that run on GPUs.
过去十年见证了基于推送的数据管理系统的流行,其中查询执行器被动地从远程数据源(例如,传感器)或从本地存储扫描数据库表/文件的I/O进程接收数据。与主要受I/ o限制的传统关系数据库管理系统(RDBMS)体系结构不同,基于推送的数据库系统通常需要大量计算,因为数据到达率可能非常高。在本文中,我们认为现代多核硬件,特别是图形处理单元(GPU),提供了最具成本效益的计算平台,以赶上大量数据流到基于推送的数据库系统。在此基础上,我们将开始讨论如何设计和实现在gpu上运行的此类系统的查询处理引擎。
{"title":"Data management systems on GPUs: promises and challenges","authors":"Yi-Cheng Tu, Anand Kumar, Di Yu, Ran Rui, Ryan Wheeler","doi":"10.1145/2484838.2484871","DOIUrl":"https://doi.org/10.1145/2484838.2484871","url":null,"abstract":"The past decade has witnessed the popularity of push-based data management systems, in which the query executor passively receives data from either remote data sources (e.g., sensors) or I/O processes that scan database tables/files from local storage. Unlike traditional relational database management system (RDBMS) architectures that are mostly I/O-bound, push-based database systems often become heavily computation-bound since the data arrival rate could be very high. In this paper, we argue that modern multi-core hardware, especially Graphics Processing Units (GPU), provide the most cost-effective computing platform to catch up with the large amount of data streamed into a push-based database system. Based on that, we will open discussions on how to design and implement a query processing engine for such systems that run on GPUs.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"62 1","pages":"33:1-33:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84970522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Best of both worlds: relational databases and statistics 两全其美:关系数据库和统计数据
H. Mühleisen, T. Lumley
Statistics software packages and relational database systems possess considerable overlap in the area of data loading, handling, and transformation. However, only databases are mainly optimized towards high performance in this area. In this paper, we present our approach on bringing the best of these two worlds together. We integrate the analytics-optimized database MonetDB and the R environment for statistical computing in a non-obtrusive, transparent and compatible way.
统计软件包和关系数据库系统在数据加载、处理和转换方面具有相当大的重叠。然而,只有数据库在这方面主要针对高性能进行了优化。在本文中,我们提出了将这两个世界的优点结合在一起的方法。我们以一种非突兀、透明和兼容的方式集成了分析优化数据库MonetDB和R环境,用于统计计算。
{"title":"Best of both worlds: relational databases and statistics","authors":"H. Mühleisen, T. Lumley","doi":"10.1145/2484838.2484869","DOIUrl":"https://doi.org/10.1145/2484838.2484869","url":null,"abstract":"Statistics software packages and relational database systems possess considerable overlap in the area of data loading, handling, and transformation. However, only databases are mainly optimized towards high performance in this area. In this paper, we present our approach on bringing the best of these two worlds together. We integrate the analytics-optimized database MonetDB and the R environment for statistical computing in a non-obtrusive, transparent and compatible way.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"32:1-32:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87319636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Semantic query reformulation: the NIF experience 语义查询重构:NIF经验
Amarnath Gupta, A. Bandrowski, C. Condit, Xufei Qian, J. Grethe, M. Martone
The NIF system is a semantic search engine that uses an ontology to improve search quality. In this experience paper we present SKEYQL, our semantic keyword query language and describe a number of ontology-based query reformulation strategies that go beyond standard query expansion techniques. We also present a set of lessons learnt and strategies that did not work. We reaffirm the importance of pre-annotating data to ensure quality query results.
NIF系统是一个使用本体来提高搜索质量的语义搜索引擎。在这篇经验论文中,我们介绍了SKEYQL,我们的语义关键字查询语言,并描述了一些超越标准查询扩展技术的基于本体的查询重新表述策略。我们还提出了一套失败的经验教训和策略。我们重申预标注数据对确保查询结果质量的重要性。
{"title":"Semantic query reformulation: the NIF experience","authors":"Amarnath Gupta, A. Bandrowski, C. Condit, Xufei Qian, J. Grethe, M. Martone","doi":"10.1145/2484838.2484839","DOIUrl":"https://doi.org/10.1145/2484838.2484839","url":null,"abstract":"The NIF system is a semantic search engine that uses an ontology to improve search quality. In this experience paper we present SKEYQL, our semantic keyword query language and describe a number of ontology-based query reformulation strategies that go beyond standard query expansion techniques. We also present a set of lessons learnt and strategies that did not work. We reaffirm the importance of pre-annotating data to ensure quality query results.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"35:1-35:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84565979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nearest group queries 最近组查询
Dongxiang Zhang, C. Chan, K. Tan
k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.
k最近邻(kNN)搜索是大量应用中的一个重要问题,包括聚类、模式识别、图像检索和推荐系统。它从数据源D中找到k个最接近度量空间中给定查询点q的元素。在本文中,我们将kNN查询扩展到从多个数据源中检索最接近的元素。这种新类型的查询被命名为k最近组(kNG)查询,它查找k组最接近q的元素,每组包含来自每个数据源的一个对象。kNG查询在许多基于位置的服务中都很有用。为了有效地处理kNG查询,我们提出了一个使用R-tree的基线算法以及一个使用Hilbert R-tree的改进版本。我们还研究了kNG查询的一个变体,称为kNG Join,它类似于kNN Join。给定一组查询点Q,对于Q中的每个点,king Join返回k个最近的组。这样的查询在发布/订阅系统中用于为订阅者集合查找匹配项。在合成数据集和真实数据集上进行了综合性能研究,实验结果表明Hilbert R-tree在回答kNG查询和kNG Join方面都比R-tree取得了明显更好的性能。
{"title":"Nearest group queries","authors":"Dongxiang Zhang, C. Chan, K. Tan","doi":"10.1145/2484838.2484866","DOIUrl":"https://doi.org/10.1145/2484838.2484866","url":null,"abstract":"k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using R-tree as well as an improved version using Hilbert R-tree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert R-tree achieves significantly better performance than R-tree in answering both kNG query and kNG Join.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"38 1","pages":"7:1-7:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81735118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
A multidimensional data model with subcategories for flexibly capturing summarizability 具有子类别的多维数据模型,可灵活地捕获摘要性
S. Ariyan, L. Bertossi
In multidimensional (MD) databases and data warehouses we commonly prefer instances that have summarizable dimensions. This is because they have good properties for query answering. Most typically, with summarizable dimensions, precomputed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that some established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. We propose an extension to the Hurtado-Meldelzon (HM) MD model with subcategories, the EHM model, and show that it allows to capture the summarizability. We propose an efficient algorithm that, for a given cube view (i.e. MD aggregate query) in an EHM database, determines from which minimal subset of precomputed cube views it can be correctly computed. Finally, we show how the EHM can be implemented with minor modifications to the familiar ROLAP schemas.
在多维(MD)数据库和数据仓库中,我们通常更喜欢具有可汇总维度的实例。这是因为它们具有很好的查询应答特性。最典型的是,对于可汇总的维度,可以使用维度层次结构较低级别上的预计算和物化的聚合查询结果来正确计算同一层次结构较高级别上的结果,从而提高效率。摘要性是一个理想的属性,我们认为一些已建立的MD模型不能正确地对摘要性条件进行建模,这是建模语言表达能力有限的结果。我们提出了一个扩展到Hurtado-Meldelzon (HM) MD模型的子类别,EHM模型,并表明它允许捕获摘要性。我们提出了一种有效的算法,对于EHM数据库中给定的多维数据集视图(即MD聚合查询),该算法确定可以从哪个预先计算的多维数据集视图的最小子集中正确计算它。最后,我们将展示如何通过对熟悉的ROLAP模式进行微小修改来实现EHM。
{"title":"A multidimensional data model with subcategories for flexibly capturing summarizability","authors":"S. Ariyan, L. Bertossi","doi":"10.1145/2484838.2484857","DOIUrl":"https://doi.org/10.1145/2484838.2484857","url":null,"abstract":"In multidimensional (MD) databases and data warehouses we commonly prefer instances that have summarizable dimensions. This is because they have good properties for query answering. Most typically, with summarizable dimensions, precomputed and materialized aggregate query results at lower levels of the dimension hierarchy can be used to correctly compute results at higher levels of the same hierarchy, improving efficiency. Being summarizability such a desirable property, we argue that some established MD models cannot properly model the summarizability condition, and this is a consequence of the limited expressive power of the modeling languages. We propose an extension to the Hurtado-Meldelzon (HM) MD model with subcategories, the EHM model, and show that it allows to capture the summarizability. We propose an efficient algorithm that, for a given cube view (i.e. MD aggregate query) in an EHM database, determines from which minimal subset of precomputed cube views it can be correctly computed. Finally, we show how the EHM can be implemented with minor modifications to the familiar ROLAP schemas.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"6:1-6:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91088565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Autonomous clustering for wireless sensor networks 无线传感器网络的自主聚类
Fabian D. Winter, Peer Kröger, Johannes Niedermayer, M. Renz
Most algorithms treat Wireless Sensor Networks (WSNs) only as a generator of data without any autonomy. In contrast to this approach, we propose the ACIDE framework: A completely decentralized, bottom-up clustering process and information exchange that does not depend on given infrastructure such as fixed root nodes. While it has slightly higher requirements for the nodes, its dynamic and independent nature has many advantages, such as the user beeing able to initiate queries from any point in the network rather than being limited to query the network through an a priori fixed sink node. The framework can deal with changing environments and energy depletion. Through careful abstraction, we also support customization and adaption to different environments.
大多数算法仅将无线传感器网络(WSNs)视为数据生成器,而没有任何自主性。与此方法相反,我们提出了ACIDE框架:一个完全分散的,自下而上的集群过程和信息交换,不依赖于给定的基础设施,如固定根节点。虽然它对节点的要求稍微高一些,但它的动态和独立性有很多优点,比如用户可以从网络中的任何一点发起查询,而不是局限于通过一个先验的固定汇聚节点来查询网络。该框架可以应对不断变化的环境和能源枯竭。通过仔细的抽象,我们还支持定制和适应不同的环境。
{"title":"Autonomous clustering for wireless sensor networks","authors":"Fabian D. Winter, Peer Kröger, Johannes Niedermayer, M. Renz","doi":"10.1145/2484838.2484841","DOIUrl":"https://doi.org/10.1145/2484838.2484841","url":null,"abstract":"Most algorithms treat Wireless Sensor Networks (WSNs) only as a generator of data without any autonomy. In contrast to this approach, we propose the ACIDE framework: A completely decentralized, bottom-up clustering process and information exchange that does not depend on given infrastructure such as fixed root nodes. While it has slightly higher requirements for the nodes, its dynamic and independent nature has many advantages, such as the user beeing able to initiate queries from any point in the network rather than being limited to query the network through an a priori fixed sink node. The framework can deal with changing environments and energy depletion. Through careful abstraction, we also support customization and adaption to different environments.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"96 1","pages":"36:1-36:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89191196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Providing multi-scale consistency for multi-scale geospatial data 为多尺度地理空间数据提供多尺度一致性
João Sávio C. Longo, C. B. Medeiros
We are immersed in a world in which we constantly deal (and cope) with objects and phenomena in a variety of scales in space and time. With the increase in collaborative and inter-disciplinary research, there appeared a growing need for handling data in multiple scales and representations, within a single environment. The so called multi-scale environments must guarantee the manipulation of information while ensuring consistency. This paper is concerned with the challenges of managing data in multiple scales, while preserving consistency across scales. Its main contributions are the following: (a) the specification of generic, extensible multi-scale integrity constraints; and (b) the implementation of a prototype based on data versioning, which supports the maintenance of these constraints. This prototype was tested using watershed data from Brazil.
我们沉浸在这样一个世界里,在这个世界里,我们不断地在空间和时间的各种尺度上处理(和应对)物体和现象。随着协作和跨学科研究的增加,在单一环境中以多种尺度和表示方式处理数据的需求越来越大。所谓的多尺度环境必须在保证信息一致性的同时保证信息的可操作性。本文关注的是在多个尺度上管理数据的挑战,同时保持跨尺度的一致性。它的主要贡献如下:(a)规范了通用的、可扩展的多尺度完整性约束;(b)基于数据版本控制的原型的实现,它支持这些约束的维护。这个原型使用巴西的流域数据进行了测试。
{"title":"Providing multi-scale consistency for multi-scale geospatial data","authors":"João Sávio C. Longo, C. B. Medeiros","doi":"10.1145/2484838.2484867","DOIUrl":"https://doi.org/10.1145/2484838.2484867","url":null,"abstract":"We are immersed in a world in which we constantly deal (and cope) with objects and phenomena in a variety of scales in space and time. With the increase in collaborative and inter-disciplinary research, there appeared a growing need for handling data in multiple scales and representations, within a single environment. The so called multi-scale environments must guarantee the manipulation of information while ensuring consistency. This paper is concerned with the challenges of managing data in multiple scales, while preserving consistency across scales. Its main contributions are the following: (a) the specification of generic, extensible multi-scale integrity constraints; and (b) the implementation of a prototype based on data versioning, which supports the maintenance of these constraints. This prototype was tested using watershed data from Brazil.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"5 1","pages":"8:1-8:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75327290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards efficient discovery of coverage patterns in transactional databases 在事务性数据库中有效地发现覆盖模式
R. U. Kiran, Masashi Toyoda, M. Kitsuregawa
Coverage pattern mining is an important model in data mining. It provides useful information pertaining to the sets of items that have coverage interesting to the users in a transactional database. The coverage patterns do not satisfy the anti-monotonic property. This increases the search space in the itemset lattice, which in turn increases the computational cost of mining these patterns. An Apriori-like algorithm known as CMine has been proposed in the literature to discover the patterns. It employs a pruning technique to reduce the search space. We have observed that there exists further scope for reducing the search space effectively. In this paper, we theoretically analyze different measures used in the pattern model, and introduce a novel pruning technique to reduce the search space. An Apriori-like algorithm, called CMine++, has also been proposed to discover the patterns. The performance study shows that mining coverage patterns with CMine++ is efficient.
覆盖模式挖掘是数据挖掘中的一个重要模型。它提供了有关事务数据库中用户感兴趣的覆盖范围的项目集的有用信息。覆盖模式不满足反单调性。这增加了项集格中的搜索空间,进而增加了挖掘这些模式的计算成本。文献中已经提出了一种称为CMine的类似apriori的算法来发现模式。它采用修剪技术来减少搜索空间。我们观察到,还有进一步的空间可以有效地缩小搜索空间。本文从理论上分析了模式模型中使用的不同度量,并引入了一种新的剪枝技术来减少搜索空间。一种类似apriori的算法,称为CMine++,也被提出用于发现模式。性能研究表明,使用cmin++进行覆盖模式挖掘是有效的。
{"title":"Towards efficient discovery of coverage patterns in transactional databases","authors":"R. U. Kiran, Masashi Toyoda, M. Kitsuregawa","doi":"10.1145/2484838.2484850","DOIUrl":"https://doi.org/10.1145/2484838.2484850","url":null,"abstract":"Coverage pattern mining is an important model in data mining. It provides useful information pertaining to the sets of items that have coverage interesting to the users in a transactional database. The coverage patterns do not satisfy the anti-monotonic property. This increases the search space in the itemset lattice, which in turn increases the computational cost of mining these patterns. An Apriori-like algorithm known as CMine has been proposed in the literature to discover the patterns. It employs a pruning technique to reduce the search space. We have observed that there exists further scope for reducing the search space effectively. In this paper, we theoretically analyze different measures used in the pattern model, and introduce a novel pruning technique to reduce the search space. An Apriori-like algorithm, called CMine++, has also been proposed to discover the patterns. The performance study shows that mining coverage patterns with CMine++ is efficient.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"38:1-38:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76022277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining multidimensional contextual outliers from categorical relational data 从分类关系数据中挖掘多维上下文离群值
Guanting Tang, J. Bailey, J. Pei, Guozhu Dong
A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical open issue. In this paper, we develop a notion of contextual outliers on categorical data. Intuitively, a contextual outlier is a small group of objects that share strong similarity with a significantly larger reference group of objects on some attributes, but deviate dramatically on some other attributes. We develop a detection algorithm, and conduct experiments to evaluate our approach.
为了在全空间和子空间中检测不同类型的异常值,已经提出了各种各样的方法。然而,异常值的可解释性,即以何种方式和在何种程度上解释一个对象是异常值,仍然是一个关键的开放问题。在本文中,我们提出了一个关于分类数据的上下文异常值的概念。直观地说,上下文离群值是一小组对象,它们在某些属性上与一个大得多的对象参考组具有很强的相似性,但在其他一些属性上却大相径庭。我们开发了一种检测算法,并进行了实验来评估我们的方法。
{"title":"Mining multidimensional contextual outliers from categorical relational data","authors":"Guanting Tang, J. Bailey, J. Pei, Guozhu Dong","doi":"10.1145/2484838.2484883","DOIUrl":"https://doi.org/10.1145/2484838.2484883","url":null,"abstract":"A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical open issue. In this paper, we develop a notion of contextual outliers on categorical data. Intuitively, a contextual outlier is a small group of objects that share strong similarity with a significantly larger reference group of objects on some attributes, but deviate dramatically on some other attributes. We develop a detection algorithm, and conduct experiments to evaluate our approach.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"15 1","pages":"43:1-43:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73177464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
On the combination of relative clustering validity criteria 相对聚类效度准则的组合研究
L. Vendramin, P. Jaskowiak, R. Campello
Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
存在许多不同的相对聚类有效性标准,它们作为评估数据分区质量的定量度量非常有用。这些标准被赋予了特定的特征,使它们中的每一个都更适合于特定类别的问题。然而,每个标准的性能通常是未知的先验用户。因此,选择一个特定的标准并不是一项微不足道的任务。规避这一缺点的一种可能的方法是将不同的相对标准结合起来,以获得更可靠的评估。然而,到目前为止,这种方法只以一种特别的方式应用;它的真正潜力实际上还没有得到很好的理解。在本文中,我们对考虑合成和真实数据集的相关标准的组合进行了广泛的研究。实验涉及28个标准和4种不同的组合策略,这些策略应用于5种聚类算法产生的不同数据分区集合。总共考虑了972个合成数据集的427,680个分区和400个图像数据集的14,000个分区。根据结果,我们讨论了将不同的相关标准合并成一个委员会的缺点和可能的好处。
{"title":"On the combination of relative clustering validity criteria","authors":"L. Vendramin, P. Jaskowiak, R. Campello","doi":"10.1145/2484838.2484844","DOIUrl":"https://doi.org/10.1145/2484838.2484844","url":null,"abstract":"Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"4:1-4:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73788776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
期刊
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1