首页 > 最新文献

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献

英文 中文
A study of partitioning and parallel UDF execution with the SAP HANA database 基于SAP HANA数据库的分区和并行UDF执行研究
Philippe Grosse, Norman May, Wolfgang Lehner
Large-scale data analysis relies on custom code both for preparing the data for analysis as well as for the core analysis algorithms. The map-reduce framework offers a simple model to parallelize custom code, but it does not integrate well with relational databases. Likewise, the literature on optimizing queries in relational databases has largely ignored user-defined functions (UDFs). In this paper, we discuss annotations for user-defined functions that facilitate optimizations that both consider relational operators and UDFs. In this paper we focus on optimizations that enable the parallel execution of relational operators and UDFs for a number of typical patterns. A study on real-world data investigates the opportunities for parallelization of complex data flows containing both relational operators and UDFs.
大规模数据分析依赖于定制代码来准备分析数据以及核心分析算法。map-reduce框架提供了一个简单的模型来并行化定制代码,但是它不能很好地与关系数据库集成。同样,关于优化关系数据库查询的文献在很大程度上忽略了用户定义函数(udf)。在本文中,我们将讨论用户定义函数的注释,这些注释有助于同时考虑关系操作符和udf的优化。在本文中,我们将重点关注为许多典型模式支持并行执行关系运算符和udf的优化。对实际数据的研究探讨了同时包含关系运算符和udf的复杂数据流并行化的可能性。
{"title":"A study of partitioning and parallel UDF execution with the SAP HANA database","authors":"Philippe Grosse, Norman May, Wolfgang Lehner","doi":"10.1145/2618243.2618274","DOIUrl":"https://doi.org/10.1145/2618243.2618274","url":null,"abstract":"Large-scale data analysis relies on custom code both for preparing the data for analysis as well as for the core analysis algorithms. The map-reduce framework offers a simple model to parallelize custom code, but it does not integrate well with relational databases. Likewise, the literature on optimizing queries in relational databases has largely ignored user-defined functions (UDFs). In this paper, we discuss annotations for user-defined functions that facilitate optimizations that both consider relational operators and UDFs. In this paper we focus on optimizations that enable the parallel execution of relational operators and UDFs for a number of typical patterns. A study on real-world data investigates the opportunities for parallelization of complex data flows containing both relational operators and UDFs.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"25 1","pages":"36:1-36:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84370109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Mining statistically sound co-location patterns at multiple distances 在多个距离上挖掘统计上合理的共定位模式
Sajib Barua, J. Sander
Existing co-location mining algorithms require a user provided distance threshold at which prevalent patterns are searched. Since spatial interactions, in reality, may happen at different distances, finding the right distance threshold to mine all true patterns is not easy and a single appropriate threshold may not even exist. A standard co-location mining algorithm also requires a prevalence measure threshold to find prevalent patterns. The prevalence measure values of the true co-location patterns occurring at different distances may vary and finding a prevalence measure threshold to mine all true patterns without reporting random patterns is not easy and sometimes not even possible. In this paper, we propose an algorithm to mine true co-location patterns at multiple distances. Our approach is based on a statistical test and does not require thresholds for the prevalence measure and the interaction distance. We evaluate the efficacy of our algorithm using synthetic and real data sets comparing it with the state-of-the-art co-location mining approach.
现有的协同位置挖掘算法需要用户提供搜索流行模式的距离阈值。由于空间相互作用在现实中可能发生在不同的距离上,找到合适的距离阈值来挖掘所有真实的模式并不容易,甚至可能不存在一个合适的阈值。标准的同址挖掘算法还需要一个流行度量阈值来发现流行模式。发生在不同距离上的真实同位模式的流行度测量值可能会有所不同,并且在不报告随机模式的情况下找到一个流行度测量阈值来挖掘所有真实模式并不容易,有时甚至不可能。在本文中,我们提出了一种算法来挖掘多距离的真实共定位模式。我们的方法基于统计检验,不需要对流行度量和相互作用距离的阈值。我们使用合成和真实数据集来评估我们的算法的有效性,并将其与最先进的协同位置挖掘方法进行比较。
{"title":"Mining statistically sound co-location patterns at multiple distances","authors":"Sajib Barua, J. Sander","doi":"10.1145/2618243.2618261","DOIUrl":"https://doi.org/10.1145/2618243.2618261","url":null,"abstract":"Existing co-location mining algorithms require a user provided distance threshold at which prevalent patterns are searched. Since spatial interactions, in reality, may happen at different distances, finding the right distance threshold to mine all true patterns is not easy and a single appropriate threshold may not even exist. A standard co-location mining algorithm also requires a prevalence measure threshold to find prevalent patterns. The prevalence measure values of the true co-location patterns occurring at different distances may vary and finding a prevalence measure threshold to mine all true patterns without reporting random patterns is not easy and sometimes not even possible. In this paper, we propose an algorithm to mine true co-location patterns at multiple distances. Our approach is based on a statistical test and does not require thresholds for the prevalence measure and the interaction distance. We evaluate the efficacy of our algorithm using synthetic and real data sets comparing it with the state-of-the-art co-location mining approach.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"7:1-7:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85548398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Detecting correlated columns in relational databases with mixed data types 在混合数据类型的关系数据库中检测相关列
H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm
In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns. In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.
在数据库中,除了列之间已知的依赖关系(例如,外键和主键约束)之外,还有许多数据库用户不知道的其他相关性。众所周知,提取这种隐藏的相关性对于数据库优化和数据分析中的各种任务非常有用。然而,由于缺乏量化列相关性的措施,这项任务具有挑战性。不同数据类型和值域的列之间可能存在相关性,这使得基于值匹配的技术不适用。此外,一个列可以有多个语义,这就不允许对列进行不相交的分区。最后,从计算的角度来看,必须考虑一个巨大的搜索空间,它随着列的数量呈指数级增长。在本文中,我们提出了一种新的检测列相关性的方法(DeCoRel)。它旨在发现关系数据库中具有混合数据类型的相关列的重叠组。为了处理数据类型的异质性,我们提出了一种结合Shannon熵和累积熵的优点的新的相关度量。为了解决巨大的搜索空间,我们引入了一种高效的列分组算法。与最先进的技术相比,我们的方法比数据库文献中最新的方法更通用。实验结果表明,该方法比现有方法具有更高的质量和更好的可扩展性。
{"title":"Detecting correlated columns in relational databases with mixed data types","authors":"H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm","doi":"10.1145/2618243.2618251","DOIUrl":"https://doi.org/10.1145/2618243.2618251","url":null,"abstract":"In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns.\u0000 In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"12 1","pages":"30:1-30:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74592234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
New approaches to storing and manipulating multi-dimensional sparse arrays 存储和操作多维稀疏数组的新方法
E. Otoo, Hairong Wang, Gideon Nimako
In this paper, we introduce some storage schemes for multi-dimensional sparse arrays (MDSAs) that handle the sparsity of the array with two primary goals; reducing the storage overhead and maintaining efficient data element access. Four schemes are proposed. These are: i.) The PATRICIA trie compressed storage method (PTCS) which uses PATRICIA trie to store the valid non-zero array elements; ii.)The extended compressed row storage (xCRS) which extends CRS method for sparse matrix storage to sparse arrays of higher dimensions and achieves the best data element access efficiency of all the methods; iii.) The bit encoded xCRS (BxCRS) which optimizes the storage utilization of xCRS by applying data compression methods with run length encoding, while maintaining its data access efficiency; and iv.) a hybrid approach that provides a desired balance between the storage utilization and data manipulation efficiency by combining xCRS and the Bit Encoded Sparse Storage (BESS). These storage schemes were evaluated and compared on three basic array operations; constructing the storage scheme, accessing a random element and retrieving a sub-array, using a set of synthetic sparse multi-dimensional arrays.
在本文中,我们介绍了一些多维稀疏阵列(MDSAs)的存储方案,它们处理阵列的稀疏性有两个主要目标;减少存储开销并维护有效的数据元素访问。提出了四种方案。它们是:1)。PATRICIA trie压缩存储方法(PTCS)使用PATRICIA trie存储有效的非零数组元素;二)。扩展压缩行存储(xCRS)将CRS方法用于稀疏矩阵存储扩展到更高维度的稀疏数组,实现了所有方法中最佳的数据元素访问效率;iii)。采用bit编码的xCRS (BxCRS),在保持xCRS数据访问效率的同时,采用游程编码的数据压缩方法优化xCRS的存储利用率;iv.)一种混合方法,通过结合xCRS和比特编码稀疏存储(BESS),在存储利用率和数据操作效率之间提供理想的平衡。在三种基本数组操作上对这些存储方案进行了评价和比较;利用一组合成的稀疏多维数组构造存储方案,访问随机元素并检索子数组。
{"title":"New approaches to storing and manipulating multi-dimensional sparse arrays","authors":"E. Otoo, Hairong Wang, Gideon Nimako","doi":"10.1145/2618243.2618281","DOIUrl":"https://doi.org/10.1145/2618243.2618281","url":null,"abstract":"In this paper, we introduce some storage schemes for multi-dimensional sparse arrays (MDSAs) that handle the sparsity of the array with two primary goals; reducing the storage overhead and maintaining efficient data element access. Four schemes are proposed. These are: i.) The PATRICIA trie compressed storage method (PTCS) which uses PATRICIA trie to store the valid non-zero array elements; ii.)The extended compressed row storage (xCRS) which extends CRS method for sparse matrix storage to sparse arrays of higher dimensions and achieves the best data element access efficiency of all the methods; iii.) The bit encoded xCRS (BxCRS) which optimizes the storage utilization of xCRS by applying data compression methods with run length encoding, while maintaining its data access efficiency; and iv.) a hybrid approach that provides a desired balance between the storage utilization and data manipulation efficiency by combining xCRS and the Bit Encoded Sparse Storage (BESS). These storage schemes were evaluated and compared on three basic array operations; constructing the storage scheme, accessing a random element and retrieving a sub-array, using a set of synthetic sparse multi-dimensional arrays.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"48 1","pages":"41:1-41:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87316217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Simulation workflow design tailor-made for scientists 为科学家量身定制的仿真工作流程设计
P. Reimann, H. Schwarz
Scientific workflows have to deal with highly heterogeneous data environments. In particular, they have to carry out complex data provisioning tasks that filter and transform heterogeneous input data in such a way that underlying tools or services can ingest them. This results in a high complexity of workflow design. Scientists often want to design their workflows on their own, but usually do not have the necessary skills to cope with this complexity. Therefore, we have developed a pattern-based approach to workflow design, thereby mainly focusing on workflows that realize numeric simulations [4]. This approach removes the burden from scientists to specify low-level details of data provisioning. In this demonstration, we apply a prototype implementation of our approach to various use cases and show how it makes simulation workflow design tailor-made for scientists.
科学工作流必须处理高度异构的数据环境。特别是,它们必须执行复杂的数据供应任务,过滤和转换异构输入数据,使底层工具或服务能够摄取这些数据。这导致了工作流设计的高度复杂性。科学家经常想要自己设计他们的工作流程,但通常没有必要的技能来处理这种复杂性。因此,我们开发了一种基于模式的工作流设计方法,从而主要关注实现数值模拟的工作流[4]。这种方法消除了科学家指定数据提供的底层细节的负担。在此演示中,我们将我们的方法的原型实现应用于各种用例,并展示它如何为科学家量身定制仿真工作流设计。
{"title":"Simulation workflow design tailor-made for scientists","authors":"P. Reimann, H. Schwarz","doi":"10.1145/2618243.2618291","DOIUrl":"https://doi.org/10.1145/2618243.2618291","url":null,"abstract":"Scientific workflows have to deal with highly heterogeneous data environments. In particular, they have to carry out complex data provisioning tasks that filter and transform heterogeneous input data in such a way that underlying tools or services can ingest them. This results in a high complexity of workflow design. Scientists often want to design their workflows on their own, but usually do not have the necessary skills to cope with this complexity. Therefore, we have developed a pattern-based approach to workflow design, thereby mainly focusing on workflows that realize numeric simulations [4]. This approach removes the burden from scientists to specify low-level details of data provisioning. In this demonstration, we apply a prototype implementation of our approach to various use cases and show how it makes simulation workflow design tailor-made for scientists.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"49:1-49:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82763244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Point cloud databases 点云数据库
L. Dobos, I. Csabai, J. Szalai-Gindl, T. Budavári, A. Szalay
We introduce the concept of the point cloud database, a new kind of database system aimed primarily towards scientific applications. Many scientific observations, experiments, feature extraction algorithms and large-scale simulations produce enormous amounts of data that are better represented as sparse (but often highly-clustered) points in a k-dimensional (k ≲ 10) metric space than on a multi-dimensional grid. Dimensionality reduction techniques, such as principal components, are also widely-used to project high dimensional data into similarly low dimensional spaces. Analysis techniques developed to work on multi-dimensional data points are usually implemented as in-memory algorithms and need to be modified to work in distributed cluster environments and on large amounts of disk-resident data. We conclude that the relational model, with certain additions, is appropriate for point clouds, but point cloud databases must also provide unique set of spatial search and proximity join operators, indexing schemes, and query language constructs that make them a distinct class of database systems.
我们介绍了点云数据库的概念,这是一种主要针对科学应用的新型数据库系统。许多科学观察、实验、特征提取算法和大规模模拟产生了大量的数据,这些数据在k维(k > 10)度量空间中更好地表示为稀疏(但通常是高度聚集的)点,而不是在多维网格中。降维技术,如主成分,也被广泛用于将高维数据投影到类似的低维空间中。为处理多维数据点而开发的分析技术通常是作为内存算法实现的,需要进行修改才能在分布式集群环境和大量磁盘驻留数据中工作。我们得出的结论是,关系模型加上某些附加功能,适用于点云,但是点云数据库还必须提供一组独特的空间搜索和邻近连接操作符、索引方案和查询语言结构,使它们成为独特的数据库系统。
{"title":"Point cloud databases","authors":"L. Dobos, I. Csabai, J. Szalai-Gindl, T. Budavári, A. Szalay","doi":"10.1145/2618243.2618275","DOIUrl":"https://doi.org/10.1145/2618243.2618275","url":null,"abstract":"We introduce the concept of the point cloud database, a new kind of database system aimed primarily towards scientific applications. Many scientific observations, experiments, feature extraction algorithms and large-scale simulations produce enormous amounts of data that are better represented as sparse (but often highly-clustered) points in a k-dimensional (k ≲ 10) metric space than on a multi-dimensional grid. Dimensionality reduction techniques, such as principal components, are also widely-used to project high dimensional data into similarly low dimensional spaces. Analysis techniques developed to work on multi-dimensional data points are usually implemented as in-memory algorithms and need to be modified to work in distributed cluster environments and on large amounts of disk-resident data. We conclude that the relational model, with certain additions, is appropriate for point clouds, but point cloud databases must also provide unique set of spatial search and proximity join operators, indexing schemes, and query language constructs that make them a distinct class of database systems.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"83 1","pages":"33:1-33:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80650199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
DistillFlow: removing redundancy in scientific workflows 蒸馏流:去除科学工作流程中的冗余
Jiuqiang Chen, Sarah Cohen Boulakia, C. Froidevaux, C. Goble, P. Missier, Alan R. Williams
Scientific workflows management systems are increasingly used by scientists to specify complex data processing pipelines. Workflows are represented using a graph structure, where nodes represent tasks and links represent the dataflow. However, the complexity of workflow structures is increasing over time, reducing the rate of scientific workflows reuse. Here, we introduce DistillFlow, a tool based on effective methods for workflow design, with a focus on the Taverna model. DistillFlow is able to detect "anti-patterns" in the structure of workflows (idiomatic forms that lead to over-complicated design) and replace them with different patterns to reduce the workflow's overall structural complexity. Rewriting workflows in this way is beneficial both in terms of user experience and workflow maintenance.
科学家越来越多地使用科学工作流管理系统来指定复杂的数据处理管道。工作流使用图形结构表示,其中节点表示任务,链接表示数据流。然而,工作流结构的复杂性随着时间的推移而增加,降低了科学工作流的重用率。在这里,我们介绍一个基于有效方法的工作流设计工具蒸馏流,重点介绍Taverna模型。DistillFlow能够检测工作流结构中的“反模式”(导致设计过于复杂的惯用形式),并用不同的模式替换它们,以减少工作流的整体结构复杂性。以这种方式重写工作流在用户体验和工作流维护方面都是有益的。
{"title":"DistillFlow: removing redundancy in scientific workflows","authors":"Jiuqiang Chen, Sarah Cohen Boulakia, C. Froidevaux, C. Goble, P. Missier, Alan R. Williams","doi":"10.1145/2618243.2618287","DOIUrl":"https://doi.org/10.1145/2618243.2618287","url":null,"abstract":"Scientific workflows management systems are increasingly used by scientists to specify complex data processing pipelines. Workflows are represented using a graph structure, where nodes represent tasks and links represent the dataflow. However, the complexity of workflow structures is increasing over time, reducing the rate of scientific workflows reuse. Here, we introduce DistillFlow, a tool based on effective methods for workflow design, with a focus on the Taverna model. DistillFlow is able to detect \"anti-patterns\" in the structure of workflows (idiomatic forms that lead to over-complicated design) and replace them with different patterns to reduce the workflow's overall structural complexity. Rewriting workflows in this way is beneficial both in terms of user experience and workflow maintenance.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"46:1-46:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75565674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Subspace anytime stream clustering 子空间随时流聚类
Marwan Hassani, P. Kranen, Rajveer Saini, T. Seidl
Clustering of high dimensional streaming data is an emerging field of research. A real life data stream imposes many challenges on the clustering task, as an endless amount of data arrives constantly. A lot of research has been done in the full space stream clustering. To handle the varying speeds of the data stream, "anytime" algorithms are proposed but so far only in full space stream clustering. However, data streams from many application domains contain abundance of dimensions; the clusters often exist only in specific subspaces (subset of dimensions) and do not show up in the full feature space. In this paper, the first algorithm that considers both the high dimensionality and the varying speeds of streaming data, is proposed. The algorithm, called SubClusTree, can flexibly adapt to the different stream speeds and makes the best use of available time to provide a high quality subspace clustering. The experimental results prove the effectiveness of our anytime subspace concept.
高维流数据的聚类是一个新兴的研究领域。现实生活中的数据流给集群任务带来了许多挑战,因为不断有无穷无尽的数据到达。在全空间流聚类方面已经做了大量的研究。为了处理数据流的变化速度,提出了“任意时间”算法,但到目前为止只适用于全空间流聚类。然而,来自许多应用领域的数据流包含丰富的维度;集群通常只存在于特定的子空间(维度的子集)中,而不会出现在完整的特征空间中。本文提出了第一种既考虑高维又考虑流数据速度变化的算法。该算法称为SubClusTree,可以灵活地适应不同的流速度,并充分利用可用时间提供高质量的子空间聚类。实验结果证明了任意时间子空间概念的有效性。
{"title":"Subspace anytime stream clustering","authors":"Marwan Hassani, P. Kranen, Rajveer Saini, T. Seidl","doi":"10.1145/2618243.2618286","DOIUrl":"https://doi.org/10.1145/2618243.2618286","url":null,"abstract":"Clustering of high dimensional streaming data is an emerging field of research. A real life data stream imposes many challenges on the clustering task, as an endless amount of data arrives constantly. A lot of research has been done in the full space stream clustering. To handle the varying speeds of the data stream, \"anytime\" algorithms are proposed but so far only in full space stream clustering. However, data streams from many application domains contain abundance of dimensions; the clusters often exist only in specific subspaces (subset of dimensions) and do not show up in the full feature space. In this paper, the first algorithm that considers both the high dimensionality and the varying speeds of streaming data, is proposed. The algorithm, called SubClusTree, can flexibly adapt to the different stream speeds and makes the best use of available time to provide a high quality subspace clustering. The experimental results prove the effectiveness of our anytime subspace concept.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"37:1-37:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85476952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Communication-efficient preference top-k monitoring queries via subscriptions 通信效率首选项top-k通过订阅监视查询
Kamalas Udomlamlert, T. Hara, S. Nishio
With the increase of data generation in distributed fashions such as peer-to-peer systems and sensor networks, top-k query processing which returns only a small set of data that satisfies many users' preferences, becomes a substantial issue. When data are periodically updated in each epoch e.g., weather information, without any techniques, a naive solution is to aggregate all data and their updates to ensure the correctness of final answers, however, it is too costly in terms of data transfer especially for data aggregator nodes. In this paper, we propose a top-k monitoring query processing method in 2-tier distributed systems based on a publish-subscribe scheme. A set of top-k subscriptions specifying summary scope of users' interests is informed to aggregators to limit the number of transferred data records for each epoch. In addition, instead of issuing subscriptions of all queries, our method identifies a small set of minimal subscriptions resulting in lower communication overhead. Our experiments show that our technique is efficient and outperforms other comparative reactive methods.
随着点对点系统和传感器网络等分布式模式下数据生成的增加,top-k查询处理只返回满足许多用户偏好的一小部分数据,这成为一个重大问题。当数据在每个epoch周期性更新时,例如天气信息,在没有任何技术的情况下,一种天真的解决方案是将所有数据及其更新汇总以确保最终答案的正确性,然而,在数据传输方面,特别是对于数据聚合器节点而言,成本太高。本文提出了一种基于发布-订阅模式的二层分布式系统top-k监控查询处理方法。一组top-k订阅指定用户兴趣的汇总范围,通知聚合器以限制每个epoch传输的数据记录的数量。此外,我们的方法不是发出所有查询的订阅,而是识别一组最小的订阅,从而降低通信开销。我们的实验表明,我们的技术是有效的,优于其他比较反应方法。
{"title":"Communication-efficient preference top-k monitoring queries via subscriptions","authors":"Kamalas Udomlamlert, T. Hara, S. Nishio","doi":"10.1145/2618243.2618284","DOIUrl":"https://doi.org/10.1145/2618243.2618284","url":null,"abstract":"With the increase of data generation in distributed fashions such as peer-to-peer systems and sensor networks, top-k query processing which returns only a small set of data that satisfies many users' preferences, becomes a substantial issue. When data are periodically updated in each epoch e.g., weather information, without any techniques, a naive solution is to aggregate all data and their updates to ensure the correctness of final answers, however, it is too costly in terms of data transfer especially for data aggregator nodes. In this paper, we propose a top-k monitoring query processing method in 2-tier distributed systems based on a publish-subscribe scheme. A set of top-k subscriptions specifying summary scope of users' interests is informed to aggregators to limit the number of transferred data records for each epoch. In addition, instead of issuing subscriptions of all queries, our method identifies a small set of minimal subscriptions resulting in lower communication overhead. Our experiments show that our technique is efficient and outperforms other comparative reactive methods.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"44:1-44:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81215567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Local context selection for outlier ranking in graphs with multiple numeric node attributes 在具有多个数字节点属性的图中进行离群值排序的局部上下文选择
Patricia Iglesias Sánchez, Emmanuel Müller, Oretta Irmler, Klemens Böhm
Outlier ranking aims at the distinction between exceptional outliers and regular objects by measuring deviation of individual objects. In graphs with multiple numeric attributes, not all the attributes are relevant or show dependencies with the graph structure. Considering both graph structure and all given attributes, one cannot measure a clear deviation of objects. This is because the existence of irrelevant attributes clearly hinders the detection of outliers. Thus, one has to select local outlier contexts including only those attributes showing a high contrast between regular and deviating objects. It is an open challenge to detect meaningful local contexts for each node in attributed graphs. In this work, we propose a novel local outlier ranking model for graphs with multiple numeric node attributes. For each object, our technique determines its subgraph and its statistically relevant subset of attributes locally. This context selection enables a high contrast between an outlier and the regular objects. Out of this context, we compute the outlierness score by incorporating both the attribute value deviation and the graph structure. In our evaluation on real and synthetic data, we show that our approach is able to detect contextual outliers that are missed by other outlier models.
异常点排序的目的是通过测量单个对象的偏差来区分异常异常点和正常对象。在具有多个数字属性的图中,并非所有属性都与图结构相关或显示依赖关系。考虑到图的结构和所有给定的属性,我们无法测量出对象的明显偏差。这是因为不相关属性的存在明显阻碍了异常值的检测。因此,必须选择局部离群上下文,仅包括那些在规则和偏离对象之间显示高对比度的属性。为属性图中的每个节点检测有意义的局部上下文是一个开放的挑战。在这项工作中,我们提出了一个新的局部离群值排序模型,用于具有多个数字节点属性的图。对于每个对象,我们的技术在局部确定其子图及其统计相关的属性子集。此上下文选择可在离群值和常规对象之间实现高对比度。在这种情况下,我们通过结合属性值偏差和图结构来计算离群值得分。在我们对真实数据和合成数据的评估中,我们表明我们的方法能够检测到被其他异常值模型遗漏的上下文异常值。
{"title":"Local context selection for outlier ranking in graphs with multiple numeric node attributes","authors":"Patricia Iglesias Sánchez, Emmanuel Müller, Oretta Irmler, Klemens Böhm","doi":"10.1145/2618243.2618266","DOIUrl":"https://doi.org/10.1145/2618243.2618266","url":null,"abstract":"Outlier ranking aims at the distinction between exceptional outliers and regular objects by measuring deviation of individual objects. In graphs with multiple numeric attributes, not all the attributes are relevant or show dependencies with the graph structure. Considering both graph structure and all given attributes, one cannot measure a clear deviation of objects. This is because the existence of irrelevant attributes clearly hinders the detection of outliers. Thus, one has to select local outlier contexts including only those attributes showing a high contrast between regular and deviating objects. It is an open challenge to detect meaningful local contexts for each node in attributed graphs.\u0000 In this work, we propose a novel local outlier ranking model for graphs with multiple numeric node attributes. For each object, our technique determines its subgraph and its statistically relevant subset of attributes locally. This context selection enables a high contrast between an outlier and the regular objects. Out of this context, we compute the outlierness score by incorporating both the attribute value deviation and the graph structure. In our evaluation on real and synthetic data, we show that our approach is able to detect contextual outliers that are missed by other outlier models.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"16:1-16:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84587883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
期刊
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1