首页 > 最新文献

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献

英文 中文
Exploring subspace clustering for recommendations 探索子空间聚类以获得推荐
Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel
Typically, recommendations are computed by considering users similar to the user in question. However, scanning the whole database of users for locating similar users is expensive. Existing approaches build user profiles by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault tolerance subspace clustering approach that detects clusters of similar users in subspaces of the original feature space and also allows for missing values. Our experiments on real movie datasets show that the diversification of the similar users through subspace clustering results in better recommendations comparing to traditional collaborative filtering and full dimensional clustering approaches.
通常,通过考虑与所讨论的用户相似的用户来计算推荐。但是,扫描整个用户数据库来定位相似的用户是非常昂贵的。现有方法通过使用全维聚类来寻找相似用户集来构建用户配置文件。由于我们处理的数据集是高维且不完整的,因此全维聚类并不是最好的选择。为此,我们探索了容错子空间聚类方法,该方法在原始特征空间的子空间中检测相似用户的聚类,并允许缺失值。我们在真实电影数据集上的实验表明,与传统的协同过滤和全维聚类方法相比,通过子空间聚类实现相似用户的多样化可以获得更好的推荐。
{"title":"Exploring subspace clustering for recommendations","authors":"Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel","doi":"10.1145/2618243.2618283","DOIUrl":"https://doi.org/10.1145/2618243.2618283","url":null,"abstract":"Typically, recommendations are computed by considering users similar to the user in question. However, scanning the whole database of users for locating similar users is expensive. Existing approaches build user profiles by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault tolerance subspace clustering approach that detects clusters of similar users in subspaces of the original feature space and also allows for missing values. Our experiments on real movie datasets show that the diversification of the similar users through subspace clustering results in better recommendations comparing to traditional collaborative filtering and full dimensional clustering approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"5 1","pages":"42:1-42:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82128493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Maintaining a microbial genome & metagenome data analysis system in an academic setting 在学术环境中维护微生物基因组和宏基因组数据分析系统
I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu
The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time. From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.
集成微生物基因组(IMG)系统将微生物群落聚合基因组(宏基因组)与来自所有生命领域的基因组集成在一起。IMG提供了在比较背景下分析和回顾宏基因组和基因组的结构和功能注释的工具。IMG系统的核心是一个数据仓库,其中包含由科学用户提供的基因组和宏基因组数据集,以及来自美国国家生物技术信息中心基因组档案的公共细菌、古细菌、真核生物和病毒基因组,以及一套丰富的工程、环境和宿主相关宏基因组。基因组和宏基因组数据集使用IMG的微生物基因组和宏基因组序列数据处理管道进行处理,然后使用IMG的数据集成工具包集成到数据仓库中。微生物基因组和宏基因组应用程序特定的用户界面提供了对IMG数据和分析工具包的不同子集的访问。基因组和宏基因组分析是一个以基因为中心的迭代过程,涉及数据探索和比较分析操作的序列(组成),单个操作期望具有快速响应时间。自2005年首次发布以来,IMG已从最初的约300个基因组,总计200万个基因,发展到2014年5月1日,细菌、古细菌、真核生物和病毒基因组22578个,宏基因组样本4188个,约246亿个基因。IMG的数据库架构不断修改,以应对基因组和宏基因组数据集数量和规模的快速增长,保持良好的查询性能,并适应新的数据类型。在本文中,我们介绍了IMG在过去三年中在有限的财务、工程和数据管理资源背景下开发的新数据库架构,这些资源通常用于学术数据库系统。我们讨论了我们考虑和试验的其他商业和开源数据库管理系统,并描述了我们为维持IMG的快速增长而设计的混合体系结构。
{"title":"Maintaining a microbial genome & metagenome data analysis system in an academic setting","authors":"I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu","doi":"10.1145/2618243.2618244","DOIUrl":"https://doi.org/10.1145/2618243.2618244","url":null,"abstract":"The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.\u0000 From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"3:1-3:11"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74815845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Data movement in hybrid analytic systems: a case for automation 混合分析系统中的数据移动:自动化的一个案例
Patrick Leyshock, D. Maier, K. Tufte
Hybrid data analysis systems integrate an analytic tool and a data management tool. While hybrid systems have benefits, in order to be effective data movement between the two hybrid components must be minimized. Through experimental results we demonstrate that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to manage data movement in hybrid systems.
混合数据分析系统集成了分析工具和数据管理工具。虽然混合系统有好处,但为了有效地实现两个混合组件之间的数据移动,必须尽量减少。通过实验结果,我们证明了在输入大小、形状和位置不同的工作负载下,自动化是管理混合系统中数据移动的唯一实用方法。
{"title":"Data movement in hybrid analytic systems: a case for automation","authors":"Patrick Leyshock, D. Maier, K. Tufte","doi":"10.1145/2618243.2618273","DOIUrl":"https://doi.org/10.1145/2618243.2618273","url":null,"abstract":"Hybrid data analysis systems integrate an analytic tool and a data management tool. While hybrid systems have benefits, in order to be effective data movement between the two hybrid components must be minimized. Through experimental results we demonstrate that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to manage data movement in hybrid systems.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"192 1","pages":"39:1-39:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76567383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A subspace filter supporting the discovery of small clusters in very noisy datasets 一种支持在非常嘈杂的数据集中发现小簇的子空间过滤器
F. Höppner
Feature selection becomes crucial when exploring high-dimensional datasets via clustering, because it is unlikely that the data groups jointly in all dimensions but clustering algorithms treat all attributes equally. A new subspace filter approach is presented that is capable of coping with the difficult situation of finding small clusters embedded in a very noisy environment (more noise than clustering data), which is not mislead by dense, high-dimensional spots caused by density fluctuations of single attributes. Experimental evaluation on artificial and real datasets demonstrate good performance and high efficiency.
当通过聚类探索高维数据集时,特征选择变得至关重要,因为数据不可能在所有维度上共同分组,但聚类算法平等地对待所有属性。提出了一种新的子空间滤波方法,该方法能够解决在非常嘈杂的环境(比聚类数据更嘈杂)中寻找嵌入的小簇的困难情况,该环境不会被单个属性密度波动引起的密集高维斑点所误导。在人工数据集和真实数据集上的实验评估表明,该方法具有良好的性能和高效率。
{"title":"A subspace filter supporting the discovery of small clusters in very noisy datasets","authors":"F. Höppner","doi":"10.1145/2618243.2618260","DOIUrl":"https://doi.org/10.1145/2618243.2618260","url":null,"abstract":"Feature selection becomes crucial when exploring high-dimensional datasets via clustering, because it is unlikely that the data groups jointly in all dimensions but clustering algorithms treat all attributes equally. A new subspace filter approach is presented that is capable of coping with the difficult situation of finding small clusters embedded in a very noisy environment (more noise than clustering data), which is not mislead by dense, high-dimensional spots caused by density fluctuations of single attributes. Experimental evaluation on artificial and real datasets demonstrate good performance and high efficiency.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"299 1","pages":"14:1-14:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75434661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects 匹配优势:捕获多维不确定对象的优势语义
Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang
The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a partial order on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if o1 dominates o2 implies f(o1) ≥ f(o2) for any ranking function f ∈ F and objects o1, o2 ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches. Motivated by this, in this paper we propose a novel matching based dominance operator, namely matching dominance, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic parameterized ranking function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (MDB), to facilitate the top k queries on multi-dimensional uncertain objects based on the matching dominance operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.
优势算子在广泛的多准则决策应用中起着重要的作用。一般来说,优势算子是对象集合O上的偏序算子,我们说优势算子对于排序函数族F具有单调性,如果o1优于o2,则意味着对于任何排序函数F∈F和对象o1, o2∈O, F (o1)≥F (o2)。多维点上的优势算子定义良好,它对于任何单调排序(评分)函数都具有单调性。由于许多新兴应用中数据的不确定性,现有的各种工作都对不确定对象上的排序查询语义进行了研究。然而,针对多维不确定目标的优势算子问题仍然是一个有待解决的问题。虽然有一些针对多维不确定对象的优势算子的尝试,但它们都没有声称这些排序方法具有单调性。基于此,本文提出了一种新的基于匹配的优势算子,即匹配优势算子,用于捕获多维不确定对象的优势语义,使得新的优势算子在单调参数化排序函数上具有单调性,可以统一其他常用的不确定对象排序方法。然后,我们开发了一种基于匹配优势度的层索引技术(MDB),以促进基于匹配优势度算子的多维不确定对象的top k查询。提出了计算MDB索引的有效算法。全面的实验令人信服地证明了我们的索引技术的有效性和效率。
{"title":"Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects","authors":"Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang","doi":"10.1145/2618243.2618246","DOIUrl":"https://doi.org/10.1145/2618243.2618246","url":null,"abstract":"The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a <i>partial order</i> on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if <i>o</i><sub>1</sub> <i>dominates</i> <i>o</i><sub>2</sub> implies <i>f</i>(<i>o</i><sub>1</sub>) ≥ <i>f</i>(<i>o</i><sub>2</sub>) for any ranking function <i>f</i> ∈ F and objects <i>o</i><sub>1</sub>, <i>o</i><sub>2</sub> ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches.\u0000 Motivated by this, in this paper we propose a novel <i>matching</i> based <i>dominance</i> operator, namely <b>matching dominance</b>, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic <i>parameterized ranking</i> function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (<b>MDB</b>), to facilitate the top <i>k</i> queries on multi-dimensional uncertain objects based on the <i>matching dominance</i> operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78363261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient data management and statistics with zero-copy integration 具有零副本集成的高效数据管理和统计
Jonathan Lajus, H. Mühleisen
Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.
长期以来,统计分析师一直在努力应对不断增长的数据量。虽然专门的数据管理系统(如关系数据库)能够处理这些数据,但统计分析工具在表达复杂的数据分析方面要方便得多。这两类系统的集成有可能克服数据管理问题,同时保持分析的便利性。但是,必须密切关注实现开销,例如序列化。在本文中,我们提出了数据管理和分析工具的进程集成。此外,我们认为零拷贝集成是可行的,因为无处不在的c风格数组包含本机类型。我们讨论了这种集成的一般概念,并基于列式关系数据库MonetDB和统计计算的R环境给出了这种集成的原型。我们在一系列常见数据管理任务的微基准测试中评估了该原型的性能。
{"title":"Efficient data management and statistics with zero-copy integration","authors":"Jonathan Lajus, H. Mühleisen","doi":"10.1145/2618243.2618265","DOIUrl":"https://doi.org/10.1145/2618243.2618265","url":null,"abstract":"Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"12:1-12:10"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73999681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Geometric graph matching and similarity: a probabilistic approach 几何图匹配与相似:一种概率方法
Ayser Armiti, Michael Gertz
Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels. In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.
对于许多基于图形的应用程序,如道路网络分析、模式识别或药物发现,找到共同结构是至关重要的。这样的任务被形式化为不精确图匹配问题,这是已知的np困难。已经提出了几种图匹配算法来寻找近似解。然而,这种算法在内存消耗、运行时间和对图结构或标签变化的容忍度方面仍然面临许多问题。本文提出了二维空间中几何图的不精确图匹配问题的一种解决方法。几何图形为上述应用程序提供了合适的建模框架,其中顶点位于某些2D空间中。该方法的主要思想是在最大似然估计框架中形式化图匹配问题。然后,利用期望最大化技术估计两图之间的匹配。我们提出了一种新的密度函数来估计不同图的顶点之间的相似性。它的计算基于两点:1)一个顶点及其直接邻居的空间属性,2)连接一个顶点到图中其他顶点的最短路径。为了保证可扩展性,我们提出基于图的子结构的性质来计算密度函数。通过使用来自多个应用领域的代表性几何图形,我们证明了我们的方法在匹配质量、运行时间和内存消耗方面优于现有的图形匹配算法。
{"title":"Geometric graph matching and similarity: a probabilistic approach","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2618243.2618259","DOIUrl":"https://doi.org/10.1145/2618243.2618259","url":null,"abstract":"Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels.\u0000 In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"55 1","pages":"27:1-27:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83777622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Node classification in uncertain graphs 不确定图中的节点分类
Michele Dallachiesa, C. Aggarwal, Themis Palpanas
In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approaches.
在许多使用和分析网络数据的实际应用程序中,网络图中的链接可能是错误的,或者来自概率技术。在这种情况下,节点分类问题可能具有挑战性,因为链接的不可靠性可能会影响分类过程的最终结果。在本文中,我们关注需要分析图结构中存在的不确定性的情况。将不确定性视为一类公民,研究了不确定图中节点分类的新问题。我们提出了两种基于贝叶斯模型的技术,并展示了在一等公民的分类过程中纳入不确定性的好处。实验结果证明了该方法的有效性。
{"title":"Node classification in uncertain graphs","authors":"Michele Dallachiesa, C. Aggarwal, Themis Palpanas","doi":"10.1145/2618243.2618277","DOIUrl":"https://doi.org/10.1145/2618243.2618277","url":null,"abstract":"In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"82 1","pages":"32:1-32:4"},"PeriodicalIF":0.0,"publicationDate":"2014-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85597296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Tuning large scale deduplication with reduced effort 以更少的工作量调优大规模重复数据删除
Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves
Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.
重复数据删除是识别数据存储库中哪些对象可能相同的任务。它通常需要用户在过程的几个步骤中进行干预,主要是识别一些代表匹配和不匹配的对。然后使用此信息来帮助识别其他可能重复的记录。当重复数据删除应用于非常大的数据集时,性能和匹配质量取决于专家用户配置过程中最重要的步骤(例如,阻塞和分类)。在本文中,我们提出了一个名为FS-Dedup able的新框架,它可以帮助用户在减少工作量的情况下调整大型数据集的重复数据删除过程,用户只需要标记一小部分自动选择的数据对子集。FS-Dedup的重复数据删除核心利用了基于签名的重复数据删除(Sig-Dedup)算法。Sig-Dedup的特点是在大型数据集中具有高效率和可扩展性,但需要专业用户调整几个参数。FS-Dedup帮助解决了这个缺点,它提供了一个框架,不需要用户对数据集或阈值有专门的了解,就能产生高效率。我们对大型真实和合成数据集(包含数百万条记录)的评估表明,FS-Dedup能够达到甚至超过Sig-Dedup技术获得的最大匹配质量,而减少了用户的手动工作量。
{"title":"Tuning large scale deduplication with reduced effort","authors":"Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves","doi":"10.1145/2484838.2484873","DOIUrl":"https://doi.org/10.1145/2484838.2484873","url":null,"abstract":"Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77256395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Learning to explore scientific workflow repositories 学习探索科学的工作流存储库
Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons
Scientific workflows are gaining popularity, and repositories of workflows are starting to emerge. In this paper we describe TopicsExplorer, a data exploration approach for myExperiment.org, a collaborative platform for the exchange of scientific workflows and experimental plans. Our approach uses a variant of topic modeling with tags as features, and generates a browsable view of the repository. TopicsExplorer has been fully integrated into the open-source platform of myExperiment.org, and is available to users at www.myexperiment.org/topics. We also present our recently developed personalization component that customizes topics based on user feedback. Finally, we discuss our ongoing performance optimization efforts that make computing and managing personalized topic views of the myExperiment.org repository feasible.
科学的工作流越来越受欢迎,工作流存储库也开始出现。在本文中,我们描述了TopicsExplorer,这是myExperiment.org的一种数据探索方法,myExperiment.org是一个用于交换科学工作流程和实验计划的协作平台。我们的方法使用主题建模的一种变体,将标签作为特征,并生成存储库的可浏览视图。TopicsExplorer已经完全集成到myExperiment.org的开源平台中,用户可以在www.myexperiment.org/topics上使用。我们还介绍了最近开发的个性化组件,该组件可以根据用户反馈定制主题。最后,我们讨论了正在进行的性能优化工作,这些工作使计算和管理myExperiment.org存储库的个性化主题视图变得可行。
{"title":"Learning to explore scientific workflow repositories","authors":"Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons","doi":"10.1145/2484838.2484848","DOIUrl":"https://doi.org/10.1145/2484838.2484848","url":null,"abstract":"Scientific workflows are gaining popularity, and repositories of workflows are starting to emerge. In this paper we describe TopicsExplorer, a data exploration approach for myExperiment.org, a collaborative platform for the exchange of scientific workflows and experimental plans. Our approach uses a variant of topic modeling with tags as features, and generates a browsable view of the repository. TopicsExplorer has been fully integrated into the open-source platform of myExperiment.org, and is available to users at www.myexperiment.org/topics. We also present our recently developed personalization component that customizes topics based on user feedback. Finally, we discuss our ongoing performance optimization efforts that make computing and managing personalized topic views of the myExperiment.org repository feasible.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"31:1-31:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85378695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1