Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel
Typically, recommendations are computed by considering users similar to the user in question. However, scanning the whole database of users for locating similar users is expensive. Existing approaches build user profiles by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault tolerance subspace clustering approach that detects clusters of similar users in subspaces of the original feature space and also allows for missing values. Our experiments on real movie datasets show that the diversification of the similar users through subspace clustering results in better recommendations comparing to traditional collaborative filtering and full dimensional clustering approaches.
{"title":"Exploring subspace clustering for recommendations","authors":"Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel","doi":"10.1145/2618243.2618283","DOIUrl":"https://doi.org/10.1145/2618243.2618283","url":null,"abstract":"Typically, recommendations are computed by considering users similar to the user in question. However, scanning the whole database of users for locating similar users is expensive. Existing approaches build user profiles by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault tolerance subspace clustering approach that detects clusters of similar users in subspaces of the original feature space and also allows for missing values. Our experiments on real movie datasets show that the diversification of the similar users through subspace clustering results in better recommendations comparing to traditional collaborative filtering and full dimensional clustering approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"5 1","pages":"42:1-42:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82128493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu
The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time. From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.
{"title":"Maintaining a microbial genome & metagenome data analysis system in an academic setting","authors":"I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu","doi":"10.1145/2618243.2618244","DOIUrl":"https://doi.org/10.1145/2618243.2618244","url":null,"abstract":"The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.\u0000 From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"3:1-3:11"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74815845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hybrid data analysis systems integrate an analytic tool and a data management tool. While hybrid systems have benefits, in order to be effective data movement between the two hybrid components must be minimized. Through experimental results we demonstrate that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to manage data movement in hybrid systems.
{"title":"Data movement in hybrid analytic systems: a case for automation","authors":"Patrick Leyshock, D. Maier, K. Tufte","doi":"10.1145/2618243.2618273","DOIUrl":"https://doi.org/10.1145/2618243.2618273","url":null,"abstract":"Hybrid data analysis systems integrate an analytic tool and a data management tool. While hybrid systems have benefits, in order to be effective data movement between the two hybrid components must be minimized. Through experimental results we demonstrate that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to manage data movement in hybrid systems.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"192 1","pages":"39:1-39:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76567383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feature selection becomes crucial when exploring high-dimensional datasets via clustering, because it is unlikely that the data groups jointly in all dimensions but clustering algorithms treat all attributes equally. A new subspace filter approach is presented that is capable of coping with the difficult situation of finding small clusters embedded in a very noisy environment (more noise than clustering data), which is not mislead by dense, high-dimensional spots caused by density fluctuations of single attributes. Experimental evaluation on artificial and real datasets demonstrate good performance and high efficiency.
{"title":"A subspace filter supporting the discovery of small clusters in very noisy datasets","authors":"F. Höppner","doi":"10.1145/2618243.2618260","DOIUrl":"https://doi.org/10.1145/2618243.2618260","url":null,"abstract":"Feature selection becomes crucial when exploring high-dimensional datasets via clustering, because it is unlikely that the data groups jointly in all dimensions but clustering algorithms treat all attributes equally. A new subspace filter approach is presented that is capable of coping with the difficult situation of finding small clusters embedded in a very noisy environment (more noise than clustering data), which is not mislead by dense, high-dimensional spots caused by density fluctuations of single attributes. Experimental evaluation on artificial and real datasets demonstrate good performance and high efficiency.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"299 1","pages":"14:1-14:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75434661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang
The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a partial order on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if o1dominateso2 implies f(o1) ≥ f(o2) for any ranking function f ∈ F and objects o1, o2 ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches. Motivated by this, in this paper we propose a novel matching based dominance operator, namely matching dominance, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic parameterized ranking function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (MDB), to facilitate the top k queries on multi-dimensional uncertain objects based on the matching dominance operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.
优势算子在广泛的多准则决策应用中起着重要的作用。一般来说,优势算子是对象集合O上的偏序算子,我们说优势算子对于排序函数族F具有单调性,如果o1优于o2,则意味着对于任何排序函数F∈F和对象o1, o2∈O, F (o1)≥F (o2)。多维点上的优势算子定义良好,它对于任何单调排序(评分)函数都具有单调性。由于许多新兴应用中数据的不确定性,现有的各种工作都对不确定对象上的排序查询语义进行了研究。然而,针对多维不确定目标的优势算子问题仍然是一个有待解决的问题。虽然有一些针对多维不确定对象的优势算子的尝试,但它们都没有声称这些排序方法具有单调性。基于此,本文提出了一种新的基于匹配的优势算子,即匹配优势算子,用于捕获多维不确定对象的优势语义,使得新的优势算子在单调参数化排序函数上具有单调性,可以统一其他常用的不确定对象排序方法。然后,我们开发了一种基于匹配优势度的层索引技术(MDB),以促进基于匹配优势度算子的多维不确定对象的top k查询。提出了计算MDB索引的有效算法。全面的实验令人信服地证明了我们的索引技术的有效性和效率。
{"title":"Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects","authors":"Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang","doi":"10.1145/2618243.2618246","DOIUrl":"https://doi.org/10.1145/2618243.2618246","url":null,"abstract":"The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a <i>partial order</i> on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if <i>o</i><sub>1</sub> <i>dominates</i> <i>o</i><sub>2</sub> implies <i>f</i>(<i>o</i><sub>1</sub>) ≥ <i>f</i>(<i>o</i><sub>2</sub>) for any ranking function <i>f</i> ∈ F and objects <i>o</i><sub>1</sub>, <i>o</i><sub>2</sub> ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches.\u0000 Motivated by this, in this paper we propose a novel <i>matching</i> based <i>dominance</i> operator, namely <b>matching dominance</b>, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic <i>parameterized ranking</i> function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (<b>MDB</b>), to facilitate the top <i>k</i> queries on multi-dimensional uncertain objects based on the <i>matching dominance</i> operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78363261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.
{"title":"Efficient data management and statistics with zero-copy integration","authors":"Jonathan Lajus, H. Mühleisen","doi":"10.1145/2618243.2618265","DOIUrl":"https://doi.org/10.1145/2618243.2618265","url":null,"abstract":"Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"12:1-12:10"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73999681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels. In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.
{"title":"Geometric graph matching and similarity: a probabilistic approach","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2618243.2618259","DOIUrl":"https://doi.org/10.1145/2618243.2618259","url":null,"abstract":"Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels.\u0000 In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"55 1","pages":"27:1-27:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83777622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approaches.
{"title":"Node classification in uncertain graphs","authors":"Michele Dallachiesa, C. Aggarwal, Themis Palpanas","doi":"10.1145/2618243.2618277","DOIUrl":"https://doi.org/10.1145/2618243.2618277","url":null,"abstract":"In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"82 1","pages":"32:1-32:4"},"PeriodicalIF":0.0,"publicationDate":"2014-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85597296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves
Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.
{"title":"Tuning large scale deduplication with reduced effort","authors":"Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves","doi":"10.1145/2484838.2484873","DOIUrl":"https://doi.org/10.1145/2484838.2484873","url":null,"abstract":"Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77256395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons
Scientific workflows are gaining popularity, and repositories of workflows are starting to emerge. In this paper we describe TopicsExplorer, a data exploration approach for myExperiment.org, a collaborative platform for the exchange of scientific workflows and experimental plans. Our approach uses a variant of topic modeling with tags as features, and generates a browsable view of the repository. TopicsExplorer has been fully integrated into the open-source platform of myExperiment.org, and is available to users at www.myexperiment.org/topics. We also present our recently developed personalization component that customizes topics based on user feedback. Finally, we discuss our ongoing performance optimization efforts that make computing and managing personalized topic views of the myExperiment.org repository feasible.
{"title":"Learning to explore scientific workflow repositories","authors":"Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons","doi":"10.1145/2484838.2484848","DOIUrl":"https://doi.org/10.1145/2484838.2484848","url":null,"abstract":"Scientific workflows are gaining popularity, and repositories of workflows are starting to emerge. In this paper we describe TopicsExplorer, a data exploration approach for myExperiment.org, a collaborative platform for the exchange of scientific workflows and experimental plans. Our approach uses a variant of topic modeling with tags as features, and generates a browsable view of the repository. TopicsExplorer has been fully integrated into the open-source platform of myExperiment.org, and is available to users at www.myexperiment.org/topics. We also present our recently developed personalization component that customizes topics based on user feedback. Finally, we discuss our ongoing performance optimization efforts that make computing and managing personalized topic views of the myExperiment.org repository feasible.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"31:1-31:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85378695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management