Rubi Boim, Ohad Greenshpan, T. Milo, Slava Novgorodov, N. Polyzotis, W. Tan
Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute information. In this work, we target the problem of gathering data from the crowd in an economical and principled fashion. We present Ask It!, a system that allows interactive data sourcing applications to effectively determine which questions should be directed to which users for reducing the uncertainty about the collected data. Ask It! uses a set of novel algorithms for minimizing the number of probing (questions) required from the different users. We demonstrate the challenge and our solution in the context of a multiple-choice question game played by the ICDE'12 attendees, targeted to gather information on the conference's publications, authors and colleagues.
{"title":"Asking the Right Questions in Crowd Data Sourcing","authors":"Rubi Boim, Ohad Greenshpan, T. Milo, Slava Novgorodov, N. Polyzotis, W. Tan","doi":"10.1109/ICDE.2012.122","DOIUrl":"https://doi.org/10.1109/ICDE.2012.122","url":null,"abstract":"Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute information. In this work, we target the problem of gathering data from the crowd in an economical and principled fashion. We present Ask It!, a system that allows interactive data sourcing applications to effectively determine which questions should be directed to which users for reducing the uncertainty about the collected data. Ask It! uses a set of novel algorithms for minimizing the number of probing (questions) required from the different users. We demonstrate the challenge and our solution in the context of a multiple-choice question game played by the ICDE'12 attendees, targeted to gather information on the conference's publications, authors and colleagues.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"415 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124820172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. D. Virgilio, G. Orsi, L. Tanca, Riccardo Torlone
We present NYAYA, a flexible system for the management of large-scale semantic data which couples a general-purpose storage mechanism with efficient ontological query answering. NYAYA rapidly imports semantic data expressed in different formalisms into semantic data kiosks. Each kiosk exposes the native ontological constraints in a uniform fashion using data log±, a very general rule-based language for the representation of ontological constraints. A group of kiosks forms a semantic data market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. NYAYA is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organizations of the persistent storage. In the demonstration, we will show the capabilities of NYAYA over real-world case studies and demonstrate its efficiency over well-known benchmarks.
{"title":"NYAYA: A System Supporting the Uniform Management of Large Sets of Semantic Data","authors":"R. D. Virgilio, G. Orsi, L. Tanca, Riccardo Torlone","doi":"10.1109/ICDE.2012.133","DOIUrl":"https://doi.org/10.1109/ICDE.2012.133","url":null,"abstract":"We present NYAYA, a flexible system for the management of large-scale semantic data which couples a general-purpose storage mechanism with efficient ontological query answering. NYAYA rapidly imports semantic data expressed in different formalisms into semantic data kiosks. Each kiosk exposes the native ontological constraints in a uniform fashion using data log±, a very general rule-based language for the representation of ontological constraints. A group of kiosks forms a semantic data market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. NYAYA is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organizations of the persistent storage. In the demonstration, we will show the capabilities of NYAYA over real-world case studies and demonstrate its efficiency over well-known benchmarks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121648231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Labeling schemes lie at the core of query processing for many tree-structured data such as XML data that is flooding the web. A labeling scheme that can simultaneously and efficiently support various relationship queries on trees (such as parent/children, descendant/ancestor, etc.), computation of lowest common ancestors (LCA) and update of trees, is desired for effective and efficient management of tree-structured data. Although a variety of labeling schemes such as prefix-based labeling, interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another. In this paper, we propose an integer-based labeling scheme branch code as well as its compressed version as our major solution to simultaneously support efficient query processing on both static and dynamic ordered trees with affordable storage cost. The proposed branch code can answer common queries on ordered trees in constant time, which comes at the cost of consuming O(N log N) storage. To reduce storage cost to O(N), a compressed branch code is further developed. We also give a relationship determination algorithm purely using compressed branch code, which is of quite low possibility to produce false positive results as verified by experimental results. With the support of splay trees, branch code can also support dynamic trees so that updates and queries can be implemented with O(log N) amortized cost. All the results above are either theoretically proved or verified by experimental studies.
{"title":"Branch Code: A Labeling Scheme for Efficient Query Answering on Trees","authors":"Yanghua Xiao, Ji Hong, Wanyun Cui, Zhenying He, Wei Wang, Guodong Feng","doi":"10.1109/ICDE.2012.71","DOIUrl":"https://doi.org/10.1109/ICDE.2012.71","url":null,"abstract":"Labeling schemes lie at the core of query processing for many tree-structured data such as XML data that is flooding the web. A labeling scheme that can simultaneously and efficiently support various relationship queries on trees (such as parent/children, descendant/ancestor, etc.), computation of lowest common ancestors (LCA) and update of trees, is desired for effective and efficient management of tree-structured data. Although a variety of labeling schemes such as prefix-based labeling, interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another. In this paper, we propose an integer-based labeling scheme branch code as well as its compressed version as our major solution to simultaneously support efficient query processing on both static and dynamic ordered trees with affordable storage cost. The proposed branch code can answer common queries on ordered trees in constant time, which comes at the cost of consuming O(N log N) storage. To reduce storage cost to O(N), a compressed branch code is further developed. We also give a relationship determination algorithm purely using compressed branch code, which is of quite low possibility to produce false positive results as verified by experimental results. With the support of splay trees, branch code can also support dynamic trees so that updates and queries can be implemented with O(log N) amortized cost. All the results above are either theoretically proved or verified by experimental studies.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124759687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient similarity search in uncertain data is a central problem in many modern applications such as biometric identification, stock market analysis, sensor networks, medical imaging, etc. In such applications, the feature vector of an object is not exactly known but is rather defined by a probability density function like a Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian distributions, hence, correlations between different features are not considered in the similarity search. In this paper, we propose a novel, efficient similarity search technique for general GMMs without independence assumption for the attributes, named SUDN, which approximates the actual components of a GMM in a conservative but tight way. A filter-refinement architecture guarantees no false dismissals, due to conservativity, as well as a good filter selectivity, due to the tightness of our approximations. An extensive experimental evaluation of SUDN demonstrates a considerable speed-up of similarity queries on general GMMs and an increase in accuracy compared to existing approaches.
{"title":"Searching Uncertain Data Represented by Non-axis Parallel Gaussian Mixture Models","authors":"K. Haegler, F. Fiedler, C. Böhm","doi":"10.1109/ICDE.2012.7","DOIUrl":"https://doi.org/10.1109/ICDE.2012.7","url":null,"abstract":"Efficient similarity search in uncertain data is a central problem in many modern applications such as biometric identification, stock market analysis, sensor networks, medical imaging, etc. In such applications, the feature vector of an object is not exactly known but is rather defined by a probability density function like a Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian distributions, hence, correlations between different features are not considered in the similarity search. In this paper, we propose a novel, efficient similarity search technique for general GMMs without independence assumption for the attributes, named SUDN, which approximates the actual components of a GMM in a conservative but tight way. A filter-refinement architecture guarantees no false dismissals, due to conservativity, as well as a good filter selectivity, due to the tightness of our approximations. An extensive experimental evaluation of SUDN demonstrates a considerable speed-up of similarity queries on general GMMs and an increase in accuracy compared to existing approaches.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131714904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wangchao Le, Anastasios Kementsietsidis, S. Duan, Feifei Li
This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.
{"title":"Scalable Multi-query Optimization for SPARQL","authors":"Wangchao Le, Anastasios Kementsietsidis, S. Duan, Feifei Li","doi":"10.1109/ICDE.2012.37","DOIUrl":"https://doi.org/10.1109/ICDE.2012.37","url":null,"abstract":"This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"341 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122280751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reynold Cheng, Jian Gong, D. Cheung, Jiefeng Cheng
A matching between two database schemas, generated by machine learning techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema matching has recently raised a lot of research interest, because the quality of applications rely on the matching result. We study query evaluation over an inexact schema matching, which is represented as a set of ``possible mappings'', as well as the probabilities that they are correct. Since the number of possible mappings can be large, evaluating queries through these mappings can be expensive. By observing the fact that the possible mappings between two schemas often exhibit a high degree of overlap, we develop two efficient solutions. We also present a fast algorithm to compute answers with the k highest probabilities. An extensive evaluation on real schemas shows that our approaches improve the query performance by almost an order of magnitude.
{"title":"Evaluating Probabilistic Queries over Uncertain Matching","authors":"Reynold Cheng, Jian Gong, D. Cheung, Jiefeng Cheng","doi":"10.1109/ICDE.2012.14","DOIUrl":"https://doi.org/10.1109/ICDE.2012.14","url":null,"abstract":"A matching between two database schemas, generated by machine learning techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema matching has recently raised a lot of research interest, because the quality of applications rely on the matching result. We study query evaluation over an inexact schema matching, which is represented as a set of ``possible mappings'', as well as the probabilities that they are correct. Since the number of possible mappings can be large, evaluating queries through these mappings can be expensive. By observing the fact that the possible mappings between two schemas often exhibit a high degree of overlap, we develop two efficient solutions. We also present a fast algorithm to compute answers with the k highest probabilities. An extensive evaluation on real schemas shows that our approaches improve the query performance by almost an order of magnitude.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129985010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Songling Liu, J. P. Cedeño, K. Candan, M. Sapino, Shengyu Huang, Xinsheng Li
Existing RDF query languages and RDF stores fail to support a large class of knowledge applications which associate utilities or costs on the available knowledge statements. A recent proposal includes (a) a ranked RDF (R2DF) specification to enhance RDF triples with an application specific weights and (b) a SPA Rank QL query language specification, which provides novel primitives on top of the SPARQL language to express top-k queries using traditional query patterns as well as novel flexible path predicates. We introduce and demonstrate R2DB, a database system for querying weighted RDF graphs. R2DB relies on the AR2Q query processing engine, which leverages novel index structures to support efficient ranked path search and includes query optimization strategies based on proximity and sub-result inter-arrival times. In addition to being the first data management system for the R2DF data model, R2DB also provides an innovative features-of-interest (FoI) based method for visualizing large sets of query results (i.e., sub graphs of the data graph).
{"title":"R2DB: A System for Querying and Visualizing Weighted RDF Graphs","authors":"Songling Liu, J. P. Cedeño, K. Candan, M. Sapino, Shengyu Huang, Xinsheng Li","doi":"10.1109/ICDE.2012.134","DOIUrl":"https://doi.org/10.1109/ICDE.2012.134","url":null,"abstract":"Existing RDF query languages and RDF stores fail to support a large class of knowledge applications which associate utilities or costs on the available knowledge statements. A recent proposal includes (a) a ranked RDF (R2DF) specification to enhance RDF triples with an application specific weights and (b) a SPA Rank QL query language specification, which provides novel primitives on top of the SPARQL language to express top-k queries using traditional query patterns as well as novel flexible path predicates. We introduce and demonstrate R2DB, a database system for querying weighted RDF graphs. R2DB relies on the AR2Q query processing engine, which leverages novel index structures to support efficient ranked path search and includes query optimization strategies based on proximity and sub-result inter-arrival times. In addition to being the first data management system for the R2DF data model, R2DB also provides an innovative features-of-interest (FoI) based method for visualizing large sets of query results (i.e., sub graphs of the data graph).","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128392639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is a wide range of applications that require finding the top-k most similar pairs of records in a given database. However, computing such top-k similarity joins is a challenging problem today, as there is an increasing trend of applications that expect to deal with vast amounts of data. For such data-intensive applications, parallel executions of programs on a large cluster of commodity machines using the MapReduce paradigm have recently received a lot of attention. In this paper, we investigate how the top-k similarity join algorithms can get benefits from the popular MapReduce framework. We first develop the divide-and-conquer and branch-and-bound algorithms. We next propose the all pair partitioning and essential pair partitioning methods to minimize the amount of data transfers between map and reduce functions. We finally perform the experiments with not only synthetic but also real-life data sets. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.
{"title":"Parallel Top-K Similarity Join Algorithms Using MapReduce","authors":"Younghoon Kim, Kyuseok Shim","doi":"10.1109/ICDE.2012.87","DOIUrl":"https://doi.org/10.1109/ICDE.2012.87","url":null,"abstract":"There is a wide range of applications that require finding the top-k most similar pairs of records in a given database. However, computing such top-k similarity joins is a challenging problem today, as there is an increasing trend of applications that expect to deal with vast amounts of data. For such data-intensive applications, parallel executions of programs on a large cluster of commodity machines using the MapReduce paradigm have recently received a lot of attention. In this paper, we investigate how the top-k similarity join algorithms can get benefits from the popular MapReduce framework. We first develop the divide-and-conquer and branch-and-bound algorithms. We next propose the all pair partitioning and essential pair partitioning methods to minimize the amount of data transfers between map and reduce functions. We finally perform the experiments with not only synthetic but also real-life data sets. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131187138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A database supporting multiple versions of records may use the versions to support queries of the past or to increase concurrency by enabling reads and writes to be concurrent. We introduce a new concurrency control approach that enables all SQL isolation levels including serializability to utilize multiple versions to increase concurrency while also supporting transaction time database functionality. The key insight is to manage a range of possible timestamps for each transaction that captures the impact of conflicts that have occurred. Using these ranges as constraints often permits concurrent access where lock based concurrency control would block. This can also allow blocking instead of some aborts that are common in earlier multi-version concurrency techniques. Also, timestamp ranges can be used to conservatively find deadlocks without graph based cycle detection. Thus, our multi-version support can enhance performance of current time data access via improved concurrency, while supporting transaction time functionality.
{"title":"Multi-version Concurrency via Timestamp Range Conflict Management","authors":"D. Lomet, A. Fekete, Rui Wang, Peter Ward","doi":"10.1109/ICDE.2012.10","DOIUrl":"https://doi.org/10.1109/ICDE.2012.10","url":null,"abstract":"A database supporting multiple versions of records may use the versions to support queries of the past or to increase concurrency by enabling reads and writes to be concurrent. We introduce a new concurrency control approach that enables all SQL isolation levels including serializability to utilize multiple versions to increase concurrency while also supporting transaction time database functionality. The key insight is to manage a range of possible timestamps for each transaction that captures the impact of conflicts that have occurred. Using these ranges as constraints often permits concurrent access where lock based concurrency control would block. This can also allow blocking instead of some aborts that are common in earlier multi-version concurrency techniques. Also, timestamp ranges can be used to conservatively find deadlocks without graph based cycle detection. Thus, our multi-version support can enhance performance of current time data access via improved concurrency, while supporting transaction time functionality.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134389331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold. Inspired by the q-gram idea for string similarity problem, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. An efficient algorithm is proposed to exploit both matching and mismatching features to improve the filtering and verification on candidates. We demonstrate the proposed algorithm significantly outperforms existing approaches with extensive experiments on publicly available datasets.
{"title":"Efficient Graph Similarity Joins with Edit Distance Constraints","authors":"Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang","doi":"10.1109/ICDE.2012.91","DOIUrl":"https://doi.org/10.1109/ICDE.2012.91","url":null,"abstract":"Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold. Inspired by the q-gram idea for string similarity problem, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. An efficient algorithm is proposed to exploit both matching and mismatching features to improve the filtering and verification on candidates. We demonstrate the proposed algorithm significantly outperforms existing approaches with extensive experiments on publicly available datasets.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132800625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}