2012 IEEE 28th International Conference on Data Engineering最新文献

英文中文

Asking the Right Questions in Crowd Data Sourcing 在人群数据来源中提出正确的问题

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.122

Rubi Boim, Ohad Greenshpan, T. Milo, Slava Novgorodov, N. Polyzotis, W. Tan

Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute information. In this work, we target the problem of gathering data from the crowd in an economical and principled fashion. We present Ask It!, a system that allows interactive data sourcing applications to effectively determine which questions should be directed to which users for reducing the uncertainty about the collected data. Ask It! uses a set of novel algorithms for minimizing the number of probing (questions) required from the different users. We demonstrate the challenge and our solution in the context of a multiple-choice question game played by the ICDE'12 attendees, targeted to gather information on the conference's publications, authors and colleagues.

基于群体的数据源是一种新的、强大的数据获取范例，它让Web用户共同贡献信息。在这项工作中，我们的目标是以经济和原则的方式从人群中收集数据的问题。我们呈现Ask It!，一个系统，它允许交互式数据源应用程序有效地确定哪些问题应该针对哪些用户，以减少收集数据的不确定性。问它!使用一组新颖的算法来最小化不同用户所需的探测(问题)数量。我们在ICDE'12与会者玩的选择题游戏的背景下展示了挑战和我们的解决方案，旨在收集会议出版物，作者和同事的信息。

引用次数: 80

NYAYA: A System Supporting the Uniform Management of Large Sets of Semantic Data NYAYA:一个支持大型语义数据集统一管理的系统

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.133

R. D. Virgilio, G. Orsi, L. Tanca, Riccardo Torlone

We present NYAYA, a flexible system for the management of large-scale semantic data which couples a general-purpose storage mechanism with efficient ontological query answering. NYAYA rapidly imports semantic data expressed in different formalisms into semantic data kiosks. Each kiosk exposes the native ontological constraints in a uniform fashion using data log±, a very general rule-based language for the representation of ontological constraints. A group of kiosks forms a semantic data market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. NYAYA is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organizations of the persistent storage. In the demonstration, we will show the capabilities of NYAYA over real-world case studies and demonstrate its efficiency over well-known benchmarks.

我们提出了一个灵活的大规模语义数据管理系统NYAYA，它将通用存储机制与高效的本体查询应答相结合。NYAYA可以快速地将不同形式的语义数据导入到语义数据亭中。每个kiosk使用数据日志(一种非常通用的基于规则的语言，用于表示本体论约束)以统一的方式公开本地本体论约束。一组kiosk形成了一个语义数据市场，其中每个kiosk中的数据可以使用联合查询统一访问，并且用户可以在其中指定用户定义的数据约束。NYAYA易于扩展，对于kiosk中的数据和元数据的更新非常健壮，并且可以很容易地适应持久存储的不同逻辑组织。在演示中，我们将通过实际案例研究展示NYAYA的功能，并在众所周知的基准测试中展示其效率。

引用次数: 19

Branch Code: A Labeling Scheme for Efficient Query Answering on Trees 分支代码:树上高效查询应答的标记方案

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.71

Yanghua Xiao, Ji Hong, Wanyun Cui, Zhenying He, Wei Wang, Guodong Feng

Labeling schemes lie at the core of query processing for many tree-structured data such as XML data that is flooding the web. A labeling scheme that can simultaneously and efficiently support various relationship queries on trees (such as parent/children, descendant/ancestor, etc.), computation of lowest common ancestors (LCA) and update of trees, is desired for effective and efficient management of tree-structured data. Although a variety of labeling schemes such as prefix-based labeling, interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another. In this paper, we propose an integer-based labeling scheme branch code as well as its compressed version as our major solution to simultaneously support efficient query processing on both static and dynamic ordered trees with affordable storage cost. The proposed branch code can answer common queries on ordered trees in constant time, which comes at the cost of consuming O(N log N) storage. To reduce storage cost to O(N), a compressed branch code is further developed. We also give a relationship determination algorithm purely using compressed branch code, which is of quite low possibility to produce false positive results as verified by experimental results. With the support of splay trees, branch code can also support dynamic trees so that updates and queries can be implemented with O(log N) amortized cost. All the results above are either theoretically proved or verified by experimental studies.

标记方案是许多树状结构数据(如web上泛滥的XML数据)查询处理的核心。为了有效地管理树状结构数据，需要一种能够同时有效地支持树上各种关系查询(如父/子、后代/祖先等)、最低共同祖先(LCA)计算和树的更新的标记方案。尽管各种各样的标记方案，如基于前缀的标记、基于间隔的标记和基于素数的标记及其变体，已经为我们提供了用于编码静态和动态树的方法，但这些标记方案通常在一个或另一个方面显示出弱点。在本文中，我们提出了一个基于整数的标记方案分支代码及其压缩版本作为我们的主要解决方案，同时支持在静态和动态有序树上高效的查询处理，并且存储成本可承受。所建议的分支代码可以在常量时间内回答对有序树的常见查询，这是以消耗O(N log N)存储为代价的。为了将存储成本降低到0 (N)，进一步开发了压缩分支代码。我们还给出了一种纯使用压缩分支代码的关系确定算法，实验结果证明该算法产生假阳性结果的可能性很小。有了展开树的支持，分支代码也可以支持动态树，这样更新和查询就可以以O(log N)平摊代价实现。以上结果都得到了理论证明和实验研究的验证。

{"title":"Branch Code: A Labeling Scheme for Efficient Query Answering on Trees","authors":"Yanghua Xiao, Ji Hong, Wanyun Cui, Zhenying He, Wei Wang, Guodong Feng","doi":"10.1109/ICDE.2012.71","DOIUrl":"https://doi.org/10.1109/ICDE.2012.71","url":null,"abstract":"Labeling schemes lie at the core of query processing for many tree-structured data such as XML data that is flooding the web. A labeling scheme that can simultaneously and efficiently support various relationship queries on trees (such as parent/children, descendant/ancestor, etc.), computation of lowest common ancestors (LCA) and update of trees, is desired for effective and efficient management of tree-structured data. Although a variety of labeling schemes such as prefix-based labeling, interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another. In this paper, we propose an integer-based labeling scheme branch code as well as its compressed version as our major solution to simultaneously support efficient query processing on both static and dynamic ordered trees with affordable storage cost. The proposed branch code can answer common queries on ordered trees in constant time, which comes at the cost of consuming O(N log N) storage. To reduce storage cost to O(N), a compressed branch code is further developed. We also give a relationship determination algorithm purely using compressed branch code, which is of quite low possibility to produce false positive results as verified by experimental results. With the support of splay trees, branch code can also support dynamic trees so that updates and queries can be implemented with O(log N) amortized cost. All the results above are either theoretically proved or verified by experimental studies.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124759687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Searching Uncertain Data Represented by Non-axis Parallel Gaussian Mixture Models 非轴平行高斯混合模型表示的不确定数据搜索

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.7

K. Haegler, F. Fiedler, C. Böhm

Efficient similarity search in uncertain data is a central problem in many modern applications such as biometric identification, stock market analysis, sensor networks, medical imaging, etc. In such applications, the feature vector of an object is not exactly known but is rather defined by a probability density function like a Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian distributions, hence, correlations between different features are not considered in the similarity search. In this paper, we propose a novel, efficient similarity search technique for general GMMs without independence assumption for the attributes, named SUDN, which approximates the actual components of a GMM in a conservative but tight way. A filter-refinement architecture guarantees no false dismissals, due to conservativity, as well as a good filter selectivity, due to the tightness of our approximations. An extensive experimental evaluation of SUDN demonstrates a considerable speed-up of similarity queries on general GMMs and an increase in accuracy compared to existing approaches.

不确定数据中的高效相似性搜索是生物特征识别、股票市场分析、传感器网络、医学成像等现代应用中的核心问题。在这样的应用中，对象的特征向量并不是完全已知的，而是由概率密度函数定义的，比如高斯混合模型(GMM)。以前的工作仅限于轴平行高斯分布，因此，在相似性搜索中没有考虑不同特征之间的相关性。在本文中，我们提出了一种新的、高效的不考虑属性独立性假设的通用GMM相似度搜索技术——SUDN，它以保守而严密的方式逼近GMM的实际组成部分。由于保守性，过滤器细化架构保证没有错误的解雇，并且由于我们的近似的严密性，具有良好的过滤器选择性。对SUDN的广泛实验评估表明，与现有方法相比，在一般gmm上的相似性查询有相当大的加速和准确性的提高。

引用次数: 2

Scalable Multi-query Optimization for SPARQL SPARQL的可伸缩多查询优化

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.37

Wangchao Le, Anastasios Kementsietsidis, S. Duan, Feifei Li

This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.

本文回顾了RDF/SPARQL环境下的经典多查询优化问题。我们表明，为关系和半结构化数据/查询语言开发的技术很难(如果不是不可能的话)进行扩展，以解释用SPARQL表示的RDF数据模型和图查询模式。鉴于SPARQL的多查询优化的np -硬度，我们提出了启发式算法，将查询的输入批划分为组，以便每组查询可以一起优化。优化的一个重要组成部分包括一个有效的算法来发现多个SPARQL查询的公共子结构，以及一个有效的成本模型来比较候选执行计划。由于我们的优化技术没有对底层SPARQL查询引擎做任何假设，因此它们具有跨不同RDF存储可移植的优势。在三种流行的RDF存储上进行的大量实验研究表明，所提出的技术是有效的、高效的和可扩展的。

引用次数: 120

Evaluating Probabilistic Queries over Uncertain Matching 评估不确定匹配的概率查询

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.14

Reynold Cheng, Jian Gong, D. Cheung, Jiefeng Cheng

A matching between two database schemas, generated by machine learning techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema matching has recently raised a lot of research interest, because the quality of applications rely on the matching result. We study query evaluation over an inexact schema matching, which is represented as a set of ``possible mappings'', as well as the probabilities that they are correct. Since the number of possible mappings can be large, evaluating queries through these mappings can be expensive. By observing the fact that the possible mappings between two schemas often exhibit a high degree of overlap, we develop two efficient solutions. We also present a fast algorithm to compute answers with the k highest probabilities. An extensive evaluation on real schemas shows that our approaches improve the query performance by almost an order of magnitude.

由机器学习技术(例如，COMA++)生成的两个数据库模式之间的匹配通常是不确定的。模式匹配的不确定性的处理是近年来研究热点之一，因为匹配结果直接影响应用的质量。我们研究了对一个不精确模式匹配的查询评估，它被表示为一组“可能映射”，以及它们是正确的概率。由于可能的映射数量可能很大，因此通过这些映射评估查询的成本可能很高。通过观察两个模式之间可能的映射经常表现出高度重叠这一事实，我们开发了两个有效的解决方案。我们还提出了一种快速算法来计算具有k个最高概率的答案。对实际模式的广泛评估表明，我们的方法将查询性能提高了几乎一个数量级。

引用次数: 6

R2DB: A System for Querying and Visualizing Weighted RDF Graphs R2DB:用于查询和可视化加权RDF图的系统

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.134

Songling Liu, J. P. Cedeño, K. Candan, M. Sapino, Shengyu Huang, Xinsheng Li

Existing RDF query languages and RDF stores fail to support a large class of knowledge applications which associate utilities or costs on the available knowledge statements. A recent proposal includes (a) a ranked RDF (R2DF) specification to enhance RDF triples with an application specific weights and (b) a SPA Rank QL query language specification, which provides novel primitives on top of the SPARQL language to express top-k queries using traditional query patterns as well as novel flexible path predicates. We introduce and demonstrate R2DB, a database system for querying weighted RDF graphs. R2DB relies on the AR2Q query processing engine, which leverages novel index structures to support efficient ranked path search and includes query optimization strategies based on proximity and sub-result inter-arrival times. In addition to being the first data management system for the R2DF data model, R2DB also provides an innovative features-of-interest (FoI) based method for visualizing large sets of query results (i.e., sub graphs of the data graph).

现有的RDF查询语言和RDF存储不能支持大量与可用知识语句相关的实用程序或成本的知识应用程序。最近的一项建议包括(A)一个排序RDF (R2DF)规范，用特定于应用程序的权重增强RDF三元组;(b)一个SPA Rank QL查询语言规范，它在SPARQL语言之上提供新颖的原语，使用传统的查询模式和新颖的灵活路径谓词来表达top-k查询。我们介绍并演示了R2DB，一个用于查询加权RDF图的数据库系统。R2DB依赖于AR2Q查询处理引擎，该引擎利用新颖的索引结构来支持高效的排序路径搜索，并包含基于接近度和子结果到达时间的查询优化策略。除了作为R2DF数据模型的第一个数据管理系统之外，R2DB还提供了一种基于兴趣特征(FoI)的创新方法，用于可视化大型查询结果集(即数据图的子图)。

引用次数: 9

Parallel Top-K Similarity Join Algorithms Using MapReduce 基于MapReduce的并行Top-K相似度连接算法

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.87

Younghoon Kim, Kyuseok Shim

There is a wide range of applications that require finding the top-k most similar pairs of records in a given database. However, computing such top-k similarity joins is a challenging problem today, as there is an increasing trend of applications that expect to deal with vast amounts of data. For such data-intensive applications, parallel executions of programs on a large cluster of commodity machines using the MapReduce paradigm have recently received a lot of attention. In this paper, we investigate how the top-k similarity join algorithms can get benefits from the popular MapReduce framework. We first develop the divide-and-conquer and branch-and-bound algorithms. We next propose the all pair partitioning and essential pair partitioning methods to minimize the amount of data transfers between map and reduce functions. We finally perform the experiments with not only synthetic but also real-life data sets. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.

有很多应用程序需要在给定的数据库中查找top-k最相似的记录对。然而，计算这种top-k相似性连接在今天是一个具有挑战性的问题，因为期望处理大量数据的应用程序越来越多。对于这样的数据密集型应用程序，使用MapReduce范式在大型商用机器集群上并行执行程序最近受到了很多关注。在本文中，我们研究了top-k相似度连接算法如何从流行的MapReduce框架中获益。我们首先发展了分治算法和分支定界算法。接下来，我们提出了全对划分和基本对划分方法，以最小化映射函数和约简函数之间的数据传输量。最后，我们不仅使用合成数据集，还使用真实数据集进行实验。我们的性能研究证实了MapReduce算法的有效性和可扩展性。

引用次数: 87

Multi-version Concurrency via Timestamp Range Conflict Management 多版本并发通过时间戳范围冲突管理

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.10

D. Lomet, A. Fekete, Rui Wang, Peter Ward

A database supporting multiple versions of records may use the versions to support queries of the past or to increase concurrency by enabling reads and writes to be concurrent. We introduce a new concurrency control approach that enables all SQL isolation levels including serializability to utilize multiple versions to increase concurrency while also supporting transaction time database functionality. The key insight is to manage a range of possible timestamps for each transaction that captures the impact of conflicts that have occurred. Using these ranges as constraints often permits concurrent access where lock based concurrency control would block. This can also allow blocking instead of some aborts that are common in earlier multi-version concurrency techniques. Also, timestamp ranges can be used to conservatively find deadlocks without graph based cycle detection. Thus, our multi-version support can enhance performance of current time data access via improved concurrency, while supporting transaction time functionality.

支持多个版本记录的数据库可以使用这些版本来支持对过去的查询，或者通过允许并发读写来增加并发性。我们引入了一种新的并发控制方法，它允许所有SQL隔离级别(包括序列化性)利用多个版本来提高并发性，同时还支持事务时数据库功能。关键的见解是为每个事务管理一系列可能的时间戳，这些时间戳捕获已经发生的冲突的影响。使用这些范围作为约束通常允许基于锁的并发控制阻塞的并发访问。这也允许阻塞而不是早期多版本并发技术中常见的一些中止。此外，时间戳范围可以用于保守地查找死锁，而不需要基于图的周期检测。因此，我们的多版本支持可以通过改进并发性来增强当前时间数据访问的性能，同时支持事务时间功能。

引用次数: 56

Efficient Graph Similarity Joins with Edit Distance Constraints 具有编辑距离约束的高效图相似连接

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.91

Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang

Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold. Inspired by the q-gram idea for string similarity problem, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. An efficient algorithm is proposed to exploit both matching and mismatching features to improve the filtering and verification on candidates. We demonstrate the proposed algorithm significantly outperforms existing approaches with extensive experiments on publicly available datasets.

在生物信息学、化学、社会网络、模式识别等领域，图被广泛用于复杂数据语义的建模。最近的一个趋势是容忍来自各种来源的噪音，例如错误的数据输入，并找到相似的匹配。在本文中，我们研究了图相似连接问题，该问题返回的图对使得它们的编辑距离不大于一个阈值。受字符串相似问题的q-gram思想的启发，我们的解决方案从图中提取路径作为索引的特征。我们建立了一个共同特征的下界来生成候选者。提出了一种有效的算法，利用匹配和不匹配的特征来改进候选对象的过滤和验证。我们通过对公开可用数据集的大量实验证明了所提出的算法显着优于现有方法。

引用次数: 73

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 IEEE 28th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀