首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Data profiling 数据概要分析
Pub Date : 2017-05-09 DOI: 10.1145/3035918.3054772
Ziawasch Abedjan, Lukasz Golab, Felix Naumann
One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [1].
对于任何应用程序,在使用数据集之前的一个关键要求是了解手头的数据集及其元数据。元数据发现的过程称为数据分析。分析活动的范围从特别的方法,比如对数据的随机子集进行观察或制定聚合查询,到使用专用分析工具对数据集的结构信息和统计进行系统推断。在本教程中,我们将强调数据分析作为任何与数据相关的用例的一部分的重要性,并通过分类数据分析任务和回顾最新的数据分析系统和技术来讨论数据分析领域。特别地,我们讨论了数据分析中的难题,例如依赖项发现算法和动态数据和流的分析算法。最后,我们对数据分析领域的未来研究方向进行了总结。本教程基于我们对分析关系数据[1]的调查。
{"title":"Data profiling","authors":"Ziawasch Abedjan, Lukasz Golab, Felix Naumann","doi":"10.1145/3035918.3054772","DOIUrl":"https://doi.org/10.1145/3035918.3054772","url":null,"abstract":"One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this tutorial, we highlight the importance of data profiling as part of any data-related use-case, and discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams. We conclude with directions for future research in the area of data profiling. This tutorial is based on our survey on profiling relational data [1].","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"67 1","pages":"1432-1435"},"PeriodicalIF":0.0,"publicationDate":"2017-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83966265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
TemProRA: Top-k temporal-probabilistic results analysis temproora: Top-k时间概率结果分析
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498350
K. Papaioannou, Michael H. Böhlen
The study of time and probability, as two combined dimensions in database systems, has focused on the correct and efficient computation of the probabilities and time intervals. However, there is a lack of analytical information that allows users to understand and tune the probability of time-varying result tuples. In this demonstration, we present TemProRA, a system that focuses on the analysis of the top-k temporal probabilistic results of a query. We propose the Temporal Probabilistic Lineage Tree (TPLT), the Temporal Probabilistic Bubble Chart (TPBC) and the Temporal Probabilistic Column Chart (TPCC): for each output tuple these three tools are created to provide the user with the most important information to systematically modify the time-varying probability of result tuples. The effectiveness and usefulness of TemProRA are demonstrated through queries performed on a dataset created based on data from Migros, the leading Swiss supermarket branch.
时间和概率作为数据库系统的两个组合维度,其研究的重点是如何正确有效地计算概率和时间间隔。但是,缺乏允许用户理解和调优时变结果元组的概率的分析信息。在这个演示中,我们介绍了temproora,一个专注于分析查询的top-k时间概率结果的系统。我们提出了时间概率谱系树(TPLT),时间概率气泡图(TPBC)和时间概率柱图(TPCC):为每个输出元组创建这三个工具,为用户提供最重要的信息,以系统地修改结果元组的时变概率。temproora的有效性和实用性通过对基于Migros(瑞士领先的超市分支)数据创建的数据集进行查询来证明。
{"title":"TemProRA: Top-k temporal-probabilistic results analysis","authors":"K. Papaioannou, Michael H. Böhlen","doi":"10.1109/ICDE.2016.7498350","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498350","url":null,"abstract":"The study of time and probability, as two combined dimensions in database systems, has focused on the correct and efficient computation of the probabilities and time intervals. However, there is a lack of analytical information that allows users to understand and tune the probability of time-varying result tuples. In this demonstration, we present TemProRA, a system that focuses on the analysis of the top-k temporal probabilistic results of a query. We propose the Temporal Probabilistic Lineage Tree (TPLT), the Temporal Probabilistic Bubble Chart (TPBC) and the Temporal Probabilistic Column Chart (TPCC): for each output tuple these three tools are created to provide the user with the most important information to systematically modify the time-varying probability of result tuples. The effectiveness and usefulness of TemProRA are demonstrated through queries performed on a dataset created based on data from Migros, the leading Swiss supermarket branch.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"23 1","pages":"1382-1385"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72767220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Durable graph pattern queries on historical graphs 对历史图的持久图模式查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498269
Konstantinos Semertzidis, E. Pitoura
In this paper, we focus on labeled graphs that evolve over time. Given a sequence of graph snapshots representing the state of the graph at different time instants, we seek to find the most durable matches of an input graph pattern query, that is, the matches that exist for the longest period of time. The straightforward way to address this problem is by running a state-of-the-art graph pattern algorithm at each snapshot and aggregating the results. However, for large networks this approach is computationally expensive, since all matches have to be generated at each snapshot, including those appearing only once. We propose a new approach that uses a compact representation of the sequence of graph snapshots, appropriate time indexes to prune the search space and a threshold on the duration of the pattern to determine the search order. We also present experimental results using real datasets that illustrate the efficiency and effectiveness of our approach.
在本文中,我们关注随时间演变的标记图。给定一系列表示图在不同时刻的状态的图快照,我们寻求找到输入图模式查询中最持久的匹配,也就是说,存在时间最长的匹配。解决此问题的直接方法是在每个快照上运行最先进的图形模式算法并聚合结果。然而,对于大型网络,这种方法在计算上是昂贵的,因为所有匹配都必须在每个快照中生成,包括那些只出现一次的快照。我们提出了一种新的方法,该方法使用图快照序列的紧凑表示,适当的时间索引来修剪搜索空间,并使用模式持续时间的阈值来确定搜索顺序。我们还提供了使用真实数据集的实验结果,以说明我们的方法的效率和有效性。
{"title":"Durable graph pattern queries on historical graphs","authors":"Konstantinos Semertzidis, E. Pitoura","doi":"10.1109/ICDE.2016.7498269","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498269","url":null,"abstract":"In this paper, we focus on labeled graphs that evolve over time. Given a sequence of graph snapshots representing the state of the graph at different time instants, we seek to find the most durable matches of an input graph pattern query, that is, the matches that exist for the longest period of time. The straightforward way to address this problem is by running a state-of-the-art graph pattern algorithm at each snapshot and aggregating the results. However, for large networks this approach is computationally expensive, since all matches have to be generated at each snapshot, including those appearing only once. We propose a new approach that uses a compact representation of the sequence of graph snapshots, appropriate time indexes to prune the search space and a threshold on the duration of the pattern to determine the search order. We also present experimental results using real datasets that illustrate the efficiency and effectiveness of our approach.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"13 1","pages":"541-552"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73224554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
QB2OLAP: Enabling OLAP on Statistical Linked Open Data QB2OLAP:在统计关联开放数据上启用OLAP
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498341
Jovan Varga, Lorena Etcheverry, A. Vaisman, Oscar Romero, T. Pedersen, Christian Thomsen
Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.
在语义网(SW)上发布和共享多维(MD)数据为在线分析处理(OLAP)的使用提供了新的机会。然而,统计数据发布的当前标准RDF Data Cube词汇表缺乏关键的MD概念,如维度层次结构和聚合函数。提出了QB4OLAP来解决这个问题。然而,QB4OLAP需要大量的手工注释,而且用户仍然必须用SPARQL (RDF的标准查询语言)编写查询,而典型的OLAP用户并不熟悉SPARQL。在本演示中,我们将介绍QB2OLAP,这是一个在现有QB数据上启用OLAP的工具。不需要任何RDF、QB(4OLAP)或SPARQL技能,它允许将QB数据集半自动地转换为QB4OLAP数据集,方法是使用QB4OLAP语义进行充实、探索经过充实的模式,以及使用利用QB4OLAP语义并自动转换为SPARQL的高级OLAP语言QL进行查询。
{"title":"QB2OLAP: Enabling OLAP on Statistical Linked Open Data","authors":"Jovan Varga, Lorena Etcheverry, A. Vaisman, Oscar Romero, T. Pedersen, Christian Thomsen","doi":"10.1109/ICDE.2016.7498341","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498341","url":null,"abstract":"Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"1346-1349"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74353249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Crowdsourced POI labelling: Location-aware result inference and Task Assignment 众包POI标签:位置感知结果推理和任务分配
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498229
Huiqi Hu, Yudian Zheng, Z. Bao, Guoliang Li, Jianhua Feng, Reynold Cheng
Identifying the labels of points of interest (POIs), aka POI labelling, provides significant benefits in location-based services. However, the quality of raw labels manually added by users or generated by artificial algorithms cannot be guaranteed. Such low-quality labels decrease the usability and result in bad user experiences. In this paper, by observing that crowdsourcing is a best-fit for computer-hard tasks, we leverage crowdsourcing to improve the quality of POI labelling. To our best knowledge, this is the first work on crowdsourced POI labelling tasks. In particular, there are two sub-problems: (1) how to infer the correct labels for each POI based on workers' answers, and (2) how to effectively assign proper tasks to workers in order to make more accurate inference for next available workers. To address these two problems, we propose a framework consisting of an inference model and an online task assigner. The inference model measures the quality of a worker on a POI by elaborately exploiting (i) worker's inherent quality, (ii) the spatial distance between the worker and the POI, and (iii) the POI influence, which can provide reliable inference results once a worker submits an answer. As workers are dynamically coming, the online task assigner judiciously assigns proper tasks to them so as to benefit the inference. The inference model and task assigner work alternately to continuously improve the overall quality. We conduct extensive experiments on a real crowdsourcing platform, and the results on two real datasets show that our method significantly outperforms state-of-the-art approaches.
识别兴趣点(POI)的标签,也就是POI标签,为基于位置的服务提供了显著的好处。但是,用户手工添加或人工算法生成的原始标签的质量无法保证。这种低质量的标签降低了可用性,导致糟糕的用户体验。在本文中,通过观察众包最适合计算机硬任务,我们利用众包来提高POI标记的质量。据我们所知,这是第一个关于众包POI标签任务的工作。特别是,有两个子问题:(1)如何根据工人的答案推断每个POI的正确标签,以及(2)如何有效地为工人分配适当的任务,以便对下一个可用的工人做出更准确的推断。为了解决这两个问题,我们提出了一个由推理模型和在线任务分配器组成的框架。推理模型通过精心利用(i)工人的内在素质,(ii)工人与POI之间的空间距离,以及(iii) POI影响来衡量POI上工人的素质,POI影响可以在工人提交答案后提供可靠的推理结果。由于工作人员是动态到来的,在线任务分配器会明智地为他们分配适当的任务,从而有利于推理。推理模型和任务分配器交替工作,不断提高整体质量。我们在一个真实的众包平台上进行了大量的实验,两个真实数据集的结果表明,我们的方法明显优于最先进的方法。
{"title":"Crowdsourced POI labelling: Location-aware result inference and Task Assignment","authors":"Huiqi Hu, Yudian Zheng, Z. Bao, Guoliang Li, Jianhua Feng, Reynold Cheng","doi":"10.1109/ICDE.2016.7498229","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498229","url":null,"abstract":"Identifying the labels of points of interest (POIs), aka POI labelling, provides significant benefits in location-based services. However, the quality of raw labels manually added by users or generated by artificial algorithms cannot be guaranteed. Such low-quality labels decrease the usability and result in bad user experiences. In this paper, by observing that crowdsourcing is a best-fit for computer-hard tasks, we leverage crowdsourcing to improve the quality of POI labelling. To our best knowledge, this is the first work on crowdsourced POI labelling tasks. In particular, there are two sub-problems: (1) how to infer the correct labels for each POI based on workers' answers, and (2) how to effectively assign proper tasks to workers in order to make more accurate inference for next available workers. To address these two problems, we propose a framework consisting of an inference model and an online task assigner. The inference model measures the quality of a worker on a POI by elaborately exploiting (i) worker's inherent quality, (ii) the spatial distance between the worker and the POI, and (iii) the POI influence, which can provide reliable inference results once a worker submits an answer. As workers are dynamically coming, the online task assigner judiciously assigns proper tasks to them so as to benefit the inference. The inference model and task assigner work alternately to continuously improve the overall quality. We conduct extensive experiments on a real crowdsourcing platform, and the results on two real datasets show that our method significantly outperforms state-of-the-art approaches.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"61-72"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85945105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples 大规模实体解析的阻塞:挑战、算法和实际示例
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498364
G. Papadakis, Themis Palpanas
Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.
实体解析是重叠信息源集成的基础任务之一。由于它的二次复杂度,大量的研究集中在提高它的效率,使其扩展到Web数据集合,本质上是庞大的和高度异构的。用于此目的的最常见方法是块,它将相似的实体聚集到块中,以便将成对比较限制在每个块中包含的实体中。在本教程中,我们将仔细研究基于块的实体解析,从早期为数据库集成而设计的块方法开始。我们强调了当代异构、嘈杂、海量的Web数据所带来的挑战,并解释了为什么它们使这些基于模式的技术变得不适用。我们继续介绍阻塞方法,这些方法是为大规模和异构信息开发的,适用于Web数据收集。我们还解释了如何通过元阻塞和并行化技术进一步提高它们的效率。我们以演示几种最先进技术的相对性能的实践会话结束。本教程的参与者将把理论部分讨论的所有主题付诸实践,并将熟悉参考工具箱,其中包括该领域最突出的技术,可以很容易地用于解决实体解决问题。
{"title":"Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples","authors":"G. Papadakis, Themis Palpanas","doi":"10.1109/ICDE.2016.7498364","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498364","url":null,"abstract":"Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1436-1439"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84183896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
QPlain: Query by explanation QPlain:按解释查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498344
Daniel Deutch, Amir Gilad
To assist non-specialists in formulating database queries, multiple frameworks that automatically infer queries from a set of input and output examples have been proposed. While highly useful, a shortcoming of the approach is that if users can only provide a small set of examples, many inherently different queries may qualify. We observe that additional information about the examples, in the form of their explanations, is useful in significantly focusing the set of qualifying queries. We propose to demonstrate QPlain, a system that learns conjunctive queries from examples and their explanations. We capture explanations of different levels of granularity and detail, by leveraging recently developed models for data provenance. Explanations are fed through an intuitive interface, are compiled to the appropriate provenance model, and are then used to derive proposed queries. We will demonstrate that it is feasible for non-specialists to provide examples with meaningful explanations, and that the presence of such explanations result in a much more focused set of queries which better match user intentions.
为了帮助非专业人员制定数据库查询,已经提出了从一组输入和输出示例中自动推断查询的多个框架。虽然非常有用,但这种方法的缺点是,如果用户只能提供一小部分示例,那么许多本质上不同的查询可能符合条件。我们观察到,关于示例的附加信息,以其解释的形式,对于显着关注合格查询集非常有用。我们建议演示QPlain,这是一个从示例及其解释中学习连接查询的系统。通过利用最近开发的数据来源模型,我们捕获了不同粒度和细节级别的解释。解释通过直观的界面提供,编译到适当的来源模型,然后用于派生建议的查询。我们将证明,对于非专业人员来说,提供具有有意义解释的示例是可行的,并且这种解释的存在会导致更集中的查询集,从而更好地匹配用户意图。
{"title":"QPlain: Query by explanation","authors":"Daniel Deutch, Amir Gilad","doi":"10.1109/ICDE.2016.7498344","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498344","url":null,"abstract":"To assist non-specialists in formulating database queries, multiple frameworks that automatically infer queries from a set of input and output examples have been proposed. While highly useful, a shortcoming of the approach is that if users can only provide a small set of examples, many inherently different queries may qualify. We observe that additional information about the examples, in the form of their explanations, is useful in significantly focusing the set of qualifying queries. We propose to demonstrate QPlain, a system that learns conjunctive queries from examples and their explanations. We capture explanations of different levels of granularity and detail, by leveraging recently developed models for data provenance. Explanations are fed through an intuitive interface, are compiled to the appropriate provenance model, and are then used to derive proposed queries. We will demonstrate that it is feasible for non-specialists to provide examples with meaningful explanations, and that the presence of such explanations result in a much more focused set of queries which better match user intentions.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"33 1","pages":"1358-1361"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87921528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Reputation aggregation in peer-to-peer network using differential gossip algorithm 基于差分八卦算法的点对点网络声誉聚合
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498426
Ruchir Gupta, Y. N. Singh
In a peer-to-peer system, a node should estimate reputation of other peers not only on the basis of its own interaction, but also on the basis of experience of other nodes. Reputation aggregation mechanism implements strategy for achieving this. Reputation aggregation in peer to peer networks is generally a very time and resource consuming process. This paper proposes a reputation aggregation algorithm that uses a variant of gossip algorithm called differential gossip. In this paper, estimate of reputation is considered to be having two parts, one common component which is same with every node, and the other one is the information received from immediate neighbours based on the neighbours' direct interaction with the node. Theoretical analysis and numerical results show that differential gossip is fast and requires lesser amount of resources. The reputation computed using the proposed algorithm also shows a good amount of immunity to the collusion.
在点对点系统中,一个节点不仅要根据自己的交互,还要根据其他节点的经验来估计其他节点的声誉。信誉聚合机制实现了实现这一目标的策略。点对点网络中的声誉聚合通常是一个非常耗时和消耗资源的过程。本文提出了一种声誉聚合算法,该算法使用了一种称为差分八卦的八卦算法的变体。本文认为信誉估计分为两部分,一部分是每个节点都相同的公共分量,另一部分是基于邻居与节点的直接交互而从近邻接收到的信息。理论分析和数值结果表明,微分八卦速度快,占用资源少。利用该算法计算的声誉对合谋也有很好的免疫力。
{"title":"Reputation aggregation in peer-to-peer network using differential gossip algorithm","authors":"Ruchir Gupta, Y. N. Singh","doi":"10.1109/ICDE.2016.7498426","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498426","url":null,"abstract":"In a peer-to-peer system, a node should estimate reputation of other peers not only on the basis of its own interaction, but also on the basis of experience of other nodes. Reputation aggregation mechanism implements strategy for achieving this. Reputation aggregation in peer to peer networks is generally a very time and resource consuming process. This paper proposes a reputation aggregation algorithm that uses a variant of gossip algorithm called differential gossip. In this paper, estimate of reputation is considered to be having two parts, one common component which is same with every node, and the other one is the information received from immediate neighbours based on the neighbours' direct interaction with the node. Theoretical analysis and numerical results show that differential gossip is fast and requires lesser amount of resources. The reputation computed using the proposed algorithm also shows a good amount of immunity to the collusion.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"51 1","pages":"1562-1563"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87550213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast motif discovery in short sequences 在短序列中快速发现motif
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498321
Honglei Liu, Fangqiu Han, Hongjun Zhou, Xifeng Yan, K. Kosik
Motif discovery in sequence data is fundamental to many biological problems such as antibody biomarker identification. Recent advances in instrumental techniques make it possible to generate thousands of protein sequences at once, which raises a big data issue for the existing motif finding algorithms: They either work only in a small scale of several hundred sequences or have to trade accuracy for efficiency. In this work, we demonstrate that by intelligently clustering sequences, it is possible to significantly improve the scalability of all the existing motif finding algorithms without losing accuracy at all. An anchor based sequence clustering algorithm (ASC) is thus proposed to divide a sequence dataset into multiple smaller clusters so that sequences sharing the same motif will be located into the same cluster. Then an existing motif finding algorithm can be applied to each individual cluster to generate motifs. In the end, the results from multiple clusters are merged together as final output. Experimental results show that our approach is generic and orders of magnitude faster than traditional motif finding algorithms. It can discover motifs from protein sequences in the scale that no existing algorithm can handle. In particular, ASC reduces the running time of a very popular motif finding algorithm, MEME, from weeks to a few minutes with even better accuracy.
序列数据中的基序发现是许多生物学问题的基础,如抗体生物标志物鉴定。仪器技术的最新进展使一次生成数千个蛋白质序列成为可能,这为现有的基序查找算法提出了一个大数据问题:它们要么只能在几百个序列的小范围内工作,要么必须以准确性为代价提高效率。在这项工作中,我们证明了通过智能聚类序列,可以显着提高所有现有基序查找算法的可扩展性,而不会失去准确性。为此,提出了一种基于锚点的序列聚类算法(ASC),将序列数据集划分为多个较小的聚类,从而将具有相同基序的序列定位到同一聚类中。然后将现有的基序查找算法应用于每个单独的聚类来生成基序。最后,将多个集群的结果合并在一起作为最终输出。实验结果表明,我们的方法是通用的,并且比传统的motif查找算法快了几个数量级。它可以在现有算法无法处理的范围内从蛋白质序列中发现基序。特别是,ASC将非常流行的motif查找算法MEME的运行时间从几周缩短到几分钟,并且精度更高。
{"title":"Fast motif discovery in short sequences","authors":"Honglei Liu, Fangqiu Han, Hongjun Zhou, Xifeng Yan, K. Kosik","doi":"10.1109/ICDE.2016.7498321","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498321","url":null,"abstract":"Motif discovery in sequence data is fundamental to many biological problems such as antibody biomarker identification. Recent advances in instrumental techniques make it possible to generate thousands of protein sequences at once, which raises a big data issue for the existing motif finding algorithms: They either work only in a small scale of several hundred sequences or have to trade accuracy for efficiency. In this work, we demonstrate that by intelligently clustering sequences, it is possible to significantly improve the scalability of all the existing motif finding algorithms without losing accuracy at all. An anchor based sequence clustering algorithm (ASC) is thus proposed to divide a sequence dataset into multiple smaller clusters so that sequences sharing the same motif will be located into the same cluster. Then an existing motif finding algorithm can be applied to each individual cluster to generate motifs. In the end, the results from multiple clusters are merged together as final output. Experimental results show that our approach is generic and orders of magnitude faster than traditional motif finding algorithms. It can discover motifs from protein sequences in the scale that no existing algorithm can handle. In particular, ASC reduces the running time of a very popular motif finding algorithm, MEME, from weeks to a few minutes with even better accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"1158-1169"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86624224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Influence based cost optimization on user preference 基于影响的用户偏好成本优化
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498283
Jianye Yang, Ying Zhang, W. Zhang, Xuemin Lin
The popularity of e-business and preference learning techniques have contributed a huge amount of product and user preference data. Analyzing the influence of an existing or new product among the users is critical to unlock the great scientific and social-economic value of these data. In this paper, we advocate the problem of influence-based cost optimization for the user preference and product data, which is fundamental in many real applications such as marketing and advertising. Generally, we aim to find a cost optimal position for a new product such that it can attract at least k or a particular percentage of users for the given user preference functions and competitors' products. Although we show the solution space of our problem can be reduced to a finite number of possible positions (points) by utilizing the classical k-level computation techniques, the computation cost is still very expensive due to the nature of the high combinatorial complexity of the k-level problem. To alleviate this issue, we develop efficient pruning and query processing techniques to significantly improve the performance. In particular, our traverse-based 2-dimensional algorithm is very efficient with time complexity O(n) where n is the number of user preference functions. For general multi-dimensional spaces, we develop space partition based algorithm to significantly improve the performance by utilizing cost-based, influence-based and local dominance based pruning techniques. Then, we show that the performance of the partition based algorithm can be further enhanced by utilizing sampling approach, where the problem can be reduced to the classical half-space intersection problem. We demonstrate the efficiency of our techniques with extensive experiments over real and synthetic datasets.
电子商务的普及和偏好学习技术提供了大量的产品和用户偏好数据。分析现有产品或新产品对用户的影响对于释放这些数据的巨大科学和社会经济价值至关重要。在本文中,我们提出了基于用户偏好和产品数据的基于影响的成本优化问题,这是许多实际应用(如营销和广告)的基础。一般来说,我们的目标是为一个新产品找到一个成本最优的位置,这样它就可以吸引至少k或特定比例的用户来使用给定的用户偏好函数和竞争对手的产品。虽然我们展示了我们的问题的解空间可以减少到有限数量的可能的位置(点),利用经典的k级计算技术,由于k级问题的高组合复杂性的性质,计算成本仍然非常昂贵。为了缓解这个问题,我们开发了高效的剪枝和查询处理技术来显著提高性能。特别是,我们基于遍历的二维算法非常有效,时间复杂度为O(n),其中n是用户偏好函数的数量。对于一般的多维空间,我们开发了基于空间划分的算法,利用基于成本的、基于影响的和基于局部优势的修剪技术,显著提高了性能。然后,我们证明了利用采样方法可以进一步提高基于分区的算法的性能,其中问题可以简化为经典的半空间相交问题。我们通过对真实和合成数据集的大量实验证明了我们的技术的效率。
{"title":"Influence based cost optimization on user preference","authors":"Jianye Yang, Ying Zhang, W. Zhang, Xuemin Lin","doi":"10.1109/ICDE.2016.7498283","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498283","url":null,"abstract":"The popularity of e-business and preference learning techniques have contributed a huge amount of product and user preference data. Analyzing the influence of an existing or new product among the users is critical to unlock the great scientific and social-economic value of these data. In this paper, we advocate the problem of influence-based cost optimization for the user preference and product data, which is fundamental in many real applications such as marketing and advertising. Generally, we aim to find a cost optimal position for a new product such that it can attract at least k or a particular percentage of users for the given user preference functions and competitors' products. Although we show the solution space of our problem can be reduced to a finite number of possible positions (points) by utilizing the classical k-level computation techniques, the computation cost is still very expensive due to the nature of the high combinatorial complexity of the k-level problem. To alleviate this issue, we develop efficient pruning and query processing techniques to significantly improve the performance. In particular, our traverse-based 2-dimensional algorithm is very efficient with time complexity O(n) where n is the number of user preference functions. For general multi-dimensional spaces, we develop space partition based algorithm to significantly improve the performance by utilizing cost-based, influence-based and local dominance based pruning techniques. Then, we show that the performance of the partition based algorithm can be further enhanced by utilizing sampling approach, where the problem can be reduced to the classical half-space intersection problem. We demonstrate the efficiency of our techniques with extensive experiments over real and synthetic datasets.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"709-720"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76677213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1