首页 > 最新文献

2011 IEEE 27th International Conference on Data Engineering最新文献

英文 中文
Towards exploratory hypothesis testing and analysis 走向探索性假设检验和分析
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767907
Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee
Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.
假设检验是科学发现的一种行之有效的工具。传统的假设检验是以假设驱动的方式进行的。一个科学家必须首先根据他/她的知识和经验提出一个假设,然后设计各种各样的实验来验证它。考虑到数据的快速增长,一个人几乎不可能手动检查所有数据以找到所有有趣的假设进行测试。在本文中,我们提出并开发了一个数据驱动的自动假设检验和分析系统。我们将假设定义为两个或多个亚种群之间的比较。我们使用频繁的模式挖掘技术找到子种群进行比较,然后将它们配对进行统计测试。我们还生成额外的信息,以进一步分析被认为重要的假设。我们进行了一组实验来证明所提出算法的效率,以及所生成假设的有用性。结果表明,我们的系统可以帮助用户(1)识别重要假设;(2)分离重大假设背后的原因;(3)用已发现的重要假设找出形成辛普森悖论的混杂因素。
{"title":"Towards exploratory hypothesis testing and analysis","authors":"Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee","doi":"10.1109/ICDE.2011.5767907","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767907","url":null,"abstract":"Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130608714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
SkyEngine: Efficient Skyline search engine for Continuous Skyline computations SkyEngine:高效的Skyline搜索引擎,用于连续的Skyline计算
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767944
Yu-Ling Hsueh, Roger Zimmermann, Wei-Shinn Ku, Yifan Jin
Skyline query processing has become an important feature in multi-dimensional, data-intensive applications. Such computations are especially challenging under dynamic conditions, when either snapshot queries need to be answered with short user response times or when continuous skyline queries need to be maintained efficiently over a set of objects that are frequently updated. To achieve high performance, we have recently designed the ESC algorithm, an Efficient update approach for Skyline Computations. ESC creates a pre-computed candidate skyline set behind the first skyline (a “second line of defense,” so to speak) that facilitates an incremental, two-stage skyline update strategy which results in a quicker query response time for the user. Our demonstration presents the two-threaded SkyEngine system that builds upon and extends the base-features of the ESC algorithm with innovative, user-oriented functionalities that are termed SkyAlert and AutoAdjust. These functions enable a data or service provider to be informed about and gain the opportunity of automatically promoting its data records to remain part of the skyline, if so desired. The SkyEngine demonstration includes both a server and a web browser based client. Finally, the SkyEngine system also provides visualizations that reveal its internal performance statistics.
Skyline查询处理已经成为多维、数据密集型应用的一个重要特性。这种计算在动态条件下尤其具有挑战性,当快照查询需要在短用户响应时间内得到回答时,或者当需要在一组频繁更新的对象上有效地维护连续的天际线查询时。为了实现高性能,我们最近设计了ESC算法,这是一种用于Skyline计算的高效更新方法。ESC在第一道天际线后面创建了一个预先计算的候选天际线集(可以这么说,是“第二道防线”),它促进了增量的、两阶段的天际线更新策略,从而为用户提供了更快的查询响应时间。我们的演示展示了双线程SkyEngine系统,该系统构建并扩展了ESC算法的基本特征,具有创新的、面向用户的功能,称为SkyAlert和AutoAdjust。这些功能使数据或服务提供商能够了解并获得自动提升其数据记录的机会,以便在需要时保留其数据记录的一部分。SkyEngine演示包括一个服务器和一个基于web浏览器的客户端。最后,SkyEngine系统还提供了显示其内部性能统计的可视化。
{"title":"SkyEngine: Efficient Skyline search engine for Continuous Skyline computations","authors":"Yu-Ling Hsueh, Roger Zimmermann, Wei-Shinn Ku, Yifan Jin","doi":"10.1109/ICDE.2011.5767944","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767944","url":null,"abstract":"Skyline query processing has become an important feature in multi-dimensional, data-intensive applications. Such computations are especially challenging under dynamic conditions, when either snapshot queries need to be answered with short user response times or when continuous skyline queries need to be maintained efficiently over a set of objects that are frequently updated. To achieve high performance, we have recently designed the ESC algorithm, an Efficient update approach for Skyline Computations. ESC creates a pre-computed candidate skyline set behind the first skyline (a “second line of defense,” so to speak) that facilitates an incremental, two-stage skyline update strategy which results in a quicker query response time for the user. Our demonstration presents the two-threaded SkyEngine system that builds upon and extends the base-features of the ESC algorithm with innovative, user-oriented functionalities that are termed SkyAlert and AutoAdjust. These functions enable a data or service provider to be informed about and gain the opportunity of automatically promoting its data records to remain part of the skyline, if so desired. The SkyEngine demonstration includes both a server and a web browser based client. Finally, the SkyEngine system also provides visualizations that reveal its internal performance statistics.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128796465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Integrating code search into the development session 将代码搜索集成到开发会话中
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767948
Mu-Woong Lee, Seung-won Hwang, Sunghun Kim
To support rapid and efficient software development, we propose to demonstrate our tool, integrating code search into software development process. For example, a developer, right during writing a module, can find a code piece sharing the same syntactic structure from a large code corpus representing the wisdom of other developers in the same team (or in the universe of open-source code). While there exist commercial code search engines on the code universe, they treat software as text (thus oblivious of syntactic structure), and fail at finding semantically related code. Meanwhile, existing tools, searching for syntactic clones, do not focus on efficiency, focusing on “post-mortem” usage scenario of detecting clones “after” the code development is completed. In clear contrast, we focus on optimizing efficiency for syntactic code search and making this search “interactive” for large-scale corpus, to complement the existing two lines of research. From our demonstration, we will show how such interactive search supports rapid software development, as similarly claimed lately in SE and HCI communities [1], [2]. As an enabling technology, we design efficient index building and traversal techniques, optimized for code corpus and code search workload. Our tool can identify relevant code in the corpus of 1.7 million code pieces in a sub-second response time, without compromising any accuracy obtained by a state-of-the-art tool, as we report our extensive evaluation results in [3].
为了支持快速有效的软件开发,我们建议演示我们的工具,将代码搜索集成到软件开发过程中。例如,开发人员在编写模块期间,可以从代表同一团队中其他开发人员智慧的大型代码语料库中找到共享相同语法结构的代码片段(或在开放源代码的世界中)。虽然在代码领域存在商业代码搜索引擎,但它们将软件视为文本(因此忽略了语法结构),无法找到语义相关的代码。同时,现有的搜索语法克隆的工具并不注重效率,而是注重“在”代码开发完成后“事后”检测克隆的使用场景。与此形成鲜明对比的是,我们的研究重点是优化句法代码搜索的效率,并使这种搜索在大规模语料库中具有“交互性”,以补充现有的两条研究方向。从我们的演示中,我们将展示这种交互式搜索如何支持快速软件开发,正如最近在SE和HCI社区中所声称的那样[1],[2]。作为一种支持技术,我们设计了高效的索引构建和遍历技术,针对代码语料库和代码搜索工作负载进行了优化。我们的工具可以在亚秒的响应时间内识别170万代码片段的语料库中的相关代码,而不会影响最先进工具获得的任何准确性,正如我们在[3]中报告的那样,我们进行了广泛的评估结果。
{"title":"Integrating code search into the development session","authors":"Mu-Woong Lee, Seung-won Hwang, Sunghun Kim","doi":"10.1109/ICDE.2011.5767948","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767948","url":null,"abstract":"To support rapid and efficient software development, we propose to demonstrate our tool, integrating code search into software development process. For example, a developer, right during writing a module, can find a code piece sharing the same syntactic structure from a large code corpus representing the wisdom of other developers in the same team (or in the universe of open-source code). While there exist commercial code search engines on the code universe, they treat software as text (thus oblivious of syntactic structure), and fail at finding semantically related code. Meanwhile, existing tools, searching for syntactic clones, do not focus on efficiency, focusing on “post-mortem” usage scenario of detecting clones “after” the code development is completed. In clear contrast, we focus on optimizing efficiency for syntactic code search and making this search “interactive” for large-scale corpus, to complement the existing two lines of research. From our demonstration, we will show how such interactive search supports rapid software development, as similarly claimed lately in SE and HCI communities [1], [2]. As an enabling technology, we design efficient index building and traversal techniques, optimized for code corpus and code search workload. Our tool can identify relevant code in the corpus of 1.7 million code pieces in a sub-second response time, without compromising any accuracy obtained by a state-of-the-art tool, as we report our extensive evaluation results in [3].","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123097662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Processing private queries over untrusted data cloud through privacy homomorphism 通过隐私同态处理不可信数据云上的私有查询
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767862
Haibo Hu, Jianliang Xu, C. Ren, Byron Choi
Query processing that preserves both the data privacy of the owner and the query privacy of the client is a new research problem. It shows increasing importance as cloud computing drives more businesses to outsource their data and querying services. However, most existing studies, including those on data outsourcing, address the data privacy and query privacy separately and cannot be applied to this problem. In this paper, we propose a holistic and efficient solution that comprises a secure traversal framework and an encryption scheme based on privacy homomorphism. The framework is scalable to large datasets by leveraging an index-based approach. Based on this framework, we devise secure protocols for processing typical queries such as k-nearest-neighbor queries (kNN) on R-tree index. Moreover, several optimization techniques are presented to improve the efficiency of the query processing protocols. Our solution is verified by both theoretical analysis and performance study.
同时保护所有者的数据隐私和客户端的查询隐私的查询处理是一个新的研究问题。随着云计算推动越来越多的企业将其数据和查询服务外包,它显示出越来越重要的意义。然而,现有的大多数研究,包括数据外包的研究,都将数据隐私和查询隐私分开处理,无法适用于该问题。在本文中,我们提出了一个完整而高效的解决方案,该方案包括一个安全遍历框架和一个基于隐私同态的加密方案。通过利用基于索引的方法,该框架可扩展到大型数据集。基于这个框架,我们设计了安全协议来处理典型的查询,如r树索引上的k近邻查询(kNN)。此外,还提出了几种优化技术来提高查询处理协议的效率。理论分析和性能研究验证了我们的解决方案。
{"title":"Processing private queries over untrusted data cloud through privacy homomorphism","authors":"Haibo Hu, Jianliang Xu, C. Ren, Byron Choi","doi":"10.1109/ICDE.2011.5767862","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767862","url":null,"abstract":"Query processing that preserves both the data privacy of the owner and the query privacy of the client is a new research problem. It shows increasing importance as cloud computing drives more businesses to outsource their data and querying services. However, most existing studies, including those on data outsourcing, address the data privacy and query privacy separately and cannot be applied to this problem. In this paper, we propose a holistic and efficient solution that comprises a secure traversal framework and an encryption scheme based on privacy homomorphism. The framework is scalable to large datasets by leveraging an index-based approach. Based on this framework, we devise secure protocols for processing typical queries such as k-nearest-neighbor queries (kNN) on R-tree index. Moreover, several optimization techniques are presented to improve the efficiency of the query processing protocols. Our solution is verified by both theoretical analysis and performance study.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126303259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 279
Implementing sentinels in the TARGIT BI suite 在TARGIT BI套件中实现哨兵
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767828
Morten Middelfart, T. Pedersen
This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.
本文描述了所谓的哨兵在TARGIT BI套件中的实现。哨兵是一种新型规则,如果多维数据立方体中的一个或多个度量更改可能导致对用户至关重要的另一个度量更改,它可以向用户发出警告。哨兵会根据之前的观察结果通知用户,例如,如果观察到客户问题的增加和网站流量的减少,那么收入可能会在两个月内下降。在本文中,我们展示了用户如何在没有任何先前技术知识的情况下挖掘和使用TARGIT BI套件中的哨兵。我们详细介绍了如何从数据中挖掘哨兵,以及如何对哨兵进行评分。我们详细描述了哨兵挖掘算法是如何在TARGIT BI套件中实现的,并表明我们的实现能够发现在使用顺序模式挖掘或相关技术时无法发现的强大且有用的哨兵。通过大量的实验,我们证明了哨兵的挖掘和使用对于真实的、可操作的数据仓库中的典型用户来说是可行的,并且具有良好的性能。
{"title":"Implementing sentinels in the TARGIT BI suite","authors":"Morten Middelfart, T. Pedersen","doi":"10.1109/ICDE.2011.5767828","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767828","url":null,"abstract":"This paper describes the implementation of so-called sentinels in the TARGIT BI Suite. Sentinels are a novel type of rules that can warn a user if one or more measure changes in a multi-dimensional data cube are expected to cause a change to another measure critical to the user. Sentinels notify users based on previous observations, e.g., that revenue might drop within two months if an increase in customer problems combined with a decrease in website traffic is observed. In this paper we show how users, without any prior technical knowledge, can mine and use sentinels in the TARGIT BI Suite. We present in detail how sentinels are mined from data, and how sentinels are scored. We describe in detail how the sentinel mining algorithm is implemented in the TARGIT BI Suite, and show that our implementation is able to discover strong and useful sentinels that could not be found when using sequential pattern mining or correlation techniques. We demonstrate, through extensive experiments, that mining and usage of sentinels is feasible with good performance for the typical users on a real, operational data warehouse.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126590700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Relational databases, virtualization, and the cloud 关系数据库、虚拟化和云
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767966
M. Ahrens, G. Alonso
Existing relational databases are facing significant challenges as the hardware infrastructure and the underlying platform change from single CPUs to virtualized multicore machines arranged in large clusters. The problems are both technical and related to the licensing models currently in place. In this short abstract we briefly outline the challenges faced by organizations trying to virtualize and bring existing relational databases into the cloud.
随着硬件基础设施和底层平台从单个cpu转变为大型集群中的虚拟化多核机器,现有的关系数据库正面临重大挑战。这些问题既有技术性的,也与现有的许可模式有关。在这篇简短的摘要中,我们简要概述了试图将现有关系数据库虚拟化并引入云中的组织所面临的挑战。
{"title":"Relational databases, virtualization, and the cloud","authors":"M. Ahrens, G. Alonso","doi":"10.1109/ICDE.2011.5767966","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767966","url":null,"abstract":"Existing relational databases are facing significant challenges as the hardware infrastructure and the underlying platform change from single CPUs to virtualized multicore machines arranged in large clusters. The problems are both technical and related to the licensing models currently in place. In this short abstract we briefly outline the challenges faced by organizations trying to virtualize and bring existing relational databases into the cloud.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126609081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Locality Sensitive Outlier Detection: A ranking driven approach 局部敏感离群点检测:一种排序驱动方法
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767852
Ye Wang, S. Parthasarathy, S. Tatikonda
Outlier detection is fundamental to a variety of database and analytic tasks. Recently, distance-based outlier detection has emerged as a viable and scalable alternative to traditional statistical and geometric approaches. In this article we explore the role of ranking for the efficient discovery of distance-based outliers from large high dimensional data sets. Specifically, we develop a light-weight ranking scheme that is powered by locality sensitive hashing, which reorders the database points according to their likelihood of being an outlier. We provide theoretical arguments to justify the rationale for the approach and subsequently conduct an extensive empirical study highlighting the effectiveness of our approach over extant solutions. We show that our ranking scheme improves the efficiency of the distance-based outlier discovery process by up to 5-fold. Furthermore, we find that using our approach the top outliers can often be isolated very quickly, typically by scanning less than 3% of the data set.
异常值检测是各种数据库和分析任务的基础。最近,基于距离的离群点检测已经成为传统统计和几何方法的一种可行且可扩展的替代方法。在本文中,我们探讨了排序对于从大型高维数据集中有效发现基于距离的离群值的作用。具体来说,我们开发了一个轻量级的排序方案,该方案由位置敏感散列提供支持,该散列根据它们成为离群值的可能性对数据库点进行重新排序。我们提供了理论论据来证明该方法的基本原理,随后进行了广泛的实证研究,突出了我们的方法相对于现有解决方案的有效性。我们表明,我们的排序方案将基于距离的离群点发现过程的效率提高了5倍。此外,我们发现使用我们的方法通常可以非常快速地分离出顶部异常值,通常通过扫描不到3%的数据集。
{"title":"Locality Sensitive Outlier Detection: A ranking driven approach","authors":"Ye Wang, S. Parthasarathy, S. Tatikonda","doi":"10.1109/ICDE.2011.5767852","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767852","url":null,"abstract":"Outlier detection is fundamental to a variety of database and analytic tasks. Recently, distance-based outlier detection has emerged as a viable and scalable alternative to traditional statistical and geometric approaches. In this article we explore the role of ranking for the efficient discovery of distance-based outliers from large high dimensional data sets. Specifically, we develop a light-weight ranking scheme that is powered by locality sensitive hashing, which reorders the database points according to their likelihood of being an outlier. We provide theoretical arguments to justify the rationale for the approach and subsequently conduct an extensive empirical study highlighting the effectiveness of our approach over extant solutions. We show that our ranking scheme improves the efficiency of the distance-based outlier discovery process by up to 5-fold. Furthermore, we find that using our approach the top outliers can often be isolated very quickly, typically by scanning less than 3% of the data set.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116110166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Automatic generation of mediated schemas through reasoning over data dependencies 通过对数据依赖性的推理自动生成中介模式
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767913
Xiang Li, C. Quix, D. Kensche, Sandra Geisler, Lisong Guo
Mediated schemas lie at the center of the well recognized data integration architecture. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process. Automatic generation of mediated schemas is still a goal to be achieved. We generate mediated schemas by merging multiple source schemas interrelated by tuple-generating dependencies (tgds). Schema merging is the process to consolidate multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Existing approaches fall short in various aspects, such as restricted expressiveness of input mappings, lacking data level interpretation, the output mapping is not in a logical language (or not given at all), and being confined to binary merging. We present here a novel system which is able to perform native n-ary schema merging using P2P style tgds as input. Suited in the scenario of generating mediated schemas for data integration, the system opts for a minimal schema signature retaining all certain answers of conjunctive queries. Logical output mappings are generated to support the mediated schemas, which enable query answering and, in some cases, query rewriting.
中介模式位于公认的数据集成体系结构的中心。经典的数据集成系统依赖于由人类专家通过密集的设计过程创建的中介模式。自动生成中介模式仍然是一个有待实现的目标。我们通过合并由元组生成依赖项(tgds)相互关联的多个源模式来生成中介模式。模式合并是将多个模式合并到统一视图中的过程。当模式是高度异构和自治的时,任务变得特别具有挑战性。现有的方法在很多方面都存在不足,比如输入映射的表达性受限、缺乏数据级解释、输出映射不是用逻辑语言(或者根本没有给出),以及局限于二进制合并。本文提出了一种新颖的系统,该系统能够使用P2P样式的tgds作为输入来执行本机n-元模式合并。适合于为数据集成生成中介模式的场景,系统选择最小的模式签名,保留连接查询的所有特定答案。生成逻辑输出映射以支持中介模式,从而支持查询应答,并在某些情况下支持查询重写。
{"title":"Automatic generation of mediated schemas through reasoning over data dependencies","authors":"Xiang Li, C. Quix, D. Kensche, Sandra Geisler, Lisong Guo","doi":"10.1109/ICDE.2011.5767913","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767913","url":null,"abstract":"Mediated schemas lie at the center of the well recognized data integration architecture. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process. Automatic generation of mediated schemas is still a goal to be achieved. We generate mediated schemas by merging multiple source schemas interrelated by tuple-generating dependencies (tgds). Schema merging is the process to consolidate multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Existing approaches fall short in various aspects, such as restricted expressiveness of input mappings, lacking data level interpretation, the output mapping is not in a logical language (or not given at all), and being confined to binary merging. We present here a novel system which is able to perform native n-ary schema merging using P2P style tgds as input. Suited in the scenario of generating mediated schemas for data integration, the system opts for a minimal schema signature retaining all certain answers of conjunctive queries. Logical output mappings are generated to support the mediated schemas, which enable query answering and, in some cases, query rewriting.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Precisely Serializable Snapshot Isolation (PSSI) 精确序列化快照隔离(PSSI)
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767853
Stephen Revilak, P. O'Neil, E. O'Neil
Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995 [2], with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 [4] and improved in 2009 [5] by aborting transactions in dangerous structures, which had been shown in 2005 [9] to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI [4] or ESSI [5]) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.
许多流行的数据库管理系统为并发控制提供了快照隔离(SI),除了基于锁定的完全序列化性之外,还提供了快照隔离(SI)。快照隔离于1995年引入[2],注意到异常可能导致序列化性违规。2008年[4]提供了完全的可序列化性,并在2009年[5]通过终止危险结构中的事务进行了改进,2005年[9]已证明这是潜在SI异常的前兆。这种方法产生的运行时环境保证了任意应用程序的可串行形式的快照隔离(我们称之为SSI[4]或ESSI[5])。但是,危险结构中的交易通常不会导致真正的异常,因此,正如作者指出的那样,他们的方法是保守的:它可能导致不必要的中止。在本文中,我们演示了我们的PSSI算法来检测快照隔离依赖图中的循环并中止事务以打破循环。该算法为执行中止提供了更精确的标准。我们已经在一个开源的生产数据库系统(MySQL/InnoDB)中实现了我们的算法,我们的性能研究表明,PSSI的吞吐量在ESSI上得到了提高,并且大大减少了中断。
{"title":"Precisely Serializable Snapshot Isolation (PSSI)","authors":"Stephen Revilak, P. O'Neil, E. O'Neil","doi":"10.1109/ICDE.2011.5767853","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767853","url":null,"abstract":"Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995 [2], with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 [4] and improved in 2009 [5] by aborting transactions in dangerous structures, which had been shown in 2005 [9] to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI [4] or ESSI [5]) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122780700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
XClean: Providing valid spelling suggestions for XML keyword queries XClean:为XML关键字查询提供有效的拼写建议
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767847
Yifei Lu, Wei Wang, Jianxin Li, Chengfei Liu
An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities. In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.
帮助对XML数据进行关键字搜索的一个重要功能是,当用户查询包含排版错误时,建议替代查询。因此,查询建议可以避免返回空结果或质量差的结果,从而改善用户的搜索体验。本文研究了如何有效地为XML文档的关键字查询提供高质量的查询建议。我们说明了以前工作中的某些偏差,并基于最先进的语言模型提出了一个原则性的通用框架XClean。与以前的方法相比,XClean可以适应不同的错误模型和XML关键字查询语义,而不会失去严谨性。已经开发了有效地计算top-k建议的算法。我们使用两个大规模的真实数据集进行了广泛的实验研究。实验结果证明了所提方法的有效性和高效性。
{"title":"XClean: Providing valid spelling suggestions for XML keyword queries","authors":"Yifei Lu, Wei Wang, Jianxin Li, Chengfei Liu","doi":"10.1109/ICDE.2011.5767847","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767847","url":null,"abstract":"An important facility to aid keyword search on XML data is suggesting alternative queries when user queries contain typographical errors. Query suggestion thus can improve users' search experience by avoiding returning empty result or results of poor qualities. In this paper, we study the problem of effectively and efficiently providing quality query suggestions for keyword queries on an XML document. We illustrate certain biases in previous work and propose a principled and general framework, XClean, based on the state-of-the-art language model. Compared with previous methods, XClean can accommodate different error models and XML keyword query semantics without losing rigor. Algorithms have been developed that compute the top-k suggestions efficiently. We performed an extensive experiment study using two large-scale real datasets. The experiment results demonstrate the effectiveness and efficiency of the proposed methods.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132639440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
期刊
2011 IEEE 27th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1