ACM Transactions on Database Systems最新文献_第8页

Cost-Effective Conceptual Design for Information Extraction 具有成本效益的信息提取概念设计

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-06-30 DOI: 10.1145/2716321

Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett

It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.

可以确定的是，在非结构化文本文档集合中提取和注释实体及其概念的出现，可以提高对该集合回答查询的效率。但是，创建和维护大型带注释的集合非常耗费资源。由于企业的可用资源有限和/或其用户可能有紧急的信息需求，因此可能只能选择相关概念的一个子集进行提取和注释。我们把这个子集称为带注释集合的概念设计。在本文中，我们将介绍并正式定义具有成本效益的概念设计问题，在给定集合、一组相关概念和固定预算的情况下，人们喜欢找到最能提高对集合回答查询效率的概念设计。我们为该问题的特殊情况提供了有效的算法，并证明了它在相关概念的数量上一般是np困难的。我们提出了三种有效的近似方法来解决这个问题:贪婪算法、近似人气最大化(简称APM)和近似标注效益最大化(简称AAM)。我们证明，如果没有关于概念重叠的约束，APM是一个完全多项式时间逼近格式。我们还证明了如果相关概念是互斥的，如果概念的代价相等，贪婪算法提供一个常数近似比，APM具有常数近似比，而AAM是一个完全多项式时间近似方案。我们使用维基百科集合和搜索引擎查询日志的实证结果验证了提出的问题形式化，并表明APM和AAM有效地计算概念设计。它们还表明，一般来说，如果相关概念不是相互排斥的，APM可以提供最佳的概念设计。此外，如果相关概念是相互排斥的，那么AAM提供的概念设计比APM提供的解决方案更能提高对集合进行查询的效率。

{"title":"Cost-Effective Conceptual Design for Information Extraction","authors":"Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett","doi":"10.1145/2716321","DOIUrl":"https://doi.org/10.1145/2716321","url":null,"abstract":"It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"12:1-12:39"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85700177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial: The Best of Two Worlds -- Present Your TODS Paper at SIGMOD 社论:两个世界中最好的——在SIGMOD上展示你的TODS论文

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-06-30 DOI: 10.1145/2770931

Christian S. Jensen

It just became even more attractive to publish your research results in ACM Transactions on Database Systems: The leadership of ACM SIGMOD and TODS have decided to offer the authors of certain TODS articles the opportunity to present their article at the " next " SIGMOD conference. This agreement aims to make it more attractive to members of the SIGMOD community to publish in TODS, as well as to further enrich the technical program at the SIGMOD conferences. Journal and conference publication differ in a number of respects. In the following, I review important differences, from the perspective of journal publication, and present a case for publication in TODS. When a submission is received for consideration of publication in TODS, the submission is assigned to an Associate Editor who then is in charge of handling the submission and, in a sense, serves as the submission's ombudsman: The handling Associate Editor aims to do what is right for the submission and will take into account the author's responses to reviews. While the aim is to provide review results within 4 months, the journal's review process can accommodate special circumstances as needed to get things right. For example, additional reviews can be obtained in a review round, and an additional round of reviewing can be introduced. The traditional conference review process has a fixed schedule of deadlines and does not offer this flexibility. Some conferences have tried to achieve some of the flexibility by allowing one round of revision. Some conferences have also introduced procedures that may be viewed as a means of approximating the Associate Editor role as found at journals. They have introduced program committee vice-chairs and meta-reviewers, and they have introduced author feedback. In my experience, these innovations to the conference review process are valuable but do not combine to yield the benefits of the journal review process. Specifically, what I call " hit-and-run " reviews still occur at times. These are superficial reviews that simply reject a paper without offering specific reasons. Key reasons why such reviews occur is that they are fast to do and that reviewers know that they can get away with them because there is no time for iteration. And I believe that the vice-chair and meta-reviewer roles are not always effective, mainly due to tight deadlines. They simply often have to make accept/reject recommendations with the information already available. Another difference between the …

在ACM Transactions on Database Systems上发表您的研究结果变得更加有吸引力:ACM SIGMOD和TODS的领导层已经决定为某些TODS文章的作者提供在“下一届”SIGMOD会议上发表他们的文章的机会。该协议旨在使其对SIGMOD社区的成员更有吸引力，以便在TODS中发布，并进一步丰富SIGMOD会议上的技术计划。期刊和会议出版在许多方面有所不同。下面，我将从期刊出版的角度回顾重要的差异，并提出一个在TODS中发表的案例。当TODS收到一份投稿，考虑发表时，该投稿会被分配给一名副编辑，他负责处理投稿，从某种意义上说，他是投稿的监察员:处理副编辑的目标是为投稿做正确的事情，并将考虑作者对评审的回应。虽然目标是在4个月内提供审稿结果，但该杂志的审稿过程可以根据需要适应特殊情况，以使事情正确。例如，可以在评审轮中获得额外的评审，并且可以引入额外的评审轮。传统的会议审查过程有固定的截止日期，不提供这种灵活性。有些会议试图通过允许一轮修订来实现一些灵活性。一些会议还引入了一些程序，这些程序可能被视为一种接近期刊副编辑角色的手段。他们引入了项目委员会副主席和元审稿人，他们还引入了作者反馈。根据我的经验，这些对会议评审过程的创新是有价值的，但并没有结合起来产生期刊评审过程的好处。具体来说，我称之为“打了就跑”的审查有时仍然会发生。这些是肤浅的评论，只是拒绝一篇论文而不提供具体原因。发生这样的审查的关键原因是，它们很快就能完成，并且审查者知道他们可以摆脱它们，因为没有时间进行迭代。我相信副主席和元审稿人的角色并不总是有效的，主要是由于紧迫的截止日期。他们通常只需要根据已有的信息做出接受/拒绝的建议。另一个区别是……

{"title":"Editorial: The Best of Two Worlds -- Present Your TODS Paper at SIGMOD","authors":"Christian S. Jensen","doi":"10.1145/2770931","DOIUrl":"https://doi.org/10.1145/2770931","url":null,"abstract":"It just became even more attractive to publish your research results in ACM Transactions on Database Systems: The leadership of ACM SIGMOD and TODS have decided to offer the authors of certain TODS articles the opportunity to present their article at the \" next \" SIGMOD conference. This agreement aims to make it more attractive to members of the SIGMOD community to publish in TODS, as well as to further enrich the technical program at the SIGMOD conferences. Journal and conference publication differ in a number of respects. In the following, I review important differences, from the perspective of journal publication, and present a case for publication in TODS. When a submission is received for consideration of publication in TODS, the submission is assigned to an Associate Editor who then is in charge of handling the submission and, in a sense, serves as the submission's ombudsman: The handling Associate Editor aims to do what is right for the submission and will take into account the author's responses to reviews. While the aim is to provide review results within 4 months, the journal's review process can accommodate special circumstances as needed to get things right. For example, additional reviews can be obtained in a review round, and an additional round of reviewing can be introduced. The traditional conference review process has a fixed schedule of deadlines and does not offer this flexibility. Some conferences have tried to achieve some of the flexibility by allowing one round of revision. Some conferences have also introduced procedures that may be viewed as a means of approximating the Associate Editor role as found at journals. They have introduced program committee vice-chairs and meta-reviewers, and they have introduced author feedback. In my experience, these innovations to the conference review process are valuable but do not combine to yield the benefits of the journal review process. Specifically, what I call \" hit-and-run \" reviews still occur at times. These are superficial reviews that simply reject a paper without offering specific reasons. Key reasons why such reviews occur is that they are fast to do and that reviewers know that they can get away with them because there is no time for iteration. And I believe that the vice-chair and meta-reviewer roles are not always effective, mainly due to tight deadlines. They simply often have to make accept/reject recommendations with the information already available. Another difference between the …","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"218 1","pages":"7:1-7:2"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79759739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deciding Determinism with Fairness for Simple Transducer Networks 简单换能器网络的公平性决定确定性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-06-30 DOI: 10.1145/2757215

Tom J. Ameloot

A distributed database system often operates in an asynchronous communication model where messages can be arbitrarily delayed. This communication model causes nondeterministic effects like unpredictable arrival orders of messages. Nonetheless, in general we want the distributed system to be deterministic; the system should produce the same output despite the nondeterministic effects on messages. Previously, two interpretations of determinism have been proposed. The first says that all infinite fair computation traces produce the same output. The second interpretation is a confluence notion, saying that all finite computation traces can still be extended to produce the same output. A decidability result for the confluence notion was previously obtained for so-called simple transducer networks, a model from the field of declarative networking. In the current article, we also present a decidability result for simple transducer networks, but this time for the first interpretation of determinism, with infinite fair computation traces. We also compare the expressivity of simple transducer networks under both interpretations.

分布式数据库系统通常在异步通信模型中运行，其中消息可以任意延迟。这种通信模型会导致不确定的影响，比如消息的到达顺序不可预测。尽管如此，通常我们希望分布式系统是确定性的;尽管对消息有不确定性的影响，系统应该产生相同的输出。此前，人们对决定论提出了两种解释。第一个说所有无限的公平计算轨迹产生相同的输出。第二种解释是合流概念，即所有有限的计算轨迹仍然可以扩展以产生相同的输出。对于所谓的简单换能器网络(一种来自声明性网络领域的模型)，先前已经获得了合流概念的可判定结果。在当前的文章中，我们也给出了简单换能器网络的可判决性结果，但这一次是决定论的第一次解释，具有无限公平的计算痕迹。我们还比较了两种解释下简单换能器网络的表达能力。

引用次数: 0

Editorial: Updates to the Editorial Board 编辑:编辑委员会的最新情况

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2747020

Christian S. Jensen

It is of paramount importance for a scholarly journal such as ACM Transactions on Database Systems to have a strong editorial board of respected, world-class scholars. The editorial board plays a fundamental role in attracting the best submissions, in ensuring insightful and timely handling of submissions, in maintaining the high scientific standards of the journal, and in maintaining the reputation of the journal. Indeed, the journal’s associate editors, along with the reviewers and authors they work with, are the primary reason that TODS is a world-class journal. As of January 1, 2017, three associate editors—Paolo Ciaccia, Divyakant Agrawal, and Sihem Amer-Yahia—ended their terms, each having served on the editorial board for roughly 6 years. In addition, they will stay on until they complete their current loads. Paolo, Divy, and Sihem have provided very substantial, high-caliber service to the journal and the database community. Specifically, they have lent their extensive experience, deep insight, and sound technical judgment to the journal. I have never seen them compromise on quality when handling submissions. Surely, they have had many other demands on their time, many of which are better paid, during these past 6 years. We are all fortunate that they have donated their time and unique expertise to the journal and our community during half a dozen years. They deserve our recognition for their commitment to the scientific enterprise. As of January 1, 2017, three new associate editors have joined the editorial board: • Feifei Li, University of Utah (https://www.cs.utah.edu/∼lifeifei/) • Kian-Lee Tan, National University of Singapore (https://www.comp.nus.edu.sg/ ∼tankl/) • Jeffrey Xu Yu, Chinese University of Hong Kong (http://www.se.cuhk.edu.hk/people/ yu.html)

对于像《ACM数据库系统汇刊》这样的学术期刊来说，拥有一个由受人尊敬的世界级学者组成的强大编辑委员会是至关重要的。编辑委员会在吸引最好的投稿、确保有见地和及时地处理投稿、保持期刊的高科学标准和维护期刊的声誉方面发挥着重要作用。事实上，该期刊的副编辑，以及与他们合作的审稿人和作者，是TODS成为世界级期刊的主要原因。截至2017年1月1日，三位副主编——paolo Ciaccia、Divyakant Agrawal和Sihem amer - yahia结束了他们的任期，他们都在编委会工作了大约6年。此外，它们将一直保持工作状态，直到完成当前的负载。Paolo、Divy和Sihem为期刊和数据库社区提供了大量高质量的服务。具体来说，他们为杂志提供了丰富的经验、深刻的见解和合理的技术判断。我从未见过他们在处理提交内容时在质量上做出妥协。当然，在过去的6年里，他们在时间上有许多其他的要求，其中许多都得到了更好的报酬。我们都很幸运，他们在六年的时间里为杂志和我们的社区贡献了他们的时间和独特的专业知识。他们对科学事业的奉献值得我们的认可。自2017年1月1日起，编委会新增三位副主编:•美国犹他大学李菲菲(https://www.cs.utah.edu/ ~ lifeife /)•新加坡国立大学陈建利(https://www.comp.nus.edu.sg/ ~ tankl/)•香港中文大学余旭(http://www.se.cuhk.edu.hk/people/ Yu .html)

{"title":"Editorial: Updates to the Editorial Board","authors":"Christian S. Jensen","doi":"10.1145/2747020","DOIUrl":"https://doi.org/10.1145/2747020","url":null,"abstract":"It is of paramount importance for a scholarly journal such as ACM Transactions on Database Systems to have a strong editorial board of respected, world-class scholars. The editorial board plays a fundamental role in attracting the best submissions, in ensuring insightful and timely handling of submissions, in maintaining the high scientific standards of the journal, and in maintaining the reputation of the journal. Indeed, the journal’s associate editors, along with the reviewers and authors they work with, are the primary reason that TODS is a world-class journal. As of January 1, 2017, three associate editors—Paolo Ciaccia, Divyakant Agrawal, and Sihem Amer-Yahia—ended their terms, each having served on the editorial board for roughly 6 years. In addition, they will stay on until they complete their current loads. Paolo, Divy, and Sihem have provided very substantial, high-caliber service to the journal and the database community. Specifically, they have lent their extensive experience, deep insight, and sound technical judgment to the journal. I have never seen them compromise on quality when handling submissions. Surely, they have had many other demands on their time, many of which are better paid, during these past 6 years. We are all fortunate that they have donated their time and unique expertise to the journal and our community during half a dozen years. They deserve our recognition for their commitment to the scientific enterprise. As of January 1, 2017, three new associate editors have joined the editorial board: • Feifei Li, University of Utah (https://www.cs.utah.edu/∼lifeifei/) • Kian-Lee Tan, National University of Singapore (https://www.comp.nus.edu.sg/ ∼tankl/) • Jeffrey Xu Yu, Chinese University of Hong Kong (http://www.se.cuhk.edu.hk/people/ yu.html)","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"1e:1"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74470954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Computation of the Tree Edit Distance 树编辑距离的高效计算

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2699485

Mateusz Pawlik, Nikolaus Augsten

We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms. In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.

我们考虑有序标记树之间的经典树编辑距离，它被定义为将一棵树转换为另一棵树的节点编辑操作的最小代价序列。树编辑距离的最新解决方案并不令人满意。该领域的主要竞争者要么具有最优的最坏情况复杂性，但最坏情况经常发生，要么它们对某些树的形状非常有效，但对其他树的形状却退化了。这将导致不可预测且常常不可行的运行时。在两种算法之间没有明显的选择方法。本文提出了一种鲁棒的树编辑距离算法RTED。对于任意输入实例，算法的渐近复杂度小于或等于最优竞争者的复杂度，即算法既有效又最坏情况最优。这是通过计算依赖于输入树的动态分解策略来实现的。在所有使用LRH(左右重)策略的算法中，RTED被证明是最优的，其中包括RTED和文献中提出的最快的树编辑距离算法。在我们对合成数据和真实世界数据的实验中，我们经验地评估我们的解决方案，并将其与最先进的技术进行比较。

{"title":"Efficient Computation of the Tree Edit Distance","authors":"Mateusz Pawlik, Nikolaus Augsten","doi":"10.1145/2699485","DOIUrl":"https://doi.org/10.1145/2699485","url":null,"abstract":"We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.\u0000 In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"36 1","pages":"3:1-3:40"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72825803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 105

Online Updates on Data Warehouses via Judicious Use of Solid-State Storage 通过明智地使用固态存储来在线更新数据仓库

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2699484

Manos Athanassoulis, Shimin Chen, A. Ailamaki, Phillip B. Gibbons, R. Stoica

Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost. This article presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. We present detailed modeling of the proposed approach, and provide proofs regarding the fundamental properties of the MaSM algorithms. Our experimentation shows that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 4KB to 100GB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.

数据仓库传统上针对只读查询性能进行了优化，只允许在夜间进行离线更新，本质上是以数据新鲜度换取性能。全球市场对24x7全天候运营的需求，以及在线和其他快速反应业务的兴起，使得同步在线更新越来越受欢迎。不幸的是，最先进的方法无法支持对新数据的快速分析查询。就地执行更新的传统方法可能会显著降低查询性能，而先前使用差异更新的建议要么需要大量内存缓冲区，要么可能会导致大量更新迁移成本。本文提出了一种在数据仓库中支持在线更新的新方法，该方法通过明智地使用可用的ssd来缓存传入的更新，从而克服了以前方法的局限性。我们将带有差异更新的查询处理问题建模为驻留在磁盘上的数据和驻留在ssd上的更新之间的一种外连接。我们提出了用于执行这种连接和定期迁移的MaSM算法，这些算法占用内存少、查询开销低、SSD写入少、更新的就地迁移高效，并且支持正确的ACID。我们对所提出的方法进行了详细的建模，并提供了关于MaSM算法基本特性的证明。我们的实验表明，在合成范围扫描(范围大小从4KB到100GB不等)和TPC-H查询重播研究中，MaSM只会产生高达7%的开销，同时还会将更新吞吐量提高几个数量级。

{"title":"Online Updates on Data Warehouses via Judicious Use of Solid-State Storage","authors":"Manos Athanassoulis, Shimin Chen, A. Ailamaki, Phillip B. Gibbons, R. Stoica","doi":"10.1145/2699484","DOIUrl":"https://doi.org/10.1145/2699484","url":null,"abstract":"Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost.\u0000 This article presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. We present detailed modeling of the proposed approach, and provide proofs regarding the fundamental properties of the MaSM algorithms. Our experimentation shows that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 4KB to 100GB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"195 1","pages":"6:1-6:42"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74486823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Size Bounds for Factorised Representations of Query Results 查询结果的分解表示的大小界限

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2656335

Dan Olteanu, Jakub Závodný

We study two succinct representation systems for relational data based on relational algebra expressions with unions, Cartesian products, and singleton relations: f-representations, which employ algebraic factorisation using distributivity of product over union, and d-representations, which are f-representations where further succinctness is brought by explicit sharing of repeated subexpressions. In particular we study such representations for results of conjunctive queries. We derive tight asymptotic bounds for representation sizes and present algorithms to compute representations within these bounds. We compare the succinctness of f-representations and d-representations for results of equi-join queries, and relate them to fractional edge covers and fractional hypertree decompositions of the query hypergraph. Recent work showed that f-representations can significantly boost the performance of query evaluation in centralised and distributed settings and of machine learning tasks.

我们研究了基于关系代数表达式的关系数据的两个简洁的表示系统，这些关系代数表达式具有联合、笛卡尔积和单例关系:f表示，它使用乘积在并上的分布来进行代数分解;d表示，它是f表示，其中通过显式地共享重复子表达式来进一步简化。我们特别研究了连接查询结果的这种表示。我们推导了表示大小的紧渐近边界，并给出了在这些边界内计算表示的算法。我们比较了等价连接查询结果的f-表示和d-表示的简洁性，并将它们与查询超图的分数阶边覆盖和分数阶超树分解联系起来。最近的研究表明，f-表示可以显著提高集中和分布式设置以及机器学习任务的查询评估性能。

引用次数: 118

Analysis of Schemas with Access Restrictions 具有访问限制的模式分析

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2699500

Michael Benedikt, P. Bourhis, Clemens Ley

We study verification of systems whose transitions consist of accesses to a Web-based data source. An access is a lookup on a relation within a relational database, fixing values for a set of positions in the relation. For example, a transition can represent access to a Web form, where the user is restricted to filling in values for a particular set of fields. We look at verifying properties of a schema describing the possible accesses of such a system. We present a language where one can describe the properties of an access path and also specify additional restrictions on accesses that are enforced by the schema. Our main property language, AccessLTL, is based on a first-order extension of linear-time temporal logic, interpreting access paths as sequences of relational structures. We also present a lower-level automaton model, A-automata, into which AccessLTL specifications can compile. We show that AccessLTL and A-automata can express static analysis problems related to “querying with limited access patterns” that have been studied in the database literature in the past, such as whether an access is relevant to answering a query and whether two queries are equivalent in the accessible data they can return. We prove decidability and complexity results for several restrictions and variants of AccessLTL and explain which properties of paths can be expressed in each restriction.

我们研究系统的验证，其转换包含对基于web的数据源的访问。访问是对关系数据库中的关系进行查找，为关系中的一组位置固定值。例如，转换可以表示对Web表单的访问，其中用户被限制为一组特定字段填写值。我们将验证描述这样一个系统的可能访问的模式的属性。我们提供了一种语言，在这种语言中，人们可以描述访问路径的属性，还可以指定由模式强制执行的访问的附加限制。我们的主要属性语言AccessLTL基于线性时间时间逻辑的一阶扩展，将访问路径解释为关系结构的序列。我们还提出了一个较低级别的自动机模型a -automata, AccessLTL规范可以编译到该模型中。我们展示了AccessLTL和a -automata可以表达与过去在数据库文献中研究过的“具有有限访问模式的查询”相关的静态分析问题，例如访问是否与回答查询相关，以及两个查询在它们可以返回的可访问数据中是否相等。我们证明了AccessLTL的几个限制和变体的可判定性和复杂性结果，并解释了在每个限制中路径的哪些属性可以表示。

{"title":"Analysis of Schemas with Access Restrictions","authors":"Michael Benedikt, P. Bourhis, Clemens Ley","doi":"10.1145/2699500","DOIUrl":"https://doi.org/10.1145/2699500","url":null,"abstract":"We study verification of systems whose transitions consist of accesses to a Web-based data source. An access is a lookup on a relation within a relational database, fixing values for a set of positions in the relation. For example, a transition can represent access to a Web form, where the user is restricted to filling in values for a particular set of fields. We look at verifying properties of a schema describing the possible accesses of such a system. We present a language where one can describe the properties of an access path and also specify additional restrictions on accesses that are enforced by the schema. Our main property language, AccessLTL, is based on a first-order extension of linear-time temporal logic, interpreting access paths as sequences of relational structures. We also present a lower-level automaton model, A-automata, into which AccessLTL specifications can compile. We show that AccessLTL and A-automata can express static analysis problems related to “querying with limited access patterns” that have been studied in the database literature in the past, such as whether an access is relevant to answering a query and whether two queries are equivalent in the accessible data they can return. We prove decidability and complexity results for several restrictions and variants of AccessLTL and explain which properties of paths can be expressed in each restriction.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"11 1","pages":"5:1-5:46"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90441246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Time- and Space-Efficient Sliding Window Top-k Query Processing 时间和空间效率的滑动窗口Top-k查询处理

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-03-25 DOI: 10.1145/2736701

K. Pripužić, Ivana Podnar Žarko, K. Aberer

A sliding window top-k (top-k/w) query monitors incoming data stream objects within a sliding window of size w to identify the k highest-ranked objects with respect to a given scoring function over time. Processing of such queries is challenging because, even when an object is not a top-k/w object at the time when it enters the processing system, it might become one in the future. Thus a set of potential top-k/w objects has to be stored in memory while its size should be minimized to efficiently cope with high data streaming rates. Existing approaches typically store top-k/w and candidate sliding window objects in a k-skyband over a two-dimensional score-time space. However, due to continuous changes of the k-skyband, its maintenance is quite costly. Probabilistic k-skyband is a novel data structure storing data stream objects from a sliding window with significant probability to become top-k/w objects in future. Continuous probabilistic k-skyband maintenance offers considerably improved runtime performance compared to k-skyband maintenance, especially for large values of k, at the expense of a small and controllable error rate. We propose two possible probabilistic k-skyband usages: (i) When it is used to process all sliding window objects, the resulting top-k/w algorithm is approximate and adequate for processing random-order data streams. (ii) When probabilistic k-skyband is used to process only a subset of most recent sliding window objects, it can improve the runtime performance of continuous k-skyband maintenance, resulting in a novel exact top-k/w algorithm. Our experimental evaluation systematically compares different top-k/w processing algorithms and shows that while competing algorithms offer either time efficiency at the expanse of space efficiency or vice-versa, our algorithms based on the probabilistic k-skyband are both time and space efficient.

滑动窗口top-k (top-k/w)查询监视大小为w的滑动窗口内的传入数据流对象，以确定相对于给定评分函数随时间变化的k个排名最高的对象。处理此类查询具有挑战性，因为即使对象在进入处理系统时不是top-k/w对象，它将来也可能成为top-k/w对象。因此，一组潜在的top-k/w对象必须存储在内存中，同时它的大小应该最小化，以有效地应对高数据流速率。现有的方法通常将top-k/w和候选滑动窗口对象存储在二维分数-时间空间的k-skyband中。然而，由于k波段的不断变化，其维护费用相当昂贵。概率k-skyband是一种新的数据结构，用于存储来自滑动窗口的数据流对象，这些对象在未来有很大的概率成为top-k/w对象。与k-skyband维护相比，连续概率k-skyband维护提供了显著改善的运行时性能，特别是对于较大的k值，但代价是错误率小且可控。我们提出了两种可能的概率k-skyband用法:(i)当它用于处理所有滑动窗口对象时，得到的top-k/w算法是近似的，足以处理随机顺序数据流。(ii)当概率k-skyband仅用于处理最近滑动窗口对象的子集时，它可以提高连续k-skyband维护的运行时性能，从而产生一种新颖的精确top-k/w算法。我们的实验评估系统地比较了不同的top-k/w处理算法，结果表明，虽然竞争算法在空间效率的扩展上提供时间效率，反之亦然，但我们基于概率k-skyband的算法同时具有时间和空间效率。

{"title":"Time- and Space-Efficient Sliding Window Top-k Query Processing","authors":"K. Pripužić, Ivana Podnar Žarko, K. Aberer","doi":"10.1145/2736701","DOIUrl":"https://doi.org/10.1145/2736701","url":null,"abstract":"A sliding window top-k (top-k/w) query monitors incoming data stream objects within a sliding window of size w to identify the k highest-ranked objects with respect to a given scoring function over time. Processing of such queries is challenging because, even when an object is not a top-k/w object at the time when it enters the processing system, it might become one in the future. Thus a set of potential top-k/w objects has to be stored in memory while its size should be minimized to efficiently cope with high data streaming rates. Existing approaches typically store top-k/w and candidate sliding window objects in a k-skyband over a two-dimensional score-time space. However, due to continuous changes of the k-skyband, its maintenance is quite costly. Probabilistic k-skyband is a novel data structure storing data stream objects from a sliding window with significant probability to become top-k/w objects in future. Continuous probabilistic k-skyband maintenance offers considerably improved runtime performance compared to k-skyband maintenance, especially for large values of k, at the expense of a small and controllable error rate. We propose two possible probabilistic k-skyband usages: (i) When it is used to process all sliding window objects, the resulting top-k/w algorithm is approximate and adequate for processing random-order data streams. (ii) When probabilistic k-skyband is used to process only a subset of most recent sliding window objects, it can improve the runtime performance of continuous k-skyband maintenance, resulting in a novel exact top-k/w algorithm. Our experimental evaluation systematically compares different top-k/w processing algorithms and shows that while competing algorithms offer either time efficiency at the expanse of space efficiency or vice-versa, our algorithms based on the probabilistic k-skyband are both time and space efficient.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"62 1","pages":"1:1-1:44"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91031478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Optimizing Batch Linear Queries under Exact and Approximate Differential Privacy 精确和近似差分隐私下的批量线性查询优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2015-02-26 DOI: 10.1145/2699501

Ganzhao Yuan, Zhenjie Zhang, M. Winslett, Xiaokui Xiao, Y. Yang, Z. Hao

Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results while satisfying the privacy guarantees. Previous work, notably Li et al. [2010], has suggested that, with an appropriate strategy, processing a batch of correlated queries as a whole achieves considerably higher accuracy than answering them individually. However, to our knowledge there is currently no practical solution to find such a strategy for an arbitrary query batch; existing methods either return strategies of poor quality (often worse than naive methods) or require prohibitively expensive computations for even moderately large domains. Motivated by this, we propose a low-rank mechanism (LRM), the first practical differentially private technique for answering batch linear queries with high accuracy. LRM works for both exact (i.e., ε-) and approximate (i.e., (ε, Δ)-) differential privacy definitions. We derive the utility guarantees of LRM and provide guidance on how to set the privacy parameters, given the user's utility expectation. Extensive experiments using real data demonstrate that our proposed method consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.

差分隐私是一种很有前途的用于敏感数据统计查询处理的隐私保护范式。它的工作原理是在每个查询结果中注入随机噪声，这样攻击者就很难从发布的噪声结果中推断出任何单独记录的存在与否。差分私有查询处理的主要目标是在满足隐私保证的同时最大限度地提高查询结果的准确性。以前的工作，特别是Li等人[2010]，已经表明，通过适当的策略，将一批相关查询作为一个整体处理比单独回答它们获得更高的准确性。然而，据我们所知，目前还没有实际的解决方案来为任意查询批找到这样的策略;现有的方法要么返回质量较差的策略(通常比朴素的方法更差)，要么即使对于中等规模的域，也需要非常昂贵的计算。受此启发，我们提出了一种低秩机制(LRM)，这是第一个实用的差分私有技术，用于高精度地回答批量线性查询。LRM适用于精确(即ε-)和近似(即(ε， Δ)-)差分隐私定义。我们推导了LRM的效用保证，并在给定用户效用期望的情况下提供了如何设置隐私参数的指导。使用真实数据的大量实验表明，我们提出的方法在差异隐私下始终优于最先进的查询处理解决方案。

{"title":"Optimizing Batch Linear Queries under Exact and Approximate Differential Privacy","authors":"Ganzhao Yuan, Zhenjie Zhang, M. Winslett, Xiaokui Xiao, Y. Yang, Z. Hao","doi":"10.1145/2699501","DOIUrl":"https://doi.org/10.1145/2699501","url":null,"abstract":"Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results while satisfying the privacy guarantees. Previous work, notably Li et al. [2010], has suggested that, with an appropriate strategy, processing a batch of correlated queries as a whole achieves considerably higher accuracy than answering them individually. However, to our knowledge there is currently no practical solution to find such a strategy for an arbitrary query batch; existing methods either return strategies of poor quality (often worse than naive methods) or require prohibitively expensive computations for even moderately large domains. Motivated by this, we propose a low-rank mechanism (LRM), the first practical differentially private technique for answering batch linear queries with high accuracy. LRM works for both exact (i.e., ε-) and approximate (i.e., (ε, Δ)-) differential privacy definitions. We derive the utility guarantees of LRM and provide guidance on how to set the privacy parameters, given the user's utility expectation. Extensive experiments using real data demonstrate that our proposed method consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 1","pages":"11:1-11:47"},"PeriodicalIF":1.8,"publicationDate":"2015-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86862937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30