Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett
It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.
{"title":"Cost-Effective Conceptual Design for Information Extraction","authors":"Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett","doi":"10.1145/2716321","DOIUrl":"https://doi.org/10.1145/2716321","url":null,"abstract":"It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"12:1-12:39"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85700177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It just became even more attractive to publish your research results in ACM Transactions on Database Systems: The leadership of ACM SIGMOD and TODS have decided to offer the authors of certain TODS articles the opportunity to present their article at the " next " SIGMOD conference. This agreement aims to make it more attractive to members of the SIGMOD community to publish in TODS, as well as to further enrich the technical program at the SIGMOD conferences. Journal and conference publication differ in a number of respects. In the following, I review important differences, from the perspective of journal publication, and present a case for publication in TODS. When a submission is received for consideration of publication in TODS, the submission is assigned to an Associate Editor who then is in charge of handling the submission and, in a sense, serves as the submission's ombudsman: The handling Associate Editor aims to do what is right for the submission and will take into account the author's responses to reviews. While the aim is to provide review results within 4 months, the journal's review process can accommodate special circumstances as needed to get things right. For example, additional reviews can be obtained in a review round, and an additional round of reviewing can be introduced. The traditional conference review process has a fixed schedule of deadlines and does not offer this flexibility. Some conferences have tried to achieve some of the flexibility by allowing one round of revision. Some conferences have also introduced procedures that may be viewed as a means of approximating the Associate Editor role as found at journals. They have introduced program committee vice-chairs and meta-reviewers, and they have introduced author feedback. In my experience, these innovations to the conference review process are valuable but do not combine to yield the benefits of the journal review process. Specifically, what I call " hit-and-run " reviews still occur at times. These are superficial reviews that simply reject a paper without offering specific reasons. Key reasons why such reviews occur is that they are fast to do and that reviewers know that they can get away with them because there is no time for iteration. And I believe that the vice-chair and meta-reviewer roles are not always effective, mainly due to tight deadlines. They simply often have to make accept/reject recommendations with the information already available. Another difference between the …
在ACM Transactions on Database Systems上发表您的研究结果变得更加有吸引力:ACM SIGMOD和TODS的领导层已经决定为某些TODS文章的作者提供在“下一届”SIGMOD会议上发表他们的文章的机会。该协议旨在使其对SIGMOD社区的成员更有吸引力,以便在TODS中发布,并进一步丰富SIGMOD会议上的技术计划。期刊和会议出版在许多方面有所不同。下面,我将从期刊出版的角度回顾重要的差异,并提出一个在TODS中发表的案例。当TODS收到一份投稿,考虑发表时,该投稿会被分配给一名副编辑,他负责处理投稿,从某种意义上说,他是投稿的监察员:处理副编辑的目标是为投稿做正确的事情,并将考虑作者对评审的回应。虽然目标是在4个月内提供审稿结果,但该杂志的审稿过程可以根据需要适应特殊情况,以使事情正确。例如,可以在评审轮中获得额外的评审,并且可以引入额外的评审轮。传统的会议审查过程有固定的截止日期,不提供这种灵活性。有些会议试图通过允许一轮修订来实现一些灵活性。一些会议还引入了一些程序,这些程序可能被视为一种接近期刊副编辑角色的手段。他们引入了项目委员会副主席和元审稿人,他们还引入了作者反馈。根据我的经验,这些对会议评审过程的创新是有价值的,但并没有结合起来产生期刊评审过程的好处。具体来说,我称之为“打了就跑”的审查有时仍然会发生。这些是肤浅的评论,只是拒绝一篇论文而不提供具体原因。发生这样的审查的关键原因是,它们很快就能完成,并且审查者知道他们可以摆脱它们,因为没有时间进行迭代。我相信副主席和元审稿人的角色并不总是有效的,主要是由于紧迫的截止日期。他们通常只需要根据已有的信息做出接受/拒绝的建议。另一个区别是……
{"title":"Editorial: The Best of Two Worlds -- Present Your TODS Paper at SIGMOD","authors":"Christian S. Jensen","doi":"10.1145/2770931","DOIUrl":"https://doi.org/10.1145/2770931","url":null,"abstract":"It just became even more attractive to publish your research results in ACM Transactions on Database Systems: The leadership of ACM SIGMOD and TODS have decided to offer the authors of certain TODS articles the opportunity to present their article at the \" next \" SIGMOD conference. This agreement aims to make it more attractive to members of the SIGMOD community to publish in TODS, as well as to further enrich the technical program at the SIGMOD conferences. Journal and conference publication differ in a number of respects. In the following, I review important differences, from the perspective of journal publication, and present a case for publication in TODS. When a submission is received for consideration of publication in TODS, the submission is assigned to an Associate Editor who then is in charge of handling the submission and, in a sense, serves as the submission's ombudsman: The handling Associate Editor aims to do what is right for the submission and will take into account the author's responses to reviews. While the aim is to provide review results within 4 months, the journal's review process can accommodate special circumstances as needed to get things right. For example, additional reviews can be obtained in a review round, and an additional round of reviewing can be introduced. The traditional conference review process has a fixed schedule of deadlines and does not offer this flexibility. Some conferences have tried to achieve some of the flexibility by allowing one round of revision. Some conferences have also introduced procedures that may be viewed as a means of approximating the Associate Editor role as found at journals. They have introduced program committee vice-chairs and meta-reviewers, and they have introduced author feedback. In my experience, these innovations to the conference review process are valuable but do not combine to yield the benefits of the journal review process. Specifically, what I call \" hit-and-run \" reviews still occur at times. These are superficial reviews that simply reject a paper without offering specific reasons. Key reasons why such reviews occur is that they are fast to do and that reviewers know that they can get away with them because there is no time for iteration. And I believe that the vice-chair and meta-reviewer roles are not always effective, mainly due to tight deadlines. They simply often have to make accept/reject recommendations with the information already available. Another difference between the …","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"218 1","pages":"7:1-7:2"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79759739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A distributed database system often operates in an asynchronous communication model where messages can be arbitrarily delayed. This communication model causes nondeterministic effects like unpredictable arrival orders of messages. Nonetheless, in general we want the distributed system to be deterministic; the system should produce the same output despite the nondeterministic effects on messages. Previously, two interpretations of determinism have been proposed. The first says that all infinite fair computation traces produce the same output. The second interpretation is a confluence notion, saying that all finite computation traces can still be extended to produce the same output. A decidability result for the confluence notion was previously obtained for so-called simple transducer networks, a model from the field of declarative networking. In the current article, we also present a decidability result for simple transducer networks, but this time for the first interpretation of determinism, with infinite fair computation traces. We also compare the expressivity of simple transducer networks under both interpretations.
{"title":"Deciding Determinism with Fairness for Simple Transducer Networks","authors":"Tom J. Ameloot","doi":"10.1145/2757215","DOIUrl":"https://doi.org/10.1145/2757215","url":null,"abstract":"A distributed database system often operates in an asynchronous communication model where messages can be arbitrarily delayed. This communication model causes nondeterministic effects like unpredictable arrival orders of messages. Nonetheless, in general we want the distributed system to be deterministic; the system should produce the same output despite the nondeterministic effects on messages.\u0000 Previously, two interpretations of determinism have been proposed. The first says that all infinite fair computation traces produce the same output. The second interpretation is a confluence notion, saying that all finite computation traces can still be extended to produce the same output. A decidability result for the confluence notion was previously obtained for so-called simple transducer networks, a model from the field of declarative networking. In the current article, we also present a decidability result for simple transducer networks, but this time for the first interpretation of determinism, with infinite fair computation traces. We also compare the expressivity of simple transducer networks under both interpretations.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2 1","pages":"9:1-9:39"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76957811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is of paramount importance for a scholarly journal such as ACM Transactions on Database Systems to have a strong editorial board of respected, world-class scholars. The editorial board plays a fundamental role in attracting the best submissions, in ensuring insightful and timely handling of submissions, in maintaining the high scientific standards of the journal, and in maintaining the reputation of the journal. Indeed, the journal’s associate editors, along with the reviewers and authors they work with, are the primary reason that TODS is a world-class journal. As of January 1, 2017, three associate editors—Paolo Ciaccia, Divyakant Agrawal, and Sihem Amer-Yahia—ended their terms, each having served on the editorial board for roughly 6 years. In addition, they will stay on until they complete their current loads. Paolo, Divy, and Sihem have provided very substantial, high-caliber service to the journal and the database community. Specifically, they have lent their extensive experience, deep insight, and sound technical judgment to the journal. I have never seen them compromise on quality when handling submissions. Surely, they have had many other demands on their time, many of which are better paid, during these past 6 years. We are all fortunate that they have donated their time and unique expertise to the journal and our community during half a dozen years. They deserve our recognition for their commitment to the scientific enterprise. As of January 1, 2017, three new associate editors have joined the editorial board: • Feifei Li, University of Utah (https://www.cs.utah.edu/∼lifeifei/) • Kian-Lee Tan, National University of Singapore (https://www.comp.nus.edu.sg/ ∼tankl/) • Jeffrey Xu Yu, Chinese University of Hong Kong (http://www.se.cuhk.edu.hk/people/ yu.html)
{"title":"Editorial: Updates to the Editorial Board","authors":"Christian S. Jensen","doi":"10.1145/2747020","DOIUrl":"https://doi.org/10.1145/2747020","url":null,"abstract":"It is of paramount importance for a scholarly journal such as ACM Transactions on Database Systems to have a strong editorial board of respected, world-class scholars. The editorial board plays a fundamental role in attracting the best submissions, in ensuring insightful and timely handling of submissions, in maintaining the high scientific standards of the journal, and in maintaining the reputation of the journal. Indeed, the journal’s associate editors, along with the reviewers and authors they work with, are the primary reason that TODS is a world-class journal. As of January 1, 2017, three associate editors—Paolo Ciaccia, Divyakant Agrawal, and Sihem Amer-Yahia—ended their terms, each having served on the editorial board for roughly 6 years. In addition, they will stay on until they complete their current loads. Paolo, Divy, and Sihem have provided very substantial, high-caliber service to the journal and the database community. Specifically, they have lent their extensive experience, deep insight, and sound technical judgment to the journal. I have never seen them compromise on quality when handling submissions. Surely, they have had many other demands on their time, many of which are better paid, during these past 6 years. We are all fortunate that they have donated their time and unique expertise to the journal and our community during half a dozen years. They deserve our recognition for their commitment to the scientific enterprise. As of January 1, 2017, three new associate editors have joined the editorial board: • Feifei Li, University of Utah (https://www.cs.utah.edu/∼lifeifei/) • Kian-Lee Tan, National University of Singapore (https://www.comp.nus.edu.sg/ ∼tankl/) • Jeffrey Xu Yu, Chinese University of Hong Kong (http://www.se.cuhk.edu.hk/people/ yu.html)","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"1e:1"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74470954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms. In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.
{"title":"Efficient Computation of the Tree Edit Distance","authors":"Mateusz Pawlik, Nikolaus Augsten","doi":"10.1145/2699485","DOIUrl":"https://doi.org/10.1145/2699485","url":null,"abstract":"We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.\u0000 In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"36 1","pages":"3:1-3:40"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72825803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manos Athanassoulis, Shimin Chen, A. Ailamaki, Phillip B. Gibbons, R. Stoica
Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost. This article presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. We present detailed modeling of the proposed approach, and provide proofs regarding the fundamental properties of the MaSM algorithms. Our experimentation shows that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 4KB to 100GB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.
{"title":"Online Updates on Data Warehouses via Judicious Use of Solid-State Storage","authors":"Manos Athanassoulis, Shimin Chen, A. Ailamaki, Phillip B. Gibbons, R. Stoica","doi":"10.1145/2699484","DOIUrl":"https://doi.org/10.1145/2699484","url":null,"abstract":"Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost.\u0000 This article presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. We present detailed modeling of the proposed approach, and provide proofs regarding the fundamental properties of the MaSM algorithms. Our experimentation shows that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 4KB to 100GB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"195 1","pages":"6:1-6:42"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74486823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study two succinct representation systems for relational data based on relational algebra expressions with unions, Cartesian products, and singleton relations: f-representations, which employ algebraic factorisation using distributivity of product over union, and d-representations, which are f-representations where further succinctness is brought by explicit sharing of repeated subexpressions. In particular we study such representations for results of conjunctive queries. We derive tight asymptotic bounds for representation sizes and present algorithms to compute representations within these bounds. We compare the succinctness of f-representations and d-representations for results of equi-join queries, and relate them to fractional edge covers and fractional hypertree decompositions of the query hypergraph. Recent work showed that f-representations can significantly boost the performance of query evaluation in centralised and distributed settings and of machine learning tasks.
{"title":"Size Bounds for Factorised Representations of Query Results","authors":"Dan Olteanu, Jakub Závodný","doi":"10.1145/2656335","DOIUrl":"https://doi.org/10.1145/2656335","url":null,"abstract":"We study two succinct representation systems for relational data based on relational algebra expressions with unions, Cartesian products, and singleton relations: f-representations, which employ algebraic factorisation using distributivity of product over union, and d-representations, which are f-representations where further succinctness is brought by explicit sharing of repeated subexpressions.\u0000 In particular we study such representations for results of conjunctive queries. We derive tight asymptotic bounds for representation sizes and present algorithms to compute representations within these bounds. We compare the succinctness of f-representations and d-representations for results of equi-join queries, and relate them to fractional edge covers and fractional hypertree decompositions of the query hypergraph.\u0000 Recent work showed that f-representations can significantly boost the performance of query evaluation in centralised and distributed settings and of machine learning tasks.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"154 1","pages":"2:1-2:44"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77478652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study verification of systems whose transitions consist of accesses to a Web-based data source. An access is a lookup on a relation within a relational database, fixing values for a set of positions in the relation. For example, a transition can represent access to a Web form, where the user is restricted to filling in values for a particular set of fields. We look at verifying properties of a schema describing the possible accesses of such a system. We present a language where one can describe the properties of an access path and also specify additional restrictions on accesses that are enforced by the schema. Our main property language, AccessLTL, is based on a first-order extension of linear-time temporal logic, interpreting access paths as sequences of relational structures. We also present a lower-level automaton model, A-automata, into which AccessLTL specifications can compile. We show that AccessLTL and A-automata can express static analysis problems related to “querying with limited access patterns” that have been studied in the database literature in the past, such as whether an access is relevant to answering a query and whether two queries are equivalent in the accessible data they can return. We prove decidability and complexity results for several restrictions and variants of AccessLTL and explain which properties of paths can be expressed in each restriction.
{"title":"Analysis of Schemas with Access Restrictions","authors":"Michael Benedikt, P. Bourhis, Clemens Ley","doi":"10.1145/2699500","DOIUrl":"https://doi.org/10.1145/2699500","url":null,"abstract":"We study verification of systems whose transitions consist of accesses to a Web-based data source. An access is a lookup on a relation within a relational database, fixing values for a set of positions in the relation. For example, a transition can represent access to a Web form, where the user is restricted to filling in values for a particular set of fields. We look at verifying properties of a schema describing the possible accesses of such a system. We present a language where one can describe the properties of an access path and also specify additional restrictions on accesses that are enforced by the schema. Our main property language, AccessLTL, is based on a first-order extension of linear-time temporal logic, interpreting access paths as sequences of relational structures. We also present a lower-level automaton model, A-automata, into which AccessLTL specifications can compile. We show that AccessLTL and A-automata can express static analysis problems related to “querying with limited access patterns” that have been studied in the database literature in the past, such as whether an access is relevant to answering a query and whether two queries are equivalent in the accessible data they can return. We prove decidability and complexity results for several restrictions and variants of AccessLTL and explain which properties of paths can be expressed in each restriction.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"11 1","pages":"5:1-5:46"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90441246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A sliding window top-k (top-k/w) query monitors incoming data stream objects within a sliding window of size w to identify the k highest-ranked objects with respect to a given scoring function over time. Processing of such queries is challenging because, even when an object is not a top-k/w object at the time when it enters the processing system, it might become one in the future. Thus a set of potential top-k/w objects has to be stored in memory while its size should be minimized to efficiently cope with high data streaming rates. Existing approaches typically store top-k/w and candidate sliding window objects in a k-skyband over a two-dimensional score-time space. However, due to continuous changes of the k-skyband, its maintenance is quite costly. Probabilistic k-skyband is a novel data structure storing data stream objects from a sliding window with significant probability to become top-k/w objects in future. Continuous probabilistic k-skyband maintenance offers considerably improved runtime performance compared to k-skyband maintenance, especially for large values of k, at the expense of a small and controllable error rate. We propose two possible probabilistic k-skyband usages: (i) When it is used to process all sliding window objects, the resulting top-k/w algorithm is approximate and adequate for processing random-order data streams. (ii) When probabilistic k-skyband is used to process only a subset of most recent sliding window objects, it can improve the runtime performance of continuous k-skyband maintenance, resulting in a novel exact top-k/w algorithm. Our experimental evaluation systematically compares different top-k/w processing algorithms and shows that while competing algorithms offer either time efficiency at the expanse of space efficiency or vice-versa, our algorithms based on the probabilistic k-skyband are both time and space efficient.
{"title":"Time- and Space-Efficient Sliding Window Top-k Query Processing","authors":"K. Pripužić, Ivana Podnar Žarko, K. Aberer","doi":"10.1145/2736701","DOIUrl":"https://doi.org/10.1145/2736701","url":null,"abstract":"A sliding window top-k (top-k/w) query monitors incoming data stream objects within a sliding window of size w to identify the k highest-ranked objects with respect to a given scoring function over time. Processing of such queries is challenging because, even when an object is not a top-k/w object at the time when it enters the processing system, it might become one in the future. Thus a set of potential top-k/w objects has to be stored in memory while its size should be minimized to efficiently cope with high data streaming rates. Existing approaches typically store top-k/w and candidate sliding window objects in a k-skyband over a two-dimensional score-time space. However, due to continuous changes of the k-skyband, its maintenance is quite costly. Probabilistic k-skyband is a novel data structure storing data stream objects from a sliding window with significant probability to become top-k/w objects in future. Continuous probabilistic k-skyband maintenance offers considerably improved runtime performance compared to k-skyband maintenance, especially for large values of k, at the expense of a small and controllable error rate. We propose two possible probabilistic k-skyband usages: (i) When it is used to process all sliding window objects, the resulting top-k/w algorithm is approximate and adequate for processing random-order data streams. (ii) When probabilistic k-skyband is used to process only a subset of most recent sliding window objects, it can improve the runtime performance of continuous k-skyband maintenance, resulting in a novel exact top-k/w algorithm. Our experimental evaluation systematically compares different top-k/w processing algorithms and shows that while competing algorithms offer either time efficiency at the expanse of space efficiency or vice-versa, our algorithms based on the probabilistic k-skyband are both time and space efficient.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"62 1","pages":"1:1-1:44"},"PeriodicalIF":1.8,"publicationDate":"2015-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91031478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ganzhao Yuan, Zhenjie Zhang, M. Winslett, Xiaokui Xiao, Y. Yang, Z. Hao
Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results while satisfying the privacy guarantees. Previous work, notably Li et al. [2010], has suggested that, with an appropriate strategy, processing a batch of correlated queries as a whole achieves considerably higher accuracy than answering them individually. However, to our knowledge there is currently no practical solution to find such a strategy for an arbitrary query batch; existing methods either return strategies of poor quality (often worse than naive methods) or require prohibitively expensive computations for even moderately large domains. Motivated by this, we propose a low-rank mechanism (LRM), the first practical differentially private technique for answering batch linear queries with high accuracy. LRM works for both exact (i.e., ε-) and approximate (i.e., (ε, Δ)-) differential privacy definitions. We derive the utility guarantees of LRM and provide guidance on how to set the privacy parameters, given the user's utility expectation. Extensive experiments using real data demonstrate that our proposed method consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.
{"title":"Optimizing Batch Linear Queries under Exact and Approximate Differential Privacy","authors":"Ganzhao Yuan, Zhenjie Zhang, M. Winslett, Xiaokui Xiao, Y. Yang, Z. Hao","doi":"10.1145/2699501","DOIUrl":"https://doi.org/10.1145/2699501","url":null,"abstract":"Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results while satisfying the privacy guarantees. Previous work, notably Li et al. [2010], has suggested that, with an appropriate strategy, processing a batch of correlated queries as a whole achieves considerably higher accuracy than answering them individually. However, to our knowledge there is currently no practical solution to find such a strategy for an arbitrary query batch; existing methods either return strategies of poor quality (often worse than naive methods) or require prohibitively expensive computations for even moderately large domains. Motivated by this, we propose a low-rank mechanism (LRM), the first practical differentially private technique for answering batch linear queries with high accuracy. LRM works for both exact (i.e., ε-) and approximate (i.e., (ε, Δ)-) differential privacy definitions. We derive the utility guarantees of LRM and provide guidance on how to set the privacy parameters, given the user's utility expectation. Extensive experiments using real data demonstrate that our proposed method consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 1","pages":"11:1-11:47"},"PeriodicalIF":1.8,"publicationDate":"2015-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86862937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}