首页 > 最新文献

ACM Transactions on Database Systems最新文献

英文 中文
Dynamic Complexity under Definable Changes 可定义更改下的动态复杂性
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2017-01-10 DOI: 10.1145/3241040
T. Schwentick, N. Vortmeier, T. Zeume
In the setting of dynamic complexity, the goal of a dynamic program is to maintain the result of a fixed query for an input database that is subject to changes, possibly using additional auxiliary relations. In other words, a dynamic program updates a materialized view whenever a base relation is changed. The update of query result and auxiliary relations is specified using first-order logic or, equivalently, relational algebra. The original framework by Patnaik and Immerman only considers changes to the database that insert or delete single tuples. This article extends the setting to definable changes, also specified by first-order queries on the database, and generalizes previous maintenance results to these more expressive change operations. More specifically, it is shown that the undirected reachability query is first-order maintainable under single-tuple changes and first-order defined insertions, likewise the directed reachability query for directed acyclic graphs is first-order maintainable under insertions defined by quantifier-free first-order queries. These results rely on bounded bridge properties, which basically say that, after an insertion of a defined set of edges, for each connected pair of nodes there is some path with a bounded number of new edges. While this bound can be huge, in general, it is shown to be small for insertion queries defined by unions of conjunctive queries. To illustrate that the results for this restricted setting could be practically relevant, they are complemented by an experimental study that compares the performance of dynamic programs with complex changes, dynamic programs with single changes, and with recomputation from scratch. The positive results are complemented by several inexpressibility results. For example, it is shown that—unlike for single-tuple insertions—dynamic programs that maintain the reachability query under definable, quantifier-free changes strictly need update formulas with quantifiers. Finally, further positive results unrelated to reachability are presented: it is shown that for changes definable by parameter-free first-order formulas, all LOGSPACE-definable (and even AC1-definable) queries can be maintained by first-order dynamic programs.
在动态复杂性的设置中,动态程序的目标是维护可能会发生更改的输入数据库的固定查询的结果,可能使用额外的辅助关系。换句话说,只要基本关系发生变化,动态程序就会更新物化视图。查询结果和辅助关系的更新使用一阶逻辑或等价的关系代数来指定。Patnaik和Immerman最初的框架只考虑对数据库进行插入或删除单个元组的更改。本文将设置扩展到可定义的更改(也由数据库上的一阶查询指定),并将以前的维护结果推广到这些更具表现力的更改操作。更具体地说,在单元组更改和一阶定义的插入下,无向可达性查询是一阶可维护性,同样,在无量词一阶查询定义的插入下,有向无环图的有向可达性查询也是一阶可维护性。这些结果依赖于有界桥属性,它基本上是说,在插入一组定义的边之后,对于每个连接的节点对,存在一些具有有限数量的新边的路径。虽然这个界限通常可能很大,但对于由联合查询的联合定义的插入查询来说,它是很小的。为了说明这种限制设置的结果可能与实际相关,他们通过一项实验研究进行了补充,该研究比较了复杂变化的动态程序、单一变化的动态程序以及从头开始重新计算的性能。积极的结果与几个不可表达的结果相辅相成。例如,与单元组插入不同,在可定义的、无量词的更改下维护可达性查询的动态程序严格需要使用量词更新公式。最后,进一步给出了与可达性无关的积极结果:结果表明,对于可由无参数一阶公式定义的更改,所有logspace可定义(甚至ac1可定义)查询都可以通过一阶动态规划来维护。
{"title":"Dynamic Complexity under Definable Changes","authors":"T. Schwentick, N. Vortmeier, T. Zeume","doi":"10.1145/3241040","DOIUrl":"https://doi.org/10.1145/3241040","url":null,"abstract":"In the setting of dynamic complexity, the goal of a dynamic program is to maintain the result of a fixed query for an input database that is subject to changes, possibly using additional auxiliary relations. In other words, a dynamic program updates a materialized view whenever a base relation is changed. The update of query result and auxiliary relations is specified using first-order logic or, equivalently, relational algebra.\u0000 The original framework by Patnaik and Immerman only considers changes to the database that insert or delete single tuples. This article extends the setting to definable changes, also specified by first-order queries on the database, and generalizes previous maintenance results to these more expressive change operations. More specifically, it is shown that the undirected reachability query is first-order maintainable under single-tuple changes and first-order defined insertions, likewise the directed reachability query for directed acyclic graphs is first-order maintainable under insertions defined by quantifier-free first-order queries.\u0000 These results rely on bounded bridge properties, which basically say that, after an insertion of a defined set of edges, for each connected pair of nodes there is some path with a bounded number of new edges. While this bound can be huge, in general, it is shown to be small for insertion queries defined by unions of conjunctive queries. To illustrate that the results for this restricted setting could be practically relevant, they are complemented by an experimental study that compares the performance of dynamic programs with complex changes, dynamic programs with single changes, and with recomputation from scratch.\u0000 The positive results are complemented by several inexpressibility results. For example, it is shown that—unlike for single-tuple insertions—dynamic programs that maintain the reachability query under definable, quantifier-free changes strictly need update formulas with quantifiers.\u0000 Finally, further positive results unrelated to reachability are presented: it is shown that for changes definable by parameter-free first-order formulas, all LOGSPACE-definable (and even AC1-definable) queries can be maintained by first-order dynamic programs.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"5 1","pages":"12:1-12:38"},"PeriodicalIF":1.8,"publicationDate":"2017-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83469893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Bounded repairability for regular tree languages 正则树语言的有限可修复性
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-08-08 DOI: 10.1145/2274576.2274593
P. Bourhis, G. Puppis, Cristian Riveros, S. Staworko
We consider the problem of repairing unranked trees (e.g., XML documents) satisfying a given restriction specification R (e.g., a DTD) into unranked trees satisfying a given target specification T. Specifically, we focus on the question of whether one can get from any tree in a regular language R to some tree in another regular language T with a finite, uniformly bounded, number of edit operations (i.e., deletions and insertions of nodes). We give effective characterizations of the pairs of specifications R and T for which such a uniform bound exists, and we study the complexity of the problem under different representations of the regular tree languages (e.g., non-deterministic stepwise automata, deterministic stepwise automata, DTDs). Finally, we point out some connections with the analogous problem for regular languages of words, which was previously studied in [6].
我们考虑将满足给定限制规范R(例如DTD)的未排序树(例如XML文档)修复为满足给定目标规范T的未排序树的问题。具体地说,我们关注的问题是,是否可以从正则语言R中的任何树到另一种正则语言T中的某些树,并且具有有限的,一致有界的编辑操作(即节点的删除和插入)。我们给出了存在一致界的规范R和T对的有效刻画,并研究了正则树语言(如非确定性逐步自动机、确定性逐步自动机、dtd)不同表示下问题的复杂性。最后,我们指出了与规则语言的类似问题的一些联系,该问题在先前的[6]中进行了研究。
{"title":"Bounded repairability for regular tree languages","authors":"P. Bourhis, G. Puppis, Cristian Riveros, S. Staworko","doi":"10.1145/2274576.2274593","DOIUrl":"https://doi.org/10.1145/2274576.2274593","url":null,"abstract":"We consider the problem of repairing unranked trees (e.g., XML documents) satisfying a given restriction specification R (e.g., a DTD) into unranked trees satisfying a given target specification T. Specifically, we focus on the question of whether one can get from any tree in a regular language R to some tree in another regular language T with a finite, uniformly bounded, number of edit operations (i.e., deletions and insertions of nodes). We give effective characterizations of the pairs of specifications R and T for which such a uniform bound exists, and we study the complexity of the problem under different representations of the regular tree languages (e.g., non-deterministic stepwise automata, deterministic stepwise automata, DTDs). Finally, we point out some connections with the analogous problem for regular languages of words, which was previously studied in [6].","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"8 1","pages":"18:1-18:45"},"PeriodicalIF":1.8,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89038139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The Dark Citations of TODS Papers and What to Do About It: Or: Cite the Journal Paper TODS论文的黑暗引用和如何做:或:引用期刊论文
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-06-30 DOI: 10.1145/3003665.3003680
Christian S. Jensen
In contrast, the academic impact of the content of a paper can be measured by the number of citations to the paper. In some areas, it is easier to get citations than in other areas. However, when comparing two papers from the same area, one paper with many citations and one paper with few, the former can generally be considered as the more interesting, relevant, important, and/or impactful one. The academic impact of a researcher can then be measured by the number of citations to their papers.
相比之下,论文内容的学术影响可以通过论文被引用的次数来衡量。在某些领域,获得引用比在其他领域更容易。然而,当比较来自同一领域的两篇论文时,一篇论文被引用的次数多,一篇论文被引用的次数少,前者通常可以被认为是更有趣、相关、重要和/或有影响力的一篇。研究人员的学术影响可以通过论文被引用的次数来衡量。
{"title":"The Dark Citations of TODS Papers and What to Do About It: Or: Cite the Journal Paper","authors":"Christian S. Jensen","doi":"10.1145/3003665.3003680","DOIUrl":"https://doi.org/10.1145/3003665.3003680","url":null,"abstract":"In contrast, the academic impact of the content of a paper can be measured by the number of citations to the paper. In some areas, it is easier to get citations than in other areas. However, when comparing two papers from the same area, one paper with many citations and one paper with few, the former can generally be considered as the more interesting, relevant, important, and/or impactful one. The academic impact of a researcher can then be measured by the number of citations to their papers.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2 1","pages":"8e:1-8e:3"},"PeriodicalIF":1.8,"publicationDate":"2016-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78865023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Inferring Social Strength from Spatiotemporal Data 从时空数据推断社会力量
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-07 DOI: 10.1145/2877200
Huy Pham, C. Shahabi, Yan Liu
The advent of geolocation technologies has generated unprecedented rich datasets of people’s location information at a very high fidelity. These location datasets can be used to study human behavior; for example, social studies have shown that people who are seen together frequently at the same place and same time are most probably socially related. In this article, we are interested in inferring these social connections by analyzing people’s location information; this is useful in a variety of application domains, from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people’s co-occurrences in space and time. We examine two independent methods: diversity and weighted frequency, through which co-occurrences contribute to the strength of a social connection. In addition, we take the characteristics of each location into consideration in order to compensate for cases where only limited location information is available. We also study the role of location semantics in improving our computation of social strength. We develop a parallel implementation of our algorithm using MapReduce to create a scalable and efficient solution for online applications. We conducted extensive sets of experiments with real-world datasets including both people’s location data and their social connections, where we used the latter as the ground truth to verify the results of applying our approach to the former. We show that our approach is valid across different networks and outperforms the competitors.
地理定位技术的出现产生了前所未有的、丰富的、保真度极高的人们位置信息数据集。这些位置数据集可以用来研究人类行为;例如,社会研究表明,经常在同一地点和同一时间出现在一起的人很可能是有社会关系的。在本文中,我们感兴趣的是通过分析人们的位置信息来推断这些社会联系;这在各种应用领域都很有用,从销售和市场营销到情报分析。特别是,我们提出了一个基于熵的模型(EBM),该模型不仅可以推断社会联系,还可以通过分析人们在空间和时间上的共现来估计社会联系的强度。我们研究了两种独立的方法:多样性和加权频率,通过共同发生有助于社会联系的强度。此外,我们考虑了每个位置的特征,以补偿只有有限位置信息可用的情况。我们还研究了位置语义在改进社会强度计算中的作用。我们使用MapReduce开发了算法的并行实现,为在线应用程序创建了一个可扩展和高效的解决方案。我们对现实世界的数据集进行了广泛的实验,包括人们的位置数据和他们的社会关系,我们使用后者作为基础事实来验证将我们的方法应用于前者的结果。我们证明了我们的方法在不同的网络中是有效的,并且优于竞争对手。
{"title":"Inferring Social Strength from Spatiotemporal Data","authors":"Huy Pham, C. Shahabi, Yan Liu","doi":"10.1145/2877200","DOIUrl":"https://doi.org/10.1145/2877200","url":null,"abstract":"The advent of geolocation technologies has generated unprecedented rich datasets of people’s location information at a very high fidelity. These location datasets can be used to study human behavior; for example, social studies have shown that people who are seen together frequently at the same place and same time are most probably socially related. In this article, we are interested in inferring these social connections by analyzing people’s location information; this is useful in a variety of application domains, from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people’s co-occurrences in space and time. We examine two independent methods: diversity and weighted frequency, through which co-occurrences contribute to the strength of a social connection. In addition, we take the characteristics of each location into consideration in order to compensate for cases where only limited location information is available. We also study the role of location semantics in improving our computation of social strength. We develop a parallel implementation of our algorithm using MapReduce to create a scalable and efficient solution for online applications. We conducted extensive sets of experiments with real-world datasets including both people’s location data and their social connections, where we used the latter as the ground truth to verify the results of applying our approach to the former. We show that our approach is valid across different networks and outperforms the competitors.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"22 1","pages":"7:1-7:47"},"PeriodicalIF":1.8,"publicationDate":"2016-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90259740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Dichotomies for Queries with Negation in Probabilistic Databases 概率数据库中带有否定查询的二分类
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-07 DOI: 10.1145/2877203
Robert Fink, Dan Olteanu
This article charts the tractability frontier of two classes of relational algebra queries in tuple-independent probabilistic databases. The first class consists of queries with join, projection, selection, and negation but without repeating relation symbols and union. The second class consists of quantified queries that express the following binary relationships among sets of entities: set division, set inclusion, set equivalence, and set incomparability. Quantified queries are expressible in relational algebra using join, projection, nested negation, and repeating relation symbols. Each query in the two classes has either polynomial-time or #P-hard data complexity and the tractable queries can be recognised efficiently. Our result for the first query class extends a known dichotomy for conjunctive queries without self-joins to such queries with negation. For quantified queries, their tractability is sensitive to their outermost projection operator: They are tractable if no attribute representing set identifiers is projected away and #P-hard otherwise.
本文绘制了元独立概率数据库中两类关系代数查询的可跟踪性边界图。第一类由带有连接、投影、选择和否定的查询组成,但不重复关系符号和联合。第二类由量化查询组成,这些查询表示实体集之间的以下二元关系:集分割、集包含、集等价和集不可比较。量化查询可以在关系代数中使用连接、投影、嵌套否定和重复关系符号来表示。这两个类中的每个查询都具有多项式时间或#P-hard数据复杂性,并且可以有效地识别可处理的查询。我们对第一个查询类的结果将已知的不带自连接的联合查询的二分法扩展到带否定的这种查询。对于量化查询,它们的可跟踪性对最外层的投影运算符很敏感:如果没有表示集合标识符的属性被投影掉,它们是可处理的,否则是#P-hard。
{"title":"Dichotomies for Queries with Negation in Probabilistic Databases","authors":"Robert Fink, Dan Olteanu","doi":"10.1145/2877203","DOIUrl":"https://doi.org/10.1145/2877203","url":null,"abstract":"This article charts the tractability frontier of two classes of relational algebra queries in tuple-independent probabilistic databases. The first class consists of queries with join, projection, selection, and negation but without repeating relation symbols and union. The second class consists of quantified queries that express the following binary relationships among sets of entities: set division, set inclusion, set equivalence, and set incomparability. Quantified queries are expressible in relational algebra using join, projection, nested negation, and repeating relation symbols.\u0000 Each query in the two classes has either polynomial-time or #P-hard data complexity and the tractable queries can be recognised efficiently. Our result for the first query class extends a known dichotomy for conjunctive queries without self-joins to such queries with negation. For quantified queries, their tractability is sensitive to their outermost projection operator: They are tractable if no attribute representing set identifiers is projected away and #P-hard otherwise.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"25 1","pages":"4:1-4:47"},"PeriodicalIF":1.8,"publicationDate":"2016-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79087163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
ENFrame: A Framework for Processing Probabilistic Data ENFrame:处理概率数据的框架
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-07 DOI: 10.1145/2877205
Dan Olteanu, Sebastiaan J. van Schaik
This article introduces ENFrame, a framework for processing probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as loops, list comprehension, aggregate operations on lists, and calls to external database engines. Programs are then interpreted probabilistically by ENFrame. We exemplify ENFrame on three clustering algorithms (k-means, k-medoids, and Markov clustering) and one classification algorithm (k-nearest-neighbour). A key component of ENFrame is an event language to succinctly encode correlations, trace the computation of user programs, and allow for computation of discrete probability distributions for program variables. We propose a family of sequential and concurrent, exact, and approximate algorithms for computing the probability of interconnected events. Experiments with k-medoids clustering and k-nearest-neighbour show orders-of-magnitude improvements of exact processing using ENFrame over naïve processing in each possible world, of approximate over exact, and of concurrent over sequential processing.
本文介绍了一种处理概率数据的框架——ENFrame。使用ENFrame,用户可以在Python片段中编写程序,其中包含循环、列表推导、列表聚合操作和对外部数据库引擎的调用等结构。然后由ENFrame按概率解释程序。我们在三种聚类算法(k-means, k-medoids和Markov聚类)和一种分类算法(k-nearest-neighbour)上举例说明了ENFrame。ENFrame的一个关键组件是一种事件语言,用于简洁地对相关性进行编码,跟踪用户程序的计算,并允许计算程序变量的离散概率分布。我们提出了一系列顺序的、并发的、精确的和近似的算法来计算相互关联事件的概率。使用k- medioids聚类和k-nearest-neighbour的实验表明,在每个可能世界中,使用ENFrame的精确处理优于naïve处理,近似优于精确,并发优于顺序处理。
{"title":"ENFrame: A Framework for Processing Probabilistic Data","authors":"Dan Olteanu, Sebastiaan J. van Schaik","doi":"10.1145/2877205","DOIUrl":"https://doi.org/10.1145/2877205","url":null,"abstract":"This article introduces ENFrame, a framework for processing probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as loops, list comprehension, aggregate operations on lists, and calls to external database engines. Programs are then interpreted probabilistically by ENFrame. We exemplify ENFrame on three clustering algorithms (k-means, k-medoids, and Markov clustering) and one classification algorithm (k-nearest-neighbour).\u0000 A key component of ENFrame is an event language to succinctly encode correlations, trace the computation of user programs, and allow for computation of discrete probability distributions for program variables. We propose a family of sequential and concurrent, exact, and approximate algorithms for computing the probability of interconnected events. Experiments with k-medoids clustering and k-nearest-neighbour show orders-of-magnitude improvements of exact processing using ENFrame over naïve processing in each possible world, of approximate over exact, and of concurrent over sequential processing.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"20 S2","pages":"3:1-3:44"},"PeriodicalIF":1.8,"publicationDate":"2016-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2877205","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72398042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Declarative Cleaning of Inconsistencies in Information Extraction 信息提取中不一致的声明性清除
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-07 DOI: 10.1145/2877202
Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren
The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent, and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.
从文本内容中填充预定义的关系模式,通常称为信息提取(IE),是与大数据相关的当代计算挑战中普遍存在的任务。由于文本内容在性质和结构上变化很大(从机器日志到非正式的自然语言),因此编写明确地提取所查找信息的IE程序非常困难。例如,在提取过程中,IE程序可以将子字符串注释为地址和人名。当这种情况发生时,提取的信息就被认为是不一致的,而某种消除不一致的方法对于计算最终输出是至关重要的。因此,像GATE和IBM SystemT这样的工业级IE系统提供了一个内置的清理操作集合,以从提取的关系中删除不一致的内容。然而,这些操作是通过用例以特别的方式收集的。理想情况下,我们希望允许IE开发者声明他们自己的策略。但是现有的清理操作是以算法的方式定义的,因此不清楚如何在不需要对内部或外部函数进行底层编码的情况下扩展内置操作。我们着手建立一个框架,通过数据库理论的原则来声明性地清理IE中的不一致性。具体地说,在IE文档生成器的形式主义的基础上,我们采用了优先修复的概念,这是最近提出的传统数据库修复的扩展,将冲突事实中的优先级纳入其中。我们展示了我们的框架捕获了流行的清理策略,以及通过正则表达式进行提取的POSIX语义。我们探讨了确定清理声明是否明确(即,总是导致单个修复)以及它是否增加了提取语言的表达能力的问题。我们给出了积极和消极的结果,有些是一般性的,有些是适用于实际政策的。
{"title":"Declarative Cleaning of Inconsistencies in Information Extraction","authors":"Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren","doi":"10.1145/2877202","DOIUrl":"https://doi.org/10.1145/2877202","url":null,"abstract":"The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent, and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions.\u0000 We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"19 1","pages":"6:1-6:44"},"PeriodicalIF":1.8,"publicationDate":"2016-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89049178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
BEVA: An Efficient Query Processing Algorithm for Error-Tolerant Autocompletion 一种高效的容错自动补全查询处理算法
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-07 DOI: 10.1145/2877201
Xiaoling Zhou, Jianbin Qin, Chuan Xiao, Wei Wang, Xuemin Lin, Y. Ishikawa
Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the error-tolerant autocompletion, which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors. In this article, we systematically study the query processing problem for error-tolerant autocompletion with a given edit distance threshold. We propose a general framework that encompasses existing methods and characterizes different classes of algorithms and the minimum amount of information they need to maintain under different constraints. We then propose a novel evaluation strategy that achieves the minimum active node size by eliminating ancestor-descendant relationships among active nodes entirely. In addition, we characterize the essence of edit distance computation by a novel data structure named edit vector automaton (EVA). It enables us to compute new active nodes and their associated states efficiently by table lookups. In order to support large distance thresholds, we devise a partitioning scheme to reduce the size and construction cost of the automaton, which results in the universal partitioned EVA (UPEVA) to handle arbitrarily large thresholds. Our extensive evaluation demonstrates that our proposed method outperforms existing approaches in both space and time efficiencies.
查询自动补全已经成为许多搜索应用程序的标准特性,尤其是搜索引擎。最近的一个趋势是支持容错自动补全,它通过匹配数据库字符串的前缀和允许少量错误来显著提高可用性。在本文中,我们系统地研究了具有给定编辑距离阈值的容错自动补全查询处理问题。我们提出了一个包含现有方法的一般框架,并描述了不同类别的算法以及它们在不同约束下需要维护的最小信息量。然后,我们提出了一种新的评估策略,通过完全消除活动节点之间的祖先-后代关系来实现最小活动节点大小。此外,我们用一种叫做编辑向量自动机(EVA)的新颖数据结构来描述编辑距离计算的本质。它使我们能够通过表查找有效地计算新的活动节点及其相关状态。为了支持大距离阈值,我们设计了一种分区方案,以减少自动机的大小和构建成本,从而使通用分区EVA (UPEVA)能够处理任意大的阈值。我们的广泛评估表明,我们提出的方法在空间和时间效率方面都优于现有的方法。
{"title":"BEVA: An Efficient Query Processing Algorithm for Error-Tolerant Autocompletion","authors":"Xiaoling Zhou, Jianbin Qin, Chuan Xiao, Wei Wang, Xuemin Lin, Y. Ishikawa","doi":"10.1145/2877201","DOIUrl":"https://doi.org/10.1145/2877201","url":null,"abstract":"Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the error-tolerant autocompletion, which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors.\u0000 In this article, we systematically study the query processing problem for error-tolerant autocompletion with a given edit distance threshold. We propose a general framework that encompasses existing methods and characterizes different classes of algorithms and the minimum amount of information they need to maintain under different constraints. We then propose a novel evaluation strategy that achieves the minimum active node size by eliminating ancestor-descendant relationships among active nodes entirely. In addition, we characterize the essence of edit distance computation by a novel data structure named edit vector automaton (EVA). It enables us to compute new active nodes and their associated states efficiently by table lookups. In order to support large distance thresholds, we devise a partitioning scheme to reduce the size and construction cost of the automaton, which results in the universal partitioned EVA (UPEVA) to handle arbitrarily large thresholds. Our extensive evaluation demonstrates that our proposed method outperforms existing approaches in both space and time efficiencies.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"5:1-5:44"},"PeriodicalIF":1.8,"publicationDate":"2016-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89650487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Editorial: Updates to the Editorial Board 编辑:编辑委员会的最新情况
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-04-01 DOI: 10.1145/2893581
Christian S. Jensen
{"title":"Editorial: Updates to the Editorial Board","authors":"Christian S. Jensen","doi":"10.1145/2893581","DOIUrl":"https://doi.org/10.1145/2893581","url":null,"abstract":"","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":"1e:1"},"PeriodicalIF":1.8,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84784773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading SCANRAW:用于并行原位处理和加载的数据库元操作符
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2015-10-23 DOI: 10.1145/2818181
Yu Cheng, Florin Rusu
Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file. In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.
由于在查询之前需要在系统内部加载数据,传统数据库会导致严重的数据到查询延迟。由于这在许多产生大量原始数据的领域(例如,基因组学)是不可接受的,因此数据库完全被丢弃。另一方面,外部表提供对原始文件的即时SQL查询。它们在查询工作负载中的性能受到整个文件的重复完整扫描、标记化和解析的速度的限制。在本文中,我们提出SCANRAW,这是一种新颖的数据库元操作符,用于对原始文件进行原位处理,它无缝地集成了数据加载和外部表,同时保留了它们的优点:跨查询工作负载的最佳性能和零查询时间。我们将加载和外部表处理分解为原子阶段,以便识别公共功能。我们分析了可选的实现,并讨论了每个阶段可能的优化。我们的主要贡献是一个并行的超标量管道实现,它允许SCANRAW通过重叠独立阶段的执行来利用当前的多核和多核处理器。此外,SCANRAW通过推测性地使用将数据存储到数据库的转换过程中产生的额外I/O带宽,使查询处理与加载重叠,从而使后续查询执行得更快。因此,SCANRAW通过在任务之间动态切换来智能地利用可用的系统资源(cpu周期和I/O带宽),以确保实现最佳性能。我们在最先进的数据库系统中实现SCANRAW,并在各种合成数据集和实际数据集上评估其性能。我们的结果表明,具有推测加载的SCANRAW在处理过程中的任何一点上都可以实现查询序列的最佳性能。此外,SCANRAW在推测加载数据且不干扰正常查询处理的情况下最大化整个工作负载执行的资源利用率。
{"title":"SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading","authors":"Yu Cheng, Florin Rusu","doi":"10.1145/2818181","DOIUrl":"https://doi.org/10.1145/2818181","url":null,"abstract":"Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.\u0000 In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"38 3 1","pages":"19:1-19:45"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74473007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
ACM Transactions on Database Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1