Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory最新文献_第2页

Constant-delay enumeration for SLP-compressed documents slp压缩文档的恒定延迟枚举

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-09-25 DOI: 10.48550/arXiv.2209.12301

Martin Muñoz, Cristian Riveros

We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries, we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm that evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt's result, which, with the same preprocessing time, enumerates with a delay that is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did to allow complex document editing while maintaining the constant delay guarantee.

我们研究了从压缩文档的查询中枚举结果的问题。我们用于压缩的模型是直线程序(slp)，它由生成单个字符串的上下文无关语法定义。对于我们的查询，我们使用一个称为Annotated Automata的模型，它是常规自动机的扩展，允许对字母进行注释。这个模型扩展了Regular spanner的概念，因为它允许任意长的输出。我们的主要结果是一种算法，该算法通过枚举预处理阶段后具有输出线性延迟的所有结果来评估这样的查询，预处理阶段需要的时间与SLP的大小成线性关系，并且需要的时间与自动机的大小成三次关系。这是对Schmid和Schweikardt的结果的改进，在相同的预处理时间下，枚举的延迟是未压缩文档大小的对数。我们通过一个名为Enumerable Compact Sets with Shifts的持久数据结构来实现这一目标，该结构保证了在某些限制下的输出线性延迟。这些结果意味着在正则生成器上下文中使用恒定延迟枚举算法。此外，我们使用了注释自动机的扩展，它利用简洁编码的注释来节省先前处理vset自动机上的常延迟枚举的结果中的指数因子。最后，我们以与Schmid和Schweikardt相同的方式扩展我们的结果，以允许复杂的文档编辑，同时保持恒定的延迟保证。

{"title":"Constant-delay enumeration for SLP-compressed documents","authors":"Martin Muñoz, Cristian Riveros","doi":"10.48550/arXiv.2209.12301","DOIUrl":"https://doi.org/10.48550/arXiv.2209.12301","url":null,"abstract":"We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries, we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm that evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt's result, which, with the same preprocessing time, enumerates with a delay that is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did to allow complex document editing while maintaining the constant delay guarantee.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"21 1","pages":"7:1-7:17"},"PeriodicalIF":0.0,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84960906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Conjunctive Queries with Free Access Patterns Under Updates 更新下具有自由访问模式的连接查询

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-06-17 DOI: 10.4230/LIPIcs.ICDT.2023.17

A. Kara, M. Nikolic, Dan Olteanu, Haozhe Zhang

We study the problem of answering conjunctive queries with free access patterns under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables. We introduce a fully dynamic evaluation approach for such queries. We also give a syntactic characterisation of those queries that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given an input tuple. Finally, we chart the complexity trade-off between the preprocessing time, update time and enumeration delay for such queries. For a class of queries, our approach achieves optimal, albeit non-constant, update time and delay. Their optimality is predicated on the Online Matrix-Vector Multiplication conjecture. Our results recover prior work on the dynamic evaluation of conjunctive queries without access patterns.

我们研究了在更新条件下，用自由访问模式回答连接查询的问题。自由访问模式是将查询的自由变量划分为输入和输出。给定输入变量上的值元组，查询返回输出变量上的元组。我们为此类查询引入了一种完全动态的求值方法。我们还给出了这些查询的语法特征，这些查询允许每次单元组更新花费恒定的时间，并且在给定输入元组的情况下，可以以恒定的延迟枚举输出元组。最后，我们绘制了此类查询的预处理时间、更新时间和枚举延迟之间的复杂性权衡图。对于一类查询，我们的方法实现了最优(尽管不是恒定的)更新时间和延迟。它们的最优性是基于在线矩阵向量乘法猜想的。我们的结果恢复了之前在没有访问模式的联合查询的动态评估方面的工作。

引用次数: 5

Absolute Expressiveness of Subgraph-Based Centrality Measures 基于子图的中心性测度的绝对表达性

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-06-13 DOI: 10.4230/LIPIcs.ICDT.2023.9

Andreas Pieris, J. Salas

In graph-based applications, a common task is to pinpoint the most important or ``central'' vertex in a (directed or undirected) graph, or rank the vertices of a graph according to their importance. To this end, a plethora of so-called centrality measures have been proposed in the literature. Such measures assess which vertices in a graph are the most important ones by analyzing the structure of the underlying graph. A family of centrality measures that are suited for graph databases has been recently proposed by relying on the following simple principle: the importance of a vertex in a graph is relative to the number of ``relevant'' connected subgraphs surrounding it; we refer to the members of this family as subgraph-based centrality measures. Although it has been shown that such measures enjoy several favourable properties, their absolute expressiveness remains largely unexplored. The goal of this work is to precisely characterize the absolute expressiveness of the family of subgraph-based centrality measures by considering both directed and undirected graphs. To this end, we characterize when an arbitrary centrality measure is a subgraph-based one, or a subgraph-based measure relative to the induced ranking. These characterizations provide us with technical tools that allow us to determine whether well-established centrality measures are subgraph-based. Such a classification, apart from being interesting in its own right, gives useful insights on the structural similarities and differences among existing centrality measures.

在基于图的应用程序中，一个常见的任务是确定(有向或无向)图中最重要的顶点或“中心”顶点，或者根据它们的重要性对图的顶点进行排序。为此，文献中提出了过多的所谓中心性措施。这些方法通过分析底层图的结构来评估图中哪些顶点是最重要的。最近提出了一系列适合于图数据库的中心性度量，这些中心性度量依赖于以下简单原则:图中顶点的重要性与围绕它的“相关”连接子图的数量有关;我们把这个家族的成员称为基于子图的中心性度量。虽然已经证明这些措施有几个有利的特性，但它们的绝对表现力在很大程度上仍未被探索。这项工作的目标是通过考虑有向图和无向图，精确地表征基于子图的中心性度量族的绝对表达性。为此，我们描述了任意中心性度量是基于子图的度量，还是相对于诱导排名的基于子图的度量。这些特征为我们提供了技术工具，使我们能够确定已建立的中心性度量是否是基于子图的。这种分类除了本身很有趣之外，还对现有中心性度量之间的结构相似性和差异性提供了有用的见解。

{"title":"Absolute Expressiveness of Subgraph-Based Centrality Measures","authors":"Andreas Pieris, J. Salas","doi":"10.4230/LIPIcs.ICDT.2023.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2023.9","url":null,"abstract":"In graph-based applications, a common task is to pinpoint the most important or ``central'' vertex in a (directed or undirected) graph, or rank the vertices of a graph according to their importance. To this end, a plethora of so-called centrality measures have been proposed in the literature. Such measures assess which vertices in a graph are the most important ones by analyzing the structure of the underlying graph. A family of centrality measures that are suited for graph databases has been recently proposed by relying on the following simple principle: the importance of a vertex in a graph is relative to the number of ``relevant'' connected subgraphs surrounding it; we refer to the members of this family as subgraph-based centrality measures. Although it has been shown that such measures enjoy several favourable properties, their absolute expressiveness remains largely unexplored. The goal of this work is to precisely characterize the absolute expressiveness of the family of subgraph-based centrality measures by considering both directed and undirected graphs. To this end, we characterize when an arbitrary centrality measure is a subgraph-based one, or a subgraph-based measure relative to the induced ranking. These characterizations provide us with technical tools that allow us to determine whether well-established centrality measures are subgraph-based. Such a classification, apart from being interesting in its own right, gives useful insights on the structural similarities and differences among existing centrality measures.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"33 1","pages":"9:1-9:18"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76970000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Probabilistic Query Evaluation with Bag Semantics 基于袋语义的概率查询求值

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-27 DOI: 10.4230/LIPIcs.ICDT.2023.20

Martin Grohe, P. Lindner, Christoph Standke

We study the complexity of evaluating queries on probabilistic databases under bag semantics. We focus on self-join free conjunctive queries, and probabilistic databases where occurrences of different facts are independent, which is the natural generalization of tuple-independent probabilistic databases to the bag semantics setting. For set semantics, the data complexity of this problem is well understood, even for the more general class of unions of conjunctive queries: it is either in polynomial time, or #P-hard, depending on the query (Dalvi&Suciu, JACM 2012). A reasonably general model of bag probabilistic databases may have unbounded multiplicities. In this case, the probabilistic database is no longer finite, and a careful treatment of representation mechanisms is required. Moreover, the answer to a Boolean query is a probability distribution over (possibly all) non-negative integers, rather than a probability distribution over { true, false }. Therefore, we discuss two flavors of probabilistic query evaluation: computing expectations of answer tuple multiplicities, and computing the probability that a tuple is contained in the answer at most k times for some parameter k. Subject to mild technical assumptions on the representation systems, it turns out that expectations are easy to compute, even for unions of conjunctive queries. For query answer probabilities, we obtain a dichotomy between solvability in polynomial time and #P-hardness for self-join free conjunctive queries.

研究了袋语义下概率数据库查询求值的复杂性。我们关注的是自由自连接的联合查询，以及不同事实的出现是独立的概率数据库，这是元独立概率数据库对包语义设置的自然推广。对于集合语义，这个问题的数据复杂性被很好地理解，即使对于更一般的联合查询类:它要么是多项式时间，要么是#P-hard，取决于查询(Dalvi&Suciu, JACM 2012)。一个合理的一般袋概率数据库模型可能具有无界的多重性。在这种情况下，概率数据库不再是有限的，并且需要仔细处理表示机制。此外，布尔查询的答案是(可能全部)非负整数的概率分布，而不是{true, false}的概率分布。因此，我们讨论了概率查询评估的两种风格:计算答案元组多重性的期望，以及计算某个参数k在最多k次的答案中包含元组的概率。根据对表示系统的温和技术假设，结果表明期望很容易计算，即使对于联合查询也是如此。对于查询答案概率，我们得到了自连接自由合取查询的多项式时间可解性和# p -硬度之间的二分法。

{"title":"Probabilistic Query Evaluation with Bag Semantics","authors":"Martin Grohe, P. Lindner, Christoph Standke","doi":"10.4230/LIPIcs.ICDT.2023.20","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2023.20","url":null,"abstract":"We study the complexity of evaluating queries on probabilistic databases under bag semantics. We focus on self-join free conjunctive queries, and probabilistic databases where occurrences of different facts are independent, which is the natural generalization of tuple-independent probabilistic databases to the bag semantics setting. For set semantics, the data complexity of this problem is well understood, even for the more general class of unions of conjunctive queries: it is either in polynomial time, or #P-hard, depending on the query (Dalvi&Suciu, JACM 2012). A reasonably general model of bag probabilistic databases may have unbounded multiplicities. In this case, the probabilistic database is no longer finite, and a careful treatment of representation mechanisms is required. Moreover, the answer to a Boolean query is a probability distribution over (possibly all) non-negative integers, rather than a probability distribution over { true, false }. Therefore, we discuss two flavors of probabilistic query evaluation: computing expectations of answer tuple multiplicities, and computing the probability that a tuple is contained in the answer at most k times for some parameter k. Subject to mild technical assumptions on the representation systems, it turns out that expectations are easy to compute, even for unions of conjunctive queries. For query answer probabilities, we obtain a dichotomy between solvability in polynomial time and #P-hardness for self-join free conjunctive queries.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"28 1","pages":"20:1-20:19"},"PeriodicalIF":0.0,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73106414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Improved Approximation and Scalability for Fair Max-Min Diversification 公平最大最小分散的改进逼近和可扩展性

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-18 DOI: 10.4230/LIPIcs.ICDT.2022.7

Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou

Given an $n$-point metric space $(mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $iin [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-epsilon$ for any constant $epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $epsilon>0$, we present a $1+epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-epsilon) k_i$ points from category $iin [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.

给定一个$n$点度量空间$(mathcal{X}，d)$，其中每个点属于$m=O(1)$个不同类别或组中的一个，以及一组整数$k_1， ldots, k_m$，公平的Max-Min多样化问题是选择$k_i$个属于类别$iin [m]$的点，使所选点之间的最小对向距离最大化。该问题是由Moumoulidou等人提出的[ICDT 2021]，其动机是在各种应用中需要对大型数据集进行下采样，以便衍生样本在多样性(即一对选定点之间的最小距离)和公平性(即确保每个类别包含足够的点)之间取得平衡。我们证明了以下结果:1。我们首先考虑一般度量空间。我们提出了一种随机多项式时间算法，该算法返回多样性的因子$2$近似值，但仅满足期望中的公平性约束。在此结果的基础上，我们提出了一个$6$-近似，它保证对任何常数$ $epsilon$满足一个因子$1- $的公平性约束。我们还提出了一种线性时间算法，返回具有精确公平性的$m+1$近似值。之前最好的结果是300 -1美元的近似值。2. 然后我们关注欧几里得度量。我们首先证明了这个问题可以在一维中精确地解决。对于常数维度，类别和任意常数$epsilon> $，我们提出了一个$1+epsilon$近似算法，该算法在$O(nk) + 2^{O(k)}$时间内运行，其中$k=k_1+ldots+k_m$。我们可以将运行时间提高到$O(nk)+ poly(k)$，代价是只从类别$iin [m]$中选取$(1-epsilon) k_i$点。最后，我们提出了适合处理海量数据集的算法，包括单通道数据流算法和用于分布式处理的可组合核心集。

{"title":"Improved Approximation and Scalability for Fair Max-Min Diversification","authors":"Raghavendra Addanki, A. Mcgregor, A. Meliou, Zafeiria Moumoulidou","doi":"10.4230/LIPIcs.ICDT.2022.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2022.7","url":null,"abstract":"Given an $n$-point metric space $(mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $iin [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-epsilon$ for any constant $epsilon$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $epsilon>0$, we present a $1+epsilon$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-epsilon) k_i$ points from category $iin [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"131 1","pages":"7:1-7:21"},"PeriodicalIF":0.0,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86322004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Certifiable Robustness for Nearest Neighbor Classifiers 最近邻分类器的可认证鲁棒性

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-13 DOI: 10.4230/LIPIcs.ICDT.2022.6

Austen Z. Fan, Paraschos Koutris

ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.

机器学习模型通常使用高质量的大型数据集进行训练。然而，训练数据集经常包含不一致或不完整的数据。为了解决这个问题，一个解决方案是开发一种算法，可以检查模型的预测是否可以证明是鲁棒的。给定一个生成分类器的学习算法，并在测试时给出一个示例，如果在不确定(不一致)数据集的所有可能世界(修复)中训练的每个模型都能预测到分类结果，则可以证明分类结果是鲁棒的。这种稳健性的概念自然属于某些答案的框架。在本文中，我们研究了一个简单但广泛应用的分类算法，$k$-最近邻($k$-NN)的鲁棒性证明的复杂性。当完整性约束是功能依赖(fd)时，我们主要关注不一致的数据集。对于这种情况，我们在证明鲁棒性的复杂度上建立了一个二分法:问题要么允许多项式时间算法，要么是coNP-hard。此外，我们对问题的计数版本展示了类似的二分法，其目标是计算预测某个标签的可能世界的数量。作为我们研究的副产品，我们还建立了一个与寻找最佳子集修复相关的问题的复杂性，这可能是独立的兴趣。

{"title":"Certifiable Robustness for Nearest Neighbor Classifiers","authors":"Austen Z. Fan, Paraschos Koutris","doi":"10.4230/LIPIcs.ICDT.2022.6","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2022.6","url":null,"abstract":"ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"25 1","pages":"6:1-6:20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91270733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Characterising Fixed Parameter Tractability for Query Evaluation over Guarded TGDs 保护TGDs查询评估的固定参数可跟踪性特征

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-01 DOI: 10.4230/LIPIcs.ICDT.2022.12

C. Feier

We consider the parameterized complexity of evaluating Ontology Mediated Queries (OMQ) based on Guarded TGDs (GTGD) and Unions of Conjunctive Queries, in the case where relational symbols have unrestricted arity and where the parameter is the size of the OMQ. We establish exact criteria for fixed-parameter tractable (fpt) evaluation of recursively enumerable (r.e.) classes of such OMQs (under the widely held Exponential Time Hypothesis). One of the main technical tools introduced in the paper is an fpt-reduction from deciding parameterized uniform CSPs to parameterized OMQ evaluation. The reduction preserves measures known to be essential for classifying r.e. classes of parameterized uniform CSPs: submodular width (according to the well known result of Marx for unrestricted-arity schemas) and treewidth (according to the well known result of Grohe for bounded-arity schemas). As such, it can be employed to obtain hardness results for evaluation of r.e. classes of parameterized OMQs based on GTGD both in the unrestricted and in the bounded arity case. Previously, for bounded arity schemas, this has been tackled using a technique requiring full introspection into the construction employed by Grohe. 2012 ACM Subject Classification Theory of computation → Database theory

在关系符号不受限制且参数为本体中介查询(OMQ)大小的情况下，我们考虑了基于守护tgd (GTGD)和合取查询联合的本体中介查询(OMQ)的参数化复杂度。我们建立了这种omq的递归可枚举(r.e)类的固定参数可处理(fpt)评估的精确准则(在广泛持有的指数时间假设下)。本文介绍的主要技术工具之一是将参数化均匀csp的确定简化为参数化OMQ的评估。约简保留了已知的对参数化一致csp的r.e.类进行分类所必需的度量:子模宽度(根据Marx对非限制性模式的众所周知的结果)和树宽度(根据Grohe对有界性模式的众所周知的结果)。因此，无论是在不受限制的情况下，还是在有界的情况下，都可以利用该方法得到基于GTGD的参数化omq的r.e.类评定的硬度结果。在此之前，对于有界性模式，已经使用一种技术来解决这个问题，该技术需要对Grohe所采用的结构进行充分的自省。2012 ACM主题分类计算理论→数据库理论

{"title":"Characterising Fixed Parameter Tractability for Query Evaluation over Guarded TGDs","authors":"C. Feier","doi":"10.4230/LIPIcs.ICDT.2022.12","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2022.12","url":null,"abstract":"We consider the parameterized complexity of evaluating Ontology Mediated Queries (OMQ) based on Guarded TGDs (GTGD) and Unions of Conjunctive Queries, in the case where relational symbols have unrestricted arity and where the parameter is the size of the OMQ. We establish exact criteria for fixed-parameter tractable (fpt) evaluation of recursively enumerable (r.e.) classes of such OMQs (under the widely held Exponential Time Hypothesis). One of the main technical tools introduced in the paper is an fpt-reduction from deciding parameterized uniform CSPs to parameterized OMQ evaluation. The reduction preserves measures known to be essential for classifying r.e. classes of parameterized uniform CSPs: submodular width (according to the well known result of Marx for unrestricted-arity schemas) and treewidth (according to the well known result of Grohe for bounded-arity schemas). As such, it can be employed to obtain hardness results for evaluation of r.e. classes of parameterized OMQs based on GTGD both in the unrestricted and in the bounded arity case. Previously, for bounded arity schemas, this has been tackled using a technique requiring full introspection into the construction employed by Grohe. 2012 ACM Subject Classification Theory of computation → Database theory","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"10 1","pages":"12:1-12:20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78343398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Inference of Shape Graphs for Graph Databases 图形数据库中形状图的推理

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-01 DOI: 10.4230/LIPIcs.ICDT.2022.14

B. Groz, Aurélien Lemay, S. Staworko, Piotr Wieczorek

We investigate the problem of constructing a shape graph that describes the structure of a given graph database. We employ the framework of grammatical inference , where the objective is to find an inference algorithm that is both sound , i.e., always producing a schema that validates the input graph, and complete , i.e., able to produce any schema, within a given class of schemas, provided that a sufficiently informative input graph is presented. We identify a number of fundamental limitations that preclude feasible inference. We present inference algorithms based on natural approaches that allow to infer schemas that we argue to be of practical importance.

我们研究了构造一个描述给定图数据库结构的形状图的问题。我们采用语法推理的框架，其目标是找到一种推理算法，它既健全，即总是产生一个验证输入图的模式，又完整，即能够在给定的模式类别中产生任何模式，只要提供足够信息的输入图。我们确定了一些排除可行推断的基本限制。我们提出了基于自然方法的推理算法，允许推断我们认为具有实际重要性的模式。

引用次数: 6

Linear Programs with Conjunctive Queries 具有合取查询的线性规划

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-01 DOI: 10.4230/LIPIcs.ICDT.2022.5

Florent Capelli, Nicolas Crosetti, Joachim Niehren, J. Ramon

In this paper, we study the problem of optimizing a linear program whose variables are the answers to a conjunctive query. For this we propose the language LP(CQ) for specifying linear programs whose constraints and objective functions depend on the answer sets of conjunctive queries. We contribute an efficient algorithm for solving programs in a fragment of LP(CQ). The naive approach constructs a linear program having as many variables as there are elements in the answer set of the queries. Our approach constructs a linear program having the same optimal value but fewer variables. This is done by exploiting the structure of the conjunctive queries using generalized hypertree decompositions of small width to factorize elements of the answer set together. We illustrate the various applications of LP(CQ) programs on three examples: optimizing deliveries of resources, minimizing noise for differential privacy, and computing the s-measure of patterns in graphs as needed for data mining. We

本文研究了一个线性规划的优化问题，该规划的变量是一个联合查询的答案。为此，我们提出了一种语言LP(CQ)来描述约束和目标函数依赖于合取查询的答案集的线性规划。我们提出了一种求解LP片段(CQ)中的程序的有效算法。朴素方法构建了一个线性程序，该程序具有与查询的答案集中的元素一样多的变量。我们的方法构造了一个具有相同最优值但变量较少的线性规划。这是通过使用小宽度的广义超树分解来利用连接查询的结构来将答案集的元素分解在一起来实现的。我们通过三个例子说明了LP(CQ)程序的各种应用:优化资源交付，最小化差分隐私的噪声，以及计算数据挖掘所需的图中模式的s度量。我们

引用次数: 2

On the Hardness of Category Tree Construction 论类别树构造的硬度

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

Pub Date : 2022-01-01 DOI: 10.4230/LIPIcs.ICDT.2022.4

Shay Gershtein, Uri Avron, Ido Guy, T. Milo, Slava Novgorodov

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order ˜Θ( √ n ) or ˜Θ( n ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

类别树或分类法是有根的树，其中每个节点(称为类别)对应于一组相关项目。分类法的构建已经在各个领域进行了研究，包括电子商务、文档管理和问答。已经提出了多种自动化施工算法，采用各种聚类方法和众包。然而，没有正式的模型来捕捉这些分类问题，其复杂性也没有研究。为了解决这个问题，我们在这项工作中提出了一个组合模型，该模型捕获了许多实际设置，并表明上述经验方法是有保证的，因为我们证明了各种问题变体和特殊情况的强不可逼近性界限，当目标是产生最大效用的分类时。在我们的模型中，输入是一组n个加权项目集，理想情况下，树将包含这些项目集作为类别。每个类别，而不是完全匹配相应的输入集，允许超过给定相似性函数的给定阈值。目标是生成一个树，使包含匹配类别的集合的总权重最大化。关键参数是一个项目可能属于的类别数量的上界，它产生了问题的难度，因为最初每个项目可能包含在任意数量的输入集中。对于这个模型，我们证明了对于各种问题变体和特殊情况的阶~ Θ(√n)或阶~ Θ(n)的不可逼近性界限，松散地证明了上述启发式方法的合理性。我们的工作包括基于参数化随机结构的约简，突出了各种问题参数和输入属性如何影响硬度。此外，对于类别必须与相应输入集相同的特殊情况，我们设计了一种算法，其近似保证仅依赖于更细粒度的参数，从而允许改进的最坏情况保证。最后，我们还将我们的结果推广到基于dag的非分层分类。

{"title":"On the Hardness of Category Tree Construction","authors":"Shay Gershtein, Uri Avron, Ido Guy, T. Milo, Slava Novgorodov","doi":"10.4230/LIPIcs.ICDT.2022.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2022.4","url":null,"abstract":"Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order ˜Θ( √ n ) or ˜Θ( n ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"26 1","pages":"4:1-4:17"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72814429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1