Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献_第5页

Efficient Maintenance of Agree-Sets Against Dynamic Datasets 动态数据集协议集的有效维护

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.02

Khalid Belhajjame

Constraint discovery is a fundamental task in data profiling, which involves identifying the dependencies that are satisfied by a dataset. As published datasets are increasingly dynamic, a number of researchers have begun to investigate the problem of dependencies’ discovery in dynamic datasets. Proposals this far in this area can be viewed as schema-based in the sense that they model and explore the solution space using a lattice built on the basis of the attributes (columns) of the dataset. It is recognized that proposals that belong to this class, like their static counterpart, tend to perform well for datasets with a large number of tuples but a small number of attributes. The second class of proposals that have been examined for static datasets (but not in dynamic settings) is data-driven and is known to perform well for datasets with a large number of attributes and a small number of tuples. The main bottleneck of this class of solutions is the generation of agree-sets, which involves pairwise comparison of the tuples in the dataset. We present in this paper DynASt , a system for the efficient maintenance of agree-sets in dynamic datasets. We investigate the performance of DynASt and its scalability in terms of the number of tuples and the number of attributes of the target dataset. We also show that it outperforms existing (static and dynamic) state-of-the-art solutions for datasets with a large number of attributes.

约束发现是数据分析中的一项基本任务，它涉及识别数据集满足的依赖项。随着已发表的数据集越来越动态，许多研究者开始研究动态数据集中的依赖关系发现问题。到目前为止，在这个领域的建议可以被看作是基于模式的，因为它们使用基于数据集的属性(列)构建的晶格来建模和探索解决方案空间。人们认识到，属于这类的建议，就像它们的静态对应物一样，对于具有大量元组但少量属性的数据集往往表现良好。对于静态数据集(而不是动态设置)，第二类建议是数据驱动的，并且已知对于具有大量属性和少量元组的数据集执行良好。这类解决方案的主要瓶颈是协议集的生成，它涉及对数据集中的元组进行两两比较。本文提出了一个动态数据集协议集的高效维护系统DynASt。我们根据元组的数量和目标数据集的属性数量来研究DynASt的性能及其可伸缩性。我们还表明，对于具有大量属性的数据集，它优于现有的(静态和动态)最先进的解决方案。

{"title":"Efficient Maintenance of Agree-Sets Against Dynamic Datasets","authors":"Khalid Belhajjame","doi":"10.48786/edbt.2023.02","DOIUrl":"https://doi.org/10.48786/edbt.2023.02","url":null,"abstract":"Constraint discovery is a fundamental task in data profiling, which involves identifying the dependencies that are satisfied by a dataset. As published datasets are increasingly dynamic, a number of researchers have begun to investigate the problem of dependencies’ discovery in dynamic datasets. Proposals this far in this area can be viewed as schema-based in the sense that they model and explore the solution space using a lattice built on the basis of the attributes (columns) of the dataset. It is recognized that proposals that belong to this class, like their static counterpart, tend to perform well for datasets with a large number of tuples but a small number of attributes. The second class of proposals that have been examined for static datasets (but not in dynamic settings) is data-driven and is known to perform well for datasets with a large number of attributes and a small number of tuples. The main bottleneck of this class of solutions is the generation of agree-sets, which involves pairwise comparison of the tuples in the dataset. We present in this paper DynASt , a system for the efficient maintenance of agree-sets in dynamic datasets. We investigate the performance of DynASt and its scalability in terms of the number of tuples and the number of attributes of the target dataset. We also show that it outperforms existing (static and dynamic) state-of-the-art solutions for datasets with a large number of attributes.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"28 1","pages":"14-26"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85184961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tagger: A Tool for the Discovery of Tagged Unions in JSON Schema Extraction Tagger:一个在JSON模式提取中发现标记联合的工具

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.75

Stefan Klessinger, Michael Fruth, Valentin Gittinger, Meike Klettke, U. Störl, Stefanie Scherzinger

This tool demo features an original approach to model inference or schema extraction from collections of JSON documents: We automatically detect tagged unions, an established design pattern in hand-crafted schemas for conditionally declaring subtypes. Our “Tagger” approach is based on the discovery of conditional functional dependencies in a relational encoding of JSON objects. We have integrated our prototype implementation in an open source tool for managing data models in schema-flexible NoSQL data stores. Demo participants can interactively apply different schema extraction algorithms to real-world inputs, and compare the extracted schemas with those produced by “Tagger”.

这个工具演示提供了一种从JSON文档集合中进行模型推断或模式提取的原始方法:我们自动检测标记的联合，这是手工模式中用于有条件声明子类型的既定设计模式。我们的“标记器”方法是基于发现JSON对象的关系编码中的条件函数依赖。我们已经将原型实现集成到一个开源工具中，用于管理模式灵活的NoSQL数据存储中的数据模型。演示参与者可以交互式地将不同的模式提取算法应用于实际输入，并将提取的模式与“Tagger”生成的模式进行比较。

引用次数: 0

Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL 蓝象检查熊猫:SQL中机器学习管道的检查和执行

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.04

Maximilian E. Schüle

Data preprocessing, the step of transforming data into a suitable format for training a model, rarely happens within database systems but rather in external Python libraries and thus requires extraction from the database systems first. However, database systems are tuned for efficient data access and offer aggregate functions to calculate the distribution frequencies necessary to detect the under- or overrepresentation of a certain value within the data (bias). We argue that database systems with SQL are capable of executing machine learning pipelines as well as discovering technical biases—introduced by data preprocessing—efficiently. Therefore, we present a set of SQL queries to cover data preprocessing and data inspection: During preprocessing, we annotate the tuples with an identifier to compute the distribution frequency of columns. To inspect distribution changes, we join the prepro-cessed dataset with the original one on the tuple identifier and use aggregate functions to count the number of occurrences per sensitive column. This allows us to detect operations which filter out tuples and thus introduce a technical bias even for columns preprocessing has removed. To automatically generate such queries, our implementation extends the mlinspect project to transpile existing data preprocessing pipelines written in Python to SQL queries, while maintaining detailed inspection results using views or common table expressions (CTEs). The evaluation proves that a modern beyond main-memory database system, i.e. Umbra, accelerates the runtime for preprocessing and inspection. Even PostgreSQL as a disk-based database system shows similar performance for inspection to Umbra when materialising views.

数据预处理，即将数据转换为适合训练模型的格式的步骤，很少在数据库系统中进行，而是在外部Python库中进行，因此需要首先从数据库系统中提取数据。然而，数据库系统为有效的数据访问进行了调优，并提供了聚合函数来计算检测数据中某个值的不足或过度表示(偏差)所需的分布频率。我们认为，使用SQL的数据库系统能够有效地执行机器学习管道，并发现由数据预处理引入的技术偏差。因此，我们提供了一组SQL查询来涵盖数据预处理和数据检查:在预处理期间，我们用标识符对元组进行注释，以计算列的分布频率。为了检查分布变化，我们将预处理数据集与元组标识符上的原始数据集连接起来，并使用聚合函数计算每个敏感列的出现次数。这允许我们检测过滤掉元组的操作，从而引入技术偏差，即使是预处理已经删除的列。为了自动生成这样的查询，我们的实现扩展了mlininspect项目，将用Python编写的现有数据预处理管道转换为SQL查询，同时使用视图或公共表表达式(cte)维护详细的检查结果。评估结果表明，现代的非主存数据库系统Umbra加快了预处理和检测的运行时间。即使PostgreSQL作为一个基于磁盘的数据库系统，在物化视图时也显示出与Umbra相似的性能。

{"title":"Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL","authors":"Maximilian E. Schüle","doi":"10.48786/edbt.2023.04","DOIUrl":"https://doi.org/10.48786/edbt.2023.04","url":null,"abstract":"Data preprocessing, the step of transforming data into a suitable format for training a model, rarely happens within database systems but rather in external Python libraries and thus requires extraction from the database systems first. However, database systems are tuned for efficient data access and offer aggregate functions to calculate the distribution frequencies necessary to detect the under- or overrepresentation of a certain value within the data (bias). We argue that database systems with SQL are capable of executing machine learning pipelines as well as discovering technical biases—introduced by data preprocessing—efficiently. Therefore, we present a set of SQL queries to cover data preprocessing and data inspection: During preprocessing, we annotate the tuples with an identifier to compute the distribution frequency of columns. To inspect distribution changes, we join the prepro-cessed dataset with the original one on the tuple identifier and use aggregate functions to count the number of occurrences per sensitive column. This allows us to detect operations which filter out tuples and thus introduce a technical bias even for columns preprocessing has removed. To automatically generate such queries, our implementation extends the mlinspect project to transpile existing data preprocessing pipelines written in Python to SQL queries, while maintaining detailed inspection results using views or common table expressions (CTEs). The evaluation proves that a modern beyond main-memory database system, i.e. Umbra, accelerates the runtime for preprocessing and inspection. Even PostgreSQL as a disk-based database system shows similar performance for inspection to Umbra when materialising views.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"40-52"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87720589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Learned Selection Strategy for Lightweight Integer Compression Algorithms 轻量级整数压缩算法的学习选择策略

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.47

Lucas Woltmann, Patrick Damme, Claudio Hartmann, Dirk Habich, Wolfgang Lehner

Data compression has recently experienced a revival in the domain of in-memory column stores. In this field, a large corpus of lightweight integer compression algorithms plays a dominant role since all columns are typically encoded as sequences of integer values. Unfortunately, there is no single-best integer compression algorithm and the best algorithm depends on data and hardware properties. For this reason, selecting the best-fitting integer compression algorithm becomes more important and is an interesting tuning knob for optimization. However, traditional selection strategies require a profound knowledge of the (de-)compression algorithms for decision-making. This limits the broad applicability of the selection strategies. To counteract this, we propose a novel learned selection strategy by consider-ing integer compression algorithms as independent black boxes. This black-box approach ensures broad applicability and requires machine learning-based methods to model the required knowledge for decision-making. Most importantly, we show that a local approach, where every algorithm is modeled individually, plays a crucial role. Moreover, our learned selection strategy is generalized by user-data-independence. Finally, we evaluate our approach and compare our approach against existing selection strategies to show the benefits of our learned selection strategy .

最近，在内存列存储领域，数据压缩经历了一次复兴。在这个领域中，大量轻量级整数压缩算法占据主导地位，因为所有列通常都被编码为整数值序列。不幸的是，没有单一的最佳整数压缩算法，最佳算法取决于数据和硬件属性。由于这个原因，选择最适合的整数压缩算法变得更加重要，并且是一个有趣的优化调整钮。然而，传统的选择策略需要深入了解决策的(解)压缩算法。这限制了选择策略的广泛适用性。为了解决这个问题，我们提出了一种新的学习选择策略，将整数压缩算法视为独立的黑盒。这种黑盒方法确保了广泛的适用性，并需要基于机器学习的方法来建模决策所需的知识。最重要的是，我们展示了一种局部方法，其中每个算法都是单独建模的，起着至关重要的作用。此外，我们的学习选择策略是由用户数据无关的推广。最后，我们评估我们的方法，并将我们的方法与现有的选择策略进行比较，以显示我们学习的选择策略的好处。

{"title":"Learned Selection Strategy for Lightweight Integer Compression Algorithms","authors":"Lucas Woltmann, Patrick Damme, Claudio Hartmann, Dirk Habich, Wolfgang Lehner","doi":"10.48786/edbt.2023.47","DOIUrl":"https://doi.org/10.48786/edbt.2023.47","url":null,"abstract":"Data compression has recently experienced a revival in the domain of in-memory column stores. In this field, a large corpus of lightweight integer compression algorithms plays a dominant role since all columns are typically encoded as sequences of integer values. Unfortunately, there is no single-best integer compression algorithm and the best algorithm depends on data and hardware properties. For this reason, selecting the best-fitting integer compression algorithm becomes more important and is an interesting tuning knob for optimization. However, traditional selection strategies require a profound knowledge of the (de-)compression algorithms for decision-making. This limits the broad applicability of the selection strategies. To counteract this, we propose a novel learned selection strategy by consider-ing integer compression algorithms as independent black boxes. This black-box approach ensures broad applicability and requires machine learning-based methods to model the required knowledge for decision-making. Most importantly, we show that a local approach, where every algorithm is modeled individually, plays a crucial role. Moreover, our learned selection strategy is generalized by user-data-independence. Finally, we evaluate our approach and compare our approach against existing selection strategies to show the benefits of our learned selection strategy .","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"77 1","pages":"552-564"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80984513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation 基于机器学习的基数估计的混合谓词组合查询的增强特征

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.22

Magnus Müller, Lucas Woltmann, Wolfgang Lehner

Estimating query result sizes is a critical task in areas like query optimization. For some years now it has been popular to apply machine learning to this problem. However, surprisingly, there has been very little research yet on how to present queries to a machine learning model. Machine learning models do not simply consume SQL strings. Instead, a SQL string is transformed into a numerical representation. This transformation is called query featurization and is defined by a query featurization technique (QFT). This paper is concerned with QFTs for queries with many selection predicates. In particular, we consider queries that contain both predicates over different attributes and multiple predicates per attribute. We identify a desired property of query featurization and present three novel QFTs. To the best of our knowledge, we are the first to featurize queries with mixed combinations of predicates, i.e., containing both conjunctions and disjunctions. Our QFTs are model-independent and can serve as the query featurization layer for different machine learning model types. In our evaluation, we combine our QFTs with three different machine learning models. We demonstrate that the estimation accuracy of machine learning models significantly depends on the QFT used. In addition, we compare our best combination of QFT and machine learning model to various existing cardinality estimators.

估计查询结果大小在查询优化等领域是一项关键任务。几年来，将机器学习应用于这个问题已经很流行了。然而，令人惊讶的是，关于如何向机器学习模型呈现查询的研究很少。机器学习模型并不是简单地使用SQL字符串。相反，SQL字符串被转换成数字表示形式。这种转换称为查询特征化，由查询特征化技术(QFT)定义。本文研究具有许多选择谓词的查询的qft。特别是，我们考虑的查询既包含不同属性上的谓词，也包含每个属性上的多个谓词。我们确定了查询特征化的一个期望属性，并提出了三个新的qft。据我们所知，我们是第一个使用混合谓词组合的查询，即同时包含连词和析词的查询。我们的qft是模型独立的，可以作为不同机器学习模型类型的查询特征层。在我们的评估中，我们将qft与三种不同的机器学习模型结合起来。我们证明了机器学习模型的估计精度在很大程度上取决于所使用的QFT。此外，我们将QFT和机器学习模型的最佳组合与各种现有的基数估计器进行了比较。

{"title":"Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation","authors":"Magnus Müller, Lucas Woltmann, Wolfgang Lehner","doi":"10.48786/edbt.2023.22","DOIUrl":"https://doi.org/10.48786/edbt.2023.22","url":null,"abstract":"Estimating query result sizes is a critical task in areas like query optimization. For some years now it has been popular to apply machine learning to this problem. However, surprisingly, there has been very little research yet on how to present queries to a machine learning model. Machine learning models do not simply consume SQL strings. Instead, a SQL string is transformed into a numerical representation. This transformation is called query featurization and is defined by a query featurization technique (QFT). This paper is concerned with QFTs for queries with many selection predicates. In particular, we consider queries that contain both predicates over different attributes and multiple predicates per attribute. We identify a desired property of query featurization and present three novel QFTs. To the best of our knowledge, we are the first to featurize queries with mixed combinations of predicates, i.e., containing both conjunctions and disjunctions. Our QFTs are model-independent and can serve as the query featurization layer for different machine learning model types. In our evaluation, we combine our QFTs with three different machine learning models. We demonstrate that the estimation accuracy of machine learning models significantly depends on the QFT used. In addition, we compare our best combination of QFT and machine learning model to various existing cardinality estimators.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"273-284"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78782454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Example-Driven Exploratory Analytics over Knowledge Graphs 基于知识图谱的示例驱动探索性分析

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.09

Matteo Lissandrini, K. Hose, T. Pedersen

Due to their expressive power, Knowledge Graphs (KGs) have received increasing interest not only as means to structure and integrate heterogeneous information but also as a native stor-age format for large amounts of knowledge and statistical data. Therefore, analytical queries over KG data, typically stored as RDF, have become increasingly important. Yet, formulating such queries represents a difficult task for users that are not familiar with the query language (typically SPARQL) and the structure of the dataset at hand. To overcome this limitation, we propose Re2xOLAP: the first comprehensive interactive approach that allows to reverse-engineer and refine RDF exploratory OLAP queries over KGs containing statistical data. Thus, Re2xOLAP enables to perform KG exploratory analytics without requiring the user to write any query at all. We achieve this goal by first reverse-engineering analytical SPARQL queries from a small set of user-provided examples and then, given the reverse-engineered query, we propose intuitive and explainable exploratory query refinements to iteratively help the user obtain the desired information. Our experiments on real-world large-scale KGs show that Re2xOLAP can efficiently reverse-engineer analytical SPARQL queries solely based on a small set of input examples. Additionally, we demonstrate the expressive power of our interactive refinement methods by showing that Re2xOLAP allows users to navigate hundreds of thousands of different exploration paths with just a few interactions.

由于其强大的表达能力，知识图(Knowledge Graphs, KGs)不仅作为构建和集成异构信息的手段，而且作为大量知识和统计数据的本地存储格式，受到了越来越多的关注。因此，对KG数据(通常存储为RDF)的分析查询变得越来越重要。然而，对于不熟悉查询语言(通常是SPARQL)和手头数据集结构的用户来说，制定这样的查询是一项困难的任务。为了克服这一限制，我们提出了Re2xOLAP:这是第一个全面的交互式方法，它允许对包含统计数据的kg进行反向工程和改进RDF探索性OLAP查询。因此，Re2xOLAP支持执行KG探索性分析，而不需要用户编写任何查询。为了实现这一目标，我们首先从一小部分用户提供的示例中对分析性SPARQL查询进行逆向工程，然后给出逆向工程的查询，提出直观且可解释的探索性查询改进，以迭代地帮助用户获得所需的信息。我们在现实世界大规模KGs上的实验表明，Re2xOLAP可以仅基于一小组输入示例有效地对分析性SPARQL查询进行逆向工程。此外，我们通过展示Re2xOLAP允许用户仅通过少量交互就可以导航数十万种不同的探索路径，从而展示了交互式优化方法的表现力。

{"title":"Example-Driven Exploratory Analytics over Knowledge Graphs","authors":"Matteo Lissandrini, K. Hose, T. Pedersen","doi":"10.48786/edbt.2023.09","DOIUrl":"https://doi.org/10.48786/edbt.2023.09","url":null,"abstract":"Due to their expressive power, Knowledge Graphs (KGs) have received increasing interest not only as means to structure and integrate heterogeneous information but also as a native stor-age format for large amounts of knowledge and statistical data. Therefore, analytical queries over KG data, typically stored as RDF, have become increasingly important. Yet, formulating such queries represents a difficult task for users that are not familiar with the query language (typically SPARQL) and the structure of the dataset at hand. To overcome this limitation, we propose Re2xOLAP: the first comprehensive interactive approach that allows to reverse-engineer and refine RDF exploratory OLAP queries over KGs containing statistical data. Thus, Re2xOLAP enables to perform KG exploratory analytics without requiring the user to write any query at all. We achieve this goal by first reverse-engineering analytical SPARQL queries from a small set of user-provided examples and then, given the reverse-engineered query, we propose intuitive and explainable exploratory query refinements to iteratively help the user obtain the desired information. Our experiments on real-world large-scale KGs show that Re2xOLAP can efficiently reverse-engineer analytical SPARQL queries solely based on a small set of input examples. Additionally, we demonstrate the expressive power of our interactive refinement methods by showing that Re2xOLAP allows users to navigate hundreds of thousands of different exploration paths with just a few interactions.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"107 1","pages":"105-117"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91315196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Comprehensive Evaluation of Algorithms for Unrestricted Graph Alignment 无限制图对齐算法的综合评价

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.21

Konstantinos Skitsas, Karol Orlowski, Judith Hermanns, D. Mottin, Panagiotis Karras

The graph alignment problem calls for finding a matching between the nodes of one graph and those of another graph, in a way that they correspond to each other by some fitness measure. Over the last years, several graph alignment algorithms have been proposed and evaluated on diverse datasets and quality measures. Typically, a newly proposed algorithm is compared to previously proposed ones on some specific datasets, types of noise, and quality measures where the new proposal achieves superiority over the previous ones. However, no systematic comparison of the proposed algorithms has been attempted on the same benchmarks. This paper fills this gap by conducting an extensive, thorough, and commensurable evaluation of state-of-the-art graph alignment algorithms. Our results highlight the value of overlooked solutions and an unprecedented effect of graph density on performance, hence call for further work.

图对齐问题要求在一个图的节点和另一个图的节点之间找到匹配，以一种通过某种适应度度量相互对应的方式。在过去的几年中，已经提出了几种图形对齐算法，并在不同的数据集和质量度量上进行了评估。通常，将新提出的算法与先前提出的算法在某些特定的数据集、噪声类型和质量度量上进行比较，其中新提议优于先前的提议。然而，在相同的基准上，没有对所提出的算法进行系统的比较。本文通过对最先进的图对齐算法进行广泛，彻底和可通约的评估来填补这一空白。我们的研究结果强调了被忽视的解决方案的价值，以及图形密度对性能的前所未有的影响，因此需要进一步的工作。

引用次数: 0

Progressive Entity Resolution over Incremental Data 增量数据的渐进实体解析

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.07

Leonardo Gazzarri, Melanie Herschel

Entity Resolution (ER) algorithms identify entity profiles corresponding to the same real-world entity among one or multiple data sets. Modern challenges for ER are posed by volume, variety, and velocity that characterize Big Data. While progressive ER aims to efficiently solve the problem under time constraints by prioritizing useful work over superfluous work, incremental ER aims to incrementally produce results as new data increments come in. This paper presents algorithms that combine these two approaches in the context of streaming and heterogeneous data. The overall goal is to maximize the chances to spot duplicates to a given entity profile in a moment closest to its arrival time (early quality), without relying on any schema information, while being sufficiently efficient to process large volumes of fast streaming data without compromising the eventual quality (by cutting too many corners for efficiency). Experiments validate that our algorithms are the first to support incremental and progressive ER and, compared to state-of-the-art incremental approaches, improve early quality, eventual quality, and system efficiency by progressively and adaptively performing the unexecuted comparisons that are more likely to match when waiting for the next stream input increment.

实体解析(ER)算法在一个或多个数据集中识别对应于同一个现实世界实体的实体配置文件。现代ER面临的挑战是大数据的数量、种类和速度。渐进式ER的目标是通过将有用的工作优先于多余的工作来有效地解决时间限制下的问题，而增量式ER的目标是随着新数据的增加而逐步产生结果。本文提出了在流数据和异构数据环境下结合这两种方法的算法。总体目标是在最接近到达时间(早期质量)的时刻最大限度地发现给定实体配置文件的副本，而不依赖于任何模式信息，同时足够有效地处理大量快速流数据，而不会损害最终质量(通过为效率而走太多弯路)。实验证实，我们的算法是第一个支持增量和渐进式ER的算法，与最先进的增量方法相比，通过逐步和自适应地执行未执行的比较，提高了早期质量、最终质量和系统效率，这些比较在等待下一个流输入增量时更有可能匹配。

{"title":"Progressive Entity Resolution over Incremental Data","authors":"Leonardo Gazzarri, Melanie Herschel","doi":"10.48786/edbt.2023.07","DOIUrl":"https://doi.org/10.48786/edbt.2023.07","url":null,"abstract":"Entity Resolution (ER) algorithms identify entity profiles corresponding to the same real-world entity among one or multiple data sets. Modern challenges for ER are posed by volume, variety, and velocity that characterize Big Data. While progressive ER aims to efficiently solve the problem under time constraints by prioritizing useful work over superfluous work, incremental ER aims to incrementally produce results as new data increments come in. This paper presents algorithms that combine these two approaches in the context of streaming and heterogeneous data. The overall goal is to maximize the chances to spot duplicates to a given entity profile in a moment closest to its arrival time (early quality), without relying on any schema information, while being sufficiently efficient to process large volumes of fast streaming data without compromising the eventual quality (by cutting too many corners for efficiency). Experiments validate that our algorithms are the first to support incremental and progressive ER and, compared to state-of-the-art incremental approaches, improve early quality, eventual quality, and system efficiency by progressively and adaptively performing the unexecuted comparisons that are more likely to match when waiting for the next stream input increment.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"37 1","pages":"80-91"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86463306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Density-Based Geometry Compression for LiDAR Point Clouds 基于密度的激光雷达点云几何压缩

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.30

Xibo Sun, Qiong Luo

LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.

激光雷达(光探测和测距)传感器产生捕捉周围环境的3D点云，这些数据用于自动驾驶、交通监控和远程调查等应用。为了方便传输和存储，激光雷达点云通常被压缩。然而，为了实现高压缩比，现有的工作往往会牺牲数据的几何精度，从而损害下游应用程序的有效性。因此，我们提出了一个在保持几何精度的同时实现高压缩比的系统。在我们的方法中，我们首先执行基于密度的聚类来区分密集点和稀疏点，因为它们适用于不同的压缩方法。聚类算法针对我们的目的进行了优化，其参数值被设置为保持准确性。然后用八叉树压缩密集点，将稀疏点组织成折线，以减少冗余。考虑到激光雷达传感器和真实场景的特性，我们进一步提出对折线上的稀疏点进行球坐标压缩。最后，我们设计了合适的方案来压缩不在任何折线上的剩余稀疏点。在原型系统DBGC上的实验结果表明，我们的方案将大规模真实数据集压缩了19倍，对于数千立方米的场景，误差范围在0.02米以下。这一结果与DBGC的快速压缩速度一起证明了激光雷达数据的在线压缩具有很高的精度。我们的源代码可以在https://github.com/RapidsAtHKUST/DBGC上公开获得。

{"title":"Density-Based Geometry Compression for LiDAR Point Clouds","authors":"Xibo Sun, Qiong Luo","doi":"10.48786/edbt.2023.30","DOIUrl":"https://doi.org/10.48786/edbt.2023.30","url":null,"abstract":"LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"44 1","pages":"378-390"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83061379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new PET for Data Collection via Forms with Data Minimization, Full Accuracy and Informed Consent 一个新的PET通过数据最小化，完全准确和知情同意的表格收集数据

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2024.08

N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri

The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.

隐私法和数据最小化和知情同意等原则的出现，本应保护公民免受个人数据的过度收集。然而，目前的流程，主要是通过填写表格，仍然基于导致过度收集的做法。事实上，任何希望申请福利(或服务)的公民都将提交所有涉及资格标准评估的个人数据。由此产生的过度收集问题影响了数百万个人，收集了大量信息。如果这个遵从性问题涉及公共和私人组织(例如，社会服务、银行、保险公司)，那是因为它面临着一些重要的问题，这些问题阻碍了开发人员实现数据最小化。在本文中，我们提出了一种新的建模方法，可以为用户提供数据最小化和知情选择，适用于使用经典逻辑建模的任何决策问题，该方法涵盖了广泛的实际案例。我们的数据最小化解决方案使用博弈论概念来解释和量化用户的隐私回报。我们展示了如何将我们的算法应用于实际案例研究，作为一种新的PET，以实现最小的、完全准确的(所有应有的服务必须保留)和知情的数据收集。

{"title":"A new PET for Data Collection via Forms with Data Minimization, Full Accuracy and Informed Consent","authors":"N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri","doi":"10.48786/edbt.2024.08","DOIUrl":"https://doi.org/10.48786/edbt.2024.08","url":null,"abstract":"The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"57 1","pages":"81-93"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80497929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1