Constraint discovery is a fundamental task in data profiling, which involves identifying the dependencies that are satisfied by a dataset. As published datasets are increasingly dynamic, a number of researchers have begun to investigate the problem of dependencies’ discovery in dynamic datasets. Proposals this far in this area can be viewed as schema-based in the sense that they model and explore the solution space using a lattice built on the basis of the attributes (columns) of the dataset. It is recognized that proposals that belong to this class, like their static counterpart, tend to perform well for datasets with a large number of tuples but a small number of attributes. The second class of proposals that have been examined for static datasets (but not in dynamic settings) is data-driven and is known to perform well for datasets with a large number of attributes and a small number of tuples. The main bottleneck of this class of solutions is the generation of agree-sets, which involves pairwise comparison of the tuples in the dataset. We present in this paper DynASt , a system for the efficient maintenance of agree-sets in dynamic datasets. We investigate the performance of DynASt and its scalability in terms of the number of tuples and the number of attributes of the target dataset. We also show that it outperforms existing (static and dynamic) state-of-the-art solutions for datasets with a large number of attributes.
{"title":"Efficient Maintenance of Agree-Sets Against Dynamic Datasets","authors":"Khalid Belhajjame","doi":"10.48786/edbt.2023.02","DOIUrl":"https://doi.org/10.48786/edbt.2023.02","url":null,"abstract":"Constraint discovery is a fundamental task in data profiling, which involves identifying the dependencies that are satisfied by a dataset. As published datasets are increasingly dynamic, a number of researchers have begun to investigate the problem of dependencies’ discovery in dynamic datasets. Proposals this far in this area can be viewed as schema-based in the sense that they model and explore the solution space using a lattice built on the basis of the attributes (columns) of the dataset. It is recognized that proposals that belong to this class, like their static counterpart, tend to perform well for datasets with a large number of tuples but a small number of attributes. The second class of proposals that have been examined for static datasets (but not in dynamic settings) is data-driven and is known to perform well for datasets with a large number of attributes and a small number of tuples. The main bottleneck of this class of solutions is the generation of agree-sets, which involves pairwise comparison of the tuples in the dataset. We present in this paper DynASt , a system for the efficient maintenance of agree-sets in dynamic datasets. We investigate the performance of DynASt and its scalability in terms of the number of tuples and the number of attributes of the target dataset. We also show that it outperforms existing (static and dynamic) state-of-the-art solutions for datasets with a large number of attributes.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"28 1","pages":"14-26"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85184961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Klessinger, Michael Fruth, Valentin Gittinger, Meike Klettke, U. Störl, Stefanie Scherzinger
This tool demo features an original approach to model inference or schema extraction from collections of JSON documents: We automatically detect tagged unions, an established design pattern in hand-crafted schemas for conditionally declaring subtypes. Our “Tagger” approach is based on the discovery of conditional functional dependencies in a relational encoding of JSON objects. We have integrated our prototype implementation in an open source tool for managing data models in schema-flexible NoSQL data stores. Demo participants can interactively apply different schema extraction algorithms to real-world inputs, and compare the extracted schemas with those produced by “Tagger”.
{"title":"Tagger: A Tool for the Discovery of Tagged Unions in JSON Schema Extraction","authors":"Stefan Klessinger, Michael Fruth, Valentin Gittinger, Meike Klettke, U. Störl, Stefanie Scherzinger","doi":"10.48786/edbt.2023.75","DOIUrl":"https://doi.org/10.48786/edbt.2023.75","url":null,"abstract":"This tool demo features an original approach to model inference or schema extraction from collections of JSON documents: We automatically detect tagged unions, an established design pattern in hand-crafted schemas for conditionally declaring subtypes. Our “Tagger” approach is based on the discovery of conditional functional dependencies in a relational encoding of JSON objects. We have integrated our prototype implementation in an open source tool for managing data models in schema-flexible NoSQL data stores. Demo participants can interactively apply different schema extraction algorithms to real-world inputs, and compare the extracted schemas with those produced by “Tagger”.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"2096 1","pages":"827-830"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86552486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data preprocessing, the step of transforming data into a suitable format for training a model, rarely happens within database systems but rather in external Python libraries and thus requires extraction from the database systems first. However, database systems are tuned for efficient data access and offer aggregate functions to calculate the distribution frequencies necessary to detect the under- or overrepresentation of a certain value within the data (bias). We argue that database systems with SQL are capable of executing machine learning pipelines as well as discovering technical biases—introduced by data preprocessing—efficiently. Therefore, we present a set of SQL queries to cover data preprocessing and data inspection: During preprocessing, we annotate the tuples with an identifier to compute the distribution frequency of columns. To inspect distribution changes, we join the prepro-cessed dataset with the original one on the tuple identifier and use aggregate functions to count the number of occurrences per sensitive column. This allows us to detect operations which filter out tuples and thus introduce a technical bias even for columns preprocessing has removed. To automatically generate such queries, our implementation extends the mlinspect project to transpile existing data preprocessing pipelines written in Python to SQL queries, while maintaining detailed inspection results using views or common table expressions (CTEs). The evaluation proves that a modern beyond main-memory database system, i.e. Umbra, accelerates the runtime for preprocessing and inspection. Even PostgreSQL as a disk-based database system shows similar performance for inspection to Umbra when materialising views.
{"title":"Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL","authors":"Maximilian E. Schüle","doi":"10.48786/edbt.2023.04","DOIUrl":"https://doi.org/10.48786/edbt.2023.04","url":null,"abstract":"Data preprocessing, the step of transforming data into a suitable format for training a model, rarely happens within database systems but rather in external Python libraries and thus requires extraction from the database systems first. However, database systems are tuned for efficient data access and offer aggregate functions to calculate the distribution frequencies necessary to detect the under- or overrepresentation of a certain value within the data (bias). We argue that database systems with SQL are capable of executing machine learning pipelines as well as discovering technical biases—introduced by data preprocessing—efficiently. Therefore, we present a set of SQL queries to cover data preprocessing and data inspection: During preprocessing, we annotate the tuples with an identifier to compute the distribution frequency of columns. To inspect distribution changes, we join the prepro-cessed dataset with the original one on the tuple identifier and use aggregate functions to count the number of occurrences per sensitive column. This allows us to detect operations which filter out tuples and thus introduce a technical bias even for columns preprocessing has removed. To automatically generate such queries, our implementation extends the mlinspect project to transpile existing data preprocessing pipelines written in Python to SQL queries, while maintaining detailed inspection results using views or common table expressions (CTEs). The evaluation proves that a modern beyond main-memory database system, i.e. Umbra, accelerates the runtime for preprocessing and inspection. Even PostgreSQL as a disk-based database system shows similar performance for inspection to Umbra when materialising views.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"40-52"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87720589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas Woltmann, Patrick Damme, Claudio Hartmann, Dirk Habich, Wolfgang Lehner
Data compression has recently experienced a revival in the domain of in-memory column stores. In this field, a large corpus of lightweight integer compression algorithms plays a dominant role since all columns are typically encoded as sequences of integer values. Unfortunately, there is no single-best integer compression algorithm and the best algorithm depends on data and hardware properties. For this reason, selecting the best-fitting integer compression algorithm becomes more important and is an interesting tuning knob for optimization. However, traditional selection strategies require a profound knowledge of the (de-)compression algorithms for decision-making. This limits the broad applicability of the selection strategies. To counteract this, we propose a novel learned selection strategy by consider-ing integer compression algorithms as independent black boxes. This black-box approach ensures broad applicability and requires machine learning-based methods to model the required knowledge for decision-making. Most importantly, we show that a local approach, where every algorithm is modeled individually, plays a crucial role. Moreover, our learned selection strategy is generalized by user-data-independence. Finally, we evaluate our approach and compare our approach against existing selection strategies to show the benefits of our learned selection strategy .
{"title":"Learned Selection Strategy for Lightweight Integer Compression Algorithms","authors":"Lucas Woltmann, Patrick Damme, Claudio Hartmann, Dirk Habich, Wolfgang Lehner","doi":"10.48786/edbt.2023.47","DOIUrl":"https://doi.org/10.48786/edbt.2023.47","url":null,"abstract":"Data compression has recently experienced a revival in the domain of in-memory column stores. In this field, a large corpus of lightweight integer compression algorithms plays a dominant role since all columns are typically encoded as sequences of integer values. Unfortunately, there is no single-best integer compression algorithm and the best algorithm depends on data and hardware properties. For this reason, selecting the best-fitting integer compression algorithm becomes more important and is an interesting tuning knob for optimization. However, traditional selection strategies require a profound knowledge of the (de-)compression algorithms for decision-making. This limits the broad applicability of the selection strategies. To counteract this, we propose a novel learned selection strategy by consider-ing integer compression algorithms as independent black boxes. This black-box approach ensures broad applicability and requires machine learning-based methods to model the required knowledge for decision-making. Most importantly, we show that a local approach, where every algorithm is modeled individually, plays a crucial role. Moreover, our learned selection strategy is generalized by user-data-independence. Finally, we evaluate our approach and compare our approach against existing selection strategies to show the benefits of our learned selection strategy .","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"77 1","pages":"552-564"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80984513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estimating query result sizes is a critical task in areas like query optimization. For some years now it has been popular to apply machine learning to this problem. However, surprisingly, there has been very little research yet on how to present queries to a machine learning model. Machine learning models do not simply consume SQL strings. Instead, a SQL string is transformed into a numerical representation. This transformation is called query featurization and is defined by a query featurization technique (QFT). This paper is concerned with QFTs for queries with many selection predicates. In particular, we consider queries that contain both predicates over different attributes and multiple predicates per attribute. We identify a desired property of query featurization and present three novel QFTs. To the best of our knowledge, we are the first to featurize queries with mixed combinations of predicates, i.e., containing both conjunctions and disjunctions. Our QFTs are model-independent and can serve as the query featurization layer for different machine learning model types. In our evaluation, we combine our QFTs with three different machine learning models. We demonstrate that the estimation accuracy of machine learning models significantly depends on the QFT used. In addition, we compare our best combination of QFT and machine learning model to various existing cardinality estimators.
{"title":"Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation","authors":"Magnus Müller, Lucas Woltmann, Wolfgang Lehner","doi":"10.48786/edbt.2023.22","DOIUrl":"https://doi.org/10.48786/edbt.2023.22","url":null,"abstract":"Estimating query result sizes is a critical task in areas like query optimization. For some years now it has been popular to apply machine learning to this problem. However, surprisingly, there has been very little research yet on how to present queries to a machine learning model. Machine learning models do not simply consume SQL strings. Instead, a SQL string is transformed into a numerical representation. This transformation is called query featurization and is defined by a query featurization technique (QFT). This paper is concerned with QFTs for queries with many selection predicates. In particular, we consider queries that contain both predicates over different attributes and multiple predicates per attribute. We identify a desired property of query featurization and present three novel QFTs. To the best of our knowledge, we are the first to featurize queries with mixed combinations of predicates, i.e., containing both conjunctions and disjunctions. Our QFTs are model-independent and can serve as the query featurization layer for different machine learning model types. In our evaluation, we combine our QFTs with three different machine learning models. We demonstrate that the estimation accuracy of machine learning models significantly depends on the QFT used. In addition, we compare our best combination of QFT and machine learning model to various existing cardinality estimators.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"273-284"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78782454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to their expressive power, Knowledge Graphs (KGs) have received increasing interest not only as means to structure and integrate heterogeneous information but also as a native stor-age format for large amounts of knowledge and statistical data. Therefore, analytical queries over KG data, typically stored as RDF, have become increasingly important. Yet, formulating such queries represents a difficult task for users that are not familiar with the query language (typically SPARQL) and the structure of the dataset at hand. To overcome this limitation, we propose Re2xOLAP: the first comprehensive interactive approach that allows to reverse-engineer and refine RDF exploratory OLAP queries over KGs containing statistical data. Thus, Re2xOLAP enables to perform KG exploratory analytics without requiring the user to write any query at all. We achieve this goal by first reverse-engineering analytical SPARQL queries from a small set of user-provided examples and then, given the reverse-engineered query, we propose intuitive and explainable exploratory query refinements to iteratively help the user obtain the desired information. Our experiments on real-world large-scale KGs show that Re2xOLAP can efficiently reverse-engineer analytical SPARQL queries solely based on a small set of input examples. Additionally, we demonstrate the expressive power of our interactive refinement methods by showing that Re2xOLAP allows users to navigate hundreds of thousands of different exploration paths with just a few interactions.
{"title":"Example-Driven Exploratory Analytics over Knowledge Graphs","authors":"Matteo Lissandrini, K. Hose, T. Pedersen","doi":"10.48786/edbt.2023.09","DOIUrl":"https://doi.org/10.48786/edbt.2023.09","url":null,"abstract":"Due to their expressive power, Knowledge Graphs (KGs) have received increasing interest not only as means to structure and integrate heterogeneous information but also as a native stor-age format for large amounts of knowledge and statistical data. Therefore, analytical queries over KG data, typically stored as RDF, have become increasingly important. Yet, formulating such queries represents a difficult task for users that are not familiar with the query language (typically SPARQL) and the structure of the dataset at hand. To overcome this limitation, we propose Re2xOLAP: the first comprehensive interactive approach that allows to reverse-engineer and refine RDF exploratory OLAP queries over KGs containing statistical data. Thus, Re2xOLAP enables to perform KG exploratory analytics without requiring the user to write any query at all. We achieve this goal by first reverse-engineering analytical SPARQL queries from a small set of user-provided examples and then, given the reverse-engineered query, we propose intuitive and explainable exploratory query refinements to iteratively help the user obtain the desired information. Our experiments on real-world large-scale KGs show that Re2xOLAP can efficiently reverse-engineer analytical SPARQL queries solely based on a small set of input examples. Additionally, we demonstrate the expressive power of our interactive refinement methods by showing that Re2xOLAP allows users to navigate hundreds of thousands of different exploration paths with just a few interactions.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"107 1","pages":"105-117"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91315196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantinos Skitsas, Karol Orlowski, Judith Hermanns, D. Mottin, Panagiotis Karras
The graph alignment problem calls for finding a matching between the nodes of one graph and those of another graph, in a way that they correspond to each other by some fitness measure. Over the last years, several graph alignment algorithms have been proposed and evaluated on diverse datasets and quality measures. Typically, a newly proposed algorithm is compared to previously proposed ones on some specific datasets, types of noise, and quality measures where the new proposal achieves superiority over the previous ones. However, no systematic comparison of the proposed algorithms has been attempted on the same benchmarks. This paper fills this gap by conducting an extensive, thorough, and commensurable evaluation of state-of-the-art graph alignment algorithms. Our results highlight the value of overlooked solutions and an unprecedented effect of graph density on performance, hence call for further work.
{"title":"Comprehensive Evaluation of Algorithms for Unrestricted Graph Alignment","authors":"Konstantinos Skitsas, Karol Orlowski, Judith Hermanns, D. Mottin, Panagiotis Karras","doi":"10.48786/edbt.2023.21","DOIUrl":"https://doi.org/10.48786/edbt.2023.21","url":null,"abstract":"The graph alignment problem calls for finding a matching between the nodes of one graph and those of another graph, in a way that they correspond to each other by some fitness measure. Over the last years, several graph alignment algorithms have been proposed and evaluated on diverse datasets and quality measures. Typically, a newly proposed algorithm is compared to previously proposed ones on some specific datasets, types of noise, and quality measures where the new proposal achieves superiority over the previous ones. However, no systematic comparison of the proposed algorithms has been attempted on the same benchmarks. This paper fills this gap by conducting an extensive, thorough, and commensurable evaluation of state-of-the-art graph alignment algorithms. Our results highlight the value of overlooked solutions and an unprecedented effect of graph density on performance, hence call for further work.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"62 1","pages":"260-272"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84497800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Entity Resolution (ER) algorithms identify entity profiles corresponding to the same real-world entity among one or multiple data sets. Modern challenges for ER are posed by volume, variety, and velocity that characterize Big Data. While progressive ER aims to efficiently solve the problem under time constraints by prioritizing useful work over superfluous work, incremental ER aims to incrementally produce results as new data increments come in. This paper presents algorithms that combine these two approaches in the context of streaming and heterogeneous data. The overall goal is to maximize the chances to spot duplicates to a given entity profile in a moment closest to its arrival time (early quality), without relying on any schema information, while being sufficiently efficient to process large volumes of fast streaming data without compromising the eventual quality (by cutting too many corners for efficiency). Experiments validate that our algorithms are the first to support incremental and progressive ER and, compared to state-of-the-art incremental approaches, improve early quality, eventual quality, and system efficiency by progressively and adaptively performing the unexecuted comparisons that are more likely to match when waiting for the next stream input increment.
{"title":"Progressive Entity Resolution over Incremental Data","authors":"Leonardo Gazzarri, Melanie Herschel","doi":"10.48786/edbt.2023.07","DOIUrl":"https://doi.org/10.48786/edbt.2023.07","url":null,"abstract":"Entity Resolution (ER) algorithms identify entity profiles corresponding to the same real-world entity among one or multiple data sets. Modern challenges for ER are posed by volume, variety, and velocity that characterize Big Data. While progressive ER aims to efficiently solve the problem under time constraints by prioritizing useful work over superfluous work, incremental ER aims to incrementally produce results as new data increments come in. This paper presents algorithms that combine these two approaches in the context of streaming and heterogeneous data. The overall goal is to maximize the chances to spot duplicates to a given entity profile in a moment closest to its arrival time (early quality), without relying on any schema information, while being sufficiently efficient to process large volumes of fast streaming data without compromising the eventual quality (by cutting too many corners for efficiency). Experiments validate that our algorithms are the first to support incremental and progressive ER and, compared to state-of-the-art incremental approaches, improve early quality, eventual quality, and system efficiency by progressively and adaptively performing the unexecuted comparisons that are more likely to match when waiting for the next stream input increment.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"37 1","pages":"80-91"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86463306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.
{"title":"Density-Based Geometry Compression for LiDAR Point Clouds","authors":"Xibo Sun, Qiong Luo","doi":"10.48786/edbt.2023.30","DOIUrl":"https://doi.org/10.48786/edbt.2023.30","url":null,"abstract":"LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"44 1","pages":"378-390"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83061379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri
The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.
{"title":"A new PET for Data Collection via Forms with Data Minimization, Full Accuracy and Informed Consent","authors":"N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri","doi":"10.48786/edbt.2024.08","DOIUrl":"https://doi.org/10.48786/edbt.2024.08","url":null,"abstract":"The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"57 1","pages":"81-93"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80497929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}