Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.
{"title":"Fair Spatial Indexing: A paradigm for Group Spatial Fairness.","authors":"Sina Shaham, Gabriel Ghinita, Cyrus Shahabi","doi":"10.48786/edbt.2024.14","DOIUrl":"10.48786/edbt.2024.14","url":null,"abstract":"<p><p>Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies <i>fairness</i> with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on <i>spatial group fairness</i> and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.</p>","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"27 2","pages":"150-161"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digital data plays a central role in sciences, journalism, environment, digital humanities, etc. Open Data sharing initiatives lead to many large, interesting datasets being shared online. Some of these are RDF graphs, but other formats like CSV, relational, property graphs, JSON or XML documents are also frequent. Practitioners need to understand a dataset to decide whether it is suited to their needs. Datasets may come with a schema and/or may be summarized, however the first is not always provided and the latter is often too technical for non-IT users. To overcome these limitations, we present an end-to-end dataset abstraction approach, which ( 𝑖 ) applies on any (semi)structured data model; ( 𝑖𝑖 ) computes a description meant for human users, in the form of an Entity-Relationship diagram; ( 𝑖𝑖𝑖 ) integrates Information Extraction and data profiling to classify dataset entities among a large set of intelligible categories. We implemented our approach in a system called Abstra, and detail its performance on various datasets.
{"title":"Computing Generic Abstractions from Application Datasets","authors":"Nelly Barret, I. Manolescu, P. Upadhyay","doi":"10.48786/edbt.2024.09","DOIUrl":"https://doi.org/10.48786/edbt.2024.09","url":null,"abstract":"Digital data plays a central role in sciences, journalism, environment, digital humanities, etc. Open Data sharing initiatives lead to many large, interesting datasets being shared online. Some of these are RDF graphs, but other formats like CSV, relational, property graphs, JSON or XML documents are also frequent. Practitioners need to understand a dataset to decide whether it is suited to their needs. Datasets may come with a schema and/or may be summarized, however the first is not always provided and the latter is often too technical for non-IT users. To overcome these limitations, we present an end-to-end dataset abstraction approach, which ( 𝑖 ) applies on any (semi)structured data model; ( 𝑖𝑖 ) computes a description meant for human users, in the form of an Entity-Relationship diagram; ( 𝑖𝑖𝑖 ) integrates Information Extraction and data profiling to classify dataset entities among a large set of intelligible categories. We implemented our approach in a system called Abstra, and detail its performance on various datasets.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"14 1","pages":"94-107"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87206329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-24DOI: 10.48550/arXiv.2306.13868
Melika Mousavi, N. Shahbazi, Abolfazl Asudeh
Existing machine learning models have proven to fail when it comes to their performance for minority groups, mainly due to biases in data. In particular, datasets, especially social data, are often not representative of minorities. In this paper, we consider the problem of representation bias identification on image datasets without explicit attribute values. Using the notion of data coverage for detecting a lack of representation, we develop multiple crowdsourcing approaches. Our core approach, at a high level, is a divide and conquer algorithm that applies a search space pruning strategy to efficiently identify if a dataset misses proper coverage for a given group. We provide a different theoretical analysis of our algorithm, including a tight upper bound on its performance which guarantees its near-optimality. Using this algorithm as the core, we propose multiple heuristics to reduce the coverage detection cost across different cases with multiple intersectional/non-intersectional groups. We demonstrate how the pre-trained predictors are not reliable and hence not sufficient for detecting representation bias in the data. Finally, we adjust our core algorithm to utilize existing models for predicting image group(s) to minimize the coverage identification cost. We conduct extensive experiments, including live experiments on Amazon Mechanical Turk to validate our problem and evaluate our algorithms' performance.
{"title":"Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach","authors":"Melika Mousavi, N. Shahbazi, Abolfazl Asudeh","doi":"10.48550/arXiv.2306.13868","DOIUrl":"https://doi.org/10.48550/arXiv.2306.13868","url":null,"abstract":"Existing machine learning models have proven to fail when it comes to their performance for minority groups, mainly due to biases in data. In particular, datasets, especially social data, are often not representative of minorities. In this paper, we consider the problem of representation bias identification on image datasets without explicit attribute values. Using the notion of data coverage for detecting a lack of representation, we develop multiple crowdsourcing approaches. Our core approach, at a high level, is a divide and conquer algorithm that applies a search space pruning strategy to efficiently identify if a dataset misses proper coverage for a given group. We provide a different theoretical analysis of our algorithm, including a tight upper bound on its performance which guarantees its near-optimality. Using this algorithm as the core, we propose multiple heuristics to reduce the coverage detection cost across different cases with multiple intersectional/non-intersectional groups. We demonstrate how the pre-trained predictors are not reliable and hence not sufficient for detecting representation bias in the data. Finally, we adjust our core algorithm to utilize existing models for predicting image group(s) to minimize the coverage identification cost. We conduct extensive experiments, including live experiments on Amazon Mechanical Turk to validate our problem and evaluate our algorithms' performance.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"3 9 1","pages":"47-60"},"PeriodicalIF":0.0,"publicationDate":"2023-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84356880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-23DOI: 10.48550/arXiv.2302.12333
Dimitris Sacharidis, G. Giannopoulos, George Papastefanatos, K. Stefanidis
This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.
{"title":"Auditing for Spatial Fairness","authors":"Dimitris Sacharidis, G. Giannopoulos, George Papastefanatos, K. Stefanidis","doi":"10.48550/arXiv.2302.12333","DOIUrl":"https://doi.org/10.48550/arXiv.2302.12333","url":null,"abstract":"This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"35 1","pages":"485-491"},"PeriodicalIF":0.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74184759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-16DOI: 10.48550/arXiv.2302.08019
Abhishek A. Singh, Aasim Khan, S. Mehrotra, Faisal Nawab
We propose Transactional Edge (TransEdge), a distributed transaction processing system for untrusted environments such as edge computing systems. What distinguishes TransEdge is its focus on efficient support for read-only transactions. TransEdge allows reading from different partitions consistently using one round in most cases and no more than two rounds in the worst case. TransEdge design is centered around this dependency tracking scheme including the consensus and transaction processing protocols. Our performance evaluation shows that TransEdge's snapshot read-only transactions achieve an 9-24x speedup compared to current byzantine systems.
{"title":"TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes","authors":"Abhishek A. Singh, Aasim Khan, S. Mehrotra, Faisal Nawab","doi":"10.48550/arXiv.2302.08019","DOIUrl":"https://doi.org/10.48550/arXiv.2302.08019","url":null,"abstract":"We propose Transactional Edge (TransEdge), a distributed transaction processing system for untrusted environments such as edge computing systems. What distinguishes TransEdge is its focus on efficient support for read-only transactions. TransEdge allows reading from different partitions consistently using one round in most cases and no more than two rounds in the worst case. TransEdge design is centered around this dependency tracking scheme including the consensus and transaction processing protocols. Our performance evaluation shows that TransEdge's snapshot read-only transactions achieve an 9-24x speedup compared to current byzantine systems.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"684-696"},"PeriodicalIF":0.0,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91113987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.
{"title":"Implementing and Evaluating E2LSH on Storage","authors":"Yuuichi Nakanishi, Kazuhiro Hiwada, Yosuke Bando, Tomoya Suzuki, H. Kajihara, Shintarou Sano, Tatsuro Endo, Tatsuo Shiozawa","doi":"10.48786/edbt.2023.35","DOIUrl":"https://doi.org/10.48786/edbt.2023.35","url":null,"abstract":"Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"437-449"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77757083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yeasir Rayhan, T. Hashem, M. A. Cheema, Hua Lu, Mohammed Eunus Ali
The advancement of indoor location-aware technologies enables a wide range of location based services in indoor spaces. In this paper, we formulate a novel Indoor Facility Location Selection (IFLS) query that finds the optimal location for placing a new facility (e.g., a coffee station) in an indoor venue (e.g., a university building) such that the maximum distance of all clients (e.g., staffs/students) to their nearest facility is minimized. To the best of our knowledge we are the first to address this problem in an indoor setting. We first adapt the state-of-the-art solution in road networks for indoor settings, which exposes the limitations of existing approaches to solve our problem in an indoor space. Therefore, we propose an efficient approach which prunes the search space in terms of the number of clients considered, and the total number of facilities retrieved from the database, thus reducing the total number of indoor distance calculations required. The key idea of our approach is to use a single pass on a state-of-the-art index for an indoor space, and reuse the nearest neighbor computation of clients to prune irrelevant facilities and clients. We evaluate the performance of both approaches on four indoor datasets. Our approach achieves a speedup from 2 . 84 × to 71 . 29 × for synthetic data and 97 . 74 × for real data over the baseline.
{"title":"An Efficient Approach for Indoor Facility Location Selection","authors":"Yeasir Rayhan, T. Hashem, M. A. Cheema, Hua Lu, Mohammed Eunus Ali","doi":"10.48786/edbt.2023.53","DOIUrl":"https://doi.org/10.48786/edbt.2023.53","url":null,"abstract":"The advancement of indoor location-aware technologies enables a wide range of location based services in indoor spaces. In this paper, we formulate a novel Indoor Facility Location Selection (IFLS) query that finds the optimal location for placing a new facility (e.g., a coffee station) in an indoor venue (e.g., a university building) such that the maximum distance of all clients (e.g., staffs/students) to their nearest facility is minimized. To the best of our knowledge we are the first to address this problem in an indoor setting. We first adapt the state-of-the-art solution in road networks for indoor settings, which exposes the limitations of existing approaches to solve our problem in an indoor space. Therefore, we propose an efficient approach which prunes the search space in terms of the number of clients considered, and the total number of facilities retrieved from the database, thus reducing the total number of indoor distance calculations required. The key idea of our approach is to use a single pass on a state-of-the-art index for an indoor space, and reuse the nearest neighbor computation of clients to prune irrelevant facilities and clients. We evaluate the performance of both approaches on four indoor datasets. Our approach achieves a speedup from 2 . 84 × to 71 . 29 × for synthetic data and 97 . 74 × for real data over the baseline.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"33 1","pages":"632-644"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77819825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Query compilation is crucial to efficiently execute query plans. In the past decade, we have witnessed considerable progress in this field, including compilation with LLVM, adaptively switching from interpretation to compiled code, as well as adaptively switching from non-optimized to optimized code. All of these ideas aim to reduce latency and/or increase throughput. However, these approaches require immense engineering effort, a considerable part of which includes reengineering very fundamental techniques from the compiler construction community, like register allocation or machine code generation – techniques studied in this field for decades. In this paper, we argue
{"title":"A simplified Architecture for Fast, Adaptive Compilation and Execution of SQL Queries","authors":"Immanuel Haffner, J. Dittrich","doi":"10.48786/edbt.2023.01","DOIUrl":"https://doi.org/10.48786/edbt.2023.01","url":null,"abstract":"Query compilation is crucial to efficiently execute query plans. In the past decade, we have witnessed considerable progress in this field, including compilation with LLVM, adaptively switching from interpretation to compiled code, as well as adaptively switching from non-optimized to optimized code. All of these ideas aim to reduce latency and/or increase throughput. However, these approaches require immense engineering effort, a considerable part of which includes reengineering very fundamental techniques from the compiler construction community, like register allocation or machine code generation – techniques studied in this field for decades. In this paper, we argue","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88929963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dhrubajyoti Ghosh, Peeyush Gupta, S. Mehrotra, Shantanu Sharma
Several application domains require data to be enriched prior to its use. Data enrichment is often performed using expensive machine learning models to interpret low-level data ( e . g ., models for face detection) into semantically meaningful observation. Col-lecting and enriching data offline before loading it to a database is infeasible if one desires online analysis on data as it arrives. Enriching data on the fly at insertion could result in redundant work (if applications require only a fraction of the data to be enriched) and could result in a bottleneck (if enrichment functions are expensive). Any scalable solution requires enrichment during query processing. This paper explores two different architectures for integrating enrichment into query processing – a loosely coupled approach wherein enrichment is performed outside of the DBMS and a tightly coupled approach wherein it is performed within the DBMS. The paper addresses the challenges of increased query latency due to query time enrichment.
{"title":"Supporting Complex Query Time Enrichment For Analytics","authors":"Dhrubajyoti Ghosh, Peeyush Gupta, S. Mehrotra, Shantanu Sharma","doi":"10.48786/edbt.2023.08","DOIUrl":"https://doi.org/10.48786/edbt.2023.08","url":null,"abstract":"Several application domains require data to be enriched prior to its use. Data enrichment is often performed using expensive machine learning models to interpret low-level data ( e . g ., models for face detection) into semantically meaningful observation. Col-lecting and enriching data offline before loading it to a database is infeasible if one desires online analysis on data as it arrives. Enriching data on the fly at insertion could result in redundant work (if applications require only a fraction of the data to be enriched) and could result in a bottleneck (if enrichment functions are expensive). Any scalable solution requires enrichment during query processing. This paper explores two different architectures for integrating enrichment into query processing – a loosely coupled approach wherein enrichment is performed outside of the DBMS and a tightly coupled approach wherein it is performed within the DBMS. The paper addresses the challenges of increased query latency due to query time enrichment.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"91 1","pages":"92-104"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80872252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Local Differential Privacy (LDP) allows answering queries on users data while maintaining their privacy. Queries are often is-sued on multidimensional datasets with categorical and numeric dimensions. In this paper, we tackle the problem of answering counting queries over multidimensional datasets with categorical and numeric dimensions under LDP. In the setting without a trusted central agent, the user’s private dimensions are firstly perturbed locally to preserve privacy and then sent to an aggregator who will be able to estimate answers to queries. We build our approach on the existing idea of using grids. Mapping users dimensions into grids which are perturbed and sent to the aggregator so it can estimate the real data distributions to answer different queries on the dimensions collected. Finer-grained grids lead to greater error due to noises, while coarser-grained ones result in greater error due to biases. We propose optimizing the construction of grids taking into consideration a number of different factors to obtain better accuracy. Also, we propose to adaptively select the LDP algorithm that based on the grid characteristics will provide the better utility. We conduct experiments on real and synthetic datasets and compare our solution with existing approaches.
{"title":"FELIP: A local Differentially Private approach to frequency estimation on multidimensional datasets","authors":"José S. Costa Filho, Javam C. Machado","doi":"10.48786/edbt.2023.56","DOIUrl":"https://doi.org/10.48786/edbt.2023.56","url":null,"abstract":"Local Differential Privacy (LDP) allows answering queries on users data while maintaining their privacy. Queries are often is-sued on multidimensional datasets with categorical and numeric dimensions. In this paper, we tackle the problem of answering counting queries over multidimensional datasets with categorical and numeric dimensions under LDP. In the setting without a trusted central agent, the user’s private dimensions are firstly perturbed locally to preserve privacy and then sent to an aggregator who will be able to estimate answers to queries. We build our approach on the existing idea of using grids. Mapping users dimensions into grids which are perturbed and sent to the aggregator so it can estimate the real data distributions to answer different queries on the dimensions collected. Finer-grained grids lead to greater error due to noises, while coarser-grained ones result in greater error due to biases. We propose optimizing the construction of grids taking into consideration a number of different factors to obtain better accuracy. Also, we propose to adaptively select the LDP algorithm that based on the grid characteristics will provide the better utility. We conduct experiments on real and synthetic datasets and compare our solution with existing approaches.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"58 1","pages":"671-683"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84830983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}