S. Davidson, Shay Gershtein, T. Milo, Slava Novgorodov, May Shoshan
Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.
{"title":"Efficiently Archiving Photos under Storage Constraints","authors":"S. Davidson, Shay Gershtein, T. Milo, Slava Novgorodov, May Shoshan","doi":"10.48786/edbt.2023.50","DOIUrl":"https://doi.org/10.48786/edbt.2023.50","url":null,"abstract":"Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"34 1","pages":"591-603"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85088942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skyline query processing is essential to the database commu-nity. Many algorithms have been designed to perform efficient skyline computation, which can be generally categorized into sorting-based and partitioning-based by considering the different mechanisms to reduce the dominance tests. Sorting-based skyline algorithms first sort all points with respect to a monotone score function, for instance the sum of all values of a point, then the dominance tests can be bounded by the score function; partitioning-based algorithms create partitions from the dataset so that the dominance tests can be limited in partitions. On the other hand, the incomparability between points has been considered as an important property, that is, if two points are incomparable, then any dominance test between them is unnec-essary. In fact, the state-of-the-art skyline algorithms effectively reduce the dominance tests by taking the incomparability into account. In this paper, we present a subset-based approach that allows to integrate subspace-based incomparability to existing sorting-based skyline algorithms and can therefore significantly reduce the total number of dominance tests in large multidimensional datasets. Our theoretical and experimental studies show that the proposed subset approach boosts existing sorting-based skyline algorithms and makes them comparable to the state-of-the-art algorithms and even faster with uniform independent data.
{"title":"Subset Approach to Efficient Skyline Computation","authors":"Dominique H. Li","doi":"10.48786/edbt.2023.31","DOIUrl":"https://doi.org/10.48786/edbt.2023.31","url":null,"abstract":"Skyline query processing is essential to the database commu-nity. Many algorithms have been designed to perform efficient skyline computation, which can be generally categorized into sorting-based and partitioning-based by considering the different mechanisms to reduce the dominance tests. Sorting-based skyline algorithms first sort all points with respect to a monotone score function, for instance the sum of all values of a point, then the dominance tests can be bounded by the score function; partitioning-based algorithms create partitions from the dataset so that the dominance tests can be limited in partitions. On the other hand, the incomparability between points has been considered as an important property, that is, if two points are incomparable, then any dominance test between them is unnec-essary. In fact, the state-of-the-art skyline algorithms effectively reduce the dominance tests by taking the incomparability into account. In this paper, we present a subset-based approach that allows to integrate subspace-based incomparability to existing sorting-based skyline algorithms and can therefore significantly reduce the total number of dominance tests in large multidimensional datasets. Our theoretical and experimental studies show that the proposed subset approach boosts existing sorting-based skyline algorithms and makes them comparable to the state-of-the-art algorithms and even faster with uniform independent data.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"63 1","pages":"391-403"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83858579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Appel, H. Bach, S. Migdall, Manolis Koubarakis, G. Stamoulis, D. Bilidas, D. Pantazi, L. Bruzzone, C. Paris, Giulio Weikmann
Food security, especially in a changing Earth environment, is one of the most challenging issues of this century. Population growth, increased food consumption and the challenges of climate change will extend over the next decades. To deal with these, both regional and global measures are necessary. Biomass production and thus yield will need to be increased in a sustainable way. It is important to minimize the risks of yield loss even under more extreme environmental conditions, while making sure not to deplete or damage the available resources. Two measures are most important for this: irrigation and fertilization. While fertilization relies mainly on industrial goods, irrigation requires reliable water resources in the area that is being farmed, either from groundwater or surface water. Regarding surface water, a large portion of the world’s fresh-water is linked to snowfall, snow storage and seasonal release of the water. All these components are subject to increased variability due to climate change and the
{"title":"ExtremeEarth: Managing Water Availability for Crops Using Earth Observation and Machine Learning","authors":"F. Appel, H. Bach, S. Migdall, Manolis Koubarakis, G. Stamoulis, D. Bilidas, D. Pantazi, L. Bruzzone, C. Paris, Giulio Weikmann","doi":"10.48786/edbt.2023.62","DOIUrl":"https://doi.org/10.48786/edbt.2023.62","url":null,"abstract":"Food security, especially in a changing Earth environment, is one of the most challenging issues of this century. Population growth, increased food consumption and the challenges of climate change will extend over the next decades. To deal with these, both regional and global measures are necessary. Biomass production and thus yield will need to be increased in a sustainable way. It is important to minimize the risks of yield loss even under more extreme environmental conditions, while making sure not to deplete or damage the available resources. Two measures are most important for this: irrigation and fertilization. While fertilization relies mainly on industrial goods, irrigation requires reliable water resources in the area that is being farmed, either from groundwater or surface water. Regarding surface water, a large portion of the world’s fresh-water is linked to snowfall, snow storage and seasonal release of the water. All these components are subject to increased variability due to climate change and the","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"3 1","pages":"749-756"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88470977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Submodular function maximization is a fundamental combinatorial optimization problem with plenty of applications -- including data summarization, influence maximization, and recommendation. In many of these problems, the goal is to find a solution that maximizes the average utility over all users, for each of whom the utility is defined by a monotone submodular function. However, when the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across different groups. Although the emph{utility} and emph{fairness} objectives are both desirable, they might contradict each other, and, to the best of our knowledge, little attention has been paid to optimizing them jointly. To fill this gap, we propose a new problem called emph{Bicriteria Submodular Maximization} (BSM) to balance utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor, we focus on designing efficient instance-dependent approximation schemes. Our algorithmic proposal comprises two methods, with different approximation factors, obtained by converting a BSM instance into other submodular optimization problem instances. Using real-world and synthetic datasets, we showcase applications of our proposed methods in three submodular maximization problems: maximum coverage, influence maximization, and facility location.
{"title":"Balancing Utility and Fairness in Submodular Maximization","authors":"Yanhao Wang, Yuchen Li, F. Bonchi, Ying Wang","doi":"10.48786/edbt.2024.01","DOIUrl":"https://doi.org/10.48786/edbt.2024.01","url":null,"abstract":"Submodular function maximization is a fundamental combinatorial optimization problem with plenty of applications -- including data summarization, influence maximization, and recommendation. In many of these problems, the goal is to find a solution that maximizes the average utility over all users, for each of whom the utility is defined by a monotone submodular function. However, when the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across different groups. Although the emph{utility} and emph{fairness} objectives are both desirable, they might contradict each other, and, to the best of our knowledge, little attention has been paid to optimizing them jointly. To fill this gap, we propose a new problem called emph{Bicriteria Submodular Maximization} (BSM) to balance utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor, we focus on designing efficient instance-dependent approximation schemes. Our algorithmic proposal comprises two methods, with different approximation factors, obtained by converting a BSM instance into other submodular optimization problem instances. Using real-world and synthetic datasets, we showcase applications of our proposed methods in three submodular maximization problems: maximum coverage, influence maximization, and facility location.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"51 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75586044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-07DOI: 10.48550/arXiv.2210.03718
Lukas Grasmann, R. Pichler, Alexander Selzer
Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL.
{"title":"Integration of Skyline Queries into Spark SQL","authors":"Lukas Grasmann, R. Pichler, Alexander Selzer","doi":"10.48550/arXiv.2210.03718","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03718","url":null,"abstract":"Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"46 1","pages":"337-350"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80187603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.48550/arXiv.2210.00262
Héber H. Arcolezi, Carlos Pinz'on, C. Palamidessi, S. Gambs
Collecting and analyzing evolving longitudinal data has become a common practice. One possible approach to protect the users' privacy in this context is to use local differential privacy (LDP) protocols, which ensure the privacy protection of all users even in the case of a breach or data misuse. Existing LDP data collection protocols such as Google's RAPPOR and Microsoft's dBitFlipPM can have longitudinal privacy linear to the domain size k, which is excessive for large domains, such as Internet domains. To solve this issue, in this paper we introduce a new LDP data collection protocol for longitudinal frequency monitoring named LOngitudinal LOcal HAshing (LOLOHA) with formal privacy guarantees. In addition, the privacy-utility trade-off of our protocol is only linear with respect to a reduced domain size $2leq g ll k$. LOLOHA combines a domain reduction approach via local hashing with double randomization to minimize the privacy leakage incurred by data updates. As demonstrated by our theoretical analysis as well as our experimental evaluation, LOLOHA achieves a utility competitive to current state-of-the-art protocols, while substantially minimizing the longitudinal privacy budget consumption by up to k/g orders of magnitude.
收集和分析不断变化的纵向数据已经成为一种常见的做法。在这种情况下,保护用户隐私的一种可能方法是使用本地差异隐私(LDP)协议,该协议确保即使在数据泄露或数据滥用的情况下也能保护所有用户的隐私。现有的LDP数据收集协议(如Google的RAPPOR和Microsoft的dBitFlipPM)可以具有与域大小k线性的纵向隐私,这对于大域(如Internet域)来说是过度的。为了解决这个问题,本文引入了一种新的纵向频率监测LDP数据收集协议,称为纵向局部哈希(LOLOHA),具有正式的隐私保证。此外,我们协议的隐私-效用权衡仅在减小域大小$2leq g ll k$方面是线性的。LOLOHA结合了通过局部哈希和双重随机化的域约简方法,以最大限度地减少数据更新引起的隐私泄漏。正如我们的理论分析和实验评估所证明的那样,LOLOHA实现了与当前最先进协议竞争的实用程序,同时将纵向隐私预算消耗大幅降低了k/g数量级。
{"title":"Frequency Estimation of Evolving Data Under Local Differential Privacy","authors":"Héber H. Arcolezi, Carlos Pinz'on, C. Palamidessi, S. Gambs","doi":"10.48550/arXiv.2210.00262","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00262","url":null,"abstract":"Collecting and analyzing evolving longitudinal data has become a common practice. One possible approach to protect the users' privacy in this context is to use local differential privacy (LDP) protocols, which ensure the privacy protection of all users even in the case of a breach or data misuse. Existing LDP data collection protocols such as Google's RAPPOR and Microsoft's dBitFlipPM can have longitudinal privacy linear to the domain size k, which is excessive for large domains, such as Internet domains. To solve this issue, in this paper we introduce a new LDP data collection protocol for longitudinal frequency monitoring named LOngitudinal LOcal HAshing (LOLOHA) with formal privacy guarantees. In addition, the privacy-utility trade-off of our protocol is only linear with respect to a reduced domain size $2leq g ll k$. LOLOHA combines a domain reduction approach via local hashing with double randomization to minimize the privacy leakage incurred by data updates. As demonstrated by our theoretical analysis as well as our experimental evaluation, LOLOHA achieves a utility competitive to current state-of-the-art protocols, while substantially minimizing the longitudinal privacy budget consumption by up to k/g orders of magnitude.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"59 1","pages":"512-525"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90850424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-30DOI: 10.48550/arXiv.2210.00034
G. Graefe, Thanh Do
Recent work shows how offset-value coding speeds up database query execution, not only sorting but also duplicate removal and grouping (aggregation) in sorted streams, order-preserving exchange (shuffle), merge join, and more. It already saves thousands of CPUs in Google's Napa and F1 Query systems, e.g., in grouping algorithms and in log-structured merge-forests. In order to realize the full benefit of interesting orderings, however, query execution algorithms must not only consume and exploit offset-value codes but also produce offset-value codes for the next operator in the pipeline. Our research has sought ways to produce offset-value codes without comparing successive output rows one-by-one, column-by-column. This short paper introduces a new theorem and, based on its proof and a simple corollary, describes in detail how order-preserving algorithms (from filter to merge join and even shuffle) can compute offset-value codes for their outputs. These computations are surprisingly simple and very efficient.
{"title":"Offset-value coding in database query processing","authors":"G. Graefe, Thanh Do","doi":"10.48550/arXiv.2210.00034","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00034","url":null,"abstract":"Recent work shows how offset-value coding speeds up database query execution, not only sorting but also duplicate removal and grouping (aggregation) in sorted streams, order-preserving exchange (shuffle), merge join, and more. It already saves thousands of CPUs in Google's Napa and F1 Query systems, e.g., in grouping algorithms and in log-structured merge-forests. In order to realize the full benefit of interesting orderings, however, query execution algorithms must not only consume and exploit offset-value codes but also produce offset-value codes for the next operator in the pipeline. Our research has sought ways to produce offset-value codes without comparing successive output rows one-by-one, column-by-column. This short paper introduces a new theorem and, based on its proof and a simple corollary, describes in detail how order-preserving algorithms (from filter to merge join and even shuffle) can compute offset-value codes for their outputs. These computations are surprisingly simple and very efficient.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"5 1","pages":"464-470"},"PeriodicalIF":0.0,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83841912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.48550/arXiv.2207.04789
B. Mößner, Christian Riegger, Arthur Bernhardt, Ilia Petrov
We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries. As a first core idea, bloomRF introduces novel prefix hashing to efficiently encode range information in the hash-code of the key itself. As a second key concept, bloomRF proposes novel piecewise-monotone hash-functions that preserve local order and support fast range-lookups with fewer memory accesses. bloomRF has near-optimal space complexity and constant query complexity. Although, bloomRF is designed for integer domains, it supports floating-points, and can serve as a multi-attribute filter. The evaluation in RocksDB and in a standalone library shows that it is more efficient and outperforms existing point-range-filters by up to 4x across a range of settings and distributions, while keeping the false-positive rate low.
{"title":"bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing","authors":"B. Mößner, Christian Riegger, Arthur Bernhardt, Ilia Petrov","doi":"10.48550/arXiv.2207.04789","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04789","url":null,"abstract":"We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries. As a first core idea, bloomRF introduces novel prefix hashing to efficiently encode range information in the hash-code of the key itself. As a second key concept, bloomRF proposes novel piecewise-monotone hash-functions that preserve local order and support fast range-lookups with fewer memory accesses. bloomRF has near-optimal space complexity and constant query complexity. Although, bloomRF is designed for integer domains, it supports floating-points, and can serve as a multi-attribute filter. The evaluation in RocksDB and in a standalone library shows that it is more efficient and outperforms existing point-range-filters by up to 4x across a range of settings and distributions, while keeping the false-positive rate low.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"131-143"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88751086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Primal Pappachan, Chenxi Qiu, A. Squicciarini, Vishnu Sharma Hunsur Manjunath
Location obfuscation functions generated by existing systems for ensuring location privacy are monolithic and do not allow users to customize their obfuscation range. This can lead to the user being mapped in undesirable locations (e.g., shady neighborhoods) to the location-requesting services. Modifying the obfuscation function generated by a centralized server on the user side can result in poor privacy as the original function is not robust against such updates. Users themselves might find it challenging to understand the parameters involved in obfuscation mechanisms (e.g., obfuscation range and granularity of location representation) and therefore struggle to set realistic trade-offs between privacy, utility, and customization. In this paper, we propose a new framework called, CORGI, i.e., CustOmizable Robust Geo-Indistinguishability, which generates location obfuscation functions that are robust against user customization while providing strong privacy guarantees based on the Geo-Indistinguishability paradigm. CORGI utilizes a tree representation of a given region to assist users in specifying their privacy and customization requirements. The server side of CORGI takes these requirements as inputs and generates an obfuscation function that satisfies Geo-Indistinguishability requirements and is robust against customization on the user side. The obfuscation function is returned to the user who can then choose to update the obfuscation function (e.g., obfuscation range, granularity of location representation). The experimental results on a real dataset demonstrate that CORGI can efficiently generate obfuscation matrices that are more robust to the customization by users.
{"title":"User Customizable and Robust Geo-Indistinguishability for Location Privacy","authors":"Primal Pappachan, Chenxi Qiu, A. Squicciarini, Vishnu Sharma Hunsur Manjunath","doi":"10.48786/edbt.2023.55","DOIUrl":"https://doi.org/10.48786/edbt.2023.55","url":null,"abstract":"Location obfuscation functions generated by existing systems for ensuring location privacy are monolithic and do not allow users to customize their obfuscation range. This can lead to the user being mapped in undesirable locations (e.g., shady neighborhoods) to the location-requesting services. Modifying the obfuscation function generated by a centralized server on the user side can result in poor privacy as the original function is not robust against such updates. Users themselves might find it challenging to understand the parameters involved in obfuscation mechanisms (e.g., obfuscation range and granularity of location representation) and therefore struggle to set realistic trade-offs between privacy, utility, and customization. In this paper, we propose a new framework called, CORGI, i.e., CustOmizable Robust Geo-Indistinguishability, which generates location obfuscation functions that are robust against user customization while providing strong privacy guarantees based on the Geo-Indistinguishability paradigm. CORGI utilizes a tree representation of a given region to assist users in specifying their privacy and customization requirements. The server side of CORGI takes these requirements as inputs and generates an obfuscation function that satisfies Geo-Indistinguishability requirements and is robust against customization on the user side. The obfuscation function is returned to the user who can then choose to update the obfuscation function (e.g., obfuscation range, granularity of location representation). The experimental results on a real dataset demonstrate that CORGI can efficiently generate obfuscation matrices that are more robust to the customization by users.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"45 3","pages":"658-670"},"PeriodicalIF":0.0,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72634522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-16DOI: 10.48550/arXiv.2206.08091
Abrar Fahim, Mohammed Eunus Ali, M. A. Cheema
Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.
{"title":"Unsupervised Space Partitioning for Nearest Neighbor Search","authors":"Abrar Fahim, Mohammed Eunus Ali, M. A. Cheema","doi":"10.48550/arXiv.2206.08091","DOIUrl":"https://doi.org/10.48550/arXiv.2206.08091","url":null,"abstract":"Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"36 1","pages":"351-363"},"PeriodicalIF":0.0,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81640004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}